All you need is a World Model + Policy + Value Model; Thats it.

What are the minimally sufficient skills required of an intelligent agent. Lets consider as our running example playing chess.

You need to have a model for what your opponent is going to do (World Model defined as a system that takes in St and an action sequence At:n and outputs a state trajectory St+1:n). For chess, we can model the other person as ourselves but we have to model them (the world) in some way.
You need to score a candidate trajectory V(St+1:n) (or set of states). e.g. for chess, if I move my pawn here, I will lose my pawn (bad), but I will get there queen (good). So the "valence" of that trajectory is overall good.
Then you need a policy (π(At:n | St) ) because eventually you need to take an action which is why we are not JUST world models. You need to make a choice eventually. Sometimes it may be thoughtless (in the beginning for sure as the child is just learning), but it becomes more and more informed the more games played as Alpha Go showed us.
Some actions the policy can take: (1) prompt the world model. (evaluate a possible seq of moves) (2) score a trajectory (given this sequence, is it net good or net bad). (3) take an "external" action. So then the policy will be given a state St then will generate a few possible At:n's to evaluate, prompt the world model with each one to get back a bunch of St+1:n's and then invoke the Value model to score each move V(St+1:n)'s. Then make the final choice of "external" action that we estimate will maximize V.. if for example the world model gives us little confidence, we may just stand still in shock. That is what most ppl do when they dont have a built up world model for a new task. See old ppl on a computer for evidence. :)
Then (the most important step) how to update all 3 model's weights? After we commit the action externally, we get back truth St+1:n, and compare it to our predicted to update the world model (which even HORRIBLE moves helps us learn this! Which is why babies "play"!), and we also get truth V(St+1:n) (pain, sadness, embarrassment or happiness or indifference) to update the Value network (again even BAD moves are still helpful in this way), and finally after updating both, then and only then can we update our policy (π(At:n | St) ) so next time we make a better decision.
I think thats it?

We are not JUST a world model, but we are a policy that has a very good world model to lean on. Thats what makes us so sample efficinet. Diffusion policy is JUST a policy. Needs a world model and valence model. Then it may not even need labels! Hmm... BUT how to train all 3 networks at the same time??? Thats the key... paper coming soon.. :)

All you need is a World Model + Policy + Value Model; Thats it.

Return Home