(Train Time) Recurrence as a necessary condition for General Intelligence

To get to AGI, we need an architecture that is "Turing Complete". And a necessary condition of an architecture to be turing complete is “unbounded recurrence/iterations”. The program needs to be able to call itself as many times as it wants. LLMs / diffusion models are said to be “Turing Complete” [1]. However, this is only true at test time, NOT during training. Infinite recursion is not possible in just one foreward pass of the model as is done during training. To make this work at all, we need to teacher force explicitly the intermediary goal states (per time step) to achieve any amount of success (CoT for LLMs, and a bit denoised noisy input for diffusion). However, this is much more limiting than most people can comprehend. We train it this way and then cross our fingers and hope that the non-recurrent model can be called in a while true loop and will magically become turing complete.

As is the most popular SOTA approach right now, what I call "teacher forcing recurrence" + Backprop, your task must be 1) decomposable into 'known' intermediary steps explicitly provided to the model (snapped into a discrete token space for LLMs, or in continuous pixel space for diffusion image gen), and 2) you need to train it starting from one 'known' intermediary results to another 'known' intermediary result with just one fpass to the model. If both are not true, more inference, model size, tokens, training steps, gpus, etc will provide no more value to you. And it wont get us to AGI. e.g. if its a problem where we dont know the intermediary results/decomposition? or worse we dont even know the final answer? or, the intermediary results are not decomposable into "discrete tokens" (e.g. visual simulation is better than text toks)? in all cases this "smearing" over token space wont save you, no matter how much you train or test time infer. Thinking of Ramanujan's "god gave me the answer" where he couldnt explain where the answer came from. How to train it to think like this? How to solve any of the millinium prize problems? Whats the trace? To get the rest of the way there, we need to make distinct problem classes that highlight these issues with current architectures / training techniques so we can try out new ones on tiny subproblems per class and once we get some working, then worry about scale up. Like learning simple things like rolling sum, sin, sort, copy, DFS, BFS, factorial, all the CLRS algos, etc etc.. non-compressible sequential problems.

I have heard specious arguments that RL against 'enough' of these traces will get you there and it will somehow learn the recurrence programs even beyond what is was trained on. E.g. train on more lean proofs and it will learn how to solve unsolved problems. While there may be examples of it working using the "same program we already had it learn" it will never learn a new way to prove theorems. Furthermore, this is NOT a path to intelligence. Intelligence was not created on earth by training on a bunch of lean proofs or code.. In fact the opposite! The lean proofs were created BY THE INTELLIGENCE! So what process/training procedure/architecture created the intelligence?

To give a hard example: Imagine if mankind only knows about bubble sort (which is O(n^2)) and not merge sort (which is O(nlogn)) for example. Merge sort is just a few if statements and a recursive call. Each recursion loop being one level of heirarchy in the 'tree' of compute. Quite simple but powerful. Instead we teacher force on bubble sort, which is much worse performance and awful at run time but its the "CoT" trace we trained it on. In no way will merge sort "emerge" at test time, no matter how many monkeys you have typing.. you just taught it to recurse each time step via the bubble sort scheme. So how could it?

By teacher forcing "current SOTA solutions", we will never sample a "beyond SOTA solution" at test time. We must train with "adaptive compute time in latent space" (could be continuous latents or a codebook, either way). And IMO this is a necessary architecture feature to get the rest of the way there to AGI/ASI.

You need to let the model recurse as many times as it wants during training to enable new cool programs to be found, and you should not force it to do so in only one fpass, or tell it what each intermediary state should be so explicitly. This is just too limiting. When you release this constrainst, you let the model find super cool novel hierarchies and programs that we have never thought of! And even better the programs could end up being really simple, finding a very small Kolmogorov Complexity solution.

Another way to look at this is "self-boosting". We let the model learn to improve its own input by giving in previous outputs of itself and try to improve upon it. This is using the correct input to the model in both cases as at test time, this is what will really happen, and is the inductive bias we are looking for. This is NOT what happens in diffusion today. We give it an explicit noised up image (not sampled from the model itself) and have it learn to denoise it a bit but at test time it has to take in its own input. And if there were short cuts possible, it never learned it because you teacher forced the denoising schedule! We need to stop thinking about "teacher forcing". We need uninhibited recurrence embedded into the architecture itself at train time for it be actually turing complete, not hope and pray our teacher forcing will turn a non-turing machine into a turing machine with a white true loop.

Our brain is massively ‘loopy’ / recurrent and that is embedded into the architecture itself, in both “inner loops” (microcircuits) and big “outer loops” (2 lobes). This is not a random detail that we can circumnavigate and still achieve nice things. Going “bigger” will not solve this (perhaps theoretically, but not realistically).

We just need the algo to train it.

So.... what are some candidates to enable this "train time recurrence"?

  1. HRM (Hierarchical Reasoning Model): Some magic has been proven with this paper. I think this is more important than ppl think. They prove you can do recurrence at train time and teach it with just TBPTT with T=1 and somehow nice things happen?? Can't believe it but its true. Very similar to my SLN (Strange Loop Networks actually) but they got it to train. There must be a few scaling laws with this paper unexplored as of yet. Scaling Nmax. Scaling number of frequencies maybe? Scaling model size. Scaling iters. Etc. The only thing I dont like about this paper is there is no "search" at test time. So needs a "world model" and the policy needs to invoke it during training. You want that embedded into the architecture. Side note: If TBPTT @ T=1 works for reasoning, it should work for diffusion too. Diffusion only iterates once on theta before incurring loss. Perhaps, we should enable do the same training procedure but allow N ponder steps sampled between Nmin and Nmax for diffusion/flow matching as well! Training not from random noised up images but from the biased sampling of the model. never heard of training image/video gen like that. Should work! There will be many papers doing this with diffusion/flow matching soon and it will work meaningfully better with higher accuracy and MUCH shorter denoising steps for image generation/video generation/diffusion policy/etc.
  2. RNNs + Zero Order training: like SPSA or otherwise. Obviously... :)
  3. Mix of both?

Paper coming soon. Want to try a few things. First I want to HRM + SPSA. Should work. Second, I can't believe no one has tried this but take a "frozen" pretrained LLM and fpass it on Needle in haystack and RULER and then take those final hidden states as input to a new RNN module, and teacher force it to see if it can learn general programs from this pretrained awesome latent space. I am less excited by this but should be a cool paper and should work. This is so common in CV but no one has done this. In CV you typically have the backbone model that you freeze (like resnet50 or whatever) then you fpass it on MSCOCO, save the activations as your new input signal and train an RPN to do object/instance detection for example and you only touch the RPN. Same premise. BUT this is still not going to be a train time turing complete arch so lets focus on HRM + SPSA.. This way its trained to be fully recurrent during training vs. just the final state. Lets see!

Return Home