Is Diffusion All You Need? Is Diffusion (Langevin Dynamics) the Free Lunch We Have All Been Waiting For?

Every month that passes, I grow more convinced that Diffusion (i.e. training w/ Langevin Dynamics) is the free lunch we have all been searching for in AI. Its dominating subfield after subfield. Diffusion models can model multiple distributions effectively, mapping very high-dim state spaces to very high-dim action spaces, and do so with incredible training stability. As many who have tried them say: "It just works". We’ve seen this dominance play out in multiple domains.

They have already dominated Image Generation 3 years ago
Video Generation 2 years ago
GNNs were SOTA for Weather prediction (GraphCast) but last year got dethroned by GenCast (diffusion-based) [3].
GNNs were SOTA for protein folding, but last year AlphaFold 3 leverages diffusion to take SOTA [4].
Also last year, Diffusion Policy just dominated robotics: [1] which IMO is a bigger deal than AlexNet! AlexNet dropped ImageNet error from ~30% to ~10%, but Diffusion Policy jumped robotics task success rates from ~20% to ~80%.. much bigger leap! This is the AlexNet moment for diffusion (Langevin dynamics) training. From their abstract:

"This paper introduces Diffusion Policy, a new way of generating robot behavior by representing a robot's visuomotor policy as a conditional denoising diffusion process. We benchmark Diffusion Policy across 12 different tasks from 4 different robot manipulation benchmarks and find that it consistently outperforms existing state-of-the-art robot learning methods with an average improvement of 46.9%."

And the long hold out of Autoregressive vs. Diffusion was LLMs. Well this past week, LLADA [2] just got dropped and gets comparable performance to AR-based LLMs! Moreso, there has been no "Grad Student Descent" or nowadays "VC descent" on it yet. Just wait.

IMO the AI community is not making as big a deal about this as I think they should. Does the rest of the AI community not see it yet? Just a matter of time until its used to solve GO, poker, MARL, and other hold out subfields where diffusion is not SOTA yet.

The Squint Test: Why Diffusion Feels More "Right" than AR

Here’s my squint test: If I squint at a plane and a bird, I can see how they’re similar. If I squint at an architecture with stacked transformers and 3 different training stages, and compare it to the brain with two giant lobes with a huge fissure on inhibition separating them, with continuous learning / single objective function the entire time on remarkably small amounts of data resulting in huge generalization, it does NOT pass my squint test. The biggest tell IMO actually? Time to first token.

Humans don’t just start spewing out words instantly. We ponder. We ponder more for harder problems naturally and less for less. We engage in a kind of iterative refinement before speaking. AR-LLMs don’t do this; they predict one token at a time, with fixed compute for each token, and iter time per token (generally). I know test time compute means just let it "think" for longer and that is true. But this is not the same thing. There is a hard snapping back to token space which may not be right given the problem. While it may be the same thing in the end to "smear the compute" in the time domain, but for some tasks, you do not have a COT trace, or if you mandate the model to not output anyhting but the answer, this won't work. E.g. bit parity / rolling sum without COT. Diffusion allows a more natural process for it to keep calling the model again and again until convergence which is a very natural convergence compared to AR-LLMs that must output an "eos" token. In diffusion, w/ more inference steps, nonconvergence of the output == more “thinking” is needed. Very natural. And even more important IMO, that thinking is NOT broadcasted back to token space, it remains in latent space. WHich if you think about how you ponder, you aren't necessarily thinking in token space, maybe you are visualizing, maybe you aren't even aware of what space your are doing your thinking in. That is what diffusion models are doing. Depending on implementation, there isn't necessarily a forcing back to token space. You can stay in latent until you feel like you are "done thinking". And also, its natural to think about type 1 vs. type 2 thinking. Type 1 is just first iter on the diffusion at inference time. Type 2, is do it for longer. This aligns with how humans do type 1/2 to me!

LLADA: Diffusion Is Already Competitive with AR-LLMs

I want to include this snippet that jumped out to me in LLADA: [2].

"LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task."

So not only is it equivalent performance but is more "human-like" as it doesnt fail in ways that AR-LLMs fail. Also think about the compute pattern in humans vs. AR-LLMs. AR-LLMs think by computing X FLOPS, emitting one token, then computing X FLOPS again for the next token.. so on and so on. But we don’t! We do a lot of "upfront" thinking and then generate structured output quickly. Diffusion allows that. More inference steps mean deeper pondering, leading to higher-quality generations.

Diffisuion is implicit "self-boosting"

Power of boosting. In machine learning, boosting is a simple yet effective technique to build a strong predictor by chaining many weak learners, each only slightly better than chance, so that every learner focuses on the residual mistakes left by the previous learner, so you can think of training each successive model on the mistakes of the previous models, driving the overall "ensemble" error down exponentially.
Why Diffusion can be casted as self-boosting. Training with denoising diffusion, the same network is applied for T time-steps, starting from pure noise and removing some of the noise at each step (or adding non-noise if predicting x or v instead of the error). So each step is a weak denoiser. Because every step conditions on the partially denoised sample produced one step earlier, the residual errors are automatically re-weighted concentrating almost entirely on what earlier steps failed to fix. Analytically, the reverse-time SDE can be written as an iterative estimator whose fixed-point is the data distribution, mirroring how boosting iteratively reduces bias. Empirically, work such as CVPR 2024’s MASF shows that “ensembling” (or averaging in that case) intermediate denoised states further reduces variance, directly echoing classic boosting theory. The diffusion process is therefore a single network bootstrapping ITSELF through time: hundreds of weak denoising updates accumulate into a high-fidelity sample, just as boosting turns weak learners into a strong one, hence diffusion can be viewed as implicit “self-boosting” where the network becomes on ensemble of "weak" learners which makes it work so well as a training technique, IF you have sufficient capacity in the model.

Vibe Check and is it biologically plausible?

I really hate that term but I can't think of a better header. When I talk to everyone who has tried diffusion training for their problem every says the same thing.. "it just works". Like no HPP tuning. That is nuts! The other vibe check is diffusion itself is very easy to see how the brain could be doing it. Diffusion is literally one of the most basic processes in physics. Its seen everywhere. And in the brain it obviously appears as well. You have path one which is a "clean" path and then you have a "dirty path" which is any other circuit that has a few more neurons in the way and then you just want to get the dirty path to match the clean. To "denoise" the clean. Thats it. I dont pretend to be a neuroscientist but it seems obvious.

Diffusion is one of the most Data Efficient training procedures

As proven in the Diffusion Policy paper, with just 50-100 examples per task, the model is able to generalize. That is crazy if you compare to any other training procedure.

Diffusion-Based Learning is perhaps the simplest General-Purpose Training Method

The last argument I will make is occam's razor. Diffusion is so so so simple! Just look at this. It could be used for supervised learning, unsupervised learning, RL, etc. And it always "just works"!

    def diffusion_train(data, num_timesteps):
        for (x, y) in data:
            noise = sample_gaussian_noise()
            y_hat = x  # Initialize prediction sequence
            for t in range(num_timesteps):
                y_hat = apply_noise(y_hat, noise, t)
                loss = model.train(y_hat, y)  # Try to get y_hat closer to y
    
    def diffusion_inference(x, num_timesteps):
        y_hat = x  # Start from noisy input
        for t in reversed(range(num_timesteps)):
            y_hat = model.denoise_step(y_hat, t)
        return y_hat

And wouldn't evolution more likely evolve to find the simplest algorithm? Especially since "adding noise" is pretty easy for nature to achieve. I also don't believe in backprop as the right solver as neurons are one-way (you can't backprop through a neuron). So IMO the most exciting direction is Zero-Order RNNs + Diffusion Online RL. To me, that is the recipe to crank on.

Conclusion: Diffusion Is our Free Lunch! So go use it!

Diffusion models are hitting SOTA across in more and more subfields, and I believe that trend will not stop. I think its the most likely candidate for the the free lunch we’ve all been waiting for. Still early, but the results are already insane so far.

References

Chi, H., Li, L., & Wang, B. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. Retrieved from Diffusion Policy Project.
GSAI Team (2025). LLaDA: Scaling Diffusion Models for Language Tasks. arXiv:2502.09992.
Pathak, A. et al. (2023). GenCast: Diffusion-Based Weather Prediction. arXiv:2312.15796.
Langevin, P. (1908). On the Theory of Brownian Motion. Comptes Rendus de l'Académie des Sciences.