My Journey Going from First Order to Zeroth Order to Reduce the Cost of Intelligence 1 million fold

Breaking Free from Gradients: The Pain of Zeroth-Order Optimization

I have been investigating Zeroth Order optimizers the past 2 months which has been a very gruelling exercise. Its hard to describe how good we have it w/ backpropagation. Its so good at its job. When training deeper and deeper neural networks, the paramater space grows and the right direction to travel in high R^n becomes harder and harder to find especially at later stages of training where smaller jumps are needed for improvement. As Sutton recently proved in his continuous lesrning paper, plasticity reduces with every training step. Initially almost all directions reduce loss (no degeneracy). With every iter another singularity dimension is introduced reducing the set of loss reducing directions.

Backprop always finds the best direction to step, provided the set of loss reducing directions isnt null. Zeroth Order on the other hand must "guess" a step, and then test if its good or not with perturbing the weights and measuring the change in loss, then if the random direction leads to a reduction, step in that direction. So it has to get "lucky". And the larger the model, the harder it is to get lucky. It doesnt get lucky, so this is why it doesnt work. The only saving grace is to reduce the size of weights that need to be perturbed meaningfully. For ex, MeZO only perturbes one layer at a time, DeepZero only show results on ResNet-20 trained on CIFAR-10 / ImageNet-10 which are very small scale experiments, and LeZO perturbs only select layers. Also MeZO and LeZO are only tested to fine-tune a model, not to train from scratch!

This is why Backprop has been king for so long. But it comes with a lot of terms and conditions. It works, but we are then restricted to continuous and differentiable operations and we have to maintian activations to compute the chain of gradients. Thats not terrible for most DL but for sequential modeling w/ long contexts its a complete deal breaker. This is why LSTMs, NTMs, and DNCs lost to Stacked Transformers. As model sizes get larger and memory constraints become the bottleneck, backprop through time (BPTT) for RNNs is an absolute killer. They scale VRAM required O(c) with c = context length. The transformer parallelized the training process but they did not solve it. In fact they made it worse! Transformers scale VRAM needs O(c^2)! This is why Elon and the hyperscalers are procuring nuclear powerplants to train and run their AI. Thats how absurdly inefficient it is in terms of VRAM requirments. Its like we've invented a 0.1% efficinet carnot engine and instead of trying to make it more effiicnet, we just said "Get me a gas tank the size of rhode island and make the engine the size of a sky scaper"! Note, Mamba and Neural Sparse Attention (NSA) say they solve this getting transformers back to O(c) but Ive not tested myself.

Who cares? An alternative path to getting to an AGI that scales without terrawatts of electricity, but closer to humans 100 watts.

Enter zeroth-order (ZO) optimization: training withOUT gradients. No BPTT, no storing activations, no exploding VRAM. So no giant datacenter! If we can get Zero Order working and use it to train giant DNCs / NTMs, this will bring down the cost of "intelligence" by a million fold! Sounds amazing, right? This is because these methods scale VRAM requirements O(1) in context length so we could have 1 million size context length for 1 million times less cost. The only catch — ZO optimization doesn't work (yet) in high parameter spaces. Minor detail. You start training with tons of plasticity—almost every random perturbation decreases loss. Then, as training progresses, things slow down very fast. Suddenly, most perturbation directions stop helping. Learning stops. Almost all direction are degenerate, i.e. do not impact output/loss. Training dies way faster making training impossible even for memorizing a single batch. Why?

Let’s get into it. Below, I walk through the history of zeroth-order methods, the scaling challenges they hit, how they compare to first-order methods, and why Watanabe’s work on singular learning theory helps explain the weird things we’re seeing when pushing ZO to large models. Spoiler: it all comes down to singularities and the effective dimensionality of the optimization landscape.

1. Zeroth-Order Optimization: A Crash Course

1.1 Classical ZO Methods

The simplest way to estimate gradients without computing them directly is to use finite differences. Given a function \( L(\theta) \), we can approximate its gradient along coordinate \( \theta_i \) as:

\[ \frac{\partial L}{\partial \theta_i} \approx \frac{L(\theta + \mu e_i) - L(\theta - \mu e_i)}{2\mu} \]

where \( e_i \) is a unit vector along the \( i \)-th coordinate.

Simultaneous Perturbation

To fix this, researchers came up with simultaneous perturbation methods like SPSA (Spall, 1992), which estimate gradients by perturbing all parameters at once using a single random vector:

\[ g_i = \frac{L(\theta + \sigma \Delta) - L(\theta - \sigma \Delta)}{2\sigma \Delta_i} \]

This drops function evaluations to just two per step.

This drops function evaluations to just two per step, independent of dimension. Evolution Strategies (ES) takes this further by averaging over multiple perturbations, effectively performing gradient descent in a smoothed loss landscape.

The bottom line: classical ZO methods work, but they’re noisy and don’t scale well to large models unless you get creative.

1.2 Modern ZO Methods

Recently, researchers have adapted ZO methods for deep learning. Some highlights:

These methods prove that ZO isn’t just a toy algorithm. It can scale—but only if you respect the constraints.

2. Scaling Challenges: Why ZO Stops Working

Here’s where things get weird. You’d expect that as you scale up a model, adding more parameters would give you more directions to escape local minima. More dimensions should mean better optimization, right? Nope. Instead, we see:

Turns out, this is all tied to singular learning theory (Watanabe, 2001). Neural networks have singularities—regions in parameter space where many different weight configurations produce the same output. In these regions, the Fisher Information Matrix is degenerate, meaning standard optimization theory breaks down.

SGD eventually escapes these plateaus because gradients accumulate in the right directions. But ZO methods waste time perturbing flat directions, meaning they often never escape.

3. Where Do We Go from Here?

ZO methods aren’t going to replace SGD, but they offer a promising alternative where backprop is infeasible. The key to making them work at scale is:

There’s a lot more to explore. But for now, the takeaway is it doesn't scale with model size.. YET. We need to keep trying. If we do, we may find the most scalable path to AI, promising a way more scalable approach to intelligence. This would allow all of us to have datacenter size intelligence running on our laptop and reducing the concentration of power to just a few hyperscalers.


References

  1. Spall, J. (1992). Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation. IEEE Transactions on Automatic Control.
  2. Malladi et al. (2023). MeZO: Memory-Efficient Zeroth-Order Optimization. arXiv preprint.
  3. Chen et al. (2024). DeepZero: Training Deep Networks Without Gradients. arXiv preprint.
  4. Watanabe, S. (2001). Singular Learning Theory. Advances in Neural Information Processing Systems.

Return Home