By Francois Chaubard
For decades, computer scientists and mathematicians have tried to classify the complexity of algorithms and models—often in “big-O” notation similar to P, NP, PSPACE, etc. under the assumption that the model itself is the core entity we need to analyze. But when it comes to intelligence, focusing on the model in isolation overlooks critical factors like the training process (the solver), and data distribution. For example, curriculum learning can greatly improve without curriculum. More trainnig steps outperforms less steps. All with the same model class. Using model alone is not useful as even a 2 layer NN can fit ANY function. So why in practice can't it? Because its not possible to find the magical theta that gets you there. So the solver / training dynamics must be taking into consideration for us to have a useful bound on intelligence to start defining intelligence classes. This post proposes a new perspective: an agent’s intelligence depends on the synergy of three components—model (architecture + capacity), training data (or experience more broadly), and the solver (all the training dynamics like: SGD, MeZO, adam, lr schedule, weight decay, # of iters, etc etc).
We might call this integrated measure:
OIntelligence(model + solver + data)
where each part plays a role analogous to how, in Einstein’s theory of relativity, matter and space-time cannot be treated as independent. In the same way, the solver, data, and model architecture are tightly interwoven and shape each other.
In the machine learning world, we often talk about model complexity—like the number of parameters in a neural network, or the depth/width of layers—as if that alone determines learning capacity. The classic big-O notation helps us discuss computational or memory costs of running or training the model, but it rarely captures the model’s actual ability to generalize.
Key takeaway: Looking at any one of these three (model, data, or solver) without the others is akin to analyzing a planet’s motion in purely Newtonian terms, ignoring the fact that mass and spacetime curvature are intertwined in Einstein’s relativity.
One way to define intelligence is in terms of average performance on a broad set of tasks. This echoes ideas from the Legg-Hutter measure of universal intelligence, which looks at how an agent performs across all computable environments.
If we let 𝕌
represent a set of tasks and Perf(agent, 𝕌)
represent the agent’s average performance on these tasks, then a rough measure of intelligence might be:
Intelligence(agent) ≈ 𝔼t∈𝕌[Perf(agent, t)]
In Newtonian mechanics, we often see space and time as fixed backgrounds and treat objects (matter) as separate. Einstein’s theory of general relativity flips that picture: matter-energy warps spacetime, and spacetime curvature affects how matter moves. They are intertwined, not independent.
Classical big-O typically gives you something like:
O(N × model cost)
for N
training samples.O(d × model cost)
for d
input features.1. Isn’t This Just Legg-Hutter or Universal Intelligence?
Similar spirit, different emphasis.
2. Measuring “All Possible Tasks” is Impossible
Indeed, we can’t measure literally everything.