Master Plan to Solve AI

My Research Plan to Solve AI

Today, we have four fundamental questions left to answer before we reach Artificial Superintelligence (ASI):

1) What’s the right architecture?

LLMs work exceptionally well for a broad range of tasks, but they are not universally optimal. Graph Neural Networks (GNNs) are SOTA for protein folding [Jumper et al., 2020]. Counterfactual Regret Minimization (CFRM) is SOTA for poker [Brown & Sandholm, 2017]. Monte Carlo Tree Search (MCTS) and Deep Q-Networks (DQN) dominate in games and proof-based reasoning [Silver et al., 2016]. Ideally, a single model should achieve SOTA across all of these. It needs to pass the squint test!

2) What’s the right solver?

SGD today encourages approximately 90% memorization and only 10% generalization, which is fundamentally flawed. Current models fail to generalize reliably in many scenarios. We need a superior optimization method—perhaps Monte Carlo Tree Search (MCTS), Zeroth-Order Optimization (MeZO), or a continual backprop approach? This is my current focus and the biggest area the current ML community is not focusing on. The father of backprop doesnt believe in backprop. Hinton doesnt believe the brain does backprop, but instead forward forward. I have strong reason to believe this should work even better than first order methods bc it is not as sensitive to local rockiness in the loss landscape which is why RL doesnt work.

3) How do we train without overfitting?

What is the optimal training procedure? Do we need massive datasets, an online RL setup, batch-based training, single-sample updates, denoising techniques, data augmentations, or something like Gradient Agreement Filtering (GAF)? We don't know the best method yet.

4) How do we validate the learned model?

Current benchmarks are unreliable—static tests like GSM8k and MMLU are proving inadequate. Dynamic evaluation setups, such as LMSys, offer better real-world insight. How can we establish a rigorous and reliable validation framework?

A Problem Class to Infer Intelligence

We define a general intelligence test as follows: given any arbitrary black-box, Turing-computable function \( f \) (e.g., factorial, Fibonacci, sorting functions, polynomial mappings), an intelligent system should construct and update a procedure \( p \) that efficiently samples from \( f \) and finds a generalizable neural representation \( f^* \) such that:

Hamming Distance( \( f^*(x) \), \( f(x) \) ) = 0 for all \( x \),

ensuring that the learned function perfectly maps the full domain without loss or introduction of information. Humans perform this task naturally—we are pattern-finders. Why can’t LLMs do the same?

Transduction vs. Program Induction

There are two fundamental paradigms in how a model can learn to solve problems:

Transduction: The model itself becomes the mapping function from input to output. This approach, used in many neural network-based models, directly learns the transformation without explicitly constructing an underlying program.
Program Induction: The model learns to generate an explicit program that can produce the correct outputs. This is more analogous to how humans abstract patterns into mathematical formulas or symbolic programs, allowing greater flexibility and generalization.

Modern AI systems tend to operate in a transductive way, memorizing mappings rather than explicitly deriving general rules. To advance AI, we need to shift towards program induction, where the model discovers and expresses reusable, compositional programs.

Human Problem-Solving Search Procedure

Humans approach such tasks with an iterative process:

Sample a few input-output pairs from \( f \).
Iterate over hypothesis functions and test their fit against known pairs.
Prune hypotheses that fail, refine those that succeed.
If multiple solutions remain, refine sampling strategy to disambiguate.
Converge on the simplest hypothesis with minimum Kolmogorov complexity.

However, today’s LLMs + SGD do not operate this way. Instead, they memorize large datasets in a batch-learning process, preventing them from achieving generalization. Even when they do generalize, it requires heavy manual interventions like Chain-of-Thought (CoT) prompting, external tool use (e.g., sympy, CVXPY), or huge hyperparameter sweeps.

Conclusion

Current AI models fail to generalize because of architectural, optimization, and training limitations. If we want true AGI, we need better solvers step 1. This needs to be my focus right now. This research agenda is our path forward.