SGD+Backprop is WRONG! There, I said it.
Overfitting is the core issue when training deep learning models with far more parameters than training samples. Standard optimizers (SGD, SGD+Momentum, Adam, etc.) inevitably lead to 100% training accuracy while test accuracy plateaus and then declines, forcing us to use early stopping. This is memorization, not generalization (Arpit et al., 2017) [1].
Deep networks have an enormous capacity to memorize data, and this has been empirically validated. Arpit et al. (2017) showed that neural networks can memorize randomly labeled training data, indicating that standard optimization methods fail to prioritize generalizable patterns [1].
SGD is a powerful optimization technique, but it has a fundamental flaw: it optimizes for a solution that minimizes training loss, which often results in memorization rather than learning generalizable patterns. This is particularly evident in problems like factorial prediction, where the correct general solution is steep in the loss landscape, making it inaccessible to SGD [2].
Using L1/L2 weight decay, dropout, and batch normalization can prevent models from fully memorizing training data [5].
Stopping training when validation performance deteriorates can prevent overfitting [6].
One promising approach I’ve been working on is Gradient Agreement Filtering (GAF), which filters gradient updates based on agreement across different batches. (code here). This helps ensure that gradient steps move toward a generalizable solution instead of overfitting [7].
SGD is fundamentally flawed for generalization. It is a memorization optimizer by default, and unless we change the optimization paradigm, models will continue to prefer memorization over learning true patterns.
The future of deep learning needs optimizers that actively discourage memorization and prioritize generalization. GAF is one step in that direction.