How do we stop SGD+Backprop from memorizing, and learn to generalize instead

Problem Statement

SGD+Backprop is WRONG! There, I said it.

Overfitting is the core issue when training deep learning models with far more parameters than training samples. Standard optimizers (SGD, SGD+Momentum, Adam, etc.) inevitably lead to 100% training accuracy while test accuracy plateaus and then declines, forcing us to use early stopping. This is memorization, not generalization (Arpit et al., 2017) [1].

Understanding Overfitting and Memorization

Deep networks have an enormous capacity to memorize data, and this has been empirically validated. Arpit et al. (2017) showed that neural networks can memorize randomly labeled training data, indicating that standard optimization methods fail to prioritize generalizable patterns [1].

The Role of SGD in Memorization

SGD is a powerful optimization technique, but it has a fundamental flaw: it optimizes for a solution that minimizes training loss, which often results in memorization rather than learning generalizable patterns. This is particularly evident in problems like factorial prediction, where the correct general solution is steep in the loss landscape, making it inaccessible to SGD [2].

Why Memorization Happens

Batch Size: Small batches introduce noise that can help escape sharp minima, while large batches lead to overfitting [3].
Initialization Scale: Large initializations can force extreme memorization by letting models fit noise directly [4].
Model Capacity: Overparameterized networks memorize more easily than smaller models, given enough training time [1].

How to Prevent Memorization

Regularization Techniques

Using L1/L2 weight decay, dropout, and batch normalization can prevent models from fully memorizing training data [5].

Early Stopping

Stopping training when validation performance deteriorates can prevent overfitting [6].

Gradient Agreement Filtering (GAF)

One promising approach I’ve been working on is Gradient Agreement Filtering (GAF), which filters gradient updates based on agreement across different batches. (code here). This helps ensure that gradient steps move toward a generalizable solution instead of overfitting [7].

Conclusion

SGD is fundamentally flawed for generalization. It is a memorization optimizer by default, and unless we change the optimization paradigm, models will continue to prefer memorization over learning true patterns.

The future of deep learning needs optimizers that actively discourage memorization and prioritize generalization. GAF is one step in that direction.

References

Arpit, D. et al. (2017). A Closer Look at Memorization in Deep Networks. Proceedings of the 34th International Conference on Machine Learning.
Nakkiran, P. et al. (2019). Deep Double Descent. arXiv:1912.02292.
StackExchange Discussion on Batch Size and Overfitting. Link.
Mehta, H. et al. (2020). Extreme Memorization via Scale of Initialization. arXiv:2008.13363.
Brownlee, J. (2020). Introduction to Regularization. Link.
Wikipedia Entry on Early Stopping. Link.
Chaubard, F. et al. (2024). Beyond Gradient Averaging in Parallel Optimization. arXiv:2412.18052.