Created: July 18, 2023
Tags: Optimization, Gradient Descent, Adam, AdamW
Link: https://arxiv.org/pdf/1711.05101.pdf
Status: Reading
This paper is easy to read and there are some fun comments on Openreview
L2 regularization and weight decay regularization are equivalent for SGD (when rescaled by learning rate), but it is not the case for adaptive gradient algorithms, such as Adam
Adam leads to worst generalization and stronger overfiftting than SGD with momentum on classification tasks despite its faster convergence.
Look at the colors one at a time! It’s confusing, I know!
This one is for Adam: