Decoupled Weight Decay Regularization

Created: July 18, 2023

Tags: Optimization, Gradient Descent, Adam, AdamW

Link: https://arxiv.org/pdf/1711.05101.pdf

Status: Reading

This paper is easy to read and there are some fun comments on Openreview

What?

L2 regularization and weight decay regularization are equivalent for SGD (when rescaled by learning rate), but it is not the case for adaptive gradient algorithms, such as Adam

Why?

Adam leads to worst generalization and stronger overfiftting than SGD with momentum on classification tasks despite its faster convergence.

How?

L2 regularization and weight decay are not identical
L2 regularization is not effective in Adam
Weight decay is equally effective in both SGD and Adam
Adam can substantially benefit from a scheduled learning rate multiplier such as Cosine Annealing, even if it is adaptive.

Look at the colors one at a time! It’s confusing, I know!

This one is for Adam: