Created: July 18, 2023

Tags: Optimization, Gradient Descent, Adam, AdamW

Link: https://arxiv.org/pdf/1711.05101.pdf

Status: Reading

This paper is easy to read and there are some fun comments on Openreview

What?

L2 regularization and weight decay regularization are equivalent for SGD (when rescaled by learning rate), but it is not the case for adaptive gradient algorithms, such as Adam

Why?

Adam leads to worst generalization and stronger overfiftting than SGD with momentum on classification tasks despite its faster convergence.

How?

Look at the colors one at a time! It’s confusing, I know!

1_dVPcugWY3wyxmSGYdUyDYg.webp

This one is for Adam:

1_Q9W_oSHOK3c8K-ZliSWYhQ.webp