AdaMax - Infinite Lexicon - Infinite Lexicon

AdaMax

AdaMax is an adaptive optimization algorithm for stochastic gradient descent that extends Adam by using the infinity norm of past gradients to scale the step size. It was introduced as part of the Adam family by Kingma and Ba. AdaMax maintains two moving quantities: the first moment m_t, an exponential average of past gradients, and the infinity norm u_t, the exponential moving maximum of past gradient magnitudes. The update rules are:

g_t is the gradient at time t

m_t = beta1 * m_{t-1} + (1 - beta1) * g_t

u_t = max(beta2 * u_{t-1}, |g_t|)

m_hat_t = m_t / (1 - beta1^t)

theta_t = theta_{t-1} - alpha * m_hat_t / (u_t + epsilon)

where alpha is the step size, beta1 and beta2 are decay rates, and epsilon is a small

AdaMax inherits Adam's adaptive learning rate concept but replaces the second moment estimate with the L-infinity

See also Adam, RMSProp, SGD with momentum.

initialization.

Hyperparameters

a