AdaMax
AdaMax is an adaptive optimization algorithm for stochastic gradient descent that extends Adam by using the infinity norm of past gradients to scale the step size. It was introduced as part of the Adam family by Kingma and Ba. AdaMax maintains two moving quantities: the first moment m_t, an exponential average of past gradients, and the infinity norm u_t, the exponential moving maximum of past gradient magnitudes. The update rules are:
m_t = beta1 * m_{t-1} + (1 - beta1) * g_t
u_t = max(beta2 * u_{t-1}, |g_t|)
theta_t = theta_{t-1} - alpha * m_hat_t / (u_t + epsilon)
where alpha is the step size, beta1 and beta2 are decay rates, and epsilon is a small
AdaMax inherits Adam's adaptive learning rate concept but replaces the second moment estimate with the L-infinity