Adam Optimizer: Adaptive Moment Estimation

optimizationdeep-learningalgorithms

The Adam (Adaptive Moment Estimation) optimizer is a first-order stochastic method that combines momentum with per-coordinate adaptive step sizes. Proposed by Kingma and Ba (2014), Adam maintains exponentially weighted moving averages of both gradients and squared gradients. This pairing supplies a direction estimate with inertia and a scale estimate that normalizes by recent curvature, yielding stable updates with minimal hyperparameter tuning. Empirically, Adam converges quickly on a wide range of nonconvex deep learning problems, particularly those with sparse or noisy gradients; theoretically, its behavior is well understood in convex settings and partially understood in nonconvex ones.

The Optimization Problem and Difficulties Adam Targets

We seek to minimize a (possibly nonconvex) objective

f(θ)=Eξ[F(θ;ξ)]f(\theta)=\mathbb{E}_{\xi}[F(\theta;\xi)]

over parameters θRd\theta\in\mathbb{R}^d using stochastic gradients

gt=θF(θt1;ξt).g_t=\nabla_\theta F(\theta_{t-1};\xi_t).

Fixed-rate SGD applies the same learning rate to all coordinates, which can be too aggressive along steep directions and too timid along flat ones; moreover, curvature anisotropy induces zig-zagging, and plateaus or saddle points with small gradients slow progress. Momentum partially mitigates curvature-induced oscillations by smoothing directions, while RMS-style normalization stabilizes the step size by down-weighting coordinates that recently exhibited large gradient magnitudes. Adam integrates both ideas.

Algorithmic Definition

Let

α>0,β1,β2[0,1),ε>0,\alpha>0,\qquad \beta_1,\beta_2\in[0,1),\qquad \varepsilon>0,

where α\alpha is the base step size, β1,β2\beta_1,\beta_2 control exponential averaging, and ε\varepsilon ensures numerical stability. For t1t\ge 1,

gt=θf(θt1),g_t=\nabla_\theta f(\theta_{t-1}), mt=β1mt1+(1β1)gt,m_t=\beta_1 m_{t-1}+(1-\beta_1)g_t, vt=β2vt1+(1β2)gtgt,v_t=\beta_2 v_{t-1}+(1-\beta_2)g_t\odot g_t,

with m0=0m_0=0, v0=0v_0=0, and \odot denoting element-wise multiplication. Because mtm_t and vtv_t are initialized at zero, their expectations are biased toward zero in early iterations; bias-corrected estimators are

m^t=mt1β1t,v^t=vt1β2t.\hat m_t=\frac{m_t}{1-\beta_1^t},\qquad \hat v_t=\frac{v_t}{1-\beta_2^t}.

The parameter update is performed element-wise:

θt=θt1αm^tv^t+ε,\theta_t=\theta_{t-1}-\alpha\frac{\hat m_t}{\sqrt{\hat v_t}+\varepsilon},

where \sqrt{\cdot} is applied element-wise. Typical defaults

(α,β1,β2,ε)=(103,0.9,0.999,108)(\alpha,\beta_1,\beta_2,\varepsilon)=(10^{-3},0.9,0.999,10^{-8})

work well in practice.

Why Bias Correction Is Necessary

If mtm_t is an exponential moving average with m0=0m_0=0, then

E[mt]=(1β1)k=1tβ1tkE[gk].\mathbb{E}[m_t]=(1-\beta_1)\sum_{k=1}^t \beta_1^{t-k}\mathbb{E}[g_k].

When the gradient mean is stationary, this simplifies to

E[mt]=(1β1t)E[g],\mathbb{E}[m_t]=(1-\beta_1^t)\mathbb{E}[g],

so mtm_t underestimates the mean by a factor 1β1t1-\beta_1^t. Dividing by 1β1t1-\beta_1^t yields the unbiased estimator m^t\hat m_t. An analogous argument applies to vtv_t. As tt\to\infty, the factors 1βit1-\beta_i^t approach 1, so the correction vanishes asymptotically but matters at initialization when tt is small and βi\beta_i is close to 1.

Interpretation and Effective Step Size

Adam can be viewed as momentum SGD preconditioned by a diagonal matrix formed from recent gradient magnitudes. Writing the update coordinate-wise,

θt,i=θt1,iαm^t,iv^t,i+ε.\theta_{t,i}=\theta_{t-1,i}-\alpha\frac{\hat m_{t,i}}{\sqrt{\hat v_{t,i}}+\varepsilon}.

The quantity α/(v^t,i+ε)\alpha/(\sqrt{\hat v_{t,i}}+\varepsilon) acts as an adaptive per-coordinate step size: large recent gradients inflate v^t,i\hat v_{t,i} and shrink the step, while small recent gradients deflate v^t,i\hat v_{t,i} and enlarge the step. The momentum term m^t,i\hat m_{t,i} provides directionality and accelerates along persistent descent directions, reducing oscillations in ravines.

Relation to Classical Methods

Plain SGD uses neither momentum nor adaptivity. Momentum SGD replaces gtg_t with an exponentially averaged direction, improving conditioning while keeping a global step size. RMSProp uses the raw second-moment accumulator vtv_t to normalize gtg_t without momentum. Adam composes these two mechanisms, which often shortens the transient needed to reach a performant region of the parameter space.

Variants and Their Theoretical Motivations

AdamW decouples weight decay from the adaptive gradient step. Classical L2L_2 regularization implemented as "weight decay via gradients" interacts with the adaptive preconditioner, effectively rescaling the decay per coordinate and coupling it to gradient history. AdamW instead applies the decay directly to parameters:

θt(1αλ)θt1αm^tv^t+ε,\theta_t \leftarrow (1-\alpha\lambda)\theta_{t-1}-\alpha\frac{\hat m_t}{\sqrt{\hat v_t}+\varepsilon},

with λ0\lambda\ge 0 the decay coefficient. This preserves the intended meaning of weight decay as a uniform shrinkage independent of gradient statistics and has been shown to improve generalization in many settings.

AMSGrad addresses a failure mode of Adam’s convergence proofs in convex problems by enforcing a non-increasing effective learning rate per coordinate. It maintains

v~t=max(v~t1,vt)(element-wise),\tilde v_t=\max(\tilde v_{t-1},v_t)\quad\text{(element-wise)},

and updates

θt=θt1αm^tv~t+ε.\theta_t=\theta_{t-1}-\alpha\frac{\hat m_t}{\sqrt{\tilde v_t}+\varepsilon}.

By never allowing the denominator to decrease, AMSGrad restores convergence guarantees under standard assumptions, whereas vanilla Adam can diverge on constructed convex counterexamples despite bounded gradients.

Practical Guidance on Hyperparameters and Schedules

The defaults β1=0.9\beta_1=0.9 and β2=0.999\beta_2=0.999 balance responsiveness and smoothness of the moment estimators. Increasing β1\beta_1 improves inertia but slows reaction to regime shifts, while decreasing β2\beta_2 makes normalization more reactive but noisier. The base step α\alpha typically starts near 10310^{-3} and benefits from warmup followed by cosine or exponential decay in large-scale training. The constant ε\varepsilon prevents division by zero and subtly affects very small-gradient regimes; values in [108,106][10^{-8},10^{-6}] are common, with larger ε\varepsilon slightly de-emphasizing normalization.

Known Failure Modes, Generalization, and Remedies

Despite fast optimization, Adam sometimes yields worse test performance than well-tuned momentum SGD, particularly in vision tasks. Two drivers are often implicated: overly aggressive adaptive steps in flat directions, and the coupling of adaptivity with implicit regularization. Remedies include using AdamW for principled decay, decaying α\alpha over training, increasing batch sizes late in training, switching from Adam to momentum SGD for the final epochs (an "optimizer switch"), or using AMSGrad to enforce conservative per-coordinate schedules. Gradient clipping and careful initialization remain beneficial in very noisy or extremely sparse regimes.

Typical Training Dynamics

Early iterations are dominated by bias correction as both moment estimates spin up from zero; updates can be comparatively large while v^t\hat v_t is still small. As training progresses, v^t\hat v_t stabilizes and Adam behaves like momentum SGD with a learned diagonal preconditioner, moving swiftly along well-identified directions and cautiously where curvature is high. Late training enters a fine-tuning phase with smaller, more isotropic steps; learning-rate decay often improves the final plateaus.

Conclusion

Adam is best understood as momentum SGD equipped with an adaptive, diagonal preconditioner derived from recent gradient magnitudes and corrected for initialization bias. This combination delivers rapid, stable progress on noisy, high-dimensional, and sparse problems with little tuning. When theoretical convergence or stronger regularization semantics are required, AMSGrad and AdamW provide principled modifications that preserve Adam’s practical strengths while addressing its limitations.

References

Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. ICLR 2015.

Loshchilov, I., & Hutter, F. (2017). Decoupled Weight Decay Regularization. ICLR 2019.

Reddi, S. J., Kale, S., & Kumar, S. (2018). On the Convergence of Adam and Beyond. ICLR 2018.

Kleyton da Costa