Adam Optimizer: Adaptive Moment Estimation

October 27, 2025

optimizationdeep-learningalgorithms

The Adam (Adaptive Moment Estimation) optimizer is a first-order stochastic method that combines momentum with per-coordinate adaptive step sizes. Proposed by Kingma and Ba (2014), Adam maintains exponentially weighted moving averages of both gradients and squared gradients. This pairing supplies a direction estimate with inertia and a scale estimate that normalizes by recent curvature, yielding stable updates with minimal hyperparameter tuning. Empirically, Adam converges quickly on a wide range of nonconvex deep learning problems, particularly those with sparse or noisy gradients; theoretically, its behavior is well understood in convex settings and partially understood in nonconvex ones.

The Optimization Problem and Difficulties Adam Targets

We seek to minimize a (possibly nonconvex) objective

f(\theta)=\mathbb{E}_{\xi}[F(\theta;\xi)]

over parameters $\theta\in\mathbb{R}^d$ using stochastic gradients

g_t=\nabla_\theta F(\theta_{t-1};\xi_t).

Fixed-rate SGD applies the same learning rate to all coordinates, which can be too aggressive along steep directions and too timid along flat ones; moreover, curvature anisotropy induces zig-zagging, and plateaus or saddle points with small gradients slow progress. Momentum partially mitigates curvature-induced oscillations by smoothing directions, while RMS-style normalization stabilizes the step size by down-weighting coordinates that recently exhibited large gradient magnitudes. Adam integrates both ideas.

Algorithmic Definition

Let

\alpha>0,\qquad \beta_1,\beta_2\in[0,1),\qquad \varepsilon>0,

where $\alpha$ is the base step size, $\beta_1,\beta_2$ control exponential averaging, and $\varepsilon$ ensures numerical stability. For $t\ge 1$ ,

g_t=\nabla_\theta f(\theta_{t-1}),

m_t=\beta_1 m_{t-1}+(1-\beta_1)g_t,

v_t=\beta_2 v_{t-1}+(1-\beta_2)g_t\odot g_t,

with $m_0=0$ , $v_0=0$ , and $\odot$ denoting element-wise multiplication. Because $m_t$ and $v_t$ are initialized at zero, their expectations are biased toward zero in early iterations; bias-corrected estimators are

\hat m_t=\frac{m_t}{1-\beta_1^t},\qquad \hat v_t=\frac{v_t}{1-\beta_2^t}.

The parameter update is performed element-wise:

\theta_t=\theta_{t-1}-\alpha\frac{\hat m_t}{\sqrt{\hat v_t}+\varepsilon},

where $\sqrt{\cdot}$ is applied element-wise. Typical defaults

(\alpha,\beta_1,\beta_2,\varepsilon)=(10^{-3},0.9,0.999,10^{-8})

work well in practice.

Why Bias Correction Is Necessary

If $m_t$ is an exponential moving average with $m_0=0$ , then

\mathbb{E}[m_t]=(1-\beta_1)\sum_{k=1}^t \beta_1^{t-k}\mathbb{E}[g_k].

When the gradient mean is stationary, this simplifies to

\mathbb{E}[m_t]=(1-\beta_1^t)\mathbb{E}[g],

so $m_t$ underestimates the mean by a factor $1-\beta_1^t$ . Dividing by $1-\beta_1^t$ yields the unbiased estimator $\hat m_t$ . An analogous argument applies to $v_t$ . As $t\to\infty$ , the factors $1-\beta_i^t$ approach 1, so the correction vanishes asymptotically but matters at initialization when $t$ is small and $\beta_i$ is close to 1.

Interpretation and Effective Step Size

Adam can be viewed as momentum SGD preconditioned by a diagonal matrix formed from recent gradient magnitudes. Writing the update coordinate-wise,

\theta_{t,i}=\theta_{t-1,i}-\alpha\frac{\hat m_{t,i}}{\sqrt{\hat v_{t,i}}+\varepsilon}.

The quantity $\alpha/(\sqrt{\hat v_{t,i}}+\varepsilon)$ acts as an adaptive per-coordinate step size: large recent gradients inflate $\hat v_{t,i}$ and shrink the step, while small recent gradients deflate $\hat v_{t,i}$ and enlarge the step. The momentum term $\hat m_{t,i}$ provides directionality and accelerates along persistent descent directions, reducing oscillations in ravines.

Relation to Classical Methods

Plain SGD uses neither momentum nor adaptivity. Momentum SGD replaces $g_t$ with an exponentially averaged direction, improving conditioning while keeping a global step size. RMSProp uses the raw second-moment accumulator $v_t$ to normalize $g_t$ without momentum. Adam composes these two mechanisms, which often shortens the transient needed to reach a performant region of the parameter space.

Variants and Their Theoretical Motivations

AdamW decouples weight decay from the adaptive gradient step. Classical $L_2$ regularization implemented as "weight decay via gradients" interacts with the adaptive preconditioner, effectively rescaling the decay per coordinate and coupling it to gradient history. AdamW instead applies the decay directly to parameters:

\theta_t \leftarrow (1-\alpha\lambda)\theta_{t-1}-\alpha\frac{\hat m_t}{\sqrt{\hat v_t}+\varepsilon},

with $\lambda\ge 0$ the decay coefficient. This preserves the intended meaning of weight decay as a uniform shrinkage independent of gradient statistics and has been shown to improve generalization in many settings.

AMSGrad addresses a failure mode of Adam’s convergence proofs in convex problems by enforcing a non-increasing effective learning rate per coordinate. It maintains

\tilde v_t=\max(\tilde v_{t-1},v_t)\quad\text{(element-wise)},

and updates

\theta_t=\theta_{t-1}-\alpha\frac{\hat m_t}{\sqrt{\tilde v_t}+\varepsilon}.

By never allowing the denominator to decrease, AMSGrad restores convergence guarantees under standard assumptions, whereas vanilla Adam can diverge on constructed convex counterexamples despite bounded gradients.

Practical Guidance on Hyperparameters and Schedules

The defaults $\beta_1=0.9$ and $\beta_2=0.999$ balance responsiveness and smoothness of the moment estimators. Increasing $\beta_1$ improves inertia but slows reaction to regime shifts, while decreasing $\beta_2$ makes normalization more reactive but noisier. The base step $\alpha$ typically starts near $10^{-3}$ and benefits from warmup followed by cosine or exponential decay in large-scale training. The constant $\varepsilon$ prevents division by zero and subtly affects very small-gradient regimes; values in $[10^{-8},10^{-6}]$ are common, with larger $\varepsilon$ slightly de-emphasizing normalization.

Known Failure Modes, Generalization, and Remedies

Despite fast optimization, Adam sometimes yields worse test performance than well-tuned momentum SGD, particularly in vision tasks. Two drivers are often implicated: overly aggressive adaptive steps in flat directions, and the coupling of adaptivity with implicit regularization. Remedies include using AdamW for principled decay, decaying $\alpha$ over training, increasing batch sizes late in training, switching from Adam to momentum SGD for the final epochs (an "optimizer switch"), or using AMSGrad to enforce conservative per-coordinate schedules. Gradient clipping and careful initialization remain beneficial in very noisy or extremely sparse regimes.

Typical Training Dynamics

Early iterations are dominated by bias correction as both moment estimates spin up from zero; updates can be comparatively large while $\hat v_t$ is still small. As training progresses, $\hat v_t$ stabilizes and Adam behaves like momentum SGD with a learned diagonal preconditioner, moving swiftly along well-identified directions and cautiously where curvature is high. Late training enters a fine-tuning phase with smaller, more isotropic steps; learning-rate decay often improves the final plateaus.

Conclusion

Adam is best understood as momentum SGD equipped with an adaptive, diagonal preconditioner derived from recent gradient magnitudes and corrected for initialization bias. This combination delivers rapid, stable progress on noisy, high-dimensional, and sparse problems with little tuning. When theoretical convergence or stronger regularization semantics are required, AMSGrad and AdamW provide principled modifications that preserve Adam’s practical strengths while addressing its limitations.

References

Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. ICLR 2015.

Loshchilov, I., & Hutter, F. (2017). Decoupled Weight Decay Regularization. ICLR 2019.

Reddi, S. J., Kale, S., & Kumar, S. (2018). On the Convergence of Adam and Beyond. ICLR 2018.