What is Adam Optimizer?

Last Updated : 04 Oct, 2025

Adam (Adaptive Moment Estimation) optimizer combines the advantages of Momentum and RMSprop techniques to adjust learning rates during training. It works well with large datasets and complex models because it uses memory efficiently and adapts the learning rate for each parameter automatically.

How Does Adam Work?

Adam builds upon two key concepts in optimization:

1. Momentum

Momentum is used to accelerate the gradient descent process by incorporating an exponentially weighted moving average of past gradients. This helps smooth out the trajectory of the optimization allowing the algorithm to converge faster by reducing oscillations.

The update rule with momentum is:

w_{t+1} = w_{t} - \alpha m_{t}

where:

m_t is the moving average of the gradients at time t
α is the learning rate
w_t and w_{t+1} are the weights at time t and t+1, respectively

The momentum term m_t is updated recursively as:

m_{t} = \beta_1 m_{t-1} + (1 - \beta_1) \frac{\partial L}{\partial w_t}

where:

\beta_1 is the momentum parameter (typically set to 0.9)
\frac{\partial L}{\partial w_t} is the gradient of the loss function with respect to the weights at time t

2. RMSprop (Root Mean Square Propagation)

RMSprop is an adaptive learning rate method that improves upon AdaGrad. While AdaGrad accumulates squared gradients and RMSprop uses an exponentially weighted moving average of squared gradients, which helps overcome the problem of diminishing learning rates.

The update rule for RMSprop is:

w_{t+1} = w_{t} - \frac{\alpha_t}{\sqrt{v_t + \epsilon}} \frac{\partial L}{\partial w_t}

where:

v_t is the exponentially weighted average of squared gradients:

v_t = \beta_2 v_{t-1} + (1 - \beta_2) \left( \frac{\partial L}{\partial w_t} \right)^2

ϵ is a small constant (e.g., 10^{-8}) added to prevent division by zero

Combining Momentum and RMSprop to form Adam Optimizer

Adam optimizer combines the momentum and RMSprop techniques to provide a more balanced and efficient optimization process. The key equations governing Adam are as follows:

First moment (mean) estimate:

m_t = \beta_1 m_{t-1} + (1 - \beta_1) \frac{\partial L}{\partial w_t}

Second moment (variance) estimate:

v_t = \beta_2 v_{t-1} + (1 - \beta_2) \left( \frac{\partial L}{\partial w_t} \right)^2

Bias correction: Since both m_t and v_t are initialized at zero, they tend to be biased toward zero, especially during the initial steps. To correct this bias, Adam computes the bias-corrected estimates:

\hat{m_t} = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v_t} = \frac{v_t}{1 - \beta_2^t}

Final weight update: The weights are then updated as:

w_{t+1} = w_t - \frac{\hat{m_t}}{\sqrt{\hat{v_t}} + \epsilon} \alpha

Key Parameters

α: The learning rate or step size (default is 0.001)
\beta_1 and \beta_2: Decay rates for the moving averages of the gradient and squared gradient, typically set to \beta_1 = 0.9 and \beta_2 = 0.999
ϵ: A small positive constant (e.g., 10^{-8}) used to avoid division by zero when computing the final update

Why Adam Works So Well?

Adam addresses several challenges of gradient descent optimization:

Dynamic learning rates: Each parameter has its own adaptive learning rate based on past gradients and their magnitudes. This helps the optimizer avoid oscillations and get past local minima more effectively.
Bias correction: By adjusting for the initial bias when the first and second moment estimates are close to zero helping to prevent early-stage instability.
Efficient performance: Adam typically requires fewer hyperparameter tuning adjustments compared to other optimization algorithms like SGD making it a more convenient choice for most problems.

Performance of Adam

In comparison to other optimizers like SGD (Stochastic Gradient Descent) and momentum-based SGD, Adam outperforms them significantly in terms of both training time and convergence accuracy. Its ability to adjust the learning rate per parameter combined with the bias-correction mechanism leading to faster convergence and more stable optimization. This makes Adam especially useful in complex models with large datasets as it avoids slow convergence and instability while reaching the global minimum.

In practice, Adam often achieves superior results with minimal tuning, making it a go-to optimizer for deep learning tasks.

Suggested Quiz

6 Questions

What does the Adam optimizer primarily combine?

A

Momentum and Gradient Descent
B

RMSProp and Gradient Descent
C

Momentum and RMSProp
D

Stochastic Gradient Descent and Adagrad

Explanation:

Adam optimizer combines Momentum (which helps in faster convergence) and RMSProp (which adjusts learning rates for each parameter). This makes Adam faster and more efficient than other optimizers.

What is the primary purpose of the beta1 parameter in the Adam optimizer?

A

Control learning rate
B

Control the decay rate of moving averages of gradients
C

Control the regularization strength
D

Control momentum

Explanation:

Beta1 controls how much of the past gradients the optimizer remembers. It helps smooth out updates and avoids big jumps

What is the main benefit of using the Adam optimizer over simple Gradient Descent?

A

It adapts the learning rate for each parameter
B

It uses a fixed learning rate
C

It doesn't require hyperparameter tuning
D

It computes gradients more efficiently

Explanation:

Adam adjusts the learning rate for each parameter individually based on its past gradients.

Which of the following best describes the role of the beta2 parameter in Adam?

A

Controls the momentum decay
B

Controls the smoothing factor of the moving average of squared gradients
C

Controls the learning rate
D

Controls the regularization of the loss function

Explanation:

Beta2 controls the moving average of squared gradients, helping to stabilize the learning rate.

What happens if the learning rate (lr) is too high in Adam?

A

The model converges faster
B

The model will train more accurately
C

The model may overshoot the optimal solution
D

There is no significant impact on convergence

Explanation:

A high learning rate can cause the optimizer to skip over the optimal point during training.

How does the Adam optimizer handle sparse gradients?

A

It applies a constant gradient across all parameters
B

It uses momentum to smooth out the gradients
C

It ignores parameters with sparse gradients
D

It adapts the learning rate based on each parameter's gradient history

Explanation:

Adam adjusts the learning rate based on the history of gradients, which helps with sparse gradients.

Quiz Completed Successfully

Your Score : 2/6

Accuracy : 0%

1/6 1/6 < Previous Next >

prakharr0y

Improve

Article Tags :

What is Adam Optimizer?

How Does Adam Work?

1. Momentum

2. RMSprop (Root Mean Square Propagation)

Combining Momentum and RMSprop to form Adam Optimizer

Key Parameters

Why Adam Works So Well?

Performance of Adam

Explore

Deep Learning Basics

Neural Networks Basics

Deep Learning Models

Deep Learning Frameworks

Model Evaluation

Deep Learning Projects

Thank You!

What kind of Experience do you want to share?