Open In App

Vanishing and Exploding Gradients Problems in Deep Learning

Last Updated : 15 Nov, 2025
Comments
Improve
Suggest changes
9 Likes
Like
Report

To train deep neural networks effectively, managing the Vanishing and Exploding Gradients Problems is important. These issues occur during backpropagation when gradients become too small or too large, making it difficult for the model to learn properly. Both problems directly affect the model’s convergence and overall performance.

Vanishing Gradient Problem

Vanishing gradients occur when gradients become extremely small during backpropagation, causing early layers to learn very slowly or stop learning. During backpropagation the gradient of the loss L with respect to a weight ​w_i in layer i is calculated using the chain rule:

\frac{\partial L}{\partial w_i} = \frac{\partial L}{\partial a_n} \cdot \frac{\partial a_n}{\partial a_{n-1}} \cdot \frac{\partial a_{n-1}}{\partial a_{n-2}} \cdots \frac{\partial a_1}{\partial w_i}

where

  • L : Loss function.
  • w_i : Weight parameter in the layer.
  • a_n : Activation output of layer.
  • \frac{\partial L}{\partial w_i} : Gradient of loss with respect to weight​.

When activation functions like Sigmoid or Tanh are used, their derivatives are less than 1. Repeated multiplication through layers causes the gradient to vanish as it moves backwards, making the lower layers unable to learn.

Exploding Gradient Problem

Exploding gradients occur when gradients grow too large during backpropagation, leading to unstable weight updates and divergence in loss. When derivatives or weights are greater than 1, their repeated multiplication across layers leads to exponential growth.

\prod_{i=1}^{n} \frac{\partial a_i}{\partial a_{i-1}} \longrightarrow \infty

The gradient update rule in gradient descent is:

w_{t+1} = w_t - \eta \cdot \frac{\partial L}{\partial w_t}

where

  • w_i : Current weight value at time step t.
  • \eta : Learning rate.
  • \frac{\partial L}{\partial w_t}Gradient of loss with respect to weight.
  • w_{t+1} : Updated weight after applying gradient descent.

If ​\frac{\partial L}{\partial w_t} is too large weight updates become massive causing the model loss to oscillate or diverge.

Why do the Gradients Vanish or Explode

  • Activation Functions: Sigmoid or Tanh have small derivatives that shrink gradients.
  • Weight Initialization: Too small or too large weights cause vanishing or exploding gradients.
  • Deep Networks: Many layers multiply gradients repeatedly leading to instability.
  • Learning Rate: High learning rate or unscaled inputs can make gradients explode.

Techniques to Fix Vanishing and Exploding Gradients

Vanishing and exploding gradients make training deep neural networks difficult. The following methods help stabilize gradient flow and improve learning

1. Proper Weight Initialization

Choosing the right weight initialization keeps gradients balanced during backpropagation.

2. Use Non Saturating Activation Functions

Sigmoid and Tanh can shrink gradients. Using ReLU or its variants prevents vanishing gradients:

  • ReLU: Basic rectified linear unit.
  • Leaky ReLU: Allows small gradients for negative inputs.
  • ELU / SELU: Helps maintain self normalizing properties.

3. Apply Batch Normalization

Normalizes layer inputs to have zero mean and unit variance, stabilizing gradients and accelerating convergence.

4. Gradient Clipping

Limits gradients to a maximum threshold to prevent them from exploding and destabilizing training.

Step-By-Step Implementation

Here we compare how gradients behave in deep neural networks using Sigmoid and ReLU activations to visualize the vanishing gradient problem through loss curves.

Step 1 : Import Required Libraries

  • numpy: For numerical and array operations.
  • matplotlib: For plotting graphs and visualizations.
  • Sequential: Builds neural networks layer by layer.
  • tensorflow:Framework for deep learning tasks.
Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam, SGD
import tensorflow as tf

Step 2: Define Function for Vanishing Gradient

  • Build a deep neural network using the chosen activation function (sigmoid or relu).
  • Save the initial weights before training to approximate gradient magnitude.
  • Train the model and compute how much weights changed after training to estimate gradient size.
  • Return the average gradient magnitude and training loss for visualization.
Python
def Vanishing_gradients(activation='sigmoid', layers=10, epochs=100):
    model = Sequential()
    model.add(Dense(10, activation=activation, input_dim=2))
    for _ in range(layers - 1):
        model.add(Dense(10, activation=activation))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(loss='binary_crossentropy',
                  optimizer=Adam(learning_rate=0.001),
                  metrics=['accuracy'])
    
    old_weights = model.get_weights()[0]
    
    history = model.fit(X_train, y_train, epochs=epochs, verbose=0)
    
    new_weights = model.get_weights()[0]
    
    gradient = (old_weights - new_weights) / 0.001
    avg_grad = np.mean(np.abs(gradient))
    
    return avg_grad, history.history['loss'], model

Step 3: Demonstrating the Vanishing Gradient

  • Train one model with the Sigmoid activation and another with ReLU activation.
  • Both models use the same architecture, optimizer and learning rate for fairness.
  • Observe how the Sigmoid model trains more slowly due to vanishing gradients.
  • The ReLU model should show faster and more stable convergence.
Python
sigmoid_grad, sigmoid_loss, sigmoid_model = Vanishing_gradients('sigmoid', layers=10)
relu_grad, relu_loss, relu_model = Vanishing_gradients('relu', layers=10)

plt.figure(figsize=(8,5))
plt.plot(sigmoid_loss, label='Sigmoid Activation', linewidth=2, color='orange')
plt.plot(relu_loss, label='ReLU Activation', linewidth=2, color='green')
plt.title("Training Loss Comparison (Vanishing Gradient Effect)")
plt.xlabel("Epochs")
plt.ylabel("Binary Crossentropy Loss")
plt.legend()
plt.grid(True)
plt.show()

Output:

VG1
Vanishing Gradient

ReLU activation avoids the vanishing gradient problem and learns effectively, while Sigmoid activation stalls with almost no loss improvement.

Step 4: Define Function for Exploding Gradient

  • Create a deep network with many layers and large initial weights.
  • Use a very high learning rate to amplify gradient updates.
  • The combination of deep architecture, large initialization and high LR triggers gradient explosion.
  • Return loss history to visualize instability during training.
Python
def exploding_gradient(layers=50, epochs=100, lr=1.0, init_std=3.0):
    model = Sequential()
    initializer = tf.keras.initializers.RandomNormal(mean=0.0, stddev=init_std)

    model.add(Dense(64, activation='relu', input_dim=2, kernel_initializer=initializer))
    for _ in range(layers - 1):
        model.add(Dense(64, activation='relu', kernel_initializer=initializer))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(loss='binary_crossentropy',
                  optimizer=SGD(learning_rate=lr),
                  metrics=['accuracy'])
    
    history = model.fit(X_train, y_train, epochs=epochs, verbose=0)
    return history.history['loss']

Step 5: Demonstrating the Exploding Gradient

  • Train one model in a stable setup with small initialization and low learning rate.
  • Train another model with deep layers, high LR and large initialization to trigger explosion.
  • Compare the loss curves of stable vs exploding setups.
  • Exploding gradients cause the loss to rise rapidly or become during training.
Python
stable_loss = exploding_gradient(layers=10, epochs=100, lr=0.01, init
exploding_loss = exploding_gradient(layers=50, epochs=100, lr=1.0, init_std=3.0)


plt.figure(figsize=(8,5))
plt.plot(stable_loss, label='Stable Training (Low LR, Small Init)', linewidth=2, color='green')
plt.plot(exploding_loss, label='Exploding Gradients (High LR, Large Init)', linewidth=2, color='red')
plt.title("Exploding Gradient Effect on Training Loss")
plt.xlabel("Epochs")
plt.ylabel("Binary Crossentropy Loss")
plt.legend()
plt.grid(True)
plt.show()

Output:

sa
Exploding Gradient

High learning rates or large weight initialization cause exploding gradients, while smaller values lead to stable training.

You can download full code file from here.

Impact of Vanishing and Exploding Gradients on RNNs

Recurrent Neural Networks (RNNs) process sequences step by step, using past outputs as inputs for future steps which makes them particularly sensitive to gradient issues during training.

  • Loss of Long-Term Memory: The model forgets earlier information because tiny gradients fail to carry learning signals backward.
  • Unstable Training: Large gradients make the network oscillate or diverge instead of converging smoothly.
  • Poor Sequential Performance: The RNN struggles to capture long range dependencies needed for complex sequence tasks.
  • Difficult Optimization: Training becomes unstable and highly dependent on learning rate, initialization and regularization choices.

To solve these problems advanced models like LSTM and GRU were introduced. They use gating mechanisms to control the flow of information and keep gradients stable during training. Techniques such as gradient clipping, proper weight initialization and normalization are also used to make training smoother and avoid instability.

Challenges

  • Slow Learning: Vanishing gradients make early layers learn very slowly or not at all.
  • Unstable Training: Exploding gradients cause erratic weight updates and diverging loss.
  • Long Term Dependencies: Vanishing gradients limit memory in sequence models.
  • Poor Convergence: Gradients issues can lead to suboptimal or failed convergence.
  • Hyperparameter Tuning: Gradient instability requires careful tuning of learning rates and initialization.

Explore