gradient_exploding_vanishing_problem_v2
gradient_exploding_vanishing_problem_v2
In deep learning, the gradient exploding and gradient vanishing problems are two common issues
that occur during the training of neural networks, particularly deep networks.
The gradient vanishing problem happens when gradients become very small during
backpropagation, particularly in deep networks. This leads to the weights of earlier layers (closer to
the input) updating very slowly or not at all, making it difficult for the model to learn effectively. This
issue is most prominent when using activation functions like the sigmoid or tanh, which squash input
values into small ranges (e.g., between 0 and 1 for sigmoid), causing the gradients to shrink as they
**Why it happens:**
- The gradient values computed through backpropagation are products of the derivatives of
activation functions and weights. In deep networks, this can lead to the gradients becoming
exceedingly small as they move from the output layer to the input layer.
- For example, using the sigmoid activation function results in gradients that are always smaller than
1, which can quickly cause the gradients to become very small in deeper layers.
**Consequences:**
- The model may fail to capture complex features or patterns in the data.
**Mitigations:**
- Use ReLU (Rectified Linear Unit) or its variants (like Leaky ReLU) as activation functions, as they
don't squash the gradient in the same way.
The exploding gradient problem occurs when gradients become extremely large during
backpropagation, which causes the weights of the network to become very large and unstable. This
leads to a situation where the model?s learning process diverges instead of converging.
**Why it happens:**
- In deep networks, if the gradients are excessively large due to factors like large weights or an
unsuitable activation function, the gradients can grow exponentially as they propagate back through
the layers.
- For example, using activation functions with large derivatives, or poor weight initialization, can
**Consequences:**
- Weight updates become too large, causing the model to overshoot and fail to converge to a good
solution.
- The model's training can become unstable, sometimes resulting in NaN values during computation.
**Mitigations:**
- Apply gradient clipping, which limits the gradient values to a certain threshold.
- Use weight regularization techniques (like L2 regularization) to prevent the weights from growing
too large.
- Use appropriate weight initialization methods (e.g., Xavier initialization for sigmoid or tanh, or He
- **Vanishing gradients** make training deep networks slow and ineffective by causing gradients to
- **Exploding gradients** cause instability and prevent the model from converging by causing
Both problems are especially significant in very deep neural networks, but various strategies (like
proper initialization, choice of activation functions, and gradient clipping) can help mitigate them.