Backpropagation in Convolutional Neural Networks

Backpropagation is a fundamental training algorithm in CNNs that propagates errors backwards through the network to optimise learning. It enables the model to adjust internal parameters and improve performance over time.

Calculates gradients of the loss function with respect to network weights.
Fine-tunes filters, weights and biases across different CNN layers.
Helps the network learn complex and hierarchical visual features.
Continuously reduces prediction error to improve model accuracy.

Fundamentals of Backpropagation

Backpropagation trains a CNN by calculating prediction errors and updating weights to reduce the loss function. The process mainly consists of forward propagation, loss calculation, backward propagation and weight updates.

1. Forward Pass

In the forward pass, input images move through different CNN layers to generate predictions.

Convolutional Layers: Each convolutional layer applies numerous filters to the input. For a given layer l with filters denoted by F, input I and bias b, the output O is given by: O = (I * F) + b Here, * denotes the convolution operation.
Activation Functions: After convolution, an activation function σ\sigmaσ (e.g., ReLU) is applied element-wise to introduce non-linearity: O = \sigma((I * F) + b)
Pooling Layers: Pooling (e.g., max pooling) reduces dimensionality, summarizing the features extracted by the convolutional layers.

2. Loss Calculation

The predicted output is compared with the actual output using a loss function L to measure prediction error. For classification tasks, cross-entropy loss is commonly used:

L = -\sum y \log(\hat{y})

Here

y is the true label
\hat{y} is the predicted label

3. Backward Pass

The backward pass computes the gradient of the loss function with respect to each weight in the network by applying the chain rule:

Gradient with respect to output: First, calculate the gradient of the loss function with respect to the output of the final layer: \frac{\partial L}{\partial O}
Gradient through activation function: Apply the chain rule through the activation function: \frac{\partial L}{\partial I} = \frac{\partial L}{\partial O} \frac{\partial O}{\partial I} For ReLU, \frac{\partial{O}}{\partial{I}} is 1 for I > 0 and 0 otherwise.
Gradient with respect to filters in convolutional layers: Continue applying the chain rule to find the gradients with respect to the filters:\frac{\partial L}{\partial F} = \frac{\partial L}{\partial O} * rot180(I). Here, rot180(I) rotates the input by 180 degrees, aligning it for the convolution operation used to calculate the gradient with respect to the filter.

4. Weight Update

Using the gradients calculated, the weights are updated using an optimization algorithm such as SGD:

F_{new} = F_{old} - \eta \frac{\partial L}{\partial F}

Here, \eta is the learning rate, which controls the step size during the weight update.

Advantages

Optimizes network weights by minimizing the loss function.
Enables automatic learning of complex visual patterns in CNNs.
Supports training of deep and multi-layer neural networks.
Improves prediction performance through iterative learning.

Challenges

Vanishing gradients can occur in deep networks, particularly when using activation functions such as Sigmoid or Tanh, leading to slower learning in earlier layers.
Dying ReLU may occur when ReLU neurons consistently output zero, preventing them from learning further.
Exploding gradients can make training unstable and lead to large weight updates.
Requires substantial computational resources and memory.
Depends heavily on proper hyperparameter tuning.