Unit III
Unit III
(CNN)
Introduction, CNN architecture overview, The Basic Structure of a Convolutional Network- Padding, Strides, Typical
Settings, the ReLU layer, Pooling, Fully Connected Layers, The Interleaving between Layers, Local Response
A Normalization, Training a Convolutional Network
convolutional neural network— also called CNN or ConvNet, is a Deep Learning algorithm. It takes an input image, assigns
weights/ biases to the components of the image, and then classifies the entire image. With enough training, ConvNets are capable
of learning filters/ classification and the pre-processing required is lower as compared to other algorithms. The application of
CNN in image recognition has revolutionized various AI-driven tasks, enhancing accuracy in areas like medical diagnostics and
autonomous vehicles. Convolutional Neural Networks (CNNs) are specialized deep learning models designed to efficiently
process data with a grid-like structure, such as images, by preserving spatial relationships between pixels. Unlike traditional fully
connected networks, CNNs use localized filters and shared weights to detect features, making them highly efficient with
significantly fewer parameters. This architecture enables translation invariance, allowing CNNs to recognize objects regardless of
their position in the image. Additionally, CNNs learn features hierarchically—early layers capture low-level patterns like edges
and textures, middle layers detect shapes and parts, and deeper layers understand complex structures and entire objects—making
them powerful tools for image classification, object detection, and visual recognition tasks.
2. Convolutional Layers
This is the core building block of a CNN.
• It uses learnable filters (kernels) that slide (convolve) over the image.
• Each filter detects specific features (like vertical edges, curves, textures).
• It performs a dot product between the filter and input patch to produce a feature map.
One primary advantage of CNNs is their potential to learn and extract features from images automatically. Convolutional layers
apply filters to input images, convolving them with learned parameters to detect relevant patterns and features. This hierarchical
learning process allows the network to identify simple features at lower layers, such as edges and textures, and gradually learn
more complex features at higher layers, including shapes and objects. This feature extraction capability is critical for achieving
high accuracy in various computer vision tasks.
4. Pooling Layers
Pooling layers downsample the feature maps, reducing their spatial dimensions (width and height) while keeping important
information. This layer is meant to reduce the spatial size of the image representation. As such, it also helps to reduce the
computation and processing amount in the neural network.
Additionally, it also extracts dominant features that are positionally and rotationally
invariant.
One type of pooling is done by using the Max operation. This operation picks the maximum value from each neuron cluster at the
prior layer. The other type of pooling is the Average pooling which returns an average value from the cluster. Since Max pooling
also acts as a noise suppressant, it performs better than Average
pooling.
As is depicted in the image above, there are multiple pooling layers in addition to convolutional layers. Greater the number of
these layers, the more low-level features will be extracted. However, computational power expended will also increase.
Now that the image has passed through all the present convolutional and pooling layers, feature extraction is complete. It is now
time for the classification of the image. The Fully Connected Layer carries out this task.
• Max Pooling: Takes the maximum value from each window (e.g., 2×2).
• Average Pooling: Takes the average value in the window.
Benefits:
• Reduces computational load.
• Controls overfitting.
• Adds spatial invariance (tolerates small shifts/distortions).
6. Output Layer
The final fully connected layer connects to the output layer, which depends on the task:
• For classification, it uses Softmax activation to output a probability distribution over classes.
• For binary tasks, sigmoid activation may be used.
• For regression, a linear output is used.
The highest probability in classification becomes the model’s prediction.
Advantages
1. Automatically learn features without manual extraction.
2. Use shared weights, reducing parameters and improving efficiency.
3. Recognize patterns regardless of their position in the input.
4. Capture both low-level and high-level features effectively.
5. Versatile and applicable to images, audio, video, and text.
Disadvantages
1. Require high computational power and resources.
2. Need large amounts of labeled data for training.
3. Difficult to interpret how decisions are made.
4. Can overfit, especially with small datasets.
5. Require fixed input sizes, limiting flexibility.
Pooling, also known as subsampling or downsampling, is a technique used in CNNs to reduce the spatial dimensions of feature
maps while retaining essential information. It helps in controlling the model’s complexity, reducing overfitting, and improving
computational efficiency by reducing the number of parameters and computation required in subsequent layers.
The pooling operation involves sliding a two-dimensional filter over each channel of feature map and summarising the features
lying within the region covered by the filter.
For a feature map having dimensions nh x nw x nc, the dimensions of output obtained after a pooling layer is : (nh - f + 1) / s
x (nw - f + 1)/s x nc
nh - height of feature map
-> nw - width of feature map
-> nc - number of channels in the feature map
-> f - size of filter
-> s - stride length
Types of Pooling Layers:
1)Max Pooling
Max pooling is a pooling operation that selects the maximum element from the region of the feature map covered by the
filter.Thus, the output after max-pooling layer would be a feature map containing the most prominent features of the previous
feature map.
2)Average Pooling
Average pooling computes the average of the elements present in the region of feature map covered by the filter.Thus, while max
pooling gives the most prominent feature in a particular patch of the feature map, average pooling gives the average of features
present in a patch.
2. Strides
Stride defines how far the filter moves during convolution. In convolution operations, the stride defines how much the filter shifts
across the input image after each application. By default, the stride is (1, 1), meaning the filter shifts one pixel at a time. When the
stride is 1, the filter moves across the input matrix 1 pixel at a time. When the stride is 2, the filter jumps 2 pixels at a time as we
slide it around. And so on.
You can increase the stride value to have the filter skip over pixels, resulting in a smaller output spatial dimension.
let’s say we have a 5x5 input matrix (representing a part of an image), and we’re applying a 3x3 filter using a stride of 1. The filter
starts from the top left corner of the image, and then moves one pixel to the right at each step until it hits the edge of the image, at
which point it moves back to the left edge and one pixel down.
But, if the stride were 2, the filter would move two pixels to the right at each step instead of one. This means the filter would be
applied fewer times, and the resulting output (often referred to as a feature map) would be smaller.
example code using strides:
model = Sequential()
model.add(Conv2D(32, kernel_size=(3,3), strides=(2,2), padding='same', activation='relu',
Behavior:
• Stride = 1: Filter slides one pixel at a time → high-resolution output.
• Stride = 2 or more: Filter skips pixels → reduces output size → faster computation.
Example:
Input = 7×7, Filter = 3×3
• Stride 1 → Output = 5×5
• Stride 2 → Output = 3×3
• Higher stride = smaller, faster, but less detailed output
5. Pooling Layer
Pooling reduces the size of the feature maps while preserving the most important features.
Types:
• Max Pooling: Takes the maximum value in each window.
• Average Pooling: Computes the average value.
Purpose:
• Reduces overfitting by lowering feature map size.
• Introduces invariance to small shifts in the image.
Role:
• Combines extracted features to make final predictions.
• Handles classification or regression based on task.
Example:
• For 10-class image classification, the final FC layer has 10 output nodes (with softmax).
Where:
aix,y: input activation at location (x,y), channel iii
bix,y: : normalized output
n: number of adjacent channels to normalize over
N: total number of channels
α,β,k: hyperparameters controlling normalization
Typical Values for Hyperparameters
α=10−4
β=0.75
k=2
n=5
Benefits of LRN
Encourages feature diversity by suppressing less useful activations.
Improves generalization by reducing overfitting.
Introduces lateral inhibition, mimicking biological neurons.
2. Loss Calculation:
Once the network generates a prediction, the loss function is used to measure how far the network's
prediction is from the true label (ground truth). Common loss functions include:
• Cross-Entropy Loss: Used for classification tasks, where the goal is to minimize the difference between
predicted probabilities and actual labels.
• Mean Squared Error: Used for regression tasks, where the goal is to minimize the difference between
predicted values and actual values.
The loss function quantifies the network's error, and this value is essential for updating the model's
parameters.
3. Backpropagation:
After calculating the loss, backpropagation is employed to adjust the weights of the network. The gradient of
the loss with respect to each weight is computed using the chain rule of calculus. This step involves:
• Calculating gradients: For each weight, the network computes the partial derivative of the loss with
respect to that weight.
• Propagating the error: These gradients are then propagated back through the network to update each
weight accordingly.
Backpropagation ensures that the model learns from its errors, improving its predictions over time.
6. Regularization Techniques:
To prevent overfitting (when the model learns the training data too well but fails to generalize to new data),
regularization techniques are used during training:
• Dropout: Randomly drops units (nodes) during training to prevent reliance on specific neurons.
• Data Augmentation: Increases the diversity of the training data by applying random transformations
(e.g., rotation, scaling) to the images.
• Batch Normalization: Normalizes the activations of each layer to stabilize training and speed up
convergence.
Training a CNN is an iterative process that requires fine-tuning of parameters like learning rate, batch size, and
the number of layers to achieve optimal performance. By repeating the forward propagation, loss calculation,
backpropagation, and weight update steps, the network learns to classify or detect objects with high accuracy.
Q) Define convolutional neural network (CNN), and how does it differ from other types of neural networks?
A Convolutional Neural Network (CNN) is a specialized type of deep neural network primarily used for image-related tasks
such as image classification, object detection, and facial recognition. CNNs are designed to automatically and adaptively learn
spatial hierarchies of features from input images using convolutional layers, pooling layers, and fully connected layers.
The core concept behind CNNs is the use of convolutional operations — small, learnable filters that slide over the input image
to detect local patterns (like edges, textures, or shapes). This allows CNNs to preserve spatial relationships between pixels and
reduce the number of parameters compared to fully connected networks.
How CNNs Differ from Other Neural Networks
Feature CNNs Traditional Neural Networks (e.g., MLPs)
Input Type Primarily images or spatial data Vector data
Layer Type Convolutional + Pooling + Dense Fully connected (dense) layers only
Parameter Sharing Yes (shared filters) No (each weight is unique)
Local Connectivity Yes (filters scan small regions) No (each neuron connected to every input)
Number of Parameters Much fewer (due to shared weights) Very large (especially with high-dimensional data)
Best Use Cases Computer vision tasks General tabular or simple pattern recognition
Translation Invariance Yes (learns features regardless of position) No