0% found this document useful (0 votes)
6 views

Unit III

This document provides an overview of Convolutional Neural Networks (CNNs), detailing their architecture, including input layers, convolutional layers, activation functions, pooling layers, fully connected layers, and output layers. It explains the advantages and disadvantages of CNNs, their feature extraction capabilities, and the importance of interleaving different layer types for effective learning. Additionally, it covers concepts like padding, strides, and local response normalization, emphasizing the hierarchical learning process and the role of CNNs in image recognition tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Unit III

This document provides an overview of Convolutional Neural Networks (CNNs), detailing their architecture, including input layers, convolutional layers, activation functions, pooling layers, fully connected layers, and output layers. It explains the advantages and disadvantages of CNNs, their feature extraction capabilities, and the importance of interleaving different layer types for effective learning. Additionally, it covers concepts like padding, strides, and local response normalization, emphasizing the hierarchical learning process and the role of CNNs in image recognition tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Unit III Convolution Neural Network 06 Hours

(CNN)
Introduction, CNN architecture overview, The Basic Structure of a Convolutional Network- Padding, Strides, Typical
Settings, the ReLU layer, Pooling, Fully Connected Layers, The Interleaving between Layers, Local Response
A Normalization, Training a Convolutional Network
convolutional neural network— also called CNN or ConvNet, is a Deep Learning algorithm. It takes an input image, assigns
weights/ biases to the components of the image, and then classifies the entire image. With enough training, ConvNets are capable
of learning filters/ classification and the pre-processing required is lower as compared to other algorithms. The application of
CNN in image recognition has revolutionized various AI-driven tasks, enhancing accuracy in areas like medical diagnostics and
autonomous vehicles. Convolutional Neural Networks (CNNs) are specialized deep learning models designed to efficiently
process data with a grid-like structure, such as images, by preserving spatial relationships between pixels. Unlike traditional fully
connected networks, CNNs use localized filters and shared weights to detect features, making them highly efficient with
significantly fewer parameters. This architecture enables translation invariance, allowing CNNs to recognize objects regardless of
their position in the image. Additionally, CNNs learn features hierarchically—early layers capture low-level patterns like edges
and textures, middle layers detect shapes and parts, and deeper layers understand complex structures and entire objects—making
them powerful tools for image classification, object detection, and visual recognition tasks.

II. CNN architecture overview:


1. Input Layer
The input layer receives raw image data, typically represented as a 3D array:
• Width × Height × Channels (e.g., 224 × 224 × 3 for a colour image).
• Each pixel’s value represents brightness or colour intensity.
• No computations are performed here; it simply passes the data into the network.

2. Convolutional Layers
This is the core building block of a CNN.
• It uses learnable filters (kernels) that slide (convolve) over the image.
• Each filter detects specific features (like vertical edges, curves, textures).
• It performs a dot product between the filter and input patch to produce a feature map.
One primary advantage of CNNs is their potential to learn and extract features from images automatically. Convolutional layers
apply filters to input images, convolving them with learned parameters to detect relevant patterns and features. This hierarchical
learning process allows the network to identify simple features at lower layers, such as edges and textures, and gradually learn
more complex features at higher layers, including shapes and objects. This feature extraction capability is critical for achieving
high accuracy in various computer vision tasks.

3. Activation Function (ReLU)


After convolution, the ReLU (Rectified Linear Unit) function is applied element-wise. f(x)=max(0,x)
• It introduces non-linearity, enabling the network to learn complex patterns.
• ReLU replaces negative values with zero, speeding up training and avoiding vanishing gradients.
Other activations: Leaky ReLU, ELU, but ReLU is most common.

4. Pooling Layers
Pooling layers downsample the feature maps, reducing their spatial dimensions (width and height) while keeping important
information. This layer is meant to reduce the spatial size of the image representation. As such, it also helps to reduce the
computation and processing amount in the neural network.
Additionally, it also extracts dominant features that are positionally and rotationally
invariant.
One type of pooling is done by using the Max operation. This operation picks the maximum value from each neuron cluster at the
prior layer. The other type of pooling is the Average pooling which returns an average value from the cluster. Since Max pooling
also acts as a noise suppressant, it performs better than Average
pooling.
As is depicted in the image above, there are multiple pooling layers in addition to convolutional layers. Greater the number of
these layers, the more low-level features will be extracted. However, computational power expended will also increase.
Now that the image has passed through all the present convolutional and pooling layers, feature extraction is complete. It is now
time for the classification of the image. The Fully Connected Layer carries out this task.
• Max Pooling: Takes the maximum value from each window (e.g., 2×2).
• Average Pooling: Takes the average value in the window.
Benefits:
• Reduces computational load.
• Controls overfitting.
• Adds spatial invariance (tolerates small shifts/distortions).

5. Fully Connected Layers


After several convolution + pooling operations, the feature maps are flattened into a 1D vector and passed to
fully connected (dense) layers.
• As the last layer, the FC layer is simply a feed-forward neural network. The input to the fully connected layer is the flattened
output of the last pooling/ convolutional layer. To flatten means that the 3-dimensional matrix or array is unrolled into a vector.
For each FC layer, a specific mathematical calculation takes place. After the vector has passed through all the fully connected
layers, the softmax activation function is used in the final layer. This is used to compute the probability of the input belonging to a
particular task.
Thus, the end result is the different probabilities of the input image belonging to different classes.
The process is repeated for different types of images and individual images within those types. This trains the network and teaches
it to differentiate between a dog and a cat, and a rose and a sunflower.

6. Output Layer
The final fully connected layer connects to the output layer, which depends on the task:
• For classification, it uses Softmax activation to output a probability distribution over classes.
• For binary tasks, sigmoid activation may be used.
• For regression, a linear output is used.
The highest probability in classification becomes the model’s prediction.
Advantages
1. Automatically learn features without manual extraction.
2. Use shared weights, reducing parameters and improving efficiency.
3. Recognize patterns regardless of their position in the input.
4. Capture both low-level and high-level features effectively.
5. Versatile and applicable to images, audio, video, and text.
Disadvantages
1. Require high computational power and resources.
2. Need large amounts of labeled data for training.
3. Difficult to interpret how decisions are made.
4. Can overfit, especially with small datasets.
5. Require fixed input sizes, limiting flexibility.
Pooling, also known as subsampling or downsampling, is a technique used in CNNs to reduce the spatial dimensions of feature
maps while retaining essential information. It helps in controlling the model’s complexity, reducing overfitting, and improving
computational efficiency by reducing the number of parameters and computation required in subsequent layers.
The pooling operation involves sliding a two-dimensional filter over each channel of feature map and summarising the features
lying within the region covered by the filter.
For a feature map having dimensions nh x nw x nc, the dimensions of output obtained after a pooling layer is : (nh - f + 1) / s
x (nw - f + 1)/s x nc
nh - height of feature map
-> nw - width of feature map
-> nc - number of channels in the feature map
-> f - size of filter
-> s - stride length
Types of Pooling Layers:

1)Max Pooling
Max pooling is a pooling operation that selects the maximum element from the region of the feature map covered by the
filter.Thus, the output after max-pooling layer would be a feature map containing the most prominent features of the previous
feature map.

2)Average Pooling

Average pooling computes the average of the elements present in the region of feature map covered by the filter.Thus, while max
pooling gives the most prominent feature in a particular patch of the feature map, average pooling gives the average of features
present in a patch.

Why Pooling is Used in CNNs


1. Dimensionality Reduction: Reduces the size of the feature maps, which decreases computational complexity and
memory usage.
2. Translation Invariance: Helps the network become more robust to small translations, rotations, or distortions in the
input image.
3. Prevents Overfitting: By reducing the number of parameters and computations in the network.
4. Improves Generalization: Helps the model focus on the most prominent features.
5.
Aspect Max Pooling Average Pooling
Selects the maximum value Computes the average of values in the
1. Definition
from the pooling window. pooling window.
Captures strongest/most Captures overall trends or smooth
2. Focus
dominant features. patterns.
3. Noise Less sensitive to noise; ignores More sensitive to noise due to
Sensitivity minor activations. averaging all values.
4. Feature Retains sharp and prominent Preserves subtle and background
Preservation features. information.
5. Edge Better at detecting edges and
Can blur edge information.
Detection contours.
6. Effect on Produces sparser feature maps
Produces denser feature maps.
Sparsity (more zeros or low values).
7. Common Commonly used in modern Used less frequently; sometimes for
Usage CNN architectures. smooth representation.
8. Helps models to generalize May retain unnecessary detail, leading
Generalization better on unseen data. to less generalization.
III. The Basic Structure of a Convolutional Network:
1. Padding: Padding is the technique of adding extra pixels (usually zeros) around the border of the input image or feature map.
It refers to the process of adding additional layers of pixels around the border of an image.. For an nxn input image and an fxf
filter, the shape of the output feature map without padding is (n-f+1)x(n-f+1). To maintain the same spatial dimensions after
convolution, we set n-f+1=n, which means padding the input to size (n+f-1)x(n+f-1).
Purpose:
• Maintains the spatial dimensions of the image after convolution.
• Ensures that features at the edges are not lost.
• Helps in building deeper networks without shrinking the feature map too quickly.
Types:
• Valid Padding ("No Padding"): The filter only slides within the boundaries of the image → output is
smaller than input. In this case, the filter is applied only to valid positions inside the image, not going beyond the border. This
results in smaller output dimensions.
• Same Padding ("Zero Padding"): Pads the input so that the output has the same spatial dimensions as
input. The image is padded with enough zeros around the border so that the output dimensions after the convolution operation are
the same as the input dimensions.
Example:
If input size = 5×5 and filter size = 3×3:
• Without padding → Output = 3×3
• With padding (1 pixel) → Output = 5×5
some Keras code demonstrating padding:
model = Sequential()
model.add(Conv2D(32, kernel_size=(3,3), padding='valid', activation='relu', input_shape=(28,28,1)))
model.add(Conv2D(32, kernel_size=(3,3), padding='same', activation='relu'))

2. Strides
Stride defines how far the filter moves during convolution. In convolution operations, the stride defines how much the filter shifts
across the input image after each application. By default, the stride is (1, 1), meaning the filter shifts one pixel at a time. When the
stride is 1, the filter moves across the input matrix 1 pixel at a time. When the stride is 2, the filter jumps 2 pixels at a time as we
slide it around. And so on.
You can increase the stride value to have the filter skip over pixels, resulting in a smaller output spatial dimension.
let’s say we have a 5x5 input matrix (representing a part of an image), and we’re applying a 3x3 filter using a stride of 1. The filter
starts from the top left corner of the image, and then moves one pixel to the right at each step until it hits the edge of the image, at
which point it moves back to the left edge and one pixel down.
But, if the stride were 2, the filter would move two pixels to the right at each step instead of one. This means the filter would be
applied fewer times, and the resulting output (often referred to as a feature map) would be smaller.
example code using strides:
model = Sequential()
model.add(Conv2D(32, kernel_size=(3,3), strides=(2,2), padding='same', activation='relu',

Behavior:
• Stride = 1: Filter slides one pixel at a time → high-resolution output.
• Stride = 2 or more: Filter skips pixels → reduces output size → faster computation.
Example:
Input = 7×7, Filter = 3×3
• Stride 1 → Output = 5×5
• Stride 2 → Output = 3×3
• Higher stride = smaller, faster, but less detailed output

3. Typical Settings in CNN Layers


CNNs often follow standardized settings that are known to work well across various tasks.
Parameter Typical Value
Filter Size 3×3 or 5×5
Stride 1 (sometimes 2)
Padding "same"
Pooling Size 2×2
Activation ReLU
Pooling Type Max pooling
• These settings help to balance computational cost, training time, and feature resolution.

4. ReLU Layer (Activation Function)


ReLU (Rectified Linear Unit) is a non-linear activation function applied after convolution.
Why ReLU?
• Adds non-linearity so that CNNs can learn complex features.
• Avoids the vanishing gradient problem common with sigmoid/tanh.
• Fast to compute (only zeroing negatives).
Applied after every convolutional layer, making the network capable of modeling real-world image patterns.

5. Pooling Layer
Pooling reduces the size of the feature maps while preserving the most important features.
Types:
• Max Pooling: Takes the maximum value in each window.
• Average Pooling: Computes the average value.
Purpose:
• Reduces overfitting by lowering feature map size.
• Introduces invariance to small shifts in the image.

6. Fully Connected (FC) Layers


These layers come after convolution + pooling.
Structure:
• Flatten the final 3D feature maps into a 1D vector.
• Connect every node to all nodes in the next layer (dense layer).
• Apply one or more FC layers, followed by the output layer.

Role:
• Combines extracted features to make final predictions.
• Handles classification or regression based on task.
Example:
• For 10-class image classification, the final FC layer has 10 output nodes (with softmax).

VI. The Interleaving between Layers,:


In Convolutional Neural Networks (CNNs), interleaving between layers refers to the strategic arrangement of
different layer types—typically convolutional, activation (ReLU), and pooling layers—in a repeating sequence
throughout the network. This interleaving is crucial because it allows the network to progressively extract and
condense features at different levels of abstraction. For instance, a convolutional layer captures spatial patterns,
the ReLU layer introduces non-linearity to model complex relationships, and the pooling layer reduces spatial
dimensions while retaining essential features. Repeating this pattern (Conv → ReLU → Pool) multiple times
builds a deep, hierarchical model that can recognize intricate visual patterns.
Why Interleaving is Important
 Feature Hierarchy: Low to high-level feature learning.
 Efficiency: Reduces data size before reaching dense layers.
 Modularity: Allows easy stacking and tuning of layers.
 Performance: Balances learning capacity and computational cost.

Local Response Normalization


In some architectures, layers like
Local Response Normalization (LRN) are inserted between convolution and activation layers. LRN mimics
biological neural behavior by normalizing the output of neurons based on the activity of their neighboring
neurons, encouraging competition and enhancing generalization. It is particularly useful in early layers to
sharpen activations and improve learning by emphasizing neurons with strong responses and suppressing
weaker ones. Although LRN is less commonly used in modern architectures (often replaced by Batch
Normalization), it played a key role in early CNNs like AlexNet.
LRN normalizes the output of a neuron based on the activity of its neighboring neurons. This means if one neuron has a very
high activation, it suppresses nearby neuron activations, creating a form of local competition.
Mathematical Expression of LRN

Where:
 aix,y: input activation at location (x,y), channel iii
 bix,y: : normalized output
 n: number of adjacent channels to normalize over
 N: total number of channels
 α,β,k: hyperparameters controlling normalization
Typical Values for Hyperparameters
 α=10−4
 β=0.75
 k=2
 n=5
Benefits of LRN
 Encourages feature diversity by suppressing less useful activations.
 Improves generalization by reducing overfitting.
 Introduces lateral inhibition, mimicking biological neurons.

V. Training a Convolutional Network:


Training a Convolutional Neural Network (CNN) involves several key steps, including forward propagation, loss
calculation, backpropagation, and optimization. The process can be broken down as follows:
1. Forward Propagation:
In this step, an input image is passed through the network layer by layer (convolutional, activation, pooling,
etc.). The image is progressively transformed by the filters in the convolutional layers, with activation functions
(like ReLU) applied to introduce non-linearity. Pooling layers reduce the spatial dimensions, and fully connected
layers process the final features. The output layer then produces a prediction (e.g., class probabilities for
classification tasks).

2. Loss Calculation:
Once the network generates a prediction, the loss function is used to measure how far the network's
prediction is from the true label (ground truth). Common loss functions include:
• Cross-Entropy Loss: Used for classification tasks, where the goal is to minimize the difference between
predicted probabilities and actual labels.
• Mean Squared Error: Used for regression tasks, where the goal is to minimize the difference between
predicted values and actual values.
The loss function quantifies the network's error, and this value is essential for updating the model's
parameters.

3. Backpropagation:
After calculating the loss, backpropagation is employed to adjust the weights of the network. The gradient of
the loss with respect to each weight is computed using the chain rule of calculus. This step involves:
• Calculating gradients: For each weight, the network computes the partial derivative of the loss with
respect to that weight.
• Propagating the error: These gradients are then propagated back through the network to update each
weight accordingly.
Backpropagation ensures that the model learns from its errors, improving its predictions over time.

4. Weight Update (Optimization):


After backpropagation, an optimization algorithm is used to update the network’s weights. The most common
optimization algorithms include:
• Stochastic Gradient Descent (SGD): Updates weights by moving in the opposite direction of the
gradient, with a learning rate that determines the step size.
• Adam Optimizer: An advanced optimization algorithm that adapts the learning rate for each weight,
using both the first and second moments of the gradient to improve convergence.

5. Epochs and Iterations:


The training process typically runs for several epochs (full passes through the entire dataset), with each epoch
consisting of multiple iterations (one pass over a batch of training data). The model weights are updated after
each batch, improving the model's performance progressively. Over time, the loss decreases as the network
becomes better at making predictions.

6. Regularization Techniques:
To prevent overfitting (when the model learns the training data too well but fails to generalize to new data),
regularization techniques are used during training:
• Dropout: Randomly drops units (nodes) during training to prevent reliance on specific neurons.
• Data Augmentation: Increases the diversity of the training data by applying random transformations
(e.g., rotation, scaling) to the images.
• Batch Normalization: Normalizes the activations of each layer to stabilize training and speed up
convergence.
Training a CNN is an iterative process that requires fine-tuning of parameters like learning rate, batch size, and
the number of layers to achieve optimal performance. By repeating the forward propagation, loss calculation,
backpropagation, and weight update steps, the network learns to classify or detect objects with high accuracy.

Q) How do you determine the number of filters in each convolutional layer?


Filters (or kernels) are small matrices used to extract features like edges, textures, shapes, etc., from the input image. Each filter
detects a specific kind of feature. More filters = the ability to detect more and richer features.
The number of filters in each convolutional layer is a hyperparameter — meaning it’s a design choice that can be adjusted based
on experimentation and domain knowledge.
Determining the number of filters in a convolutional neural network (CNN) involves balancing feature extraction capability,
computational cost, and overfitting risk. The number of filters controls how many distinct patterns the layer can learn from the
input. There is no fixed rule, but the choice is generally based on the following principles:
1. Layer Depth and Feature Hierarchy
 Shallow layers (closer to the input) typically use fewer filters (e.g., 16, 32) to capture basic patterns like edges or
textures.
 Deeper layers require more filters (e.g., 64, 128, 256) to capture complex abstract features like shapes, objects, or
parts.
This increasing trend reflects the hierarchical nature of feature learning.
2. Empirical Rules and Standard Practices
A common heuristic is to double the number of filters after each pooling layer, for example:
Conv Layer 1: 32 filters
Conv Layer 2: 64 filters
Conv Layer 3: 128 filters
This strategy balances feature richness with computational feasibility.
3. Input Data Complexity
 Simple tasks (e.g., digit recognition): fewer filters (8–32) may suffice.
 Complex datasets (e.g., natural images or medical images): need more filters (64–512+) to extract detailed features.
4. Model Capacity vs. Overfitting
 More filters = more learnable parameters = higher capacity.
 If the dataset is small, too many filters may lead to overfitting.
 Regularization techniques (like dropout) can help mitigate this if higher filter counts are needed.
5. Hardware and Time Constraints
More filters increase computation time and memory usage. On resource-limited systems (like edge devices or mobile
applications), fewer filters are preferable.
6. Tuning and Validation
Ultimately, the number of filters should be fine-tuned based on:
 Validation accuracy
 Training speed
 Overfitting indicators (validation loss)
Techniques like grid search, random search, or Bayesian optimization can be used to automate this process.
7. Research Insight or Reference-Based Design
In many research papers, the choice of filters is inspired by successful models (e.g., VGGNet, ResNet), or based on prior
empirical studies. For instance:
 VGG16: starts with 64 filters and increases to 512.
 ResNet: uses residual blocks with increasing filter depth.

Q) Define convolutional neural network (CNN), and how does it differ from other types of neural networks?
A Convolutional Neural Network (CNN) is a specialized type of deep neural network primarily used for image-related tasks
such as image classification, object detection, and facial recognition. CNNs are designed to automatically and adaptively learn
spatial hierarchies of features from input images using convolutional layers, pooling layers, and fully connected layers.
The core concept behind CNNs is the use of convolutional operations — small, learnable filters that slide over the input image
to detect local patterns (like edges, textures, or shapes). This allows CNNs to preserve spatial relationships between pixels and
reduce the number of parameters compared to fully connected networks.
How CNNs Differ from Other Neural Networks
Feature CNNs Traditional Neural Networks (e.g., MLPs)
Input Type Primarily images or spatial data Vector data
Layer Type Convolutional + Pooling + Dense Fully connected (dense) layers only
Parameter Sharing Yes (shared filters) No (each weight is unique)
Local Connectivity Yes (filters scan small regions) No (each neuron connected to every input)
Number of Parameters Much fewer (due to shared weights) Very large (especially with high-dimensional data)
Best Use Cases Computer vision tasks General tabular or simple pattern recognition
Translation Invariance Yes (learns features regardless of position) No

Key Advantages of CNNs


 Efficient Feature Extraction: Automatically detects patterns without manual feature engineering.
 Reduced Complexity: Fewer parameters make CNNs less prone to overfitting.
 Scalability: Can be scaled to very deep architectures (e.g., ResNet, VGG).
A CNN is a neural network architecture particularly suited for spatial data like images. Unlike traditional neural networks that
treat inputs as flat vectors, CNNs preserve the spatial structure and leverage local patterns through convolution and pooling
operations, making them extremely powerful in computer vision applications.

Q . Purpose of Using CNNs in Deep Learning


1. Efficient Feature Extraction)
CNNs are specifically designed to automatically and efficiently extract spatial features (like edges, textures, shapes) from
image data using convolutional filters. This eliminates the need for manual feature engineering and makes them highly effective in
computer vision tasks.
2. Spatial Hierarchy Learning
CNNs learn a hierarchical representation of data — lower layers capture basic features (e.g., edges), while deeper layers capture
complex patterns (e.g., object parts or full objects). This mimics the way the human visual cortex processes images.
3. Parameter Efficiency through Weight Sharing
Unlike fully connected networks, CNNs reuse the same filter (kernel) across the entire input, drastically reducing the number
of parameters. This makes them more efficient and less prone to overfitting.
4. Translation Invariance
By using pooling and shared weights, CNNs gain the ability to recognize objects regardless of their position in the image — a
property known as translation invariance.
5. Best Suited for Image and Video Processing )
CNNs are the go-to architecture for tasks involving visual data, such as:
 Image classification
 Object detection
 Face recognition
 Image segmentation
----------------------------------------------------------------------------------------------------------------------------------------------------------

You might also like