Unit-V
Unit-V
However, for black-white images, there is only one channel and the concept is
the same.
Here, we have considered an input of images with the size 28x28x3 pixels. If
we input this to our Convolutional Neural Network, we will have about 2352
weights in the first hidden layer itself.
Any generic input image will atleast have 200x200x3 pixels in size. The size of the
first hidden layer becomes a whooping 120,000. If this is just the first hidden layer,
imagine the number of neurons needed to process an entire complex image-set.
This leads to over-fitting and isn’t practical. Hence, we cannot make use of fully
connected networks.
The whole network has a loss function and all the tips and tricks that we developed
for neural networks still apply on Convolutional Neural Networks.
These neurons learn how to convert input signals (e.g. picture of a cat) into
corresponding output signals (e.g. the label “cat”), forming the basis of automated
recognition.
Let’s take the example of automatic image recognition. The process
of determining whether a picture contains a cat involves an activation function. If
the picture resembles prior cat images the neurons have seen before, the
label “cat” would be activated.
Hence, the more labeled images the neurons are exposed to, the better it learns
how to recognize other unlabelled images. We call this the process
of training neurons.
1. Convolution,
2. ReLu,
3. Pooling and
4. Full Connectedness (Fully Connected Layer).
Example of CNN:
Consider the image below:
Here, there are multiple renditions of X and O’s. This makes it tricky for the computer
to recognize. But the goal is that if the input signal looks like previous images it
has seen before, the “image” reference signal will be mixed into,
or convolved with, the input signal. The resulting output signal is then passed on to
the next layer.
So, the computer understands every pixel. In this case, the white pixels are said to
be -1 while the black ones are 1. This is just the way we’ve implemented
to differentiate the pixels in a basic binary classification.
Now if we would just normally search and compare the values between a normal
image and another ‘x’ rendition, we would get a lot of missing pixels.
We take small patches of the pixels called filters and try to match them in the
corresponding nearby locations to see if we get a match. By doing this, the
Convolutional Neural Network gets a lot better at seeing similarity than directly
trying to match the entire image.
Convolution of an Image
Convolution has the nice property of being translational invariant. Intuitively, this
means that each convolution filter represents a feature of interest (e.g pixels in
letters) and the Convolutional Neural Network algorithm learns
which features comprise the resulting reference (i.e. alphabet).
Consider the above image – As you can see, we are done with the first 2 steps. We
considered a feature image and one pixel from it. We multiplied this with
the existing image and the product is stored in another buffer feature image.
With this image, we completed the last 2 steps. We added the values which led to
the sum. We then, divide this number by the total number of pixels in the feature
image. When that is done, the final value obtained is placed at the center of
the filtered image as shown below:
Now, we can move this filter around and do the same at any pixel in the image.
For better clarity, let’s consider another example:
As you can see, here after performing the first 4 steps we have the value at 0.55! We
take this value and place it in the image as explained before. This is done in the
following image:
Similarly, we move the feature to every other position in the image and see how the
feature matches that area. So after doing this, we will get the output as:
Here we considered just one filter. Similarly, we will perform the same convolution
with every other filter to get the convolution of that filter.
The output signal strength is not dependent on where the features are located, but
simply whether the features are present. Hence, an alphabet could be sitting
in different positions and the Convolutional Neural Network algorithm would still
be able to recognize it.
ReLU Layer
ReLU is an activation function. But, what is an activation function?
Rectified Linear Unit (ReLU) transform function only activates a node if the input is
above a certain quantity, while the input is below zero, the output is zero, but when
the input rises above a certain threshold, it has a linear relationship with the
dependent variable.
The main aim is to remove all the negative values from the convolution. All the
positive values remain the same but all the negative values get changed to zero as
shown below:
So after we process this particular feature we get the following output:
Now, similarly we do the same process to all the other feature images as well:
Inputs from the convolution layer can
be “smoothened” to reduce the sensitivity of
the filters to noise and variations. This smoothing process is
called subsampling and can be achieved by taking averages or taking
the maximum over a sample of the signal.
Pooling Layer
In this layer we shrink the image stack into a smaller size. Pooling is done after
passing through the activation layer. We do this by implementing the following 4
steps:
Let us understand this with an example. Consider performing pooling with a window
size of 2 and stride being 2 as well.
So in this case, we took window size to be 2 and we got 4 values to choose from.
From those 4 values, the maximum value there is 1 so we pick 1. Also, note that
we started out with a 7×7 matrix but now the same matrix after pooling came down
to 4×4.
But we need to move the window across the entire image. The procedure is
exactly as same as above and we need to repeat that for the entire image.
Do note that this is for one filter. We need to do it for 2 other filters as well. This is
done and we arrive at the following result:
Well the easy part of this process is over. Next up, we need to stack up all these
layers!
But can we further reduce the image from 4×4 to something lesser?
Yes, we can! We need to perform the 3 operations in an iteration after the first pass.
So after the second pass we arrive at a 2×2 matrix as shown below:
The last layers in the network are fully connected, meaning that neurons of
preceding layers are connected to every neuron in subsequent layers.
This mimics high level reasoning where all possible pathways from
the input to output are considered.
Also, fully connected layer is the final layer where the classification actually happens.
Here we take our filtered and shrinked images and put them into one single list as
shown below:
So next, when we feed in, ‘X’ and ‘O’ there will be some element in the vector that
will be high. Consider the image below, as you can see for ‘X’ there are different
elements that are high and similarly, for ‘O’ we have different elements that
are high:
Well, what did we understand from the above image?
When the 1st, 4th, 5th, 10th and 11th values are high, we can classify the image
as ‘x’. The concept is similar for the other alphabets as well – when
certain values are arranged the way they are, they can be mapped to
an actual letter or a number which we require, simple right?
Well, it is really easy. We just added the values we which found out as high (1st,
4th, 5th, 10th and 11th) from the vector table of X and we got the sum to be 5. We
did the exact same thing with the input image and got a value of 4.56.
We have the output as 0.51 with this table. Well, probability being 0.51 is less
than 0.91, isn’t it?
1. AlexNet (2012)
✅ Main Idea:
AlexNet was the first deep CNN to achieve high performance on a large-scale
dataset (ImageNet). It proved that deep learning could outperform traditional
computer vision methods.
🧠 Architecture Design:
Input size: 227×227×3 (image)
Layers: 8 total
o 5 convolutional layers
o 3 fully connected layers
Uses:
o ReLU activation (instead of sigmoid/tanh for faster training)
o Max pooling to reduce spatial size
o Dropout to reduce overfitting in FC layers
o Data augmentation
Trained on 2 GPUs in parallel
💡 Innovations:
ReLU → faster training
Dropout → prevents overfitting
GPU-based training → reduced training time
⚖️Pros and Cons:
✅ Powerful and fast for its time
❌ Very large number of parameters (~60 million)
❌ High memory usage
🔹 2. ZFNet (2013)
✅ Main Idea:
Improved AlexNet by tweaking hyperparameters and made the model more
interpretable using deconvolutional visualizations.
🧠 Architecture Design:
Similar to AlexNet in structure (8 layers)
Key change:
o First convolutional layer's filter size reduced from 11×11 to 7×7
o Reduced stride for finer feature maps
Visualized intermediate feature maps to understand what CNN is
learning
💡 Innovations:
DeconvNet to visualize what filters learn
Adjusted filter size and stride to improve accuracy
⚖️Pros and Cons:
✅ Better than AlexNet with minor changes
❌ Still limited by depth
🔹 3. VGGNet (2014)
✅ Main Idea:
VGG showed that deeper networks (16–19 layers) improve performance. It
introduced a very simple and uniform architecture using only 3×3
convolutional filters.
🧠 Architecture Design:
Input: 224×224×3
VGG-16:
o 13 convolutional layers
o 3 fully connected layers
o 2×2 max pooling after every few conv layers
Uses 3×3 filters throughout the network
Same padding to maintain size
💡 Innovations:
Uniform architecture: easier to implement and scale
Replaced large filters with stacked 3×3 ones → better feature extraction
⚖️Pros and Cons:
✅ Great accuracy
✅ Easy to use for transfer learning
❌ Huge number of parameters (≈138M)
❌ Computationally expensive
🔹 5. ResNet (2015)
✅ Main Idea:
Very deep networks suffer from vanishing gradients, so ResNet introduced
skip (residual) connections that let gradients flow directly.
🧠 Architecture Design:
Residual block:
Output = F(x) + x
where F(x) is some transformation (e.g., 2 conv layers)
Many versions:
o ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152
Input size: 224×224×3
Uses Batch Normalization, ReLU
💡 Innovations:
Residual learning to allow very deep models
First network to train 152 layers effectively
⚖️Pros and Cons:
✅ Solved vanishing gradient
✅ Enabled ultra-deep CNNs
✅ Excellent transfer learning performance
❌ Slightly complex to implement
📊 Summary Table
Model Year Layers Main Idea Key Innovation Params
Deep CNN + GPU + ReLU, Dropout, Data
AlexNet 2012 8 ~60M
ReLU Augment
Improved AlexNet + DeconvNet, small
ZFNet 2013 8 ~62M
Visualization filters
Deeper, uniform Stacked 3×3 conv
VGG-16 2014 16 ~138M
architecture layers
Multi-scale convs 1×1 conv, no FC
GoogLeNet 2014 22 ~5M
(Inception) layers
Very deep, residual
ResNet-50 2015 50 Skip connections ~25M
blocks
🔍 What Are Pretrained Models?
Pretrained models are deep learning models that have already been trained
on large datasets (like ImageNet) and are available for reuse. Instead of
training a new model from scratch (which is time-consuming and requires a lot
of data), you can use or fine-tune these pretrained models for your specific
task.
🎯 Benefits of Using Pretrained Models
Advantage Explanation
🔧 Saves time You skip the training-from-scratch process
🧠 Requires less Works well even with small datasets (using fine-tuning or
data feature extraction)
🚀 High Based on training on huge datasets (like ImageNet with
performance 1.2M images)
You can transfer knowledge to new tasks (e.g.,
🔁 Transfer Learning
classification, detection)
Convolutional Autoencoder (CAE):
A Convolutional Autoencoder (CAE) is a type of autoencoder specifically
designed to work with image data. Instead of using fully connected layers like
traditional autoencoders, CAEs use convolutional and pooling layers to learn
spatial hierarchies and preserve local features in images.
2. 🎯 Object Detection
What it means:
Detects and locates different objects in an image. It answers:
What is in the image?
Where is it?
How CNN helps:
CNNs scan the image and detect key features at different locations and sizes.
They help draw bounding boxes around each object.
Example:
In a traffic camera image, CNN detects cars, bikes, and pedestrians and draws
boxes around them.
🔄 Summary Table
Application What CNN Does Example Use Case
Content-Based Finds similar images based on Fashion, medical,
Image Retrieval content artwork search
Identifies and locates multiple Self-driving cars,
Object Detection
objects in images surveillance
Natural Language Understands and classifies text Sentiment analysis, spam
Processing or language patterns detection
Finds patterns in sequential ECG analysis, DNA
Sequence Learning
data (1D signals) sequence prediction