CNN 1
CNN 1
1 Introduction
Convolutional Neural Networks (CNNs) are a class of deep neural networks designed
specifically for tasks involving structured data such as images, video sequences, and time-
series data. Their architecture mimics the connectivity pattern of neurons in the human
brain, where small clusters of neurons respond to stimuli in their receptive field.
CNNs are widely used in:
• Object detection
• Scalability:
• Performance:
– Traditional ML models like SVM and Random Forests perform well on small
datasets.
– Deep Learning surpasses in tasks involving unstructured data such as images
and videos.
Figure 2: Convolution in 2D
3 Steps in a CNN
1. Convolution: Applies filters (kernels) to extract features such as edges, textures,
or complex patterns.
2. ReLU: Introduces non-linearity by applying f (x) = max(0, x), enabling the model
to learn non-linear decision boundaries.
4 Convolution in 2D
Convolution is a mathematical operation that processes an input matrix using a smaller
matrix called a kernel (or filter). It involves sliding the kernel over the input matrix
and computing the weighted sum of the overlapping elements. The result is stored in an
output matrix. To understand the formula:
1. For each position (i, j) in the output matrix Y , place the kernel K over the input
matrix X, aligning the top-left corner of the kernel with the position (i, j) in X.
2. Multiply each element of the kernel K[m, n] by the corresponding element in the
input matrix X[i + m, j + n].
3. Sum all these multiplied values to compute a single number, which becomes the
value of Y [i, j].
4. Slide the kernel to the next position and repeat the process until all positions in Y
are filled.
In simpler terms, convolution applies a filter (kernel) to the input to detect patterns,
reduce dimensionality, or create new representations of the data. Mathematically, it can
be written as:
k−1 X
X k−1
Y [i, j] = X[i + m, j + n] · K[m, n]
m=0 n=0
Where:
• X: The input matrix, which represents the data to be processed (e.g., an image or
feature map).
• K: The kernel or filter, a small matrix containing weights used to extract specific
patterns from the input.
• Y : The output matrix, which stores the results of the convolution operation.
Step-by-step Calculation:
(1 · 1) + (1 · 0) + (1 · 1)
+(0 · 0) + (1 · 1) + (1 · 0)
+(0 · 1) + (0 · 0) + (1 · 1)
Result = 1 + 0 + 1 + 0 + 1 + 0 + 0 + 0 + 1 = 4
Thus, the value of the convolution operation at this position is: 4
4.2 Padding
Padding ensures the output size is preserved or adjusted based on the application:
• Valid Padding: No padding is applied, reducing output size.
• Same Padding: Adds zeros around the input to maintain the output size.
4.3 Stride
Stride refers to the step size with which the kernel moves. It impacts:
• Output Size: Larger strides result in smaller output dimensions.
• Computational Cost: Higher strides reduce computation at the cost of detail
loss.
5 Activation Functions
Activation functions introduce non-linearity into the model. Common types:
• Sigmoid: Squashes input into (0, 1). Useful for probabilities.
1
f (x) =
1 + e−x
• Focus on Relevant Data: Not all parts of an image are equally important. For
example:
• Improved Accuracy: Concentrating on the ROI reduces noise and irrelevant data,
which can confuse the model.
7 Limitations of 2D Convolution
While 2D convolution is highly effective, it comes with certain limitations:
• A fixed kernel size may miss patterns that span a larger area, such as global textures
or large-scale structures.
• As a result, they may not capture relationships between distant parts of an image,
which is essential for understanding the overall structure.
• This can lead to slower processing and the need for more powerful hardware.
• Padding partially mitigates this issue, but it introduces artificial data (zeros or other
values) into the computation.
• Features such as edges may not be detected properly if the object is rotated or
scaled.
8 Conclusion
Convolutional Neural Networks (CNNs) are a cornerstone of modern deep learning, ex-
celling in tasks involving structured data such as images and videos. Their architecture,
inspired by the human brain’s visual processing, allows them to learn hierarchical features
from raw data without manual feature engineering.
9 Appendix
9.1 Loss Functions
In supervised learning, the loss function serves as the guiding metric for model opti-
mization. It calculates the error between the predicted output (ypred ) and the ground
truth (ytrue ).
Examples: