Convolution Neural Networks U2
Convolution Neural Networks U2
CNN
CNN stands for Convolutional Neural Network which is a specialized neural network for processing
data that has an input shape like a 2D matrix like images.CNN’s are typically used for image detection and
classification. Images are 2D matrix of pixels on which we run CNN to either recognize the image or to classify
the image. Identify if an image is of a human being, or car or just digits on an address.
Convolution Operation :
The process of applying a convolution kernel to the input data is referred to as the convolution operation. This
operation involves sliding the kernel across the input, computing the dot product at each position, and generating
an output feature map.
Features :
In the context of CNNs, features refer to the meaningful patterns or characteristics extracted from the input data
by the convolutional layers. These features represent different aspects of the input, such as edges, textures, or
more complex structures. Each convolutional kernel is responsible for detecting specific features in the input
data.
In the context of Convolutional Neural Networks (CNNs), a “filter” is another term used interchangeably with
“convolution kernel” or “convolution matrix.” It refers to a small matrix of weights that is convolved with the
input data to perform feature extraction.Filters are typically small, square matrices (e.g., 3x3 or 5x5) containing
learnable parameters. These parameters are adjusted during the training process through backpropagation to
capture relevant patterns or features in the input data.When a filter is applied to the input data using the
convolution operation, it slides or convolves across the input, computing the dot product between its weights
and the corresponding pixels of the input data. This process generates a feature map that highlights specific
patterns or characteristics present in the input.
Like Neural Networks, CNN also draws motivation from brain .We use object recognition model
proposed by Hubel and Wiesel.The visual area V1 consists of simple cells and complex cells. Simple cells
helps with feature detection and complex cells combine several such local features from small spatial
neighborhood. Spatial pooling helps with translational invariant features.
When we see a new image, we scan the image may be left to right and top to bottom to understand the
different features of the image. Our next step is combine the different local features that we scanned to classify
the image.This is exactly how CNN works.Convolution is a mathematical operation where we have an input I,
and an argument, kernel K to produce an output that expresses how the shape of one is modified by another.
let’s explain in terms of an image.
We have an image “x”, which is a 2D array of pixels with different color channels(Red,Green and Blue-RGB)
and we have a feature detector or kernel “w” then the output we get after applying a mathematical operation
is called a feature map
Like Neural Networks, CNN also draws motivation from brain .We use object recognition model
proposed by
The way we implement this is through Convolutional Layer.Convolutional layer is core building
block of CNN, it helps with feature detection.Kernel K is a set of learnable filters and is small spatially
compared to the image but extends through the full depth of the input image.
An easy way to understand this is if you were a detective and you are came across a large image or a picture
in dark, how will you identify the image?
You will use you flashlight and scan across the entire image. This is exactly what we do in convolutional
layer.Kernel K, which is a feature detector is equivalent of the flashlight on image I, and we are trying to detect
feature and create multiple feature maps to help us identify or classify the image.
we have multiple feature detector to help with things like edge detection, identifying different shapes, bends
or different colors etc.
let’s take an image of 5 by 5 matrix with 3 channels(RGB) , a feature detector of 3 by 3 with 3 channels (RGB)
and scan the feature detector over the image by 1 stride.
Feature detector will move over image by 1 stride
What will be the dimension of the output matrix or feature map when I apply a feature detector over an image?
Dimension of the feature map as a function of the input image size(W), feature detector size(F), Stride(S)
and Zero Padding on image(P) is
(W−F+2P)/S+1
Input image size W in our case is 5.
Feature detector or receptive field size is F, which in our case is 3
Stride (S) is 1, and the amount of zero padding used (P) on the image is 0.
so, our feature map dimension will (5–3 +0)/1 + 1=3.
so feature map will a 3*3 matrix with three channels(RGB).
Fig 1
Feature map based on the input image and feature detector using cross correlation function.
We see that 5 by 5 input image is reduced to 3 by 3 feature maps. The depth or channels remain the same as
3(RGB)
we use multiple feature detectors for finding edges, we can use feature detector to sharpen the image or to blur
the image.
If we do not want to reduce the feature map dimension then we can use zero padding of one as shown below
Fig 2
Applying a zero padding of 1 on 5 by 5 input image
in that case applying the same formula, we get
(W−F+2P)/S+1 => (5–3 +2)/1 + 1=5,
now the dimension of output will be 5 by 5 with 3 color channels(RGB).
Let’s see all this in action
If we have one feature detector or filter of 3 by 3, one bias unit then we first apply linear transformation as
shown below
output= input*weight + bias
Pooling
we now apply pooling to have translational invariance. (remember the rose image)
Invariance to translation means that when we change the input by a small amount the pooled outputs do not
change. This helps with detecting features that are common in the input like edges in an image or colors in an
image
We apply the max pooling function which provides a better performance compared to min or average pooling.
when we use max pooling it summarizes the output over a whole neighborhood. we now have fewer units
compared to the feature map.
In our example, we scan over all the feature maps using a 2 by 2 box and find the maximum value.
Applying max pooling to the output using a 2 by 2 box. Highlighted region in yellow has a max value of 6
so now we know that a convolutional network consists of
● Multiple convolutions performed in parallel — output is linear activation function
● Applying nonlinear function ReLU to the convolutional layers
● Use a pooling function like max pooling to summarize the statistics of nearby locations. This helps with
“Translational Invariance”
● we flatten the max pooled output which are then inputs to a fully connected neural network
Below diagram is the full convolutional neural network
Parameter Sharing is used to control the number of parameters or weights used in CNN.
In traditional neural networks each weight is used exactly once however in CNN we assume that if the one
feature detector is useful to compute one spatial position then it can be used to compute a different spatial
position.
As we share parameters across the CNN, it reduces the number of parameters to be learnt and also reduces the
computational needs.
Equivariant representation
It means that object detection is invariant to the changes in illumination, change of position, but internal
representation is equivariance to these changes
represent(rose) = represent(transform(rose)).
Sparse Interactions
Each output unit is connected to (affected by) only a subset of the input units.
Fig 1 Sparse cpnnections versus fully connected network
Sparse connectivity (upper) vs full connectivity (lower). The grey shaded nodes in the input show the
receptive field of the node in the first layer (source)
If there are m input units and n output units, a fully connected layer would require mn parameters (one per
connection) and correspondingly the number of operations would scale as O(mn). On the other hand, if each
output unit is sparsely connected to k input units, the layer requires kn parameters and O(kn) computations. In
general, for a convolutional layer, the number of output units are a function of kernel size, stride and padding
(discussed later). This actually makes n a function of m. Keeping this in mind O(mn) ~ O(m²) while O(kn) ~
O(km). By keeping k several orders of magnitude smaller than m, we see that the computational saving from
sparse connections are huge.
Parameter Sharing
In the previous section, we saw that the output units are only connected to a small number of input units. In a
convolutional layer, each kernel weight is used at every input position (except maybe at boundaries where
different padding rules apply as discussed below), i.e. parameters used to compute different output units
are tied together. By tied together, we mean that at all times their values are same. This means that even during
training, they are updated by the same amount and by collecting the gradients from all output units.
Parameter sharing allows models to capture local connectivity while simultaneously computing the same
features at different spatial locations. We will see the use of this property soon.
Here we make a short detour to section 5 for discussing locally connected layers and tiled convolution.
● Locally connected layer/unshared convolution: The connectivity graph of convolution operation
and locally connected layer is the same. The only difference is that parameter sharing is not performed,
i.e. each output unit performs a linear operation on its neighbourhood but the parameters are not shared
across output units. This allows models to capture local connectivity while allowing different features
to be computed at different spatial locations. This however requires much more parameters than the
convolution operation.
● Tiled convolution is a sort of middle step between locally connected layer and traditional convolution.
It uses a set of kernels that are cycled through. This reduces the number of parameters in the model
while allowing for some freedom provided by unshared convolution.
Fig representing the Sparse Connections
Comparison of connectivity and parameters of locally-connected (top), tiled (middle) and standard convolution
(bottom) (source)
The parameter complexity and computation complexity can be obtained as below. Note that:
● m = number of input units
● n = number of output units
● k = kernel size
● l = number of kernels in the set (for tiled convolution)
You can see now that the quantity of ~451 thousand parameters corresponds to the locally connected
convolution operation. If we use a set of 200 kernels, the number of parameters for tiled convolution is 1.8
thousand. For a traditional convolution operation, this number is 9 parameters.
Equivariance
A function f is said to be equivariant to a function g if
f(g(x)) = g(f(x))
i.e. if input changes, the output changes in the same way.
Parameter sharing in a convolutional network provides equivariance to translation. What this means is that
translation of the image results in corresponding translation in the output map (except maybe for boundary
pixels). The reason for this is very intuitive: the same feature is being computed at all input points.
Note that convolution operation by itself is not equivariant to changes in scale or rotation. I’ve added a code
snippet to demonstrate this. See the figure below that for more clarity on this:
The output of a random 5x5 kernel on an image and its affine transforms is demonstrated. Note that the
histogram difference plots represent point-wise absolute difference between the outputs of convolution
applied to translated, rotated and scaled inputs and the corresponding transformation applied to the output of
convolution on the original image. This demonstrates the equivariance to translation but not to rotation and
scaling.
Pooling over spatial regions produces invariance to translation, but if we pool over the outputs of separately
parametrized convolutions, the features can learn which transformations to become invariant to (see figure
9.9). Because pooling summarizes the responses over a whole neighborhood, it is possible to use fewer pooling
units than detector units, by reporting summary statistics for pooling regions spaced k pixels apart rather than
1 pixel apart.
A transposed convolutional layer is an upsampling layer that generates the output feature map greater
than the input feature map. It is similar to a deconvolutional layer. A deconvolutional layer reverses the layer
to a standard convolutional layer. If the output of the standard convolution layer is deconvolved with the
deconvolutional layer then the output will be the same as the original value, While in transposed convolutional
value will not be the same, it can reverse to the same dimension.
Transposed convolutional layers are used in a variety of tasks, including image generation, image super-
resolution, and image segmentation. They are particularly useful for tasks that involve upsampling the input
data, such as converting a low-resolution image to a high-resolution one or generating an image from a set of
noise vectors.
The operation of a transposed convolutional layer is similar to that of a normal convolutional
layer, except that it performs the convolution operation in the opposite direction. Instead of sliding the kernel
over the input and performing element-wise multiplication and summation, a transposed convolutional layer
slides the input over the kernel and performs element-wise multiplication and summation. This results in an
output that is larger than the input, and the size of the output can be controlled by the stride and padding
parameters of the layer.
Example 1:
Suppose we have a grayscale image of size 2 X 2, and we want to upsample it using a transposed convolutional
layer with a kernel size of 2 x 2, a stride of 1, and zero padding (or no padding). The input image and the kernel
for the transposed convolutional layer would be as follows:
1. Reduced spatial resolution in the output feature map compared to the input feature map
2. Increased computational cost compared to regular convolutions with the same filter size and stride
An additional parameter l (dilation factor) tells how much the input is expanded. In other words, based on the
value of this parameter, (l-1) pixels are skipped in the kernel. Fig 1 depicts the difference between normal vs
dilated convolution. In essence, normal convolution is just a 1-dilated convolution.
Formula Involved:
where,
F(s) = Input
k(t) = Applied Filter
*l = l-dilated convolution
(F*lk)(p) = Output
d) Maxout:
The Maxout activation is a generalization of the ReLU and the leaky ReLU functions. It is a piecewise linear
function that returns the maximum of inputs, designed to be used in conjunction with the dropout regularization
technique. Both ReLU and leaky ReLU are special cases of Maxout. The Maxout neuron, therefore, enjoys all
the benefits of a ReLU unit and does not have any drawbacks like dying ReLU. However, it doubles the total
number of parameters for each neuron, and hence, a higher total number of parameters need to be trained.
e) ELU
The Exponential Linear Unit or ELU is a function that tends to converge faster and produce more accurate
results. Unlike other activation functions, ELU has an extra alpha constant which should be a positive number.
ELU is very similar to ReLU except for negative inputs. They are both in the identity function form for non-
negative inputs. On the other hand, ELU becomes smooth slowly until its output equal to -α whereas ReLU
sharply smoothes.
1. Regression
o MSE(Mean Squared Error)
o MAE(Mean Absolute Error)
o Hubber loss
2. Classification
o Binary cross-entropy
o Categorical cross-entropy
3. AutoEncoder
o KL Divergence
4. GAN
o Discriminator loss
o Minmax GAN loss
5. Object detection
o Focal loss
6. Word embeddings
o Triplet loss
A. Regression Loss
Advantage
● 1. Easy to interpret.
● 2. Always differential because of the square.
● 3. Only one local minima.
Disadvantage
● 1. Error unit in the square. because the unit in the square is not understood properly.
● 2. Not robust to outlier
Disadvantage
● 1. Graph, not differential. we can not use gradient descent directly, then we can subgradient calculation.
3. Huber Loss
In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data
than the squared error loss.
Advantage
● Robust to outlier
● It lies between MAE and MSE.
Disadvantage
● Its main disadvantage is the associated complexity. In order to maximize model accuracy, the hyperparameter
δ will also need to be optimized which increases the training requirements.
B. Classification Loss
Advantage –
Disadvantage –
where
● k is classes,
● y = actual value
● yhat – Neural Network prediction
Note – In multi-class classification at the last neuron use the softmax activation function.
Which is Faster?
sparse categorical cross-entropy faster than categorical cross-entropy.
Regularization works by adding a penalty or complexity term to the complex model. Let's consider the simple
linear regression equation:
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
In the above equation, Y represents the value to be predicted
X1, X2, …Xn are the features for Y.
β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here represents the bias of the
model, and b represents the intercept.
Linear regression models try to optimize the β0 and b to minimize the cost function. The equation for the cost
function for the linear model is given below:
Now, we will add a loss function and optimize parameter to make the model that can predict the accurate value
of Y. The loss function for the linear regression is called as RSS or Residual sum of squares.
Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:
o Ridge Regression
o Lasso Regression
Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount of bias is introduced
so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It
is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The amount of bias added
to the model is called Ridge Regression penalty. We can calculate it by multiplying with the lambda
to the squared weight of each individual feature.
o The equation for the cost function in ridge regression will be:
o In the above equation, the penalty term regularizes the coefficients of the model, and hence ridge
regression reduces the amplitudes of the coefficients that decreases the complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation becomes the cost
function of the linear regression model. Hence, for the minimum value of λ, the model will resemble
the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the independent
variables, so to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model. It stands
for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the absolute weights
instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only
shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso regression will be:
o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well as the feature
selection.
Lasso regression helps to reduce the overfitting in the model as well as feature selection.
Learning Rate Scheduling: The learning rate determines the step size of the Gradient Descent algorithm.
Learning Rate Scheduling involves changing the learning rate during the training process, such as decr easing
the learning rate as the number of iterations increases. This technique helps the algorithm to converge faster
and avoid overshooting the minimum.
Momentum-based Updates: The Momentum-based Gradient Descent technique involves adding a fraction of
the previous update to the current update. This technique helps the algorithm to overcome local minima and
accelerates convergence.
Batch Normalization: Batch Normalization is a technique used to normalize the inputs to each layer of the
neural network. This helps the Gradient Descent algorithm to converge faster and avoid vanishing or exploding
gradients.
Weight Decay: Weight Decay is a regularization technique that involves adding a penalty term to the cost
function proportional to the magnitude of the weights. This helps to prevent overfitting and improve the
generalization of the model.
Adaptive Learning Rates: Adaptive Learning Rate techniques involve adjusting the learning rate adaptively
during the training process. Examples include Adagrad, RMSprop, and Adam. These techniques adjust the
learning rate based on the historical gradient information, which can improve the convergence speed and
accuracy of the algorithm.
Second-Order Methods: Second-Order Methods use the second-order derivatives of the cost function to
update the parameters. Examples include Newton’s Method and Quasi-Newton Methods. These methods can
converge faster than Gradient Descent, but require more computation and may be less stable.
Gradient Descent is an iterative optimization algorithm, used to find the minimum value for a function. The
general idea is to initialize the parameters to random values, and then take small steps in the direction of the
“slope” at each iteration. Gradient descent is highly used in supervised learning to minimize the error function
and find the optimal values for the parameters. Various extensions have been designed for the gradient descent
algorithms. Some of them are discussed below:
Momentum method: This method is used to accelerate the gradient descent algorithm by taking into
consideration the exponentially weighted average of the gradients. Using averages makes the algorithm
converge towards the minima in a faster way, as the gradients towards the uncommon directions are canceled
out. The pseudocode for the momentum method is given below.
V= 0
for each iteration i:
compute dW
V = β V + (1 - β) dW
W=W-αV
V and dW are analogous to velocity and acceleration respectively. α is the learning rate, and β is analogous to
momentum normally kept at 0.9. Physics interpretation is that the velocity of a ball rolling downhill builds up
momentum according to the direction of slope(gradient) of the hill and therefore helps in better arrival of the
ball at a minimum value (in our case – at a minimum loss).
RMSprop: RMSprop was proposed by the University of Toronto’s Geoffrey Hinton. The intuition is to apply
an exponentially weighted average method to the second moment of the gradients (dW2 ). The pseudocode for
this is as follows:
S=0
for each iteration i
compute dW
S = β S + (1 - β) dW2
W = W - α dW⁄√S + ε
Adam Optimization: Adam optimization algorithm incorporates the momentum method and RMSprop, along
with bias correction. The pseudocode for this approach is as follows:
V= 0
S=0
for each iteration i
compute dW
V = β1 S + (1 - β1) dW
S = β2 S + (1 - β2) dW2
V = V⁄{1 - β1i}
S = S⁄{1 - β2i}
W = W - α V⁄√S + ε
Kingma and Ba, the proposers of Adam, recommended the following values for the hyperparameters.
α = 0.001
β1 = 0.9
β2 = 0.999
ε = 10-8
Optimisers :
Optimizers are algorithms or methods used to change the attributes of your neural network such as weights
and learning rate in order to reduce the losses.
Gradient Computation :
Gradient Descent is the most basic but most used optimization algorithm. It’s used heavily in linear regression
and classification algorithms. Backpropagation in neural networks also uses a gradient descent algorithm.
Gradient descent is a first-order optimization algorithm which is dependent on the first order derivative of a
loss function. It calculates that which way the weights should be altered so that the function can reach a
minima. Through backpropagation, the loss is transferred from one layer to another and the model’s
parameters also known as weights are modified depending on the losses so that the loss can be minimized.
algorithm: θ=θ−α⋅∇J(θ)
Advantages:
1. Easy computation.
2.Easy to implement.
3.Easy to understand.
Disadvantages:
2. Weights are changed after calculating gradient on the whole dataset. So, if the dataset is too large than
this may take years to converge to the minima.