0% found this document useful (0 votes)
92 views

Convolution Neural Networks U2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views

Convolution Neural Networks U2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

UNIT II CONVOLUTIONAL NEURAL NETWORKS

Convolution Operation -- Sparse Interactions -- Parameter Sharing -- Equivariance -- Pooling -- Convolution


Variants: Strided -- Tiled -- Transposed and dilated convolutions; CNN Learning: Nonlinearity Functions --
Loss Functions -- Regularization -- Optimizers --Gradient Computation

CNN
CNN stands for Convolutional Neural Network which is a specialized neural network for processing
data that has an input shape like a 2D matrix like images.CNN’s are typically used for image detection and
classification. Images are 2D matrix of pixels on which we run CNN to either recognize the image or to classify
the image. Identify if an image is of a human being, or car or just digits on an address.

Convolution Operation :
The process of applying a convolution kernel to the input data is referred to as the convolution operation. This
operation involves sliding the kernel across the input, computing the dot product at each position, and generating
an output feature map.

Features :

In the context of CNNs, features refer to the meaningful patterns or characteristics extracted from the input data
by the convolutional layers. These features represent different aspects of the input, such as edges, textures, or
more complex structures. Each convolutional kernel is responsible for detecting specific features in the input
data.

In the context of Convolutional Neural Networks (CNNs), a “filter” is another term used interchangeably with
“convolution kernel” or “convolution matrix.” It refers to a small matrix of weights that is convolved with the
input data to perform feature extraction.Filters are typically small, square matrices (e.g., 3x3 or 5x5) containing
learnable parameters. These parameters are adjusted during the training process through backpropagation to
capture relevant patterns or features in the input data.When a filter is applied to the input data using the
convolution operation, it slides or convolves across the input, computing the dot product between its weights
and the corresponding pixels of the input data. This process generates a feature map that highlights specific
patterns or characteristics present in the input.
Like Neural Networks, CNN also draws motivation from brain .We use object recognition model
proposed by Hubel and Wiesel.The visual area V1 consists of simple cells and complex cells. Simple cells
helps with feature detection and complex cells combine several such local features from small spatial
neighborhood. Spatial pooling helps with translational invariant features.
When we see a new image, we scan the image may be left to right and top to bottom to understand the
different features of the image. Our next step is combine the different local features that we scanned to classify
the image.This is exactly how CNN works.Convolution is a mathematical operation where we have an input I,
and an argument, kernel K to produce an output that expresses how the shape of one is modified by another.
let’s explain in terms of an image.
We have an image “x”, which is a 2D array of pixels with different color channels(Red,Green and Blue-RGB)
and we have a feature detector or kernel “w” then the output we get after applying a mathematical operation
is called a feature map
Like Neural Networks, CNN also draws motivation from brain .We use object recognition model
proposed by

The Equation for the Feature map is given by “Feature Map”


Convolution function
The mathematical operation helps compute similarity of two signals.
we may have a feature detector or filter for identifying edges in the image, so convolution operation will help
us identify the edges in the image when we use such a filter on the image.
we usually assume that convolution functions are zero everywhere but the finite set of points for which we
store the values. This means that in practice we can implement the infinite summation as a summation over a
finite number of array elements.

I is 2D array and K is kernel-Convolution function


Since convolution is commutative, we can rewrite the equation pictured above as shown below. We do this for
ease of implementation in Machine Learning, as there is less variation in range of valid values for m and n.
This is cross correlation function which most neural networks use.

The way we implement this is through Convolutional Layer.Convolutional layer is core building
block of CNN, it helps with feature detection.Kernel K is a set of learnable filters and is small spatially
compared to the image but extends through the full depth of the input image.
An easy way to understand this is if you were a detective and you are came across a large image or a picture
in dark, how will you identify the image?

You will use you flashlight and scan across the entire image. This is exactly what we do in convolutional
layer.Kernel K, which is a feature detector is equivalent of the flashlight on image I, and we are trying to detect
feature and create multiple feature maps to help us identify or classify the image.
we have multiple feature detector to help with things like edge detection, identifying different shapes, bends
or different colors etc.

let’s take an image of 5 by 5 matrix with 3 channels(RGB) , a feature detector of 3 by 3 with 3 channels (RGB)
and scan the feature detector over the image by 1 stride.
Feature detector will move over image by 1 stride
What will be the dimension of the output matrix or feature map when I apply a feature detector over an image?
Dimension of the feature map as a function of the input image size(W), feature detector size(F), Stride(S)
and Zero Padding on image(P) is
(W−F+2P)/S+1
Input image size W in our case is 5.
Feature detector or receptive field size is F, which in our case is 3
Stride (S) is 1, and the amount of zero padding used (P) on the image is 0.
so, our feature map dimension will (5–3 +0)/1 + 1=3.
so feature map will a 3*3 matrix with three channels(RGB).
Fig 1
Feature map based on the input image and feature detector using cross correlation function.
We see that 5 by 5 input image is reduced to 3 by 3 feature maps. The depth or channels remain the same as
3(RGB)
we use multiple feature detectors for finding edges, we can use feature detector to sharpen the image or to blur
the image.
If we do not want to reduce the feature map dimension then we can use zero padding of one as shown below
Fig 2
Applying a zero padding of 1 on 5 by 5 input image
in that case applying the same formula, we get
(W−F+2P)/S+1 => (5–3 +2)/1 + 1=5,
now the dimension of output will be 5 by 5 with 3 color channels(RGB).
Let’s see all this in action
If we have one feature detector or filter of 3 by 3, one bias unit then we first apply linear transformation as
shown below
output= input*weight + bias

Pooling
we now apply pooling to have translational invariance. (remember the rose image)
Invariance to translation means that when we change the input by a small amount the pooled outputs do not
change. This helps with detecting features that are common in the input like edges in an image or colors in an
image
We apply the max pooling function which provides a better performance compared to min or average pooling.
when we use max pooling it summarizes the output over a whole neighborhood. we now have fewer units
compared to the feature map.
In our example, we scan over all the feature maps using a 2 by 2 box and find the maximum value.
Applying max pooling to the output using a 2 by 2 box. Highlighted region in yellow has a max value of 6
so now we know that a convolutional network consists of
● Multiple convolutions performed in parallel — output is linear activation function
● Applying nonlinear function ReLU to the convolutional layers
● Use a pooling function like max pooling to summarize the statistics of nearby locations. This helps with
“Translational Invariance”
● we flatten the max pooled output which are then inputs to a fully connected neural network
Below diagram is the full convolutional neural network

Fig Architecture Diagram of Convolutional Neural Network

Convolution uses three important ideas


● Sparse interactions
● Parameter sharing
● Equivariant representations
Sparse interaction or sparse weights is implemented by using kernels or feature detector smaller than
the input image.
If we have an input image of the size 256 by 256 then it becomes difficult to detect edges in the image may
occupy only a smaller subset of pixels in the image. If we use smaller feature detectors then we can easily
identify the edges as we focus on the local feature identification.
one more advantage is computing output requires fewer operations making it statistically efficient.

Parameter Sharing is used to control the number of parameters or weights used in CNN.
In traditional neural networks each weight is used exactly once however in CNN we assume that if the one
feature detector is useful to compute one spatial position then it can be used to compute a different spatial
position.
As we share parameters across the CNN, it reduces the number of parameters to be learnt and also reduces the
computational needs.

Equivariant representation
It means that object detection is invariant to the changes in illumination, change of position, but internal
representation is equivariance to these changes
represent(rose) = represent(transform(rose)).

1. The Convolution Operation


The convolution operates on the input with a kernel (weights) to produce an output map given by:

Continuous domain convolution


Let us break down the formula. The steps involved are:
1. Express each function in terms of a dummy variable τ
2. Reflect the function g i.e. g(τ) → g(-τ)
3. Add a time offset i.e. g(τ) → g(t-τ). Adding the offset shifts the input to the right by t units (by
convention, a negative offset shits it to the left)
4. Multiply f and g point-wise and accumulate the results to get output at instant t. Basically, we are
calculating the area of overlap between f and shifted g
For our application, we are interested in the discrete domain formulation:

1-D discrete convolution

2-D discrete convolution


When the kernel is not flipped in its domain, we obtain the cross-correlation operation. The basic difference
between the two operations is that convolution is commutative in nature, i.e. f and g can be interchanged
without changing the output. Cross-correlation is not commutative.

Sparse Interactions
Each output unit is connected to (affected by) only a subset of the input units.
Fig 1 Sparse cpnnections versus fully connected network

Sparse connectivity (upper) vs full connectivity (lower). The grey shaded nodes in the input show the
receptive field of the node in the first layer (source)
If there are m input units and n output units, a fully connected layer would require mn parameters (one per
connection) and correspondingly the number of operations would scale as O(mn). On the other hand, if each
output unit is sparsely connected to k input units, the layer requires kn parameters and O(kn) computations. In
general, for a convolutional layer, the number of output units are a function of kernel size, stride and padding
(discussed later). This actually makes n a function of m. Keeping this in mind O(mn) ~ O(m²) while O(kn) ~
O(km). By keeping k several orders of magnitude smaller than m, we see that the computational saving from
sparse connections are huge.

Parameter Sharing
In the previous section, we saw that the output units are only connected to a small number of input units. In a
convolutional layer, each kernel weight is used at every input position (except maybe at boundaries where
different padding rules apply as discussed below), i.e. parameters used to compute different output units
are tied together. By tied together, we mean that at all times their values are same. This means that even during
training, they are updated by the same amount and by collecting the gradients from all output units.
Parameter sharing allows models to capture local connectivity while simultaneously computing the same
features at different spatial locations. We will see the use of this property soon.
Here we make a short detour to section 5 for discussing locally connected layers and tiled convolution.
● Locally connected layer/unshared convolution: The connectivity graph of convolution operation
and locally connected layer is the same. The only difference is that parameter sharing is not performed,
i.e. each output unit performs a linear operation on its neighbourhood but the parameters are not shared
across output units. This allows models to capture local connectivity while allowing different features
to be computed at different spatial locations. This however requires much more parameters than the
convolution operation.
● Tiled convolution is a sort of middle step between locally connected layer and traditional convolution.
It uses a set of kernels that are cycled through. This reduces the number of parameters in the model
while allowing for some freedom provided by unshared convolution.
Fig representing the Sparse Connections

Comparison of connectivity and parameters of locally-connected (top), tiled (middle) and standard convolution
(bottom) (source)
The parameter complexity and computation complexity can be obtained as below. Note that:
● m = number of input units
● n = number of output units
● k = kernel size
● l = number of kernels in the set (for tiled convolution)

You can see now that the quantity of ~451 thousand parameters corresponds to the locally connected
convolution operation. If we use a set of 200 kernels, the number of parameters for tiled convolution is 1.8
thousand. For a traditional convolution operation, this number is 9 parameters.
Equivariance
A function f is said to be equivariant to a function g if
f(g(x)) = g(f(x))
i.e. if input changes, the output changes in the same way.
Parameter sharing in a convolutional network provides equivariance to translation. What this means is that
translation of the image results in corresponding translation in the output map (except maybe for boundary
pixels). The reason for this is very intuitive: the same feature is being computed at all input points.
Note that convolution operation by itself is not equivariant to changes in scale or rotation. I’ve added a code
snippet to demonstrate this. See the figure below that for more clarity on this:

The output of a random 5x5 kernel on an image and its affine transforms is demonstrated. Note that the
histogram difference plots represent point-wise absolute difference between the outputs of convolution
applied to translated, rotated and scaled inputs and the corresponding transformation applied to the output of
convolution on the original image. This demonstrates the equivariance to translation but not to rotation and
scaling.

A convolutional layer can be broken down into the following components:


● Convolution
● Activation (detector stage)
● Pooling
The pooling function calculates a summary statistic of the nearby pixels at the point of operation. Some
common statistics are max, mean, weighted average and L² norm of a surrounding rectangular window.
Pooling makes the representation slightly translation invariant, in that small translations in the input do not
cause large changes in the output map. It allows detection of a particular feature if we only care about its
existence, not its position in an image.

Pooling over spatial regions produces invariance to translation, but if we pool over the outputs of separately
parametrized convolutions, the features can learn which transformations to become invariant to (see figure
9.9). Because pooling summarizes the responses over a whole neighborhood, it is possible to use fewer pooling
units than detector units, by reporting summary statistics for pooling regions spaced k pixels apart rather than
1 pixel apart.

A transposed convolutional layer is an upsampling layer that generates the output feature map greater
than the input feature map. It is similar to a deconvolutional layer. A deconvolutional layer reverses the layer
to a standard convolutional layer. If the output of the standard convolution layer is deconvolved with the
deconvolutional layer then the output will be the same as the original value, While in transposed convolutional
value will not be the same, it can reverse to the same dimension.

Transposed convolutional layers are used in a variety of tasks, including image generation, image super-
resolution, and image segmentation. They are particularly useful for tasks that involve upsampling the input
data, such as converting a low-resolution image to a high-resolution one or generating an image from a set of
noise vectors.
The operation of a transposed convolutional layer is similar to that of a normal convolutional
layer, except that it performs the convolution operation in the opposite direction. Instead of sliding the kernel
over the input and performing element-wise multiplication and summation, a transposed convolutional layer
slides the input over the kernel and performs element-wise multiplication and summation. This results in an
output that is larger than the input, and the size of the output can be controlled by the stride and padding
parameters of the layer.

Example 1:

Suppose we have a grayscale image of size 2 X 2, and we want to upsample it using a transposed convolutional
layer with a kernel size of 2 x 2, a stride of 1, and zero padding (or no padding). The input image and the kernel
for the transposed convolutional layer would be as follows:

The output will be:

Transposed Convolutional Stride = 1


Dilated Convolution: It is a technique that expands the kernel (input) by inserting holes between its
consecutive elements. In simpler terms, it is the same as convolution but it involves pixel skipping, so as to
cover a larger area of the input.
Dilated convolution, also known as atrous convolution, is a type of convolution operation used in
convolutional neural networks (CNNs) that enables the network to have a larger receptive field without
increasing the number of parameters.
In a regular convolution operation, a filter of a fixed size slides over the input feature map, and the
values in the filter are multiplied with the corresponding values in the input feature map to produce a single
output value. The receptive field of a neuron in the output feature map is defined as the area in the input feature
map that the filter can “see”. The size of the receptive field is determined by the size of the filter and the stride
of the convolution.
In contrast, in a dilated convolution operation, the filter is “dilated” by inserting gaps between the filter
values. The dilation rate determines the size of the gaps, and it is a hyperparameter that can be adjusted. When
the dilation rate is 1, the dilated convolution reduces to a regular convolution.
The dilation rate effectively increases the receptive field of the filter without increasing the number of
parameters, because the filter is still the same size, but with gaps between the values. This can be useful in
situations where a larger receptive field is needed, but increasing the size of the filter would lead to an increase
in the number of parameters and computational complexity.
Dilated convolutions have been used successfully in various applications, such as semantic
segmentation, where a larger context is needed to classify each pixel, and audio processing, where the network
needs to learn patterns with longer time dependencies.

Some advantages of dilated convolutions are:

1. Increased receptive field without increasing parameters


2. Can capture features at multiple scales
3. Reduced spatial resolution loss compared to regular convolutions with larger filters

Some disadvantages of dilated convolutions are:

1. Reduced spatial resolution in the output feature map compared to the input feature map
2. Increased computational cost compared to regular convolutions with the same filter size and stride

An additional parameter l (dilation factor) tells how much the input is expanded. In other words, based on the
value of this parameter, (l-1) pixels are skipped in the kernel. Fig 1 depicts the difference between normal vs
dilated convolution. In essence, normal convolution is just a 1-dilated convolution.

Fig 1: Normal Convolution vs Dilated Convolution


Intuition:
Dilated convolution helps expand the area of the input image covered without pooling. The objective is to
cover more information from the output obtained with every convolution operation. This method offers a wider
field of view at the same computational cost. We determine the value of the dilation factor (l) by seeing how
much information is obtained with each convolution on varying values of l.
By using this method, we are able to obtain more information without increasing the number of kernel
parameters. In Fig 1, the image on the left depicts dilated convolution. On keeping the value of l = 2, we skip
1 pixel (l – 1 pixel) while mapping the filter onto the input, thus covering more information in each step.

Formula Involved:
where,
F(s) = Input
k(t) = Applied Filter
*l = l-dilated convolution
(F*lk)(p) = Output

Advantages of Dilated Convolution:


Using this method rather than normal convolution is better as:
1. Larger receptive field (i.e. no loss of coverage)
2. Computationally efficient (as it provides a larger coverage on the same computation cost)
3. Lesser Memory consumption (as it skips the pooling step) implementation
4. No loss of resolution of the output image (as we dilate instead of performing pooling)
5. Structure of this convolution helps in maintaining the order of the data.

Non-Linear Activation Functions


The non-linear functions are known to be the most used activation functions. It makes it easy for a neural
network model to adapt with a variety of data and to differentiate between the outcomes.
These functions are mainly divided basis on their range or curves:

a) Sigmoid Activation Functions


Sigmoid takes a real value as the input and outputs another value between 0 and 1. The sigmoid activation
function translates the input ranged in (-∞,∞) to the range in (0,1)

b) Tanh Activation Functions


The tanh function is just another possible function that can be used as a non-linear activation function between
layers of a neural network. It shares a few things in common with the sigmoid activation function. Unlike a
sigmoid function that will map input values between 0 and 1, the Tanh will map values between -1 and 1.
Similar to the sigmoid function, one of the interesting properties of the tanh function is that the derivative of
tanh can be expressed in terms of the function itself.

c) ReLU Activation Functions


The formula is deceptively simple: max(0,z). Despite its name, Rectified Linear Units, it’s not linear and
provides the same benefits as Sigmoid but with better performance.

(i) Leaky Relu


Leaky Relu is a variant of ReLU. Instead of being 0 when z<0, a leaky ReLU allows a small, non-zero, constant
gradient α (normally, α=0.01). However, the consistency of the benefit across tasks is presently unclear. Leaky
ReLUs attempt to fix the “dying ReLU” problem.

(ii) Parametric Relu


PReLU gives the neurons the ability to choose what slope is best in the negative region. They can become
ReLU or leaky ReLU with certain values of α.

d) Maxout:
The Maxout activation is a generalization of the ReLU and the leaky ReLU functions. It is a piecewise linear
function that returns the maximum of inputs, designed to be used in conjunction with the dropout regularization
technique. Both ReLU and leaky ReLU are special cases of Maxout. The Maxout neuron, therefore, enjoys all
the benefits of a ReLU unit and does not have any drawbacks like dying ReLU. However, it doubles the total
number of parameters for each neuron, and hence, a higher total number of parameters need to be trained.
e) ELU
The Exponential Linear Unit or ELU is a function that tends to converge faster and produce more accurate
results. Unlike other activation functions, ELU has an extra alpha constant which should be a positive number.
ELU is very similar to ReLU except for negative inputs. They are both in the identity function form for non-
negative inputs. On the other hand, ELU becomes smooth slowly until its output equal to -α whereas ReLU
sharply smoothes.

f) Softmax Activation Functions


Softmax function calculates the probabilities distribution of the event over ‘n’ different events. In a general
way, this function will calculate the probabilities of each target class over all possible target classes. Later the
calculated probabilities will help determine the target class for the given inputs.

When to use which Activation Function in a Neural Network?


Specifically, it depends on the problem type and the value range of the expected output. For example, to predict
values that are larger than 1, tanh or sigmoid are not suitable to be used in the output layer, instead, ReLU can
be used. On the other hand, if the output values have to be in the range (0,1) or (-1, 1) then ReLU is not a good
choice, and sigmoid or tanh can be used here. While performing a classification task and using the neural
network to predict a probability distribution over the mutually exclusive class labels, the softmax activation
function should be used in the last layer. However, regarding the hidden layers, as a rule of thumb, use ReLU
as an activation for these layers.
In the case of a binary classifier, the Sigmoid activation function should be used. The sigmoid activation
function and the tanh activation function work terribly for the hidden layer. For hidden layers, ReLU or its
better version leaky ReLU should be used. For a multiclass classifier, Softmax is the best-used activation
function. Though there are more activation functions known, these are known to be the most used activation
functions.
Activation Functions and their Derivatives

Loss Function in Deep Learning

1. Regression
o MSE(Mean Squared Error)
o MAE(Mean Absolute Error)
o Hubber loss
2. Classification
o Binary cross-entropy
o Categorical cross-entropy
3. AutoEncoder
o KL Divergence
4. GAN
o Discriminator loss
o Minmax GAN loss
5. Object detection
o Focal loss
6. Word embeddings
o Triplet loss

A. Regression Loss

1. Mean Squared Error/Squared loss/ L2 loss


The Mean Squared Error (MSE) is the simplest and most common loss function. To calculate the MSE, you
take the difference between the actual value and model prediction, square it, and average it across the whole
dataset.

Advantage

● 1. Easy to interpret.
● 2. Always differential because of the square.
● 3. Only one local minima.

Disadvantage

● 1. Error unit in the square. because the unit in the square is not understood properly.
● 2. Not robust to outlier

Note – In regression at the last neuron use linear activation function.

2. Mean Absolute Error/ L1 loss


The Mean Absolute Error (MAE) is also the simplest loss function. To calculate the MAE, you take the
difference between the actual value and model prediction and average it across the whole dataset.
Advantage

● 1. Intuitive and easy


● 2. Error Unit Same as the output column.
● 3. Robust to outlier

Disadvantage

● 1. Graph, not differential. we can not use gradient descent directly, then we can subgradient calculation.

Note – In regression at the last neuron use linear activation function.

3. Huber Loss
In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data
than the squared error loss.

● n – the number of data points.


● y – the actual value of the data point. Also known as true value.
● ŷ – the predicted value of the data point. This value is returned by the model.
● δ – defines the point where the Huber loss function transitions from a quadratic to linear.

Advantage

● Robust to outlier
● It lies between MAE and MSE.

Disadvantage

● Its main disadvantage is the associated complexity. In order to maximize model accuracy, the hyperparameter
δ will also need to be optimized which increases the training requirements.

B. Classification Loss

1. Binary Cross Entropy/log loss


It is used in binary classification problems like two classes. example a person has covid or not or my article
gets popular or not.
Binary cross entropy compares each of the predicted probabilities to the actual class output which can be either
0 or 1. It then calculates the score that penalizes the probabilities based on the distance from the expected value.
That means how close or far from the actual value.
● yi – actual values
● yihat – Neural Network prediction

Advantage –

● A cost function is a differential.

Disadvantage –

● Multiple local minima


● Not intuitive

Note – In classification at last neuron use sigmoid activation function.

2. Categorical Cross Entropy


Categorical Cross entropy is used for Multiclass classification and softmax regression.
loss function = -sum up to k(yjlagyjhat) where k is classes

cost function = -1/n(sum upto n(sum j to k (yijloghijhat))

where

● k is classes,
● y = actual value
● yhat – Neural Network prediction

Note – In multi-class classification at the last neuron use the softmax activation function.

if problem statement have 3 classes


softmax activation – f(z) = ez1/(ez1+ez2+ez3)
When to use categorical cross-entropy and sparse categorical cross-entropy?
If target column has One hot encode to classes like 0 0 1, 0 1 0, 1 0 0 then use categorical cross-entropy. and
if the target column has Numerical encoding to classes like 1,2,3,4….n then use sparse categorical cross-
entropy.

Which is Faster?
sparse categorical cross-entropy faster than categorical cross-entropy.
Regularization works by adding a penalty or complexity term to the complex model. Let's consider the simple
linear regression equation:

y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
In the above equation, Y represents the value to be predicted
X1, X2, …Xn are the features for Y.
β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here represents the bias of the
model, and b represents the intercept.
Linear regression models try to optimize the β0 and b to minimize the cost function. The equation for the cost
function for the linear model is given below:

Now, we will add a loss function and optimize parameter to make the model that can predict the accurate value
of Y. The loss function for the linear regression is called as RSS or Residual sum of squares.

Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:

o Ridge Regression
o Lasso Regression

Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount of bias is introduced
so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It
is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The amount of bias added
to the model is called Ridge Regression penalty. We can calculate it by multiplying with the lambda
to the squared weight of each individual feature.
o The equation for the cost function in ridge regression will be:

o In the above equation, the penalty term regularizes the coefficients of the model, and hence ridge
regression reduces the amplitudes of the coefficients that decreases the complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation becomes the cost
function of the linear regression model. Hence, for the minimum value of λ, the model will resemble
the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the independent
variables, so to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model. It stands
for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the absolute weights
instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only
shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso regression will be:

o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well as the feature
selection.

Key Difference between Ridge Regression and Lasso Regression


o Ridge regression is mostly used to reduce the overfitting in the model, and it includes all the features
present in the model. It reduces the complexity of the model by shrinking the coefficients.

Lasso regression helps to reduce the overfitting in the model as well as feature selection.

Optimization techniques for Gradient Descent


Gradient Descent is a widely used optimization algorithm for machine learning models. However, there are
several optimization techniques that can be used to improve the performance of Gradient Descent. Here are
some of the most popular optimization techniques for Gradient Descent:

Learning Rate Scheduling: The learning rate determines the step size of the Gradient Descent algorithm.
Learning Rate Scheduling involves changing the learning rate during the training process, such as decr easing
the learning rate as the number of iterations increases. This technique helps the algorithm to converge faster
and avoid overshooting the minimum.

Momentum-based Updates: The Momentum-based Gradient Descent technique involves adding a fraction of
the previous update to the current update. This technique helps the algorithm to overcome local minima and
accelerates convergence.

Batch Normalization: Batch Normalization is a technique used to normalize the inputs to each layer of the
neural network. This helps the Gradient Descent algorithm to converge faster and avoid vanishing or exploding
gradients.

Weight Decay: Weight Decay is a regularization technique that involves adding a penalty term to the cost
function proportional to the magnitude of the weights. This helps to prevent overfitting and improve the
generalization of the model.

Adaptive Learning Rates: Adaptive Learning Rate techniques involve adjusting the learning rate adaptively
during the training process. Examples include Adagrad, RMSprop, and Adam. These techniques adjust the
learning rate based on the historical gradient information, which can improve the convergence speed and
accuracy of the algorithm.
Second-Order Methods: Second-Order Methods use the second-order derivatives of the cost function to
update the parameters. Examples include Newton’s Method and Quasi-Newton Methods. These methods can
converge faster than Gradient Descent, but require more computation and may be less stable.

Gradient Descent is an iterative optimization algorithm, used to find the minimum value for a function. The
general idea is to initialize the parameters to random values, and then take small steps in the direction of the
“slope” at each iteration. Gradient descent is highly used in supervised learning to minimize the error function
and find the optimal values for the parameters. Various extensions have been designed for the gradient descent
algorithms. Some of them are discussed below:

Momentum method: This method is used to accelerate the gradient descent algorithm by taking into
consideration the exponentially weighted average of the gradients. Using averages makes the algorithm
converge towards the minima in a faster way, as the gradients towards the uncommon directions are canceled
out. The pseudocode for the momentum method is given below.
V= 0
for each iteration i:
compute dW
V = β V + (1 - β) dW
W=W-αV
V and dW are analogous to velocity and acceleration respectively. α is the learning rate, and β is analogous to
momentum normally kept at 0.9. Physics interpretation is that the velocity of a ball rolling downhill builds up
momentum according to the direction of slope(gradient) of the hill and therefore helps in better arrival of the
ball at a minimum value (in our case – at a minimum loss).
RMSprop: RMSprop was proposed by the University of Toronto’s Geoffrey Hinton. The intuition is to apply
an exponentially weighted average method to the second moment of the gradients (dW2 ). The pseudocode for
this is as follows:
S=0
for each iteration i
compute dW
S = β S + (1 - β) dW2
W = W - α dW⁄√S + ε

Adam Optimization: Adam optimization algorithm incorporates the momentum method and RMSprop, along
with bias correction. The pseudocode for this approach is as follows:
V= 0
S=0
for each iteration i
compute dW
V = β1 S + (1 - β1) dW
S = β2 S + (1 - β2) dW2
V = V⁄{1 - β1i}
S = S⁄{1 - β2i}
W = W - α V⁄√S + ε
Kingma and Ba, the proposers of Adam, recommended the following values for the hyperparameters.
α = 0.001
β1 = 0.9
β2 = 0.999
ε = 10-8

Optimisers :
Optimizers are algorithms or methods used to change the attributes of your neural network such as weights
and learning rate in order to reduce the losses.

Gradient Computation :

Gradient Descent is the most basic but most used optimization algorithm. It’s used heavily in linear regression

and classification algorithms. Backpropagation in neural networks also uses a gradient descent algorithm.

Gradient descent is a first-order optimization algorithm which is dependent on the first order derivative of a

loss function. It calculates that which way the weights should be altered so that the function can reach a

minima. Through backpropagation, the loss is transferred from one layer to another and the model’s

parameters also known as weights are modified depending on the losses so that the loss can be minimized.

algorithm: θ=θ−α⋅∇J(θ)

Advantages:

1. Easy computation.

2.Easy to implement.

3.Easy to understand.

Disadvantages:

1. May trap at local minima.

2. Weights are changed after calculating gradient on the whole dataset. So, if the dataset is too large than
this may take years to converge to the minima.

3. Requires large memory to calculate gradient on the whole dataset.

You might also like