AD3501-DL-Unit 2
AD3501-DL-Unit 2
• Input Layers: It’s the layer in which we give input to our model. The number of neurons in this layer
is equal to the total number of features in our data (number of pixels in the case of an image).
• Hidden Layer: The input from the Input layer is then feed into the hidden layer. There can be many
hidden layers depending upon our model and data size. Each hidden layer can have different numbers
of neurons which are generally greater than the number of features. The output from each layer is
computed by matrix multiplication of output of the previous layer with learnable weights of that layer
and then by the addition of learnable biases followed by activation function which makes the network
nonlinear.
• Output Layer: The output from the hidden layer is then fed into a logistic function like sigmoid or
softmax which converts the output of each class into the probability score of each class.
The data is fed into the model and output from each layer is obtained from the above step is called feed
forward, we then calculate the error using an error function, some common error functions are cross-entropy,
square loss error, etc. The error function measures how well the network is performing. After that, we back
propagate into the model by calculating the derivatives. This step is called Back propagation which basically
is used to minimize the loss.
Around the 1980s, CNNs were developed and deployed for the first time. A CNN could only detect handwritten
digits at the time. CNN was primarily used in various areas to read zip and pin codes etc. The most common
aspect of any AI model is that it requires a massive amount of data to train. This was one of
the biggest problems that CNN faced at the time, and due to this, they were only used in the postal industry.
Yann LeCun was the first to introduce convolutional neural networks.
Convolutional Neural Networks, commonly referred to as CNNs, are a specialized kind of neural network
architecture that is designed to process data with a grid-like topology. This makes them particularly well- suited
for dealing with spatial and temporal data, like images and videos that maintain a high degree of correlation
between adjacent elements.
CNNs are similar to other neural networks, but they have an added layer of complexity due to the fact that
they use a series of convolutional layers. Convolutional layers perform a mathematical operation called
convolution, a sort of specialized matrix multiplication, on the input data. The convolution operation helps
to preserve the spatial relationship between pixels by learning image features using small squares of input data.
. The picture below represents a typical CNN architecture.
• Convolutional layers
Convolutional layers operate by sliding a set of ‘filters’ or ‘kernels’ across the input data. Each filter is designed
to detect a specific feature or pattern, such as edges, corners, or more complex shapes in the case of deeper
layers. As these filters move across the image, they generate a map that signifies the areas where those features
were found. The output of the convolutional layer is a feature map, which is a representation of the input
image with the filters applied. Convolutional layers can be stacked to create more complex models, which
can learn more intricate features from images. Simply speaking, convolutional layers are responsible for
extracting features from the input images. These features might include edges, corners, textures, or more
complex patterns.
• Pooling layers
Pooling layers follow the convolutional layers and are used to reduce the spatial dimension of the input, making
it easier to process and requiring less memory. In the context of images, “spatial dimensions” refer to the
width and height of the image. An image is made up of pixels, and you can think of it like a grid, with rows
and columns of tiny squares (pixels). By reducing the spatial dimensions, pooling layers help reduce the
number of parameters or weights in the network. This helps to combat over-fitting and helptrain the model
in a fast manner. Max pooling helps in reducing computational complexity, owing to reduction in size of
feature map, and making the model invariant to small transitions. Without max pooling, the network would
not gain the ability to recognize features irrespective of small shifts or
rotations. This would make the model less robust to variations in object positioning within the image, possibly
affecting accuracy.
There are two main types of pooling: max pooling and average pooling. Max pooling takes the maximum value
from each feature map. For example, if the pooling window size is 2×2, it will pick the pixel with the
highest value in that 2×2 region. Max pooling effectively captures the most prominent feature or characteristic
within the pooling window. Average pooling calculates the average of all values within the pooling window.
It provides a smooth, average feature representation.
The combination of Convolution layer followed by max-pooling layer and then similar sets creates a hierarchy
of features. The first layer detects simple patterns, and subsequent layers build on those to detect more complex
patterns.
CNNs are often used for image recognition and classification tasks. For example, CNNs can be used to identify
objects in an image or to classify an image as being a cat or a dog. CNNs can also be used for more complex
tasks, such as generating descriptions of an image or identifying the points of interest in an image. Beyond
image data, CNNs can also handle time-series data, such as audio data or even text data, although other types
of networks like Recurrent Neural Networks (RNNs) or transformers are often preferred for these scenarios.
CNNs are a powerful tool for deep learning, and they have been used to achieve state-of- the-art results in
many different applications.
.
Fig. 3 Functions of CNN Layers
LeNet: LeNet is the first CNN architecture. It was developed in 1998 by Yann LeCun, Corinna Cortes, and
Christopher Burges for handwritten digit recognition problems. LeNet was one of the first successful CNNs
and is often considered the “Hello World” of deep learning. It is one of the earliest and most widely-used CNN
architectures and has been successfully applied to tasks such as handwritten digit recognition. The LeNet
architecture consists of multiple convolutional and pooling layers, followed by a fully-connected layer. The
model has five convolution layers followed by two fully connected layers. LeNet was the beginning of CNNs
in deep learning for computer vision problems. However, LeNet could not train well due to the vanishing
gradients problem. To solve this issue, a shortcut connection layer known as max-pooling isused between
convolutional layers to reduce the spatial size of images which helps prevent overfitting and allows CNNs to
train more effectively. The diagram below represents LeNet-5 architecture.
The LeNet CNN is a simple yet powerful model that has been used for various tasks such as handwritten digit
recognition, traffic sign recognition, and face detection. Although LeNet was developed more than 20 years
ago, its architecture is still relevant today and continues to be used.
AlexNet: AlexNet is the deep learning architecture that popularized CNN. It was developed by Alex
Krizhevsky, Ilya Sutskever, and Geoff Hinton. AlexNet network had a very similar architecture to LeNet, but
was deeper, bigger, and featured Convolutional Layers stacked on top of each other. AlexNet was the first
large-scale CNN and was used to win the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in
2012. The AlexNet architecture was designed to be used with large-scale image datasets and it achieved state-
of-the-art results at the time of its publication. AlexNet is composed of 5 convolutional layers with a
combination of max-pooling layers, 3 fully connected layers, and 2 dropout layers. The activation function
used in all layers is Relu. The activation function used in the output layer is Softmax. The total number of
parameters in this architecture is around 60 million.
ZF Net: ZFnet is the CNN architecture that uses a combination of fully-connected layers and CNNs. ZF Net
was developed by Matthew Zeiler and Rob Fergus. It was the ILSVRC 2013 winner. The network has
relatively fewer parameters than AlexNet, but still outperforms it on ILSVRC 2012 classification task by
achieving top accuracy with only 1000 images per class. It was an improvement on AlexNet by tweaking the
architecture hyperparameters, in particular by expanding the size of the middle convolutional layers and
making the stride and filter size on the first layer smaller. It is based on the Zeiler and Fergus model, which
was trained on the ImageNet dataset. ZF Net CNN architecture consists of a total of seven layers:
Convolutional layer, max-pooling layer (downscaling), concatenation layer, convolutional layer with linear
activation function, and stride one, dropout for regularization purposes applied before the fully connected
output. This CNN model is computationally more efficient than AlexNet by introducing an approximate
inference stage through deconvolutional layers in the middle of CNNs.
GoogLeNet: GoogLeNet is the CNN architecture used by Google to win ILSVRC 2014 classification task.
It was developed by Jeff Dean, Christian Szegedy, Alexandro Szegedy et al.. It has been shown to have a
notably reduced error rate in comparison with previous winners AlexNet (Ilsvrc 2012 winner) and ZF-Net
(Ilsvrc 2013 winner). In terms of error rate, the error is significantly lesser than VGG (2014 runner up). It
achieves deeper architecture by employing a number of distinct techniques, including 1×1 convolution and
global average pooling. GoogleNet CNN architecture is computationally expensive. To reduce the parameters
that must be learned, it uses heavy unpooling layers on top of CNNs to remove spatial redundancy during
training and also features shortcut connections between the first two convolutional layers before adding new
filters in later CNN layers. Real-world applications/examples of GoogLeNet CNN architecture include Street
View House Number (SVHN) digit recognition task, which is often used as a proxy for roadside object
detection. Below is the simplified block diagram representing GoogLeNet CNN architecture:
VGGNet: VGGNet is the CNN architecture that was developed by Karen Simonyan, Andrew Zisserman et
al. at Oxford University. VGGNet is a 16-layer CNN with up to 95 million parameters and trained on over
one billion images (1000 classes). It can take large input images of 224 x 224-pixel size for which it has 4096
convolutional features. CNNs with such large filters are expensive to train and require a lot of data, which is
the main reason why CNN architectures like GoogLeNet (AlexNet architecture) work better than VGGNet for
most image classification tasks where input images have a size between 100 x 100-pixel and 350 x 350 pixels.
Real-world applications/examples of VGGNet CNN architecture include the ILSVRC 2014 classification
task, which was also won by GoogleNet CNN architecture. The VGG CNN model is computationally efficient
and serves as a strong baseline for many applications in computer vision due to its applicability for numerous
tasks including object detection. Its deep feature representations are used across multiple neural network
architectures like YOLO, SSD, etc. The diagram below represents the standard VGG16 network architecture
diagram:
MobileNets: MobileNets are CNNs that can be fit on a mobile device to classify images or detect objects with
low latency. MobileNets have been developed by Andrew G Trillion et al.. They are usually very small CNN
architectures, which makes them easy to run in real-time using embedded devices like smartphones and
drones. The architecture is also flexible so it has been tested on CNNs with 100-300 layers and it still works
better than other architectures like VGGNet. Real-life examples of MobileNets CNN architecture include
CNNs that is built into Android phones to run Google’s Mobile Vision API, which can automaticallyidentify
labels of popular objects in images.
GoogLeNet_DeepDream: GoogLeNet_DeepDream is a deep dream CNN architecture that was developedby
Alexander Mordvintsev, Christopher Olah, et al.. It uses the Inception network to generate images based
on CNN features. The architecture is often used with the ImageNet dataset to generate psychedelic images or
create abstract artworks using human imagination at the ICLR 2017 workshop by David Ha, et al.
To summarize the different types of CNN architectures described above in an easy to remember form, you can
use the following:
Table 1. Different Types of CNN Architectures
Architecture Year Key Features Use Case
Now imagine taking a small patch of this image and running a small neural network, called a filter or kernel
on it, with say, K outputs and representing them vertically. Now slide that neural network across the whole
image, as a result, we will get another image with different widths, heights, and depths. Instead o f just R, G,
and B channels now we have more channels but lesser width and height. This operation is called
Convolution. If the patch size is the same as that of the image it will be a regular neural network. Because of
this small patch, we have fewer weights.
Now let’s talk about a bit of mathematics that is involved in the whole convolution process.
Convolution layers consist of a set of learnable filters (or kernels) having small widths and heights and the
same depth as that of input volume (3 if the input layer is image input).
For example, if we have to run convolution on an image with dimensions 34x34x3. The possible size of filters
can be axax3, where ‘a’ can be anything like 3, 5, or 7 but smaller as compared to the image dimension.
During the forward pass, we slide each filter across the whole input volume step by step where each step is
called stride (which can have a value of 2, 3, or even 4 for high-dimensional images) and compute the dot
product between the kernel weights and patch from input volume.
As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together as a result, we’ll get
output volume having a depth equal to the number of filters. The network will learn all the filters.
1.3.1 Layers used to build ConvNets
A complete Convolution Neural Networks architecture is also known as convNets. A convNets is a
sequence of layers, and every layer transforms one volume to another through a differentiable function.
Let’s take an example by running a convNets on of image of dimension 32 x 32 x 3.
• Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the input will be
an image or a sequence of images. This layer holds the raw input of the image with width 32, height
32, and depth 3.
• Convolutional Layers: This is the layer, which is used to extract the feature from the input dataset.
It applies a set of learnable filters known as the kernels to the input images. The filters/kernels are
smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input image data and computes the
dot product between kernel weight and the corresponding input image patch. The output of this layer
is referred ad feature maps. Suppose we use a total of 12 filters for this layer we’ll get an output
volume of dimension 32 x 32 x 12.
• Activation Layer: By adding an activation function to the output of the preceding layer, activation
layers add nonlinearity to the network. it will apply an element-wise activation function to the output
of the convolution layer. Some common activation functions are RELU, Tanh, Leaky RELU, etc.
The volume remains unchanged hence output volume will have dimensions 32 x 32 x 12.
• Pooling layer: This layer is periodically inserted in the convnets and its main function is to reduce the
size of volume which makes the computation fast reduces memory and also prevents over- fitting.
Two common types of pooling layers are max pooling and average pooling. If we use a max pool
with 2 x 2 filters and stride 2, the resultant volume will be of dimension 16x16x12.
• Flattening: The resulting feature maps are flattened into a one-dimensional vector after the
convolution and pooling layers so they can be passed into a completely linked layer for categorization
or regression.
• Fully Connected Layers: It takes the input from the previous layer and computes the final
classification or regression task.
• Output Layer: The output from the fully connected layers is then fed into a logistic function for
classification tasks like sigmoid or softmax which converts the output of each class into the probability
score of each class.
2. Convolution Operation:
A convolutional neural network, or ConvNet, is just a neural network that uses convolution. To understand the
principle, we are going to work with a 2-dimensional convolution first.
Convolution is a mathematical operation that allows the merging of two sets of information. Convolution
between two functions in mathematics produces a third function expressing how the shape of one function is
modified by other.In the case of CNN, convolution is applied to the input data to filter the information and
produce a feature map.
This filter is also called a kernel, or feature detector, and its dimensions can be, for example, 3x3. A kernel is
a small 2D matrix whose contents are based upon the operations to be performed. A kernel maps on the input
image by simple matrix multiplication and addition, the output obtained is of lower dimensions and therefore
easier to work with.
Above is an example of a kernel for applying Gaussian blur (to smoothen the image before processing),
Sharpen image (enhance the depth of edges) and edge detection.To perform convolution, the kernel goes over
the input image, doing matrix multiplication element after element. The result for each receptive field (the area
where convolution takes place) is written down in the feature map.
The shape of a kernel is heavily dependent on the input shape of the image and architecture of the entire
network, mostly the size of kernels is (MxM) i.e., a square matrix. The movement of a kernel is always from
left to right and top to bottom.
Stride defines by what step does to kernel move, for example stride of 1 makes kernel slide by one
row/column at a time and stride of 2 moves kernel by 2 rows/columns. We continue sliding the filter until the
feature map is complete.
For input images with 3 or more channels such as RGB a filter is applied. Filters are one dimension higher
than kernels and can be seen as multiple kernels stacked on each other where every kernel is for a particular
channel. Therefore for an RGB image of (32x32) we have a filter of the shape say (5x5x3).
Here the input matrix has shape 4x4x1 and the kernel is of size 3x3 since the shape of input is larger than the
kernel, we are able to implement a sliding window protocol and apply the kernel over entire input. First entry
in the convoluted result is calculated as:
45*0+12*(-1)+ 5*0+22*(-1)+10*5+35*(-1)+88*0+26*(-1)+51*0 = -45
We continue sliding the filter until the feature map is complete.
2.1 Sliding window protocol:
1. The kernel gets into position at the top-left corner of the input matrix.
2. Then it starts moving left to right, calculating the dot product and saving it to a new matrix until it
has reached the last column.
3. Next, kernel resets its position at first column but now it slides one row to the bottom. Thus
following the fashion left-right and top-bottom.
4. Steps 2 and 3are repeated till the entire input has been processed.
For a 3D input matrix the movement of the kernel will be from front to back, left to right and top to bottom.
In practice, we don’t explicitly define the filters that our convolutional layer will use; we instead parameterize
the filters and let the network learn the best filters to use during training. We do, however, define how many
filters, we’ll use at each layer— a hyperparameter which is called the depth of the outputvolume.
Another hyperparameter is the stride that defines how much we slide the filter over the data. For example
if stride is 1 then we move the window by 1 pixel at a time over the image, when our input is an image.
When we use larger values of stride 2 or 3 we allow jumping 2 or pixels at a time. This reduces significantly
the output size.
The last hyperparameter is the size of zero-padding, when sometimes is convenient to pad the input volume
with zeros around the border.
So now we can compute the spatial size of the output volume as a function of the input volume size (W), the
receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the
amount of zero padding used (P) on the border. The formula for calculating how many neurons “fit” is given
by
In our previous example for the 5x5 input (W=5) and the 2x2 filter (F=2) with stride 1(S=1) and pad 0 (P=0)
we would get a 4x4x (number of filters) output for each network node.
Trivial neural network layers use matrix multiplication by a matrix of parameters describing the interaction
between the input an doutput unit. This means that every output unit interacts with every input unit.
However, convolution neural networks have sparse interaction. This is achieved by making kernel smaller
than the input e.g., an image can have millions or thousands of pixels, but while processing it using kernel
we can detect meaningful information that is of tens or hundreds of pixels. This means that we need to store
fewer parameters that not only reduces the memory requirement of the model but also improves the
statistical efficiency of the model.
2.3 Parameter (Weight) Sharing
If computing one feature at a spatial point (x1, y1) is useful then it should also be useful at some other spatial
point say (x2, y2). It means that for a single two-dimensional slice i.e., for creating one activation map, neurons
are constrained to use the same set of weights. In a traditional neural network, each element of the weight matrix
is used once and then never revisited, while convolution network has shared parameters i.e., forgetting output,
weights applied to one input are the same as the weight applied elsewhere. Parameter sharing is used in the
convolutional layers to reduce the number of parameters in the network. For example in the first convolutional
layer let’s say we have an output of 15x15x4 where 15 is the size of the output and 4 the number of filters used
in this layer. For each output node in that layer we have the same filter, thus reducing dramatically the storage
requirements of the model to the size of the filter.
The same filter (weights) (1, 0, -1) are used for that layer.
2.4 Equivariant Representations
Equivariant means varying in the similar or equivalent proportion. Due to parameter sharing, the layers of
convolution neural network will have a property of equivariance to translation. It says that if we changed the
input in a way, the output will also get changed in the same way.
Equivariant to translation means that a translation of input features results in an equivalent translation of outputs.
It makes the CNN understand the rotation or proportion change. The equivariance allows the network to
generalize edge, texture, shape, detection in different locations.
In some cases, we may not wish to share parameters across entire image
• If image is cropped to be centered on a face, we may want different features from different parts of
the face
• Part of the network processing the top of the face looks for eyebrows
• Part of the network processing the bottom of the face looks for the chin
• Certain image operations such as scale and rotation are not equivariant to convolution
• Other mechanisms are needed for such transformations
2.5 Pooling
The pooling operation involves sliding a two-dimensional filter over each channel of feature map and
summarizing the features lying within the region covered by the filter.
For a feature map having dimensions nh x nw x nc, the dimensions of output obtained after a pooling layer is
(nh – f + 1)/ s x (nw-f+1)/s x nc
where,
• nh - height of feature map
• nw – width of feature map
• nc – number of channels in the feature map
• f - size of filter
• s-stride length
A common CNN model architecture is to have a number of convolution and pooling layers stacked one after the
other.
Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the number of parameters
to learn and the amount of computation performed in the network.
The pooling layer summarizes the features present in a region of the feature map generated by a convolution
layer. So, further operations are performed on summarized features instead of precisely positioned features
generated by the convolution layer. This makes the model more robust to variations in the position of the features
in the input image.
• Max Pooling
Max pooling is a pooling operation that selects the maximum element from the region of the feature map covered
by the filter. Thus, the output after max-pooling layer would be a feature map containing the most prominent
features of the previous feature map.
• Average Pooling
Average pooling computes the average of the elements present in the region of feature map covered by the filter.
Thus, while max pooling gives the most prominent feature in a particular patch of the feature map, average
pooling gives the average of features present in a patch.
• Global Pooling
Global pooling reduces each channel in the feature map to a single value. Thus, an nh x nw x nc feature map is
reduced to 1 x 1 x nc feature map. This is equivalent to using a filter of dimensions nh x nw i.e. the dimensions of
the feature map. Further, it can be either global max pooling or global average pooling.
• Global Average Pooling
Considering a tensor of shape h*w*n, the output of the Global Average Pooling layer is a single value across
h*w that summarizes the presence of the feature. Instead of downsizing the patches of the input feature map, the
Global Average Pooling layer downsizes the whole h*w into 1 value by taking the average.
• Global Max Pooling
With the tensor of shape h*w*n, the output of the Global Max Pooling layer is a single value across h*w that
summarizes the presence of a feature. Instead of downsizing the patches of the input feature map, the Global Max
Pooling layer downsizes the whole h*w into 1 value by taking the maximum.
In convolutional neural networks (CNNs), the pooling layer is a common type of layer that is typically added
after convolutional layers. The pooling layer is used to reduce the spatial dimensions (i.e., the width and height)
of the feature maps, while preserving the depth (i.e., the number of channels).
• The pooling layer works by dividing the input feature map into a set of non-overlapping regions, called
pooling regions. Each pooling region is then transformed into a single output value, which represents the
presence of a particular feature in that region. The most common types of pooling operations are max
pooling and average pooling.
• In max pooling, the output value for each pooling region is simply the maximum value of the input values
within that region. This has the effect of preserving the most salient features in each pooling region, while
discarding less relevant information. Max pooling is often used in CNNs for object recognition tasks, as
it helps to identify the most distinctive features of an object, such as its edges and corners.
• In average pooling, the output value for each pooling region is the average of the input values within that
region. This has the effect of preserving more information than max pooling, but may also dilute themost
salient features. Average pooling is often used in CNNs for tasks such as image segmentation and object
detection, where a more fine-grained representation of the input is required.
Pooling layers are typically used in conjunction with convolutional layers in a CNN, with each pooling layer
reducing the spatial dimensions of the feature maps, while the convolutional layers extract increasingly complex
features from the input. The resulting feature maps are then passed to a fully connected layer, which performs the
final classification or regression task.
2.5.2 Advantages of Pooling Layer
Dimensionality reduction: The main advantage of pooling layers is that they help in reducing the spatial
dimensions of the feature maps. This reduces the computational cost and also helps in avoiding over-fitting by
reducing the number of parameters in the model.
Translation invariance: Pooling layers are also useful in achieving translation invariance in the feature maps. This
means that the position of an object in the image does not affect the classification result, as the same features are
detected regardless of the position of the object.
Feature selection: Pooling layers can also help in selecting the most important features from the input, as max
pooling selects the most salient features and average pooling preserves more information.
3. Convolution Variants
The goal of a CNN is to transform the input image into concise abstract representations of the original input. The
individual convolutional layers try to find more complex patterns from the previous layer’s observations. The
logic is that 10 curved lines would form two elipses, which would make an eye.
To do this, each layer uses a kernel, usually a 2x2 or 3x3 matrix, that slides through the previous layer’s output
to generate a new output. The word convolve from convolution means to roll or slide.
The variants of convolution operations are as follows:
In this example we convolve (7 times 7) matrix with a (3 times 3) matrix and we get a (3 times 3) output. The
input and output dimensions turns out to be governed by the following formula:
n − f + 2p
s +1
If we have (n times n) image convolved with an (f times f) filter and if we use a padding (p) and a stride (s), in
this example (s=2), then we end up with an output that is (n-f+2p). Because we’re stepping (s) steps at the time
instead of just one step at a time, we now divide by (s) and add (1). In our example, we have ((7-3+0)/2+1 = 4/2
+1 =3), that is why we end up with this (3 times 3) output. Notice that in this formula above, we round the value
of this fraction, which generally might not be an integer value, down to the nearest integer.
Moreover, each convolution operation is effectively learning an additional feature (or map), which is a learned
representation of the training data, and the tiled layers, like convolutional layers, also still have a relatively small
number of learned parameters. In essence, it is the pooling operation over these multiple "tiled" maps that allows
the network to learn invariances over scaling and rotation.
Next, we will allow our model to use multiple “maps,” so as to learn highly over complete representations. A
map is a set of pooling units and simple units that collectively cover the entire image (see
Figure 8 - Right). When varying the tiling size, we change the degree of weight tying within each map;
for example, if k = 1, the simple units within each map will have the same weights. In our model,
simple units in different maps are never tied. By having units in different maps learn different features,
our model can learn a rich and diverse set of features. Tiled CNNs with multiple maps enjoy the twin
benefits of (i) being able to represent complex invariances, by pooling over (partially) untied weights, and
(ii) having a relatively small number of learnable parameters.
When down sampling and up sampling techniques are applied to transposed convolutional layers, their effects
are reversed. The reason for this is for a network to be able to use convolutional layers to compress the image,
then transposed convolutional layers with the exact same down sampling and up sampling techniques to
reconstruct the image.
When padding is ‘added’ to the transposed convolutional layer, it seems as if padding is removed from the input,
and the resulting output becomes smaller.
Without padding, the output is 7x7, but with padding on both sides, it is 5x5. When strides are used, they instead
affect the input, instead of the output.
Dilated convolution helps expand the area of the input image covered without pooling. The objective is to cover
more information from the output obtained with every convolution operation. This method offers a wider field
of view at the same computational cost. We determine the value of the dilation factor (l) by seeing how much
information is obtained with each convolution on varying values of l.
By using this method, we are able to obtain more information without increasing the number of kernel
parameters. In Fig 9, the image on the left depicts dilated convolution. On keeping the value of l = 2, we skip 1
pixel (l – 1 pixel) while mapping the filter onto the input, thus covering more information in each step.
• Formula Involved:
(F*lk )( p ) = ( s+lt= p) F (s)k(t)
where,
F(s) = Input
k(t) = Applied Filter
*l = l-dilated convolution
(F*lk)(p) = Output
• Advantages of Dilated Convolution:
Using this method rather than normal convolution is better as:
1. Larger receptive field (i.e. no loss of coverage)
2. Computationally efficient (as it provides a larger coverage on the same computation cost)
3. Lesser Memory consumption (as it skips the pooling step) implementation
4. No loss of resolution of the output image (as we dilate instead of performing pooling)
5. Structure of this convolution helps in maintaining the order of the data.
4. CNN Learning
A neural network without an activation function is essentially just a linear regression model. The activation
function does the non-linear transformation to the input making it capable to learn and perform more complex
tasks.
Now we will look into the derivative of the ReLU using above graph. So, let us see the derivative at different
values of x. For example let’s see the derivative for both positive value of x and negative value of x.
Fig. 12 Derivative function of RELU
As we know the derivative of function is defined as the slope of the function at certain point. So you can see that
the function is mostly differentiable. If x is greater than 0 the derivative is 1 and if x is less than zero the derivative
is 0. But when x = 0, the derivative does not exist. There are two ways to deal with this. First, you can just
arbitrarily assign a value for the derivative of y = f(x) when x = 0. A second alternative is, instead of using the
actual y = f(x) function, use an approximation to ReLU which is differentiable for all values of x. Anyway, Till
now we were getting confused that actually what the ReLU is Linear or Nonlinear? We know that Mathematically,
it is clear that, A function is linear if the slope is constant in its complete domain and the ReLU function is non-
differentiable around 0, but the slope is always either 0 (for negative values) or 1 (for positive values). That’s
why the ReLU function is Non-Linear. Intuitively, we can understand that as The ReLU is an activation function
and the purpose of activation function is to introduce non-linearity in the neural network.
Measures the error between predicted and Quantifies the overall cost or error of the model on
actual values in a machine learning model the entire training set
Used to optimize the model during training Used to guide the optimization process by
minimizing the cost or error
Can be specific to individual samples Aggregates the loss values over the entire training
set
Examples include mean squared error Often the average or sum of individual loss values
(MSE), mean absolute error (MAE), and in the training set
binary cross- entropy.
Used to evaluate model performance Used to determine the direction and magnitude of
parameter updates during optimization
Different loss functions can be used for Typically derived from the loss function, but can
different tasks or problem domains include additional regularization terms or other
considerations
Loss Function in Deep Learning
➢ Regression
• MSE(Mean Squared Error)
• MAE(Mean Absolute Error)
• Hubber loss
➢ Classification
• Binary cross-entropy
• Categorical cross-entropy
A. Regression Loss
• Advantage
o Easy to interpret
o Always differential because of the square
o Only one local minima
• Disadvantage
o Error unit in the square. Because the unit in the square is not understood properly
o Not robust to outlier
• Advantage
o Intuitive and easy
o Error Unit Same as the output column
o Robust to outlier
• Disadvantage
• Graph, not differential. We cannot use gradient descent directly, then we can sub
gradient calculation.
Note–In regression at the last neuron use linear activation function.
3. Huber Loss
In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers
in data.
• Advantage
o Robust to outlier
o It lies between MAE and MSE
• Disadvantage
o Its main disadvantage is the associated complexity. In order to maximize model
accuracy, the hyperparameter δ will also need to be optimized which increases the
training requirements.
B. Classification Loss
1. Binary Cross Entropy / log loss
It is used in binary classification problems like two classes. Example a person has covid or not or my
article gets popular or not.
Binary cross entropy compares each of the predicted probabilities to the actual class output which can
be either 0 or 1. It then calculates the score that penalizes the probabilities based on the distance from
the expected value. That means how close or far from the actual value than the squared error loss.
Yi – actual values
yihat – neural network prediction
• Advantage
o A cost function is a differential
• Disadvantage
o Multiple local minima
o Not intuitive
Where k is classes
Where,
k is classes,
y-actual value
yhat–Neural Network prediction
Note – In multi-class classification at the last neuron use the softmax activation function.
If problem statement have 3 classes
softmax activation f(z)=ez1/(ez1+ez2+ez3)
If target column has One hot encode to classes like 0 0 1, 0 1 0, 1 0 0 then use categorical cross-entropy.
And if the target column has numerical encoding to classes like 1,2,3,4….n then use sparse categorical
cross-entropy. Sparse categorical cross-entropy faster than categorical cross-entropy.
2. Gradient Descent: Gradient descent is an optimization algorithm that uses the gradients of the loss
function to iteratively update the model's parameters in a way that reduces the loss. The basic idea is to take
steps in the opposite direction of the gradient to reach a local minimum of the loss function. This process is
repeated until the algorithm converges to a set of parameter values that hopefully result in a well-trained
model