0% found this document useful (0 votes)
14 views

Variants of Cnn(page no 17-23), structured output(29-31),datatypes

The document provides an in-depth overview of Convolutional Neural Networks (CNNs), detailing their structure, including convolutional, pooling, and fully connected layers, and their ability to process and classify images. It discusses the historical development of CNNs, their applications in various fields, and the mathematical principles behind convolution operations. Additionally, it highlights the importance of parameters such as stride, pooling types, and the concept of receptive fields in optimizing CNN performance.

Uploaded by

ystkomban2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Variants of Cnn(page no 17-23), structured output(29-31),datatypes

The document provides an in-depth overview of Convolutional Neural Networks (CNNs), detailing their structure, including convolutional, pooling, and fully connected layers, and their ability to process and classify images. It discusses the historical development of CNNs, their applications in various fields, and the mathematical principles behind convolution operations. Additionally, it highlights the importance of parameters such as stride, pooling types, and the concept of receptive fields in optimizing CNN performance.

Uploaded by

ystkomban2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Module-3 (Convolutional Neural Network) Convolutional Neural Networks –

convolution operation, motivation, pooling, Convolution and Pooling as an


infinitely strong prior, variants of convolution functions, structured outputs,
data types, efficient convolution algorithms.

Convolutional Neural Networks

Deep Learning has proved to be a very powerful tool because of its ability to
handle large amounts of data. One of the most popular deep neural networks
is Convolutional Neural Networks in deep learning.

Since the 1950s, the early days of AI, researchers have struggled to make a
system that can understand visual data. CNN’s were first developed and used
around the 1980s. The most that a CNN could do at that time was recognize
handwritten digits. It was mostly used in the postal sectors to read zip codes,
pin codes, etc. The important thing to remember about any deep learning
model is that it requires a large amount of data to train and also requires a lot
of computing resources. This was a major drawback for CNNs at that period
and hence CNNs were only limited to the postal sectors and it failed to enter
the world of machine learning.

In the following years, this field came to be known as Computer Vision. In


2012,a group of researchers from the University of Toronto developed an AI
model that surpassed the best image recognition algorithms and that too by a
large margin.

The AI system, which became known as AlexNet (named after its main creator,
Alex Krizhevsky), won the 2012 ImageNet computer vision contest with an
amazing 85 percent accuracy. The runner-up scored a modest 74 percent on
the test.At the heart of AlexNet was Convolutional Neural Networks a special
type of neural network that roughly imitates human vision.

Image filtering is changing the appearance of an image by altering the colors


of the pixels. Increasing the contrast as well as adding a variety of special
effects to images are some of the results of applying filters.

A digital image is a binary representation of visual data. It contains a series of


pixels arranged in a grid-like fashion that contains pixel values to denote how

1
bright and what color each pixel should be.

Imagine there’s an image of a bird, and we want to identify whether it’s really
a bird or some other object. The first thing is to feed the pixels of the image in
the form of arrays to the input layer of the neural network. The hidden layers
carry out feature extraction by performing different calculations and
manipulations. There are multiple hidden layers like the convolution layer, the
pooling layer, that perform feature extraction from the image and finally,
there’s a fully connected layer that identifies the object in the image.

A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm


that can take in an input image, assign importance (learnable weights and
biases) to various aspects/objects in the image, and be able to differentiate
one from the other. The pre-processing required in a ConvNet is much lower
compared to other classification algorithms.ConvNets have the ability to learn
filters/characteristics.

A ConvNet is able to successfully capture the Spatial and Temporal


dependencies in an image through the application of relevant filters. The

2
architecture performs a better fitting to the image dataset due to the
reduction in the number of parameters involved and the reusability of weights.

Spatial Dependency means a pixel's value is influenced by nearby pixel's


value in image. This is because generally they all belong to same color
because they are from same object. Temporal dependency comes in videos.
When a frame changes to next, if there is not a lot of movement in objects,
pixel's values remain same.

Mathematically, convolution is the summation of the element-wise product of


2 matrices.

Let us consider an image ‘X’ & a filter ‘Y’. i.e. X & Y, are matrices (image X is
being expressed in the state of pixels). When convolve the image ‘X’ using filter
‘Y’,output in a matrixZ is formed.

Finally, compute the sum of all the elements in ‘Z’ to get a scalar number, i.e.
3+4+0+6+0+0+0+45+2 = 60

A Convolutional Neural Network (CNN) is made up of multiple layers, including


convolutional layers, pooling layers, and fully connected layers.The
convolutional layer is the first layer while the FC layer is the last.From the
convolutional layer to the FC layer, the complexity of the CNN increases. The
layers are arranged in such a way so that they detect simpler patterns first
(lines, curves, etc.) and more complex patterns (faces, objects, etc.)It is this
increasing complexity that allows the CNN to successively identify larger
3
portions and more complex features of an image until it finally identifies the
object fully.

Convolutional layer. The majority of computations happen in the convolutional


layer, which is the core building block of a CNN. A second convolutional layer
can follow the initial convolutional layer. The process of convolution involves a
kernel or filter inside this layer moving across the fields of the image, checking
if a feature is present in the image.Filters are applied to the input image to
extract features such as edges, textures, and shapes. The output of the
convolutional layers is then passed.

A filter provides a measure for how close a region of the input resembles a
feature. A feature may be any prominent aspect – a vertical edge, a horizontal
edge, an arch, a diagonal, etc.A filter acts as a template or pattern, which,
when convoluted across the input, finds similarities between the stored
template & different locations/regions in the input image.The filter is smaller
than the input data.

A convolution is the simple application of a filter to an input that results in


activation. Repeated application of the same filter to an input result in a map
of activations called a feature map, indicating the locations and strength of a
detected feature in an input.

A convolution is a linear operation that involves the multiplication of a set of


weights (filter) with the input. Over multiple iterations, the kernel sweeps over
the entire image. After each iteration a dot product is calculated between the
input pixels and the filter. The final output from the series of dots is known as a
feature map or convolved feature. A dot product is the element-wise
multiplication between the input and filter, which is then summed, always
resulting in a single value. Because it results in a single value, the operation is
often referred to as the “scalar product“. Ultimately, the image is converted
into numerical values which allow the CNN to interpret the image and extract
relevant patterns from it.

Convolutional neural networks do not learn a single filter; they, in fact, learn
multiple features in parallel for a given input. For example, it is common for a
convolutional layer to learn from 32 to 512 filters in parallel for a given
input.This gives the model 32, or even 512, different ways of extracting
4
features from an input.Convolutional layers are not only applied to input data,
but they can also be applied to the output of other layers.

Pooling layer-The output of the convolutional layers is then passed through


pooling layers. The pooling layer also sweeps a kernel or filter across the input
image. But the pooling layer reduces the number of parameters in the input
and also results in some information loss. But it willretain the most important
information This layer reduces complexity and improves the efficiency of the
CNN.

Fully connected layer. The output of the pooling layers is then passed through
one or more fully connected layers, which are used to make a prediction or
classify the image. Here, fully connected means that all the inputs or nodes
from one layer are connected to every activation unit or node of the next
layer.

5
All the layers in the CNN are not fully connected because it would result in an
unnecessarily dense network. It also would increase losses and affect the
output quality, and it would be computationally expensive.

A CNN uses parameter sharing. In each layer of the CNN, each node connects
to another. A CNN also has an associated weight; as the layers' filters move
across the image, the weights remain fixed. This makes the whole CNN system
less computationally intensive.

Different Types of CNN Models:

*LeNet *AlexNet * ResNet * GoogleNet * MobileNet * VGG

Applications of CNN:

Decoding Facial Recognition

Understanding Climate

Collecting Historic and Environmental Elements

Motivation

A digital image is 2D grid image, since neural network expects a vector as


input, one idea to deal with images would be to flatten that image and feed
the output of the flattening operation to the neural network.

But eventually, that flattened vector won’t be the same for a translated image

6
The neural network would have to learn very different parameters in order to
classify the objects, which Is difficult job since natural images are very variant
(lightning, translated, angles …..)

Also the input Vector would be relatively big (RGB images) which can cause
problem with memory while using neural network.

Natural images have 2 main characteristics

Locality: nearby pixels are more strongly correlated

Translation invariance: meaningful patterns can occur anywhere in the image.

The 3 characteristics of the CNN that helps to solve the above problem are

1) Sparse Connectivity:This is implemented by using kernels or feature


detector smaller than the input image.When processing an image, the input
image may have thousands or millions of pixels, but we can detect small,
meaningful features such as edges with kernels that occupy only tens or
hundreds of pixels. This means that only fewer parameters had to be stored,
which both reduces the memory requirements of the model and improves its
statistical efficiency. It also means that computing the output requires fewer
operations. These improvements in efficiency are usually quite large. If there
are m inputs and n outputs, then matrix multiplication requires m×n
parameters.If we limit the number of connections each output may have to k,

7
then the sparsely connected approach requires only k × n parameters.

2) Parametersharing: In a convolutional neural net, each member of the kernel


is used at every position of the input .The parameter sharing used by the
convolution operation means that rather than learning a separate set of
parameters for every location, only one set will be learned. This will reduce the
storage requirements of the model to k parameters

3) Equivarianceor Translational Equivariance :Translational Equivariance or


equivariance is a very important property of the convolutional neural networks
where the position of the object in the image should not be fixed in for it to be
detected by the CNN. This simply means that if the input changes, the output
also changes. When processing images, if input 1 pixel is moved to the right
then it’s representations will also move 1 pixel to the right. The property of
translational equivariance is achieved in CNN’s by the concept of weight
sharing. As the same weights are shared across the images, hence if an object
occurs in any image it will be detected irrespective of its position in the image.

8
Pooling

A limitation of the feature map output of convolutional layers is that they


record the precise position of features in the input. This means that small
movements in the position of the feature in the input image will result in a
different feature map. This can happen with rotation, shifting, and other minor
changes to the input image.The pooling layer operates upon each feature map
separately to create a new set of the same number of pooled feature
maps.The size of the pooling operation or filter is smaller than the size of the
feature map; specifically, it is almost always 2×2 pixels applied with a stride of
2 pixels.

Pooling layers are used to reduce the dimensions of the feature maps. Thus, it
reduces the number of parameters to learn and the amount of computation
performed in the network.

The pooling layer summarises the features present in a region of the feature
map generated by a convolution layer. This makes the model more robust to
variations in the position of the features in the input image.

Types of Pooling Layers:

Max Pooling

Max pooling is a pooling operation that selects the maximum element from
the region of the feature map covered by the filter. Thus, the output after max-
pooling layer would be a feature map containing the most prominent features
of the previous feature map.

Average Pooling

9
Average pooling computes the average of the elements present in the region
of feature map covered by the filter. Thus, while max pooling gives the most
prominent feature in a particular patch of the feature map, average pooling
gives the average of features present in a patch.

Global Pooling

Global pooling reduces each channel in the feature map to a single value. i.e.
the dimensions of the feature map. Further, it can be either global max pooling
or global average pooling.

Stride is a parameter of the neural network's filter that modifies the amount of
movement over the image or video.ie, how far the filter moves from one
position to the next position by “stride”. In the example, the red square is a
filter. The computer is going to use this filter to scan the image.

If stride = 1, the filter will move one pixel.

10
If stride = 2, the filter will move two pixels.

Width, Height, Depth of input image

Depth is the no of channels or number of filters applied in the previous


layer.As the no of channels increases more information can be identified from
the input image.

The pooling layer takes a sliding window or a certain region that is moved in
stride across the input transforming the values into representative values. The
transformation is either performed by taking the maximum value from the
values observable in the window (called ‘max pooling’), or by taking the
average of the values.

The operation is performed for each depth slice. For example, if the input is a
volume of size 4x4x3, and the sliding window is of size 2×2, then for each color
channel, the values will be down-sampled to their representative maximum
value if we perform the max pooling operation.

11
Receptive Field

It is impractical to connect all neurons with all possible regions of the input
volume. It would lead to too many weights to train, and produce too high a
computational complexity. Thus, instead of connecting each neuron to all
possible pixels, specify a 2 dimensional region called the ‘receptive field (say of
size 5×5 units) extending to the entire depth of the input (5x5x3 for a 3 colour
channel input), within which the encompassed pixels are fully connected to the
neural network’s input layer. It’s over these small regions that the network
layer cross-sections (called ‘depth columns’) operate and produce the
activation map.

12
Designing a Convolutional Neural Network

Fashion-MNIST, which is a dataset of Zalando’s article images consisting of a


training set of 60,000 examples and a test set of 10,000 examples is used..
Each example is a 28x28 grayscale image, associated with a label from 10
classes.

For both conv layers, we will use kernel of spatial size 5 x 5 with stride size 1
and padding of 2. For both pooling layers, we will use max pool operation with
kernel size 2, stride 2, and zero padding.

13
14
15
N=input size

f-filter size

s-stride

Padding

Before padding

After padding

16
Here n=4

P=1

f=3

Variants of Convolution Function

The convolution in the context of neural networks, actually mean an operation


that consists of many applications of convolution in parallel. This is because
convolution with a single kernel can extract only one kind of feature; usually
each layer of network will extract many kinds of features, at many locations.

The input is usually not just a grid of real values, but a grid of vector-valued
observations. For example, a colour image has a red, green and blue intensity
at each pixel. In a multilayer convolutional network, the input to the second
layer is the output of the first layer, which usually has the output of many
different convolutions at each position.

Problem with Simple Convolution Layers

 For an (n x n) image and (f x f) filter/kernel, the dimensions of the image


resulting from a convolution operation is
(n – f/s + 1) x (n – f/s + 1).
For example, for an (8 x 8) image and (3 x 3) filter, the output resulting
after convolution operation would be of size (6 x 6). Thus, the image
shrinks every time a convolution operation is performed. This place an
upper limit to the number of times such an operation could be performed
before the image reduces to nothing.
.
 Also, the pixels on the corners and the edges are used much less than
those in the middle.ie,the information on the borders of images is not
preserved as well as the information in the middle.

Padding is the process of adding layers of zeros to input images so as to


avoid the problems.

17
This prevents shrinking where p = number of layers of zeros added to the
border of the image.So, applying convolution-operation (with (f x f)
filter) output will be (n + 2p – f/s + 1) x (n + 2p – f /s+ 1) images. For
example, adding one layer of padding to an (8 x 8) image and using a (3 x
3) filter we would get an (8 x 8) output after performing convolution
operation.This increases the contribution of the pixels at the border of the
original image by bringing them into the middle of the padded image.

Strided Convolution
If a very large input image is to be convoluted with an f x f filter,it is
computationally very expensive. In this situation, strides are used. The stride
amount should be selected such that comparatively lesser computations are
required and the information loss should be minimum

18
Padding

If the values for the padding are zeroes then it can be called zero padding. Zero
padding the input allows us to control the kernel width and the size of the
output independently. The number of pixels to be added for padding can be
calculated based on the size of the kernel and the desired output size.

Zero Padding Strategies

3 common zero padding strategies are:

1) Valid 2) Same 3) Full

Valid: Here filter is applied to the valid pixels of the input.The output is
computed only at places where the entire kernel lies inside the input.
Essentially, no zero padding is performed. For a kernel of size k in any
dimension, the input of m will become m-k+1 in the output.

Same: The input is zero padded such that the spatial size of the input and
output is same.

19
Full: This will introduce zeroes in such a way that all the pixels are visited the
same number of times by the filter. This will increase the output size.

A comparison of locally connected layers, tiled convolution, and standard


convolution is shown above. All three have the same sets of connections
between units, when the same size of kernel is used. The differences between
the methods lies in how they share parameters.

20
1) A locally connected layer has no sharing at all. Each connection has its own
weight by labeling each connection with a unique letter.

2) Tiled convolution has a set of different kernels. Here t-= 2. One of these
kernels has edges labeled “a” and “b,” while the other has edges labeled “c”
and “d.” Each time while moving one pixel to the right in the output,use a
different kernel. This means that, like the locally connected layer, neighboring
units in the output have different parameters.After going through all available
kernels,cycle back to the first kernel.

3) Traditional convolution is equivalent to tiled convolution with t= 1. There is


only one kernel, and it is applied everywhere, as indicated in the diagram by
using the kernel with weights labeled “a” and “b” everywhere

Dilated convolutions introduce a parameter to convolutional layers called the


dilation rate. This defines spacing between the values in a kernel. A 3x3 kernel
with a dilation rate of 2 will have the same field of view as a 5x5 kernel, while
only using 9 parameters.

This delivers a wider field of view at the same computational cost. Dilated
convolutions are popular in the field of real-time segmentation. Normally this
is used if a wide field of view is required and cannot afford multiple
convolutions or larger kernels.

21
Downsampling -to reduce the dimensions of the input image to derive mean-
ingful feature details from the input image.

Upsampling increases the dimensions of any particular input to match the


output dimensions as per the requirements.

Transposed Convolutions

The Convolution operation reduces the spatial dimensions deeper down the
network and creates an abstract representation of the input image. This
feature of CNN’s is very useful for tasks like image classification where you
have to predict whether a particular object is present in the input image or
not. But this feature might cause problems for tasks like Object Localization,
Segmentation where the spatial dimensions of the object in the original image
are necessary to predict the output bounding box or segment the object.

Transposed Convolutions are used to upsample the input feature map to a


desired output feature map using some learnable parameters.

Output to be formed

Take the upper left element of the input feature map and multiply it with every
element of the kernel

22
Similarly, do it for all the remaining elements of the input feature map

Some of the elements of the resulting upsampled feature maps are over-
lapping. To solve this issue, we simply add the elements of the over-lapping
positions.

The resulting output will be the final upsampled feature map having the
required spatial dimensions of 3x3.

It is always certain that the output of the transposed convolution operation


can have exactly the same shape as the input of the previous convolution
operation. However, the numbers are not restored.

Separable Convolution

A Separable Convolution is a process in which a single convolution can be


divided into two or more convolutions to produce the same output. Mainly
there are two types of Separable Convolutions

 Spatially Separable Convolutions.


 Depth-wise Separable Convolutions.

Spatial separable convolution

The spatial separable convolution is so named because it deals primarily with


the spatial dimensions of an image and kernel: the width and the height. A

23
spatial separable convolution simply divides a kernel into two, smaller kernels.
The most common case would be to divide a 3x3 kernel into a 3x1 and 1x3
kernel.

The output does not change as the image still obeys the matrix multiplication
rule. Spatial separable convolution reduces the number of individual
multiplications. One drawback is that not every kernel can be separated.
Because of this drawback, this method is used lesser.

Depthwise separable convolution

Depth wise separable convolutions are used with filters that cannot be decom-
posed into smaller filters. Depthwise Convolution is a type of convolution
where a single convolutional filter is applied for each input channel.MobileNet
and Xception are two examples where depth wise separable convolution is
used.

Depth wise separable convolution deals with the depth dimension as well. The
depth dimension refers to the number of channels of an image. In depthwise
separable convolution,the kernel is split into 2 different kernels known as the
depthwise convolution and the pointwise convolution.

24
Consider input layer to be of size 7 x 7 x 3 (height x width x channels),filter size
is 3 x 3 x 3. After applying convolution with one filter, get a 5 x 5 x 1 output
layer having only 1 channel.

Suppose if the filters are increased to 128. Let's stack all these layers into a big
layer; that layer will have a size of 5 x 5 x 128. We are able to shrink the spatial
dimensions which are the height and width (from 7 x 7 to 5 x 5). However, the
depth increased from 3 to 128.

In depthwise convolution, use three separate kernels of size 3 x 3 x 1 instead of


using one 3 x 3 x 3 kernel. The three kernels convolve with just one channel of
the input layer. Each of these three convolutions produces a map of size 5 x 5 x
1. Perform stacking again to generate a map of size 5 x 5 x 3.

25
Point Convolution

What if we want to increase the number of channels in our output image?


What if we want an output of size 8x8x256?

We had to create 256 kernels to create 256 8x8x1 images, then stack them up
together to create a 8x8x256 image output.

Each 5x5x1 kernel iterates 1 channel of the image giving out a 8x8x1 image.
Stacking these images together creates a 8x8x3 image which is given below

26
The original convolution transformed a 12x12x3 image to a 8x8x256 image.
Currently, the depthwise convolution has transformed the 12x12x3 image to a
8x8x3 image. Now, we need to increase the number of channels of each image.

The pointwise convolution is so named because it uses a 1x1 kernel, or a kernel


that iterates through every single point. This kernel has a depth of however
many channels the input image has; in our case, 3. Therefore, we iterate a
1x1x3 kernel through our 8x8x3 image, to get a 8x8x1 image.

We can create 256 1x1x3 kernels that output a 8x8x1 image each to get a final
image of shape 8x8x256.

Efficient Convolution Algorithms


To speed up convolution, two methods are there

27
1. Parallel Computation Resources
2. Selecting Appropriate Algorithms

*Fourier transform: Performing convolution as pointwise multiplication in the


frequency domain can provide a speed up as compared to direct computation..
Convert them back to time domain using an inverse Fourier transform.

*In the case the higher dimensional convolution kernel is separable, it can be
decomposed into several lower dimensional kernels. In this sense, a 2-D
separable kernel can be split into two 1-D kernels.The input signal can be
convolved step by step, first with one 1-D kernel, then with the second 1-D
kernel. The result equals to the convolution of the input signal with the original
2-D kernel. Gaussian, Difference of Gaussian, and Sobel are the representatives
of separable kernels commonly used in signal and image processing.

Parameters that define a convolutional layer.

Kernel Size: The kernel size defines the field of view of the convolution. A
common choice for 2D is 3 — that is 3x3 pixels.

Stride: The stride defines the step size of the kernel when traversing the
image. While its default is usually 1, stride of 2 is also used.

Padding: The padding defines how the border of a sample is handled.

28
Input & Output Channels: A convolutional layer takes a certain number of
input channels (I) and calculates a specific number of output channels (O).

Structured Outputs

Convolutional networks can be trained to output high-dimensional structured


output rather than a classification . A good example is the task of image
segmentation where each pixel needs to be associated with an object class.
Here the output is the same size (spatially) as the input. The model outputs a
tensor S where S[i,j,k] is the probability that pixel (j,k) belongs to class i.

To produce an output map as the same size as the input map, only same-
padded convolutions can be stacked. Alternatively, a coarser segmentation
map can be obtained by allowing the output map to shrink spatially.

The process of image segmentation divides an image into


different regions based on the characteristics of pixels. A
binary large object is a collection of binary data. Blobs are
typically images, audio or other multimedia objects.A coarse
segmentation means large blobs covering each class without
much detail. Whereas a fine segmentation have a much
higher level of detail which can even go down to pixel level.
The output of the first labelling stage can be refined successively by another
convolutional model. If the models use tied parameters, this gives rise to a
type of recursive model as shown below. (H¹, H², H³ share parameters)

29
The output can be further processed under the assumption that contiguous re-
gions of pixels will tend to belong to the same label. Graphical models can de-
scribe this relationship.

Another model that has gained popularity for segmentation tasks (especially in
the medical imaging community) is the U-Net.

Datatypes

The data used with a convolutional network usually consist of several channels
(RGB, CMY), each channel being the observation of a different quantity at
some point.One advantage to convolutional networks is that they can also
process inputs with varying spatial extents. The size of the area on the surface
that each pixel covers is known as the spatial resolution of the image.When the
output is accordingly variable sized, no extra design change needs to be made.
If however the output is fixed sized, as in the classification task, a pooling stage
with kernel size proportional to the input size needs to be used.
Dimension Single channel Multi channel

1-D Audio waveform-The time Skeleton animation data: Skeletal animation is a


is discretized and technique in computer animation in which a
measures the amplitude character is represented in two parts: a surface
of the waveform once representation used to draw the character (called
per time step. the mesh or skin) and a hierarchical set of
interconnected parts (called bones, and collectively
forming the skeleton).

30
2-D Audio data that has been Color image data: One channel contains the red
preprocessed with a Fourier pixels. one the green pixels, and one the blue pixels.
transform: transform the The convolution kernel moves over both the
audio waveform into a 2-D horizontal and the vertical axes of the image
tensor with different rows conferring translation equivariance.(Translational
corresponding to different Equivariance-position of the object in the image
frequencies and different should not be fixed in order to be detected by the
columns corresponding to CNN. This simply means that if the input changes,
different points in time. the output also changes).

3-D Volumetric data: A common Color video data: One axis corresponds to time, one
source of this kind of data is to the height of the video frame, and one to the
medical imaging technology width of the video frame.
such as CT scans.

31

You might also like