Variants of Cnn(page no 17-23), structured output(29-31),datatypes
Variants of Cnn(page no 17-23), structured output(29-31),datatypes
Deep Learning has proved to be a very powerful tool because of its ability to
handle large amounts of data. One of the most popular deep neural networks
is Convolutional Neural Networks in deep learning.
Since the 1950s, the early days of AI, researchers have struggled to make a
system that can understand visual data. CNN’s were first developed and used
around the 1980s. The most that a CNN could do at that time was recognize
handwritten digits. It was mostly used in the postal sectors to read zip codes,
pin codes, etc. The important thing to remember about any deep learning
model is that it requires a large amount of data to train and also requires a lot
of computing resources. This was a major drawback for CNNs at that period
and hence CNNs were only limited to the postal sectors and it failed to enter
the world of machine learning.
The AI system, which became known as AlexNet (named after its main creator,
Alex Krizhevsky), won the 2012 ImageNet computer vision contest with an
amazing 85 percent accuracy. The runner-up scored a modest 74 percent on
the test.At the heart of AlexNet was Convolutional Neural Networks a special
type of neural network that roughly imitates human vision.
1
bright and what color each pixel should be.
Imagine there’s an image of a bird, and we want to identify whether it’s really
a bird or some other object. The first thing is to feed the pixels of the image in
the form of arrays to the input layer of the neural network. The hidden layers
carry out feature extraction by performing different calculations and
manipulations. There are multiple hidden layers like the convolution layer, the
pooling layer, that perform feature extraction from the image and finally,
there’s a fully connected layer that identifies the object in the image.
2
architecture performs a better fitting to the image dataset due to the
reduction in the number of parameters involved and the reusability of weights.
Let us consider an image ‘X’ & a filter ‘Y’. i.e. X & Y, are matrices (image X is
being expressed in the state of pixels). When convolve the image ‘X’ using filter
‘Y’,output in a matrixZ is formed.
Finally, compute the sum of all the elements in ‘Z’ to get a scalar number, i.e.
3+4+0+6+0+0+0+45+2 = 60
A filter provides a measure for how close a region of the input resembles a
feature. A feature may be any prominent aspect – a vertical edge, a horizontal
edge, an arch, a diagonal, etc.A filter acts as a template or pattern, which,
when convoluted across the input, finds similarities between the stored
template & different locations/regions in the input image.The filter is smaller
than the input data.
Convolutional neural networks do not learn a single filter; they, in fact, learn
multiple features in parallel for a given input. For example, it is common for a
convolutional layer to learn from 32 to 512 filters in parallel for a given
input.This gives the model 32, or even 512, different ways of extracting
4
features from an input.Convolutional layers are not only applied to input data,
but they can also be applied to the output of other layers.
Fully connected layer. The output of the pooling layers is then passed through
one or more fully connected layers, which are used to make a prediction or
classify the image. Here, fully connected means that all the inputs or nodes
from one layer are connected to every activation unit or node of the next
layer.
5
All the layers in the CNN are not fully connected because it would result in an
unnecessarily dense network. It also would increase losses and affect the
output quality, and it would be computationally expensive.
A CNN uses parameter sharing. In each layer of the CNN, each node connects
to another. A CNN also has an associated weight; as the layers' filters move
across the image, the weights remain fixed. This makes the whole CNN system
less computationally intensive.
Applications of CNN:
Understanding Climate
Motivation
But eventually, that flattened vector won’t be the same for a translated image
6
The neural network would have to learn very different parameters in order to
classify the objects, which Is difficult job since natural images are very variant
(lightning, translated, angles …..)
Also the input Vector would be relatively big (RGB images) which can cause
problem with memory while using neural network.
The 3 characteristics of the CNN that helps to solve the above problem are
7
then the sparsely connected approach requires only k × n parameters.
8
Pooling
Pooling layers are used to reduce the dimensions of the feature maps. Thus, it
reduces the number of parameters to learn and the amount of computation
performed in the network.
The pooling layer summarises the features present in a region of the feature
map generated by a convolution layer. This makes the model more robust to
variations in the position of the features in the input image.
Max Pooling
Max pooling is a pooling operation that selects the maximum element from
the region of the feature map covered by the filter. Thus, the output after max-
pooling layer would be a feature map containing the most prominent features
of the previous feature map.
Average Pooling
9
Average pooling computes the average of the elements present in the region
of feature map covered by the filter. Thus, while max pooling gives the most
prominent feature in a particular patch of the feature map, average pooling
gives the average of features present in a patch.
Global Pooling
Global pooling reduces each channel in the feature map to a single value. i.e.
the dimensions of the feature map. Further, it can be either global max pooling
or global average pooling.
Stride is a parameter of the neural network's filter that modifies the amount of
movement over the image or video.ie, how far the filter moves from one
position to the next position by “stride”. In the example, the red square is a
filter. The computer is going to use this filter to scan the image.
10
If stride = 2, the filter will move two pixels.
The pooling layer takes a sliding window or a certain region that is moved in
stride across the input transforming the values into representative values. The
transformation is either performed by taking the maximum value from the
values observable in the window (called ‘max pooling’), or by taking the
average of the values.
The operation is performed for each depth slice. For example, if the input is a
volume of size 4x4x3, and the sliding window is of size 2×2, then for each color
channel, the values will be down-sampled to their representative maximum
value if we perform the max pooling operation.
11
Receptive Field
It is impractical to connect all neurons with all possible regions of the input
volume. It would lead to too many weights to train, and produce too high a
computational complexity. Thus, instead of connecting each neuron to all
possible pixels, specify a 2 dimensional region called the ‘receptive field (say of
size 5×5 units) extending to the entire depth of the input (5x5x3 for a 3 colour
channel input), within which the encompassed pixels are fully connected to the
neural network’s input layer. It’s over these small regions that the network
layer cross-sections (called ‘depth columns’) operate and produce the
activation map.
12
Designing a Convolutional Neural Network
For both conv layers, we will use kernel of spatial size 5 x 5 with stride size 1
and padding of 2. For both pooling layers, we will use max pool operation with
kernel size 2, stride 2, and zero padding.
13
14
15
N=input size
f-filter size
s-stride
Padding
Before padding
After padding
16
Here n=4
P=1
f=3
The input is usually not just a grid of real values, but a grid of vector-valued
observations. For example, a colour image has a red, green and blue intensity
at each pixel. In a multilayer convolutional network, the input to the second
layer is the output of the first layer, which usually has the output of many
different convolutions at each position.
17
This prevents shrinking where p = number of layers of zeros added to the
border of the image.So, applying convolution-operation (with (f x f)
filter) output will be (n + 2p – f/s + 1) x (n + 2p – f /s+ 1) images. For
example, adding one layer of padding to an (8 x 8) image and using a (3 x
3) filter we would get an (8 x 8) output after performing convolution
operation.This increases the contribution of the pixels at the border of the
original image by bringing them into the middle of the padded image.
Strided Convolution
If a very large input image is to be convoluted with an f x f filter,it is
computationally very expensive. In this situation, strides are used. The stride
amount should be selected such that comparatively lesser computations are
required and the information loss should be minimum
18
Padding
If the values for the padding are zeroes then it can be called zero padding. Zero
padding the input allows us to control the kernel width and the size of the
output independently. The number of pixels to be added for padding can be
calculated based on the size of the kernel and the desired output size.
Valid: Here filter is applied to the valid pixels of the input.The output is
computed only at places where the entire kernel lies inside the input.
Essentially, no zero padding is performed. For a kernel of size k in any
dimension, the input of m will become m-k+1 in the output.
Same: The input is zero padded such that the spatial size of the input and
output is same.
19
Full: This will introduce zeroes in such a way that all the pixels are visited the
same number of times by the filter. This will increase the output size.
20
1) A locally connected layer has no sharing at all. Each connection has its own
weight by labeling each connection with a unique letter.
2) Tiled convolution has a set of different kernels. Here t-= 2. One of these
kernels has edges labeled “a” and “b,” while the other has edges labeled “c”
and “d.” Each time while moving one pixel to the right in the output,use a
different kernel. This means that, like the locally connected layer, neighboring
units in the output have different parameters.After going through all available
kernels,cycle back to the first kernel.
This delivers a wider field of view at the same computational cost. Dilated
convolutions are popular in the field of real-time segmentation. Normally this
is used if a wide field of view is required and cannot afford multiple
convolutions or larger kernels.
21
Downsampling -to reduce the dimensions of the input image to derive mean-
ingful feature details from the input image.
Transposed Convolutions
The Convolution operation reduces the spatial dimensions deeper down the
network and creates an abstract representation of the input image. This
feature of CNN’s is very useful for tasks like image classification where you
have to predict whether a particular object is present in the input image or
not. But this feature might cause problems for tasks like Object Localization,
Segmentation where the spatial dimensions of the object in the original image
are necessary to predict the output bounding box or segment the object.
Output to be formed
Take the upper left element of the input feature map and multiply it with every
element of the kernel
22
Similarly, do it for all the remaining elements of the input feature map
Some of the elements of the resulting upsampled feature maps are over-
lapping. To solve this issue, we simply add the elements of the over-lapping
positions.
The resulting output will be the final upsampled feature map having the
required spatial dimensions of 3x3.
Separable Convolution
23
spatial separable convolution simply divides a kernel into two, smaller kernels.
The most common case would be to divide a 3x3 kernel into a 3x1 and 1x3
kernel.
The output does not change as the image still obeys the matrix multiplication
rule. Spatial separable convolution reduces the number of individual
multiplications. One drawback is that not every kernel can be separated.
Because of this drawback, this method is used lesser.
Depth wise separable convolutions are used with filters that cannot be decom-
posed into smaller filters. Depthwise Convolution is a type of convolution
where a single convolutional filter is applied for each input channel.MobileNet
and Xception are two examples where depth wise separable convolution is
used.
Depth wise separable convolution deals with the depth dimension as well. The
depth dimension refers to the number of channels of an image. In depthwise
separable convolution,the kernel is split into 2 different kernels known as the
depthwise convolution and the pointwise convolution.
24
Consider input layer to be of size 7 x 7 x 3 (height x width x channels),filter size
is 3 x 3 x 3. After applying convolution with one filter, get a 5 x 5 x 1 output
layer having only 1 channel.
Suppose if the filters are increased to 128. Let's stack all these layers into a big
layer; that layer will have a size of 5 x 5 x 128. We are able to shrink the spatial
dimensions which are the height and width (from 7 x 7 to 5 x 5). However, the
depth increased from 3 to 128.
25
Point Convolution
We had to create 256 kernels to create 256 8x8x1 images, then stack them up
together to create a 8x8x256 image output.
Each 5x5x1 kernel iterates 1 channel of the image giving out a 8x8x1 image.
Stacking these images together creates a 8x8x3 image which is given below
26
The original convolution transformed a 12x12x3 image to a 8x8x256 image.
Currently, the depthwise convolution has transformed the 12x12x3 image to a
8x8x3 image. Now, we need to increase the number of channels of each image.
We can create 256 1x1x3 kernels that output a 8x8x1 image each to get a final
image of shape 8x8x256.
27
1. Parallel Computation Resources
2. Selecting Appropriate Algorithms
*In the case the higher dimensional convolution kernel is separable, it can be
decomposed into several lower dimensional kernels. In this sense, a 2-D
separable kernel can be split into two 1-D kernels.The input signal can be
convolved step by step, first with one 1-D kernel, then with the second 1-D
kernel. The result equals to the convolution of the input signal with the original
2-D kernel. Gaussian, Difference of Gaussian, and Sobel are the representatives
of separable kernels commonly used in signal and image processing.
Kernel Size: The kernel size defines the field of view of the convolution. A
common choice for 2D is 3 — that is 3x3 pixels.
Stride: The stride defines the step size of the kernel when traversing the
image. While its default is usually 1, stride of 2 is also used.
28
Input & Output Channels: A convolutional layer takes a certain number of
input channels (I) and calculates a specific number of output channels (O).
Structured Outputs
To produce an output map as the same size as the input map, only same-
padded convolutions can be stacked. Alternatively, a coarser segmentation
map can be obtained by allowing the output map to shrink spatially.
29
The output can be further processed under the assumption that contiguous re-
gions of pixels will tend to belong to the same label. Graphical models can de-
scribe this relationship.
Another model that has gained popularity for segmentation tasks (especially in
the medical imaging community) is the U-Net.
Datatypes
The data used with a convolutional network usually consist of several channels
(RGB, CMY), each channel being the observation of a different quantity at
some point.One advantage to convolutional networks is that they can also
process inputs with varying spatial extents. The size of the area on the surface
that each pixel covers is known as the spatial resolution of the image.When the
output is accordingly variable sized, no extra design change needs to be made.
If however the output is fixed sized, as in the classification task, a pooling stage
with kernel size proportional to the input size needs to be used.
Dimension Single channel Multi channel
30
2-D Audio data that has been Color image data: One channel contains the red
preprocessed with a Fourier pixels. one the green pixels, and one the blue pixels.
transform: transform the The convolution kernel moves over both the
audio waveform into a 2-D horizontal and the vertical axes of the image
tensor with different rows conferring translation equivariance.(Translational
corresponding to different Equivariance-position of the object in the image
frequencies and different should not be fixed in order to be detected by the
columns corresponding to CNN. This simply means that if the input changes,
different points in time. the output also changes).
3-D Volumetric data: A common Color video data: One axis corresponds to time, one
source of this kind of data is to the height of the video frame, and one to the
medical imaging technology width of the video frame.
such as CT scans.
31