Introduction to Deep Learning
Introduction to Deep Learning
1
Topics to be covered…
Convolutional neural network
Convolution layer, ReLU Activation Function
Padding, Stride
Pooling layer
Flattening, Subsampling
Loss layer, Dense layer
1x1 convolution, Input channels
Inception network
Transfer learning, One shot learning
Dimension reductions
Implementation of CNN with TensorFlow, Keras
2
Introduction to Convolution Neural Network
Convolutional Neural Networks are very similar to ordinary Neural
Networks.
They are made up of neurons that have learnable weights and biases.
Convolutional neural networks, also called ConvNets, were first
introduced in the 1980s by Yann LeCun, a postdoctoral computer
science researcher.
The early version of CNNs, called LeNet (after LeCun), could recognize
handwritten digits. CNNs found a niche market in banking and postal
services and banking, where they read zip codes on envelopes and
digits on checks.
But despite their ingenuity, ConvNets remained on the sidelines of
computer vision and artificial intelligence because they faced a serious
problem: They could not scale.
In 2012, AlexNet showed that perhaps the time had come to revisit
deep learning, the branch of AI that uses multi-layered neural
3
networks.
Introduction to Convolution Neural Network
4
Introduction to Convolution Neural Network
5
Introduction to Convolution Neural Network
Convolution Neural Network has input layer, output layer, many
hidden layers and millions of parameters that have the ability to
learn complex objects and patterns.
Convolutional Neural Networks are a bit different. First of all, the
layers are organized in 3 dimensions: width, height and depth.
Further, the neurons in one layer do not connect to all the neurons
in the next layer but only to a small region of it. Lastly, the final
output will be reduced to a single vector of probability scores,
organized along the depth dimension.
It sub-samples the given input by convolution and pooling
processes and is subjected to activation function, where all of these
are the hidden layers which are partially connected and at last end
is the fully connected layer that results in the output layer.
The output retains the original shape similar to input image
dimensions. 6
Limitations of CNN
Understanding the meaning of the contents of images, they
perform poorly.
7
Limitations of CNN
But despite the vast repositories of images and videos they’re
trained on, they still struggle to detect and block inappropriate
content.
In one case, Facebook’s content-moderation AI banned the photo of
a 30,000-year-old statue as nudity.
Several studies have shown that CNNs trained on ImageNet and
other popular datasets fail to detect objects when they see them
under different lighting conditions and from new angles.
A recent study by researchers at the MIT-IBM Watson AI Lab
highlights these shortcomings. It also introduces ObjectNet, a
dataset that better represents the different nuances of how objects
are seen in real life.
CNNs don’t develop the mental models that humans have about
different objects and their ability to imagine those objects in
previously unseen contexts. 8
Limitations of CNN
9
Limitations of CNN
Another problem with convolutional neural networks is their
inability to understand the relations between different objects.
Consider the following image, which is known as a “Bongard
problem”.
10
Limitations of CNN
Adversarial attacks have become a major source of concern as
deep learning and especially CNNs have become an integral
component of many critical applications such as self-driving cars.
11
Convolution Layer
Convolution is one of the main building blocks of a CNN. The term
convolution refers to the mathematical combination of two
functions to produce a third function. It merges two sets of
information.
12
Convolution Layer
In the animation below, you can see the convolution operation. You
can see the filter (the green square) is sliding over our input (the
blue square) and the sum of the convolution goes into the feature
map (the red square).
The area of our filter is also called the receptive field, named after
the neuron cells! The size of this filter is 3x3.
13
Convolution Layer
In reality convolutions are performed in 3D. Each image is namely
represented as a 3D matrix with a dimension for width, height, and
depth. Depth is a dimension because of the colors channels used in
an image (RGB).
14
Convolution Layer
We perform numerous convolutions on our input, where each
operation uses a different filter. This results in different feature
maps.
In the end, we take all of these feature maps and put them
together as the final output of the convolution layer.
Just like any other Neural Network, we use an activation function to
make our output non-linear. In the case of a Convolutional Neural
Network, the output of the convolution will be passed through the
activation function. This could be the ReLU activation function, y =
max(0, x).
For example:
In a given matrix(M), M= [ [ -3, 19, 5 ], [ 7, -6, 12 ], [ 4, -8, 17 ] ]
ReLU converts it as: [ [ 0, 19, 5 ], [ 7, 0, 12 ], [ 4, 0, 17 ] ]
ReLU is used more often as it works fast, allows the network 15to
converge very quickly and is computationally efficient.
Stride
Stride is the size of the step the convolution filter moves each time.
A stride size is usually 1, meaning the filter slides pixel by pixel.
By increasing the stride size, your filter is sliding over the input
with a larger interval and thus has less overlap between the cells.
More the value of stride, smaller will be the resulting output and
vice versa.
16
Padding
Because the size of the feature map is always smaller than the
input, we have to do something to prevent our feature map from
shrinking. This is where we use padding.
A layer of zero-value pixels is added to surround the input with
zeros, so that our feature map will not shrink. Padding also
improves performance and makes sure that the kernel and stride
size will fit in the input.
17
Pooling
After a convolution layer, it is common to add a pooling layer in
between CNN layers. The function of pooling is to continuously
reduce the dimensionality to reduce the number of parameters and
computation in the network. This shortens the training time and
controls overfitting.
There can be many number of convolution, ReLU and pooling
layers. Initial layers of convolution learns generic information and
last layers learn more specific/complex features.
Pooling can be done in following ways :
Max-pooling : It selects maximum element from the feature
map. The resulting max-pooled layer holds important features
of feature map. It is the most common approach as it gives
better results.
Average pooling : It involves average calculation for each patch
of the feature map. 18
Pooling
19
Example
20
Classification
After the convolution and pooling layers, our classification part
consists of a few fully connected layers. However, these fully
connected layers can only accept 1 Dimensional data. To convert
our 3D data to 1D, we use the function flatten in Python. This
essentially arranges our 3D volume into a 1D vector.
The last layers of a Convolutional NN are fully connected layers.
Neurons in a fully connected layer have full connections to all the
activations in the previous layer. This part is in principle the same
as a regular Neural Network.
Since all the parameters are occupied into fully-connected layer, it
causes overfitting. Dropout is one of the techniques that reduces
overfitting.
Dropout is an approach used for regularization in neural networks.
It is a technique where randomly chosen nodes are ignored in
network during training phase at each stage. 21
Classification
Soft-max is an activation layer normally applied to the last layer of
network that acts as a classifier. Classification of given input into
distinct classes takes place at this layer. The soft max function is
used to map the non-normalized output of a network to a
probability distribution.
23
Loss Layer
In the context of an optimization algorithm, the function used to evaluate
a candidate solution (i.e., a set of weights) is referred to as the objective
function.
Typically, with neural networks, we seek to minimize the error. As such,
the objective function is often referred to as a cost function or a loss
function and the value calculated by the loss function is referred to as
simply “loss.”
The cost or loss function has an important job in that it must faithfully
distill all aspects of the model down into a single number in such a way
that improvements in that number are a sign of a better model.
The cost function reduces all the various good and bad aspects of a
possibly complex system down to a single number, a scalar value, which
allows candidate solutions to be ranked and compared.
It is important, therefore, that the function faithfully represent our design
goals. If we choose a poor error function and obtain unsatisfactory
results, the fault is ours for badly specifying the goal of the search. 24
Loss Layer
25
Dense Layer
Dense layer is the regular deeply connected neural network layer. It
is most common and frequently used layer. Dense layer does the
below operation on the input and return the output.
output = activation(dot(input, kernel) + bias)
where,
Input represents the input data
Kernel represents the weight data
Dot represents dot product of all input and its corresponding weights
Bias represents a biased value used in machine learning to optimize the model
Activation represents the activation function.
26
1x1 Convolution
Convolutions layers are lighter than fully connected ones. But they
still connect every input channels with every output channels for
every position in the kernel windows.
A 1x1 convolution kernel acts as an embedding solution. It reduces
the size of the input vector, the number of channels. It makes it
more meaningful. The 1x1 convolutional layer is also called a Point
wise Convolution.
In 1X1 Convolution simply means the filter is of size 1X1 (Yes —
that means a single number as opposed to matrix like, say 3X3
filter). This 1X1 filter will convolve over the ENTIRE input image
pixel by pixel.
Staying with our example input of 64X64X3, if we choose a 1X1
filter (which would be 1X1X3), then the output will have the same
Height and Weight as input but only one channel — 64X64X1.
27
1x1 Convolution
1X1 Convolution is effectively used for:
Dimensionality Reduction/Augmentation
Reduce computational load by reducing parameter map
Add additional non-linearity to the network
Create deeper network through “Bottle-Neck” layer
Create smaller CNN network which retains higher degree of accuracy
28
Inception Network
Inception Network is used in convolutional neural networks to allow
more efficient computation and deeper networks through a
dimensionality reduction with stacked 1×1 convolution.
The modules were designed to solve the problem of computational
expense, as well as overfitting, among other issues. The solution, in
short, is to take multiple kernel filter sizes within the CNN, and
rather than stacking them sequentially, ordering them to operate
on the same level.
The most simplified version of an inception module works by
performing a convolution on an input with not one, but three
different sizes of filters (1x1, 3x3, 5x5).
Also, max pooling is performed. Then, the resulting outputs are
concatenated and sent to the next layer. By structuring the CNN to
perform its convolutions on the same level, the network gets
progressively wider, not deeper. 29
Inception Network
30
Transfer Learning
Transfer learning is the idea of overcoming the isolated learning
paradigms and utilizing the knowledge acquired for one task to
solve related ones.
In transfer learning we first train a base network on a base dataset
and task, and then we repurpose the learned features, or transfer
them, to a second target network to be trained on a target dataset
and task. This process will tend to work if the features are general,
that is, suitable to both base and target tasks, instead of being
specific to the base task.
In practice, very few people train an entire Convolutional Network
from scratch because it is relatively rare to have a dataset of
sufficient size. Instead, it is common to pre-train a ConvNet on a
very large dataset (e.g. ImageNet, which contains 1.2 million
images with 1000 categories), and then use the ConvNet either as
an initialization or a fixed feature extractor for the task of interest.
31
Transfer Learning
32
One Shot Learning
Deep Convolutional Neural Networks have become the state-of-the-art
methods for image classification tasks. However, one of the biggest
limitations is they require a lot of labeled data.
One-shot learning is a classification task where one, or a few, examples are
used to classify many new examples in the future.
This characterizes tasks seen in the field of face recognition, such as face
identification and face verification, where people must be classified
correctly with different facial expressions, lighting conditions, accessories,
and hairstyles given one or a few template photos.
Modern face recognition systems approach the problem of one-shot
learning via face recognition by learning a rich low-dimensional feature
representation, called a face embedding that can be calculated for faces
easily and compared for verification and identification tasks.
In a one shot classification, we require only one training example for each
class.
Historically, embeddings were learned for one-shot learning problems using
a Siamese network. 33
Dimension Reduction
In machine learning classification problems, there are often too many
factors on the basis of which the final classification is done. These
factors are basically variables called features.
The higher the number of features, the harder it gets to visualize the
training set and then work on it.
Sometimes, most of these features are correlated, and hence
redundant. This is where dimensionality reduction algorithms come
into play.
Dimensionality reduction is the process of reducing the number of
random variables under consideration, by obtaining a set of principal
variables. It can be divided into feature selection and feature
extraction.
A classification problem that relies on both humidity and rainfall can
be collapsed into just one underlying feature, since both of them are
correlated to a high degree. Hence, we can reduce the number of
features in such problems. 34
Dimensionality Reduction
We can reduce the number of features in such problems. A 3-D
classification problem can be hard to visualize, whereas a 2-D one
can be mapped to a simple 2 dimensional space, and a 1-D
problem to a simple line.
The below figure illustrates this concept, where a 3-D feature space
is split into two 1-D feature spaces, and later, if found to be
correlated, the number of features can be reduced even further.
35
Dimension Reduction
There are two components of dimensionality reduction:
Feature selection: In this, we try to find a subset of the original set of
variables, or features, to get a smaller subset which can be used to
model the problem. It usually involves three ways:
Filter
Wrapper
Embedded
Feature extraction: This reduces the data in a high dimensional space
to a lower dimension space, i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction: The various methods used for
dimensionality reduction include:
Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
Generalized Discriminant Analysis (GDA) 36
Principal Component Analysis (PCA)
The Gist: PCA takes your high-dimensional data and squishes it into
a lower-dimensional space, capturing the most important
information but ditching the redundancy. Think of it like
summarizing a long lecture into key points – you lose some detail,
but the core meaning remains.
37
Principal Component Analysis (PCA)
For example, let’s assume that the scatter plot of our data set is as
shown below, can we guess the first principal component ?
It’s approximately the line that matches the purple marks because it
goes through the origin and it’s the line in which the projection of the
points (red dots) is the most spread out. Or mathematically speaking, it’s
the line that maximizes the variance (the average of the squared
distances from the projected points (red dots) to the origin).
38
Principal Component Analysis (PCA): Working
You choose how many PCs to keep based on how much information
you want to retain. Usually, the first few PCs capture most of the
39
40
Principal Component Analysis (PCA): Applications
42
Implementation of CNN with TensorFlow
Step 3: Verify the data
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
plt.figure(figsize=(10,10))
for i in range(25):
plt.subplot(5,5,i+1)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.imshow(train_images[i], cmap=plt.cm.binary)
# The CIFAR labels happen to be arrays,
# which is why you need the extra index
plt.xlabel(class_names[train_labels[i][0]])
plt.show()
Step 4: Create the convolutional base
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu')) 43
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
Implementation of CNN with TensorFlow
Step 5: Compile and train the model
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True
),metrics=['accuracy'])
predictions = model.predict(new_data)
47