402B Deep Learning
402B Deep Learning
UNIT I
Basics Of Neural Networks: Basic concept of Neurons – Perceptron Algorithm – Feed
Forward and Back Propagation Networks.
UNIT II
Introduction To Deep Learning: :Feed Forward Neural Networks – Gradient Descent – Back
Propagation Algorithm – Vanishing Gradient problem – Mitigation – RelU Heuristics for
Avoiding Bad Local Minima – Heuristics for Faster Training – Nestors Accelerated Gradient
Descent – Regularization – Dropout.
UNIT III
Convolutional Neural Networks: : CNN Architectures – Convolution – Pooling Layers –
Transfer Learning – Image Classification using Transfer Learning
UNIT IV
More Deep Learning Architectures\:LSTM, GRU, Encoder/Decoder Architectures –
Autoencoders – Standard- Sparse – Denoising – Contractive- Variational Autoencoders –
Adversarial Generative Networks – Autoencoder and DBM
UNIT V
Applications Of Deep Learning: Image Segmentation – Object Detection – Automatic Image
Captioning – Image generation with Generative Adversarial Networks – Video to Text with
LSTM Models – Attention Models for Computer Vision – Case Study: Named Entity
Recognition – Opinion Mining using Recurrent
Neural Networks – Parsing and Sentiment Analysis using Recursive Neural Networks –
Sentence Classification using Convolutional Neural Networks – Dialogue Generation with
LSTMs.
Text Books:
1. Ian Good Fellow, Yoshua Bengio, Aaron Courville, “Deep Learning”, MIT Press, 2017.
2. Navin Kumar Manaswi, “Deep Learning with Applications Using Python”, Apress,
2018.
Reference Books
1. Francois Chollet, “Deep Learning with Python”, Manning Publications, 2018.
2. Phil Kim, “Matlab Deep Learning: With Machine Learning, Neural Networks and
Artificial Intelligence”, Apress , 2017.
3. Ragav Venkatesan, Baoxin Li, “Convolutional Neural Networks in Visual Computing”,
CRC Press, 2018.
Lecture Notes
Unit-1
Basic Concept of neurons
Neurons are the building blocks of the nervous system. They receive
and transmit signals to different parts of the body. This is carried out in
both physical and electrical forms. There are several different types of
neurons that facilitate the transmission of information.
The sensory neurons carry information from the sensory receptor cells
present throughout the body to the brain. Whereas, the motor neurons
transmit information from the brain to the muscles. The interneurons
transmit information between different neurons in the body.
Also Read: Nervous System
Neuron Structure
A neuron varies in shape and size depending on its function and
location. All neurons have three different parts – dendrites, cell body and
axon.
Parts of Neuron
Following are the different parts of a neuron:
Dendrites
These are branch-like structures that receive messages from other
neurons and allow the transmission of messages to the cell body.
Cell Body
Each neuron has a cell body with a nucleus, Golgi body, endoplasmic
reticulum, mitochondria and other components.
Axon
Axon is a tube-like structure that carries electrical impulse from the cell
body to the axon terminals that pass the impulse to another neuron.
Synapse
It is the chemical junction between the terminal of one neuron and the
dendrites of another neuron.
Also Read: Difference between neurons and neuroglia
Neuron Types
There are three different types of neurons:
Sensory Neurons
The sensory neurons convert signals from the external environment into
corresponding internal stimuli. The sensory inputs activate the sensory
neurons and carry sensory information to the brain and spinal cord. They
are pseudouni polar in structure.
Motor Neurons
These are multipolar and are located in the central nervous system
extending their axons outside the central nervous system. This is the
most common type of neuron and transmits information from the brain to
the muscles of the body.
Interneurons
They are multipolar in structure. Their axons connect only to the nearby
sensory and motor neurons. They help in passing signals between two
neurons.
Also Read: Nerves
Neuron Functions
The important functions of a neuron are:
Chemical Synapse
In chemical synapses, the action potential affects other neurons through
a gap present between two neurons known as the synapse. The action
potential is carried along the axon to a postsynaptic ending that initiates
the release of chemical messengers known as neurotransmitters. These
neurotransmitters excite the postsynaptic neurons that generate an
action potential of their own.
Electrical Synapse
When two neurons are connected by a gap junction, it results in an
electrical synapse. These gaps include ion channels that help in the
direct transmission of a positive electrical signal. These are much faster
than chemical synapses.
Perceptron algorithm
set:
Since, you all are familiar with AND Gates, I will be using it as an example
to explain how a perceptron works as a linear classifier.
Class 1: Inputs having output as 0 that lies below the decision line.
Class 2: Inputs having output as 1 that lies above the decision line
or separator.
The below diagram shows the above idea of classifying the inputs of
AND Gate using a perceptron:
Till now, you understood that a linear perceptron can be used to classify
the input data set into two classes. But, how does it actually classify the
data?
Enough of the theory, let us look at the first example of this blog on
Perceptron Learning Algorithm where I will implement AND Gate using a
perceptron from scratch.
A typical neuron cell looks like below. As can be seen the initial
weights which may be random gets multiplied by the feature vector
and gets added up in a neuron, the activation then decides whether
to fire up the neuron or not.
A typical neuron with inputs weights and internal assembly containing summation and
activation
let’s have a look below at the assumed values which are required
initially for the feed fwd and back prop. The hidden layer activation
function is assumed to be sigmoid and the weights are random
initially.
Sigmoid expression
After feeding the summed up values ( 0.105 and 0.205) the output of
the neuron becomes.
Oh1=0.526
Oh2=0.551
These outputs are further multiplied by the weight matrix for the
next layer the equation becomes.
We have taken the sigmoid function at the output node but for in
practice Softmax is more appropriate especially when it comes to
multiclass classification.
Backpropagation.
This is also called as the gradient, so finding gradient wrt total error
for each weight is what Backprop does.
The total Error term E does not relate to any of the weight W
directly and hence in other words if we want to calculate the change
in W5 wrt E we have to decode sequences in reverse order.
Long Questions:
1.Explain Concept of Neuron?
2.Explain Perceptron algorithm with example?
3.Difference between Feed Forward and Back Propagation Networks?
Unit-2
Deep learning
Deep learning is a branch of m, as neural network is going to mimic the
human brain so deep learning is also a kind of mimic of human brain. In
deep learning, we don’t need to explicitly program everything. The concept
of deep learning is not new. It has been around for a couple of years now.
It’s on hype nowadays because earlier we did not have that much
processing power and a lot of data. As in the last 20 years, the processing
power increases exponentially, deep learning and machine learning came
in the picture. A formal definition of deep learning is- neurons
Deep Learning is a subset of Machine Learning that is based on artificial
neural networks (ANNs) with multiple layers, also known as deep neural
networks (DNNs). These neural networks are inspired by the structure and
function of the human brain, and they are designed to learn from large
amounts of data in an unsupervised or semi-supervised manner.
Deep Learning models are able to automatically learn features from the
data, which makes them well-suited for tasks such as image recognition,
speech recognition, and natural language processing. The most widely used
architectures in deep learning are feedforward neural networks,
convolutional neural networks (CNNs), and recurrent neural networks
(RNNs).
Feedforward neural networks (FNNs) are the simplest type of ANN, with a
linear flow of information through the network. FNNs have been widely
used for tasks such as image classification, speech recognition, and natural
language processing.
Convolutional Neural Networks (CNNs) are a special type of FNNs designed
specifically for image and video recognition tasks. CNNs are able to
automatically learn features from the images, which makes them well-
suited for tasks such as image classification, object detection, and image
segmentation.
Recurrent Neural Networks (RNNs) are a type of neural networks that are
able to process sequential data, such as time series and natural language.
RNNs are able to maintain an internal state that captures information about
the previous inputs, which makes them well-suited for tasks such as speech
recognition, natural language processing, and language translation.
Deep Learning models are trained using large amounts of labled data and
require significant computational resources. With the increasing
availability of large amounts of data and computational resources, deep
learning has been able to achieve state-of-the-art performance in a wide
range of applications such as image and speech recognition, natural
language processing, and more.
Difference
between Machine Learning and Deep Learning :
of Dataset.
Heavily dependent on
Dependent on Low-end Machine. High-end Machine.
Divides the tasks into sub-tasks, solves them Solves problem end to
individually and finally combine the results. end.
Gradient Descent
Error = Y(Predicted)-Y(Actual)
2. Update values.
In this way rather than computing new steps again and again we are
averaging the decay and as decay increases its effect in decision making
decreases and thus the older the step less effect on decision making.
More the history more bigger steps will be taken.
Backpropagation:
Features of Backpropagation:
Backpropagation Algorithm:
Step 3: Calculate the output of each neuron from the input layer to the
hidden layer to the output layer.
Step 4: Calculate the error in the outputs
Backpropagation Error= Actual Output – Desired Output
Step 5: From the output layer, go back to the hidden layer to adjust the
weights to reduce the error.
Step 6: Repeat the process until the desired output is achieved.
Parameters :
x = inputs training vector x=(x1,x2,…………xn).
t = target vector t=(t 1,t2……………tn).
δk = error at output unit.
δj = error at hidden layer.
α = learning rate.
V0j = bias of hidden unit j.
Training Algorithm :
Step 1: Initialize weight to small random values.
Step 2: While the stepsstopping condition is to be false do step 3 to 10.
Step 3: For each training pair do step 4 to 9 (Feed-Forward).
Step 4: Each input unit receives the signal unit and transmitsthe signal
xi signal to all the units.
Step 5 : Each hidden unit Zj (z=1 to a) sums its weighted input signal to
calculate its net input
zinj = v0j + Σxivij ( i=1 to n)
Applying activation function z j = f(zinj) and sends this signals to
all units in the layer about i.e output units
The result is the general inability of models with many layers to learn on
a given dataset, or for models with many layers to prematurely converge
to a poor solution.
1. Multi-level hierarchy
2. Long short – term memory
3. Faster hardware
4. Residual neural networks (ResNets)
5. ReLU
Residual neural networks (ResNets)
One of the newest and most effective ways to resolve the vanishing
gradient problem is with residual neural networks, or ResNets (not to be
confused with recurrent neural networks). It was noted before ResNets
that a deeper network would have higher training error than the shallow
network.
ResNet
How can the local gradient be 1, i.e, the derivative of which function
would always be 1? The Identity function!
The ResNet architecture, shown below, should now make perfect sense
as to how it would not allow the vanishing gradient problem to occur.
ResNet stands for Residual Network.
Mitigation
There are several attacks against deep learning models in the literature,
including fast-gradient sign method (FGSM), basic iterative
method (BIM) or momentum iterative method (MIM) attacks.
These attacks are the purest form of the gradient-based evading
technique that is used by attackers to evade the classification model.
The adversarial machine learning has been used to describe the attacks to
machine learning models, which tries to mislead models by malicious
input instances. The figure shows the typical adversarial machine leaning
attack.
Our model is able to respond to the model attacks by hackers who use the
adversarial machine learning methods. The figure illustrates the system
architecture used to protect the model and to classify correctly.
onfusion matrix
Adversarial trained model’s confusion matrix
The Rectified Linear Unit (ReLU) activation function can be described as:
f(x) = max(0, x)
The dying ReLU problem refers to the scenario when many ReLU
neurons only output values of 0. The red outline below shows that this
happens when the inputs are in the negative range.
When most of these neurons return output zero, the gradients fail to flow
during backpropagation, and the weights are not updated. Ultimately a
large part of the network becomes inactive, and it is unable to learn
further.
Because the slope of ReLU in the negative input range is also zero, once it
becomes dead (i.e., stuck in negative range and giving output 0), it is
likely to remain unrecoverable.
However, the dying ReLU problem does not happen all the time since the
optimizer (e.g., stochastic gradient descent) considers multiple input
values each time. As long as NOT all the inputs push ReLU to the
negative segment (i.e., some inputs are in the positive range), the
neurons can stay active, the weights can get updated, and the network
can continue learning.
Let us first look at the equation for the update step in backpropagation:
If our learning rate (α) is set too high, there is a significant chance that
our new weights will end up in the highly negative value range since our
old weights will be subtracted by a large number. These negative weights
result in negative inputs for ReLU, thereby causing the dying ReLU
problem to happen.
While we have mostly talked about weights so far, we must not forget that
the bias term is also passed along with the weights into the activation
function.
Bias is a constant value added to the product of inputs and weights. Given
its involvement, a large negative bias term can cause the ReLU activation
inputs to become negative. This, as already described, causes the neurons
to consistently output 0, leading to the dying ReLU problem.
with unseen data A heuristic is, simply put, a shortcut. Heuristics are
strategies often used to find a solution that is not perfect, but is within an
acceptable degree of accuracy for the needs of the process. In
computing, heuristics are especially useful when finding an optimal
solution to a problem is impractical because of slow speed or processing
power limitations.
(1)
Similarly we apply it to and which gives
(2)
Now multiplying (1) by and adding the result to (2), one obtains
with ,
(3)
Now one can verify that
(4)
Next remark that, by definition, one has
(5)
Putting together (3), (4) and (5) one gets with ,
Regularization
Sometimes the machine learning model performs well with the training
data but does not perform well with the test data. It means the model is
not able to predict the output when deals by introducing noise in the
output, and hence the model is called overfitted. This problem can be
deal with the help of a regularization technique.
This technique can be used in such a way that it will allow to maintain all
variables or features in the model by reducing the magnitude of the
variables. Hence, it maintains accuracy as well as a generalization of the
model.
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
Now, we will add a loss function and optimize parameter to make the
model that can predict the accurate value of Y. The loss function for the
linear regression is called as RSS or Residual sum of squares.
Techniques of Regularization
There are mainly two types of regularization techniques, which are given
below:
o Ridge Regression
o Lasso Regression
Ridge Regression
o Ridge regression is one of the types of linear regression in which a
small amount of bias is introduced so that we can get better long-
term predictions.
o Ridge regression is a regularization technique, which is used to
reduce the complexity of the model. It is also called as L2
regularization.
o In this technique, the cost function is altered by adding the penalty
term to it. The amount of bias added to the model is called Ridge
Regression penalty. We can calculate it by multiplying with the
lambda to the squared weight of each individual feature.
o The equation for the cost function in ridge regression will be:
Lasso Regression:
o Lasso regression is another regularization technique to reduce the
complexity of the model.
o It is similar to the Ridge Regression except that the penalty term
contains only the absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0,
whereas Ridge Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost
function of Lasso regression will be:
o Some of the features in this technique are completely neglected
for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in
the model as well as the feature selection.
Drop Out
The concept of Neural Networks is inspired by the neurons in the
human brain and scientists wanted a machine to replicate the same
process. This craved a path to one of the most important topics in
Artificial Intelligence. A Neural Network (NN) is based on a collection of
connected units or nodes called artificial neurons, which loosely model
the neurons in a biological brain. Since such a network is created
artificially in machines, we refer to that as Artificial Neural Networks
(ANN). This article assumes that you have a decent knowledge of ANN.
More about ANN can be found Now, let us go narrower into the details
of Dropout in ANN.
Problem: When a fully-connected layer has a large number of neurons,
co-adaptation is more likely to happen. Co-adaptation refers to when
multiple neurons in a layer extract the same, or very similar, hidden
features from the input data. This can happen when the connection
weights for two different neurons are nearly identical.
4.What is Regularization?
Long Questions
Unit-3
CNNArchitecture:
There are two main parts to a CNN architecture
Convolution Layers
There are three types of layers that make up the CNN which are the
convolutional layers, pooling layers, and fully-connected (FC) layers. When
these layers are stacked, a CNN architecture will be formed. In addition to
these three layers, there are two more important parameters which are the
dropout layer and the activation function which are defined be
1. Convolutional Layer
This layer is the first layer that is used to extract the various features from
the input images. In this layer, the mathematical operation of convolution is
performed between the input image and a filter of a particular size MxM. By
sliding the filter over the input image, the dot product is taken between the
filter and the parts of the input image with respect to the size of the filter
(MxM).
The output is termed as the Feature map which gives us information about
the image such as the corners and edges. Later, this feature map is fed to
other layers to learn several other features of the input image.
The convolution layer in CNN passes the result to the next layer once
applying the convolution operation in the input. Convolutional layers in
CNN benefit a lot as they ensure the spatial relationship between the pixels
is intact.
2. Pooling Layer
In most cases, a Convolutional Layer is followed by a Pooling Layer.
The primary aim of this layer is to decrease the size of the convolved
feature map to reduce the computational costs. This is performed by
decreasing the connections between layers and independently operates on
each feature map. Depending upon method used, there are several types of
Pooling operations. It basically summarises the features generated by a
convolution layer.
In Max Pooling, the largest element is taken from feature map. Average
Pooling calculates the average of the elements in a predefined sized Image
section. The total sum of the elements in the predefined section is
computed in Sum Pooling. The Pooling Layer usually serves as a bridge
between the Convolutional Layer and the FC Layer.
This CNN model generalises the features extracted by the convolution
layer, and helps the networks to recognise the features independently.
With the help of this, the computations are also reduced in a network.
3. Fully Connected Layer
The Fully Connected (FC) layer consists of the weights and biases along
with the neurons and is used to connect the neurons between two different
layers. These layers are usually placed before the output layer and form the
last few layers of a CNN Architecture.
In this, the input image from the previous layers are flattened and fed to the
FC layer. The flattened vector then undergoes few more FC layers where
the mathematical functions operations usually take place. In this stage, the
classification process begins to take place. The reason two layers are
connected is that two fully connected layers will perform better than a
single connected layer. These layers in CNN reduce the human supervision
4. Dropout
Usually, when all the features are connected to the FC layer, it can cause
overfitting in the training dataset. Overfitting occurs when a particular
model works so well on the training data causing a negative impact in the
model’s performance when used on a new data.
To overcome this problem, a dropout layer is utilised wherein a few
neurons are dropped from the neural network during training process
resulting in reduced size of the model. On passing a dropout of 0.3, 30% of
the nodes are dropped out randomly from the neural network.
Dropout results in improving the performance of a machine learning model
as it prevents overfitting by making the network simpler. It drops neurons
from the neural networks during train.
5. Activation Functions
Finally, one of the most important parameters of the CNN model is the
activation function. They are used to learn and approximate any kind of
continuous and complex relationship between variables of the network. In
simple words, it decides which information of the model should fire in the
forward direction and which ones should not at the end of the network.
It adds non-linearity to the network. There are several commonly used
activation functions such as the ReLU, Softmax, tanH and the Sigmoid
functions. Each of these functions have a specific usage. For a binary
classification CNN model, sigmoid and softmax functions are preferred an
for a multi-class classification, generally softmax us used. In simple terms,
activation functions in a CNN model determine whether a neuron should be
activated or not. It decides whether the input to the work is important or
not to predict using mathematical operation
Convolution
Convolution is a mathematical operation used to express the relation between
input and output of an LTI system. It relates input, output and impulse
response of an LTI system as
y(t)=x(t)∗h(t)�(�)=�(�)∗ℎ(�)
Where y (t) = output of LTI
x (t) = input of LTI
h (t) = impulse response of LTI
There are two types of convolutions:
Continuous convolution
Discrete convolution
Continuous Convolution
y(t)=x(t)∗h(t)�(�)=�(�)∗ℎ(�)
=∫∞−∞x(τ)h(t−τ)dτ=∫−∞∞�(�)ℎ(�−�)��
(or)
=∫∞−∞x(t−τ)h(τ)dτ=∫−∞∞�(�−�)ℎ(�)��
Discrete Convolution
y(n)=x(n)∗h(n)�(�)=�(�)∗ℎ(�)
=Σ∞k=−∞x(k)h(n−k)=Σ�=−∞∞�(�)ℎ(�−�)
(or)
=Σ∞k=−∞x(n−k)h(k)=Σ�=−∞∞�(�−�)ℎ(�)
By using convolution we can find zero state response of the system.
Deconvolution
Deconvolution is reverse process to convolution widely used in signal and
image processing.
Properties of Convolution
Commutative Property
x1(t)∗x2(t)=x2(t)∗x1(t)�1(�)∗�2(�)=�2(�)∗�1(�)
Distributive Property
x1(t)∗[x2(t)+x3(t)]=[x1(t)∗x2(t)]+[x1(t)∗x3(t)]�1(�)∗[�2(�)+�3(
�)]=[�1(�)∗�2(�)]+[�1(�)∗�3(�)]
Associative Property
x1(t)∗[x2(t)∗x3(t)]=[x1(t)∗x2(t)]∗x3(t)�1(�)∗[�2(�)∗�3(�)]=[�
1(�)∗�2(�)]∗�3(�)
Shifting Property
x1(t)∗x2(t)=y(t)�1(�)∗�2(�)=�(�)
x1(t)∗x2(t−t0)=y(t−t0)�1(�)∗�2(�−�0)=�(�−�0)
x1(t−t0)∗x2(t)=y(t−t0)�1(�−�0)∗�2(�)=�(�−�0)
x1(t−t0)∗x2(t−t1)=y(t−t0−t1)�1(�−�0)∗�2(�−�1)=�(�−�0−
�1)
Convolution with Impulse
x1(t)∗δ(t)=x(t)�1(�)∗�(�)=�(�)
x1(t)∗δ(t−t0)=x(t−t0)�1(�)∗�(�−�0)=�(�−�0)
Convolution of Unit Steps
u(t)∗u(t)=r(t)�(�)∗�(�)=�(�)
u(t−T1)∗u(t−T2)=r(t−T1−T2)�(�−�1)∗�(�−�2)=�(�−�1−�2)
u(n)∗u(n)=[n+1]u(n)�(�)∗�(�)=[�+1]�(�)
Scaling Property
If x(t)∗h(t)=y(t)�(�)∗ℎ(�)=�(�)
then x(at)∗h(at)=1|a|y(at)�(��)∗ℎ(��)=1|�|�(��)
Differentiation of Output
if y(t)=x(t)∗h(t)�(�)=�(�)∗ℎ(�)
then dy(t)dt=dx(t)dt∗h(t)��(�)��=��(�)��∗ℎ(�)
or
dy(t)dt=x(t)∗dh(t)dt��(�)��=�(�)∗�ℎ(�)��
Note:
Convolution of two causal sequences is causal.
Convolution of two anti causal sequences is anti causal.
Convolution of two unequal length rectangles results a trapezium.
Convolution of two equal le
Pooling Layers
Max Pooling
1. Max pooling is a pooling operation that selects the maximum
element from the region of the feature map covered by the
filter. Thus, the output after max-pooling layer would be a
feature map containing the most prominent features of the
previous feature map.
Transfer Learning
categories), and use features from them to solve a new task. When
dealing with transfer learning, we come across a phenomenon called
freezing of layers. A layer, it can be a CNN layer, hidden layer, a block
of layers, or any subset of a set of all layers, is said to be fixed when it
is no longer available to train. Hence, the weights of freezed layers will
not be updated during training. While layers that are not freezed follows
regular training procedure. When we use transfer learning in solving a
problem, we select a pre-trained model as our base model. Now, there
are two possible approaches to use knowledge from the pre-trained
model. First way is to freeze a few layers of pre-trained model and train
other layers on our new dataset for the new task. Second way is to
make a new model, but also take out some features from the layers in
the pre-trained model and use them in a newly created model. In both
cases, we take out some of the learned features and try to train the rest
of the model. This makes sure that the only feature that may be same
in both of the tasks is taken out from the pre-trained model, and the rest
of the model is changed to fit new dataset by training.
5.Image classification using transfer learning
Trans
Image classification is one of the supervised machine learning
problems which aims to categorize the images of a dataset into their
respective categories or labels. Classification of images of various dog
breeds is a classic image classification problem. So, we have to
classify more than one class that‟s why the name multi-class
classification, and in this article, we will be doing the same fer
learning: Transfer learning is a popular deep learning method that
follows the approach of using the knowledge that was learned in some
task and applying it to solve the problem of the related target task. So,
instead of creating a neural network from scratch we “transfer” the
learned features which are basically the “weights” of the network. To
implement the concept of transfer learning, we make use of “pre-
trained models“.
Necessities for transfer learning: Low-level features from model A
(task A) should be helpful for learning model B (task B).
Pre-trained model: Pre-trained models are the deep learning models
which are trained on very large datasets, developed, and are made
available by other developers who want to contribute to this machine
learning community to solve similar types of problems. It contains the
biases and weights of the neural network representing the features of
the dataset it was trained on. The features learned are always
transferrable. For example, a model trained on a large dataset of flower
images will contain learned features such as corners, edges, shape,
color, etc.
InceptionResNetV2: InceptionResNetV2 is a convolutional neural
network that is 164 layers deep, trained on millions of images from the
ImageNet database, and can classify images into more than 1000
categories such as flowers, animals, etc. The input size of the images
is 299-by-299.
short questions
1.what is Convolutional neural networks?
2.Explain Convolution?
3.Explain Transfer Learning?
Long Questions
1.Explain about CNN Architectures?
2.Explain about pooling Layers?
3.Explain Image Classification using Transfer Learning?
Unit-4
LSTM
LSTM networks are an extension of recurrent neural networks (RNNs)
mainly introduced to handle situations where RNNs fail. Talking about
RNN, it is a network that works on the present input by taking into
consideration the previous output (feedback) and storing in its memory
for a short period of time (short-term memory). Out of its various
applications, the most popular ones are in the fields of speech
processing, non-Markovian control, and music composition.
Nevertheless, there are drawbacks to RNNs. First, it fails to store
information for a longer period of time. At times, a reference to certain
information stored quite a long time ago is required to predict the
current output. But RNNs are absolutely incapable of handling such
“long-term dependencies”. Second, there is no finer control over which
part of the context needs to be carried forward and how much of the
past needs to be „forgotten‟. Other issues with RNNs are exploding and
vanishing gradients (explained later) which occur during the training
process of a network through backtracking. Thus, Long Short-Term
Memory (LSTM) was brought into the picture. It has been so designed
that the vanishing gradient problem is almost completely removed,
while the training model is left unaltered. Long time lags in certain
problems are bridged using LSTMs where they also handle noise,
distributed representations, and continuous values. With LSTMs, there
is no need to keep a finite number of states from beforehand as
required in the hidden Markov model (HMM). LSTMs provide us with a
large range of parameters such as learning rates, and input and output
biases. Hence, no need for fine adjustments. The complexity to update
each weight is reduced to O(1) with LSTMs, similar to that of Back
Propagation Through Time (BPTT), which is an advantage.
Exploding and Vanishing Gradients:
During the training process of a network, the main goal is to minimize
loss (in terms of error or cost) observed in the output when training data
is sent through it. We calculate the gradient, that is, loss with respect to
a particular set of weights, adjust the weights accordingly and repeat
this process until we get an optimal set of weights for which loss is
minimum. This is the concept of backtracking. Sometimes, it so
happens that the gradient is almost negligible. It must be noted that the
gradient of a layer depends on certain components in the successive
layers. If some of these components are small (less than 1), the result
obtained, which is the gradient, will be even smaller. This is known as
the scaling effect. When this gradient is multiplied with the learning rate
which is in itself a small value ranging between 0.1-0.001, it results in a
smaller value. As a consequence, the alteration in weights is quite
small, producing almost the same output as before. Similarly, if the
gradients are quite large in value due to the large values of
components, the weights get updated to a value beyond the optimal
value. This is known as the problem of exploding gradients. To avoid
this scaling effect, the neural network unit was re-built in such a way
that the scaling factor was fixed to one. The cell was then enriched by
several gating units and was called LSTM.
Architecture:
The basic difference between the architectures of RNNs and LSTMs is
that the hidden layer of LSTM is a gated unit or gated cell. It consists of
four layers that interact with one another in a way to produce the output
of that cell along with the cell state. These two things are then passed
onto the next hidden layer. Unlike RNNs which have got the only single
neural net layer of tanh, LSTMs comprises of three logistic sigmoid
gates and one tanh layer. Gates have been introduced in order to limit
the information that is passed through the cell. They determine which
part of the information will be needed by the next cell and which part is
to be discarded. The output is usually in the range of 0-1 where „0‟
means „reject all‟ and „1‟ means „include all‟.
Now, one may ask how to determine which layers we need to freeze
and which layers need to train. The answer is simple, the more you
want to inherit features from a pre-trained model, the more you have to
freeze layers. For instance, if the pre-trained model detects some
flower species and we need to detect some new species. In such a
case, a new dataset with new species contains a lot of features similar
to the pre-trained model. Thus, we freeze less number of layers so that
we can use most of its knowledge in a new model. Now, consider
another case, if there is a pre-trained model which detects humans in
images, and we want to use that knowledge to detect cars, in such a
case where dataset is entirely different, it is not good to freeze lots of
layers because freezing a large number of layers will not only give low
level features but also give high-level features like nose, eyes, etc
which are useless for new dataset (car detection). Thus, we only copy
low-level features from the base network and train the entire network
on a new dataset. Let‟s consider all situations where the size and
dataset of the target task vary from the base network.
Target dataset is small and similar to the base network
dataset: Since the target dataset is small, that means we can fine-tune
the pre-trained network with target dataset. But this may lead to a
problem of overfitting. Also, there may be some changes in the number
of classes in the target task. So, in such a case we remove the fully
connected layers from the end, maybe one or two, and add a new fully-
connected layer satisfying the number of new classes. Now, we freeze
the rest of the model and only
GRU
One of the lesser-known but equally effective variations is the Gated
Recurrent Unit Network(GRU).
Unlike LSTM, it consists of only three gates and does not maintain an
Internal Cell State. The information which is stored in the Internal Cell
State in an LSTM recurrent unit is incorporated into the hidden state of
the Gated Recurrent Unit. This collective information is passed onto the
next Gated Recurrent Unit. The different gates of a GRU are as
described below:-
1. Update Gate(z): It determines how much of the past
knowledge needs to be passed along into the future. It is
analogous to the Output Gate in an LSTM recurrent unit.
2. Reset Gate(r): It determines how much of the past knowledge
to forget. It is analogous to the combination of the Input Gate
and the Forget Gate in an LSTM recurrent unit.
3. Current Memory Gate( ): It is often overlooked during a
typical discussion on Gated Recurrent Unit Network. It is
incorporated into the Reset Gate just like the Input Modulation
Gate is a sub-part of the Input Gate and is used to introduce
some non-linearity into the input and to also make the input
Zero-mean. Another reason to make it a sub-part of the Reset
gate is to reduce the effect that previous information has on
the current information that is being passed into the future.
The basic work-flow of a Gated Recurrent Unit Network is similar to that
of a basic Recurrent Neural Network when illustrated, the main
difference between the two is in the internal working within each
recurrent unit as Gated Recurrent Unit networks consist of gates which
modulate the current input and the previous hidden state.
Note that just like the workflow, the training process for a GRU network
is also diagrammatically similar to that of a basic Recurrent Neural
Network and differs only in the internal working of each recurrent unit.
The Back-Propagation Through Time Algorithm for a Gated Recurrent
Unit Network is similar to that of a Long Short Term Memory Network
and differs only in the differential chain formation.
Encoder/Decoder Architecture
The key benefits of the approach are the ability to train a single end-to-
end model directly on source and target sentences and the ability to
handle variable length input and output sequences of text.
, 2016
In this post, we will take a closer look at two different research projects
that developed the same Encoder-Decoder architecture at the same
time in 2014 and achieved results that put the spotlight on the approach.
They are:
Autoencoders
The encoder part of the network is used for encoding and sometimes
even for data compression purposes although it is not very effective
as compared to other general compression techniques like JPEG.
Encoding is achieved by the encoder part of the network which has
a decreasing number of hidden units in each layer. Thus this part is
forced to pick up only the most significant and representative features
of the data. The second half of the network performs the Decoding
function. This part has an increasing number of hidden units in
each layer and thus tries to reconstruct the original input from the
encoded data. Thus Auto-encoders are an unsupervised learning
technique.
Example: See the below code, in autoencoder training data, is fitted to
itself. That‟s why instead of fitting X_train to Y_train we have used
X_train in both places.
Training of an Auto-encoder for data compression: For a data
compression procedure, the most important aspect of the compression
is the reliability of the reconstruction of the compressed data. This
requirement dictates the structure of the Auto-encoder as a
bottleneck. Step 1: Encoding the input data The Auto-encoder first
tries to encode the data using the initialized weights and biases.
sparse
Sparse expert models are a thirty-year old concept re-emerging as a
popular architecture in deep learning. This class of architecture
encompasses Mixture-of-Experts, Switch Transformers, Routing
Networks, BASE layers, and others, all with the unifying idea that each
example is acted on by a subset of the parameter
lot more like this. For example, returning to the original inspiration for
artificial neural networks (the brain), neurons (analogous to nodes) are
only connected to handful of other neurons.
The above-described training process is reiterated several times until
an acceptable level of reconstruction is reached.
After the training process, only the encoder part of the Auto-encoder is
retained to encode a similar type of data used in the training process.
The different ways to constrain the network are:-
Keep small Hidden Layers: If the size of each hidden layer is
kept as small as possible, then the network will be forced to
pick up only the representative features of the data thus
encoding the data.
Regularization: In this method, a loss term is added to the
cost function which encourages the network to train in ways
other than copying the input.
Denoising: Another way of constraining the network is to add
noise to the input and teach the network how to remove the
noise from the data.
Tuning the Activation Functions: This method
involves changing the activation functions of various
nodes so that a majority of the nodes are dormant thus
effectively reducing the size of the hidden layers.
The different variations of Auto-encoders are:-
Denoising Auto-encoder: This type of auto-encoder works on
a partially corrupted input and trains to recover the original
undistorted image. As mentioned above, this method is an
effective way to constrain the network from simply copying the
input.
Sparse Auto-encoder: This type of auto-encoder typically
contains more hidden units than the input but only a few are
allowed to be active at once. This property is called the
sparsity of the network. The sparsity of the network can be
controlled by either manually zeroing the required hidden units,
tuning the activation functions or by adding a loss term to the
cost function.
Variational Auto-encoder: This type of auto-encoder makes
strong assumptions about the distribution of latent variables
and uses the Stochastic Gradient Variational
Bayes estimator in the training process. It assumes that the
data is generated by a Directed Graphical Model and tries to
learn an approximation to to the conditional
property where and are the parameters of the
encoder and the decoder respectively.
7.Denoising
Denoising an image is a classical problem that researchers are trying to
solve for decades. In earlier times, researchers used filters to reduce the
noise in the images. They used to work fairly well for images with a
reasonable level of noise. However, applying those filters would add a
blur to the image. And if the image is too noisy, then the resultant image
would be so blurry that most of the critical details in the image are lost.
Contents Summary
Contractive
Contractive Autoencoder was proposed by the researchers at the
University of Toronto in 2011 in the paper Contractive auto-encoders:
Explicit invariance during feature extraction. The idea behind that is to
make the autoencoders robust of small changes in the training dataset.
To deal with the above challenge that is posed in basic autoencoders,
the authors proposed to add another penalty term to the loss function of
autoencoders. We will discuss this loss function in details.
The Loss function:
Contractive autoencoder adds an extra term in the loss function of
autoencoder, it is given as:
Variational AutoEncoders
Variational autoencoder was proposed in 2013 by Knigma and Welling
at Google and Qualcomm. A variational autoencoder (VAE) provides a
probabilistic manner for describing an observation in latent space.
Thus, rather than building an encoder that outputs a single value to
describe each latent state attribute, we‟ll formulate our encoder to
describe a probability distribution for each latent attribute.
It has many applications such as data compression, synthetic data
creation etc.
Architecture:
Autoencoders are a type of neural network that learns the data
encodings from the dataset in an unsupervised way. It basically
contains two parts: the first one is an encoder which is similar to the
convolution neural network except for the last layer. The aim of the
encoder to learn efficient data encoding from the dataset and pass it
into a bottleneck architecture. The other part of the autoencoder is a
decoder that uses latent space in the bottleneck layer to regenerate the
images similar to the dataset. These results backpropagate from the
neural network in the form of the loss function.
Variational autoencoder is different from autoencoder in a way such
that it provides a statistic manner for describing the samples of the
dataset in latent space. Therefore, in variational autoencoder, the
encoder outputs a probability distribution in the bottleneck layer instead
of a single output value.
The first term represents the reconstruction likelihood and the other
term ensures that our learned distribution q is similar to the true prior
distribution p.
Thus our total loss consists of two terms, one is reconstruction error
and other is KL-divergence loss:
Implementation:
2.Training:
3.Applications:
Image synthesis
Text-to-Image synthesis
Image-to-Image translation
Anomaly detection
Data augmentation
4.Limitations:
Units within the layers are independent of each other but are
dependent on neighboring layers
Let’s talk first about similarity between DBN and DBM and then
difference between DBN and DBM
Short questions
1.what is LSTM?
3.Explain Sparse?
4.What is Denoising?
Unit-5
Object detection
Concept[edit]
Every object class has its own special features that help in classifying
the class – for example all circles are round. Object class detection uses
these special features. For example, when looking for circles, objects
that are at a particular distance from a point (i.e. the center) are sought.
Similarly, when looking for squares, objects that are perpendicular at
corners and have equal side lengths are needed. A similar approach is
used for face identification
Automatic Image captioning
To make this Generative and Adversarial process simple, both these block
are made from Deep Neural Network based architecture which can be
trained through forward and backward propagation techniques.
From the time GANs were introduced, there has been tremendous
advancement in the GANs. There are GAN architecture which are
specifically made for some tasks.
From the time GANs were introduced, there has been tremendous
advancement in the GANs. There are GAN architecture which are
specifically made for some tasks.
The named entity recognition (NER) is one of the most popular data
preprocessing task. It involves the identification of key information in
the text and classification into a set of predefined categories. An entity
is basically the thing that is consistently talked about or refer to in the
text.
NER is the form of NLP.
At its core, NLP is just a two-step process, below are the two steps that
are involved:
Detecting the entities from the text
Classifying them into different categories
Some of the categories that are the most important architecture in NER
such that:
Person
Organization
Place/ location
Other common tasks include classifying of the following:
date/time.
expression
Numeral measurement (money, percent, weight, etc)
E-mail address
Ambiguity in NE
For a person, the category definition is intuitively quite clear,
but for computers, there is some ambiguity in classification.
Let‟s look at some ambiguous example:
England (Organisation) won the 2019 world cup vs
The 2019 world cup happened in England(Location).
Washington(Location) is the capital of the US vs The
first president of the US was Washington(Person).
Methods of NER
7.What is Regularization?
Part-B
Answer any One Question from each unit 10*2=20
Unit-1
9.Explain Perceptron algorithm with example?
(or)
10.Difference between Feed Forward and Back Propagation Networks?
Unit-II
11.Explain about Relu Heuristics for avoiding bad local minimum?
(or)
2.Explain Convolution?
4.what is LSTM?
6.Explain Sparse?
7.What is Denoising?
8.Explain Contractive in Deep Learning?
Part-B
Answer any One Question from each unit 10*2=20
Unit-1
9.Explain about CNN Architectures?
(or)
10.Explain about pooling Layers?
Unit-2
11.Explain Briefly about Variational Autoencoders?
(or)
Part-B
Answer FIVE Questions , choosing ONE question from each unit
each question carries 10 marks
5*10=50
Unit-1
2.
i). Explain Perceptron algorithm with example?
(or)
ii). Difference between Feed Forward and Back Propagation Networks?
Unit-II
3.
i)Explain about ReLu Heuristics for avoiding bad local minimum?
(or)
Part-B
Answer any One Question from each unit 10*2=20
Unit-1
1.
i)Explain Perceptron algorithm with example?
(or)
ii)Difference between Feed Forward and Back Propagation Networks?
Unit-II
2.
i)Explain about Relu Heuristics for avoiding bad local minimum?
(or)
Unit-III
3.
Unit-IV
4.
i)Explain Briefly about Variational Autoencoders?
(or)
5.