Deeplearning Notes
Deeplearning Notes
1
their parallel processing only makes the brain’s abilities possible. Figure 1 represents a human
biological nervous unit. Various parts of biological neural network(BNN) is marked in Figure 1.
2
Information flow in a neural cell
The input/output and the propagation of information are shown below.
1.3. Artificial neuron model
An artificial neuron is a mathematical function conceived as a simple model of a real (biological)
neuron.
The McCulloch-Pitts Neuron
This is a simplified model of real neurons, known as a Threshold Logic Unit.
A set of input connections brings in activations from other neuron.
A processing unit sums the inputs, and then applies a non-linear activation function (i.e.
squashing/transfer/threshold function).
An output line transmits the result to other neurons.
1.3.1 Basic Elements of ANN:
Neuron consists of three basic components –weights, thresholds and a single activation
function. An Artificial neural network(ANN) model based on the biological neural sytems is shown
in figure 2.
3
Different Training /Learning procedure available in ANN are
Supervised learning
Unsupervised learning
Reinforced learning
Hebbian learning
Gradient descent learning
Competitive learning
Stochastic learning
1.4.1. Requirements of Learning Laws:
• Learning Law should lead to convergence of weights
• Learning or training time should be less for capturing the information from the
training pairs
• Learning should use the local information
• Learning process should able to capture the complex non linear mapping available
between the input & output pairs
• Learning should able to capture as many as patterns as possible
• Storage of pattern information's gathered at the time of learning should be high for
the given network
4
Every input pattern that is used to train the network is associated with an output pattern which is
the target or the desired pattern.
A teacher is assumed to be present during the training process, when a comparison is made
between the network’s computed output and the correct expected output, to determine the error.The
error can then be used to change network parameters, which result in an improvement in
performance.
1.4.1.2 Unsupervised learning:
In this learning method the target output is not presented to the network.It is as if there is no
teacher to present the desired patterns and hence the system learns of its own by discovering and
adapting to structural features in the input patterns.
1.4.1.3 Reinforced learning:
In this method, a teacher though available, doesnot present the expected answer but only
indicates if the computed output correct or incorrect.The information provided helps the network in
the learning process.
1.4.1.4 Hebbian learning:
learning mechanism inspired by biology.In this, the input-output pattern pairs (𝑥𝑖, 𝑦𝑖) are
This rule was proposed by Hebb and is based on correlative weight adjustment.This is the oldest
Here 𝑦𝑖𝑇 is the transposeof the associated output vector 𝑦𝑖.Numerous variants of the rule have
been proposed.
1.4.1.5 Gradient descent learning:
This is based on the minimization of error E defined in terms of weights and activation function
of the network.Also it is required that the activation function employed by the network is
differentiable, as the weight update is dependent on the gradient of the error E.
Thus if ∆𝑤𝑖𝑗 is the weight update of the link connecting the 𝑖𝑡ℎ and 𝑗𝑡ℎ neuron of the two
neighbouring layers, then ∆𝑤𝑖𝑗 is defined as,
∆𝑤𝑖𝑗 =
𝛛𝐸ɳ
-----------eq(2)
𝛛𝑤
𝑖𝑗
𝛛𝐸
Where, ɳ is the learning rate parameter and
𝛛𝑤𝑖𝑗
weight 𝑤𝑖𝑗.
is the error gradient with reference to the
5
Perceptron network is capable of performing pattern classification into two or more
categories. The perceptron is trained using the perceptron learning rule. We will first consider
classification into two categories and then the general multiclass classification later. For
classification
6
into only two categories, all we need is a single output neuron. Here we will use bipolar neurons.
The simplest architecture that could do the job consists of a layer of N input neurons, an output
layer with a single output neuron, and no hidden layers. This is the same architecture as we saw
before for Hebb learning. However, we will use a different transfer function here for the output
neurons as given below in eq (7). Figure 7 represents a single layer perceptron network.
eq (7)
Equation 7 gives the bipolar activation function which is the most common function used in
the perceptron networks. Figure 7 represents a single layer perceptron network. The inputs arising
from the problem space are collected by the sensors and they are fed to the aswociation
units.Association units are the units which are responsible to associate the inputs based on their
similarities. This unit groups the similar inputs hence the name association unit. A single input
from each group is given to the summing unit.Weights are randomnly fixed intially and assigned to
this inputs. The net value is calculate by using the expression
x = Σ wiai – θ eq(8)
This value is given to the activation function unit to get the final output response.The actual
output is compared with the Target or desired .If they are same then we can stop training else the
weights haqs to be updated .It means there is error .Error is given as δ = b-s , where b is the desired
7
/ Target output and S is the actual outcome of the machinehere the weights are updated based on the
perceptron Learning law as given in equation 9.
Weight change is given as Δw= η δ ai. So new weight is given as
Wi (new) = Wi (old) + Change in weight vector (Δw) eq(9)
1.5.2. Perceptron Algorithm
Step 1: Initialize weights and bias.For simplicity, set weights and bias to zero.Set learning
rate in the range of zero to one.
• Step 2: While stopping condition is false do steps 2-6
• Step 3: For each training pair s:t do steps 3-5
• Step 4: Set activations of input units xi = ai
• Step 5: Calculate the summing part value Net = Σ aiwi-θ
• Step 6: Compute the response of output unit based on the activation functions
• Step 7: Update weights and bias if an error occurred for this pattern(if yis not equal to t)
Weight (new) = wi(old) + atxi , & bias (new) = b(old) + at
Else wi(new) = wi(old) & b(new) = b(old)
• Step 8: Test Stopping Condition
1.5.3. Limitations of single layer perceptrons:
• Uses only Binary Activation function
• Can be used only for Linear Networks
• Since uses Supervised Learning ,Optimal Solution is provided
• Training Time is More
• Cannot solve Linear In-separable Problem
8
Figure 5: Multi-Layer Perceptron
1. Initialize the weights (Wi) & Bias (B0) to small random values near Zero
2. Set learning rate η or α in the range of “0” to “1”
3. Check for stop condition. If stop condition is false do steps 3 to 7
4. For each Training pairs do step 4 to 7
5. Set activations of Output units: xi = si for i=1 to N
6. Calculate the output Response
yin = b0 + Σ xiwi
7. Activation function used is Bipolar sigmoidal or Bipolar Step functions
For Multi Layer networks, based on the number of layers steps 6 & 7 are repeated
8. If the Targets is (not equal to) = to the actual output (Y), then update weights and bias
based on Perceptron Learning Law
Wi (new) = Wi (old) + Change in weight vector
Change in weight vector = ηtixi
Where η = Learning Rate
ti = Target output of ith unit
xi = ith Input vector
b0(new) = b0 (old) + Change in Bias
Change in Bias = ηti
Else Wi (new) = Wi (old)
b0(new) = b0 (old)
9. Test for Stop condition
9
1.6. linearly seperable & Linear in separable tasks:
Perceptron are successful only on problems with a linearly separable solution sapce.Figure 9
represents both linear separable as well as linear in seperable problem.Perceptron cannot handle, in
particular, tasks which are not linearly separable.(Known as linear inseparable problem).Sets of
points in two dimensional spaces are linearly separable if the sets can be seperated by a straight
line.Generalizing, a set of points in n-dimentional space are that can be seperated by a straight
line.is called Linear seperable as represented in figure 9.
Single layer perceptron can be used for linear separation.Example AND gate.But it cant be
used for non linear ,inseparable problems.(Example XOR Gate).Consider figure 10.
10
Convex regions can be created by multiple decision lines arising from multi layer
networks.Single layer network cannot be used to solve inseparable problem.Hence we go for
multilayer network there by creating convex regions which solves the inseparable problem.
1.6.1 Convex Region:
Select any Two points in a region and draw a straight line between these two points. If the
points selected and the lines joining them both lie inside the region then that region is known as
convex regions.
1.6.2. Types of convex regions
(a) Open Convex region (b) Closed Convex region
Figure 9 A: Circle - Closed convex region Figure 9 B: Triangle - Closed convex region
1.7. Logistic Regression
Logistic regression is a probabilistic model that organizes the instances in terms of
probabilities. Because the classification is probabilistic, a natural method for optimizing the
parameters is to ensure that the predicted probability of the observed class for each training
occurrence is as large as possible. This goal is achieved by using the notion of maximumlikelihood
estimation in order to learn the parameters of the model. The likelihood of the training data is
defined as the product of the probabilities of the observed labels of each training instance. Clearly,
larger values of this objective function are better. By using the negative logarithm of this value, one
obtains a loss function in minimization form. Therefore, the output node uses the negative log-
likelihood as a loss function. This loss function replaces the squared error used in the Widrow-Hoff
method. The output layer can be formulated with the sigmoid activation function, which is very
common in neural network design.
11
Logistic regression is another supervised learning algorithm which is
used to solve the classification problems. In classification problems, we
have dependent variables in a binary or discrete format such as 0 or 1.
12
1.8. Support Vector Machines
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning. The goal of the SVM
algorithm is to create the best line or decision boundary that can segregate n-dimensional space
into classes so that we can easily put the new data point in the correct category in the future.
This best decision boundary is called a hyperplane. SVM chooses the extreme points/vectors
that help in creating the hyperplane. These extreme cases are called as support vectors, and
hence algorithm is termed as Support Vector Machine. Consider the below diagram in which
there are two different categories that are classified using a decision boundary or hyperplane :
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of
the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence
called a Support vector.
13
1.8.2. Linear SVM:
The working of the SVM algorithm can be understood by using an example.
Suppose we have a dataset that has two tags (green and blue), and the dataset has two
features x1 and x2. We want a classifier that can classify the pair (x1, x2) of
coordinates in either green or blue. Consider the below image figure11. It is 2-d space
so by just using a straight line, we can easily separate these two classes. But there can
be multiple lines that can separate these classes. Consider the below image:
14
1.9.1. Types of Gradient Descent:
Typically, there are three types of Gradient Descent:
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-batch Gradient Descent
1.9.2. Stochastic Gradient Descent (SGD):
The word ‘stochastic‘ means a system or a process that is linked with a random probability.
Hence, in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole
data set for each iteration. In Gradient Descent, there is a term called “batch” which denotes the
total number of samples from a dataset that is used for calculating the gradient for each iteration. In
typical Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to be the
whole dataset. Although, using the whole dataset is really useful for getting to the minima in a less
noisy and less random manner, but the problem arises when our datasets gets big.
Suppose, you have a million samples in your dataset, so if you use a typical Gradient Descent
optimization technique, you will have to use all of the one million samples for completing one
iteration while performing the Gradient Descent, and it has to be done for every iteration until the
minima is reached. Hence, it becomes computationally very expensive to perform.
Reference Books:
1. B. Yegnanarayana, “Artificial Neural Networks” Prentice Hall Publications.
2. Simon Haykin, “Artificial Neural Networks”, Second Edition, Pearson Education.
3. Laurene Fausett, “Fundamentals of Neural Networks, Architectures, Algorithms and
Applications”, Prentice Hall publications.
4. Cosma Rohilla Shalizi, Advanced Data Analysis from an Elementary Point of View, 2015.
5. 2. Deng & Yu, Deep Learning: Methods and Applications, Now Publishers, 2013.
6. 3. Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, MIT Press, 2016.
7. 4. Michael Nielsen, Neural Networks and Deep Learning, Determination Press, 2015.
15
UNIT II DEEP LEARNING NERUAL NETWORK
History of Deep Learning- A Probabilistic Theory of Deep Learning- Backpropagation and
regularization, batch normalization- VC Dimension and Neural Nets-Deep Vs Shallow Networks
Convolutional Networks- Generative Adversarial Networks (GAN), Semi-supervised Learning
The chain rule that underlies the back-propagation algorithm was invented in
the seventeenth century (Leibniz, 1676; L’Hôpital, 1696)
Beginning in the 1940s, the function approximation techniques were used to motivate
machine learning models such as the perceptron
The earliest models were based on linear models. Critics including Marvin Minsky
pointed out several of the flaws of the linear model family, such as its inability to learn
the XOR function, which led to a backlash against the entire neural network approach
Efficient applications of the chain rule based on dynamic programming began to appear
in the 1960s and 1970s
Werbos (1981) proposed applying chain rule techniques for training artificial neural
networks. The idea was finally developed in practice after being independently
rediscovered in different ways (LeCun, 1985; Parker, 1985; Rumelhart et al., 1986a)
Following the success of back-propagation, neural network research gained popularity
and reached a peak in the early 1990s. Afterwards, other machine learning techniques
became more popular until the modern deep learning renaissance that began in 2006
The core ideas behind modern feedforward networks have not changed substantially
since the 1980s. The same back-propagation algorithm and the same approaches to
gradient descent are still in use.
Most of the improvement in neural network performance from 1986 to 2015 can be
attributed to two factors. First, larger datasets have reduced the degree to which statistical
generalization is a challenge for neural networks. Second, neural networks have become
much larger, because of more powerful computers and better software infrastructure.A small
number of algorithmic changes have also improved the performance of neural networks
noticeably. One of these algorithmic changes was the replacement of mean squared error
with the cross-entropy family of loss functions. Mean squared error was popular in the
1980s and 1990s but was gradually replaced by cross-entropy losses and the principle of
maximum likelihood as ideas spread between the statistics community and the machine
learning community.
The other major algorithmic change that has greatly improved the performance of
feedforward networks was the replacement of sigmoid hidden units with piecewise linear
hidden units, such as rectified linear units. Rectification using the max{0, z} function was
introduced in early neural network models and dates back at least as far as the Cognitron
and Neo-Cognitron (Fukushima, 1975, 1980).
For small datasets, Jarrett et al. (2009) observed that using rectifying nonlinearities
is even more important than learning the weights of the hidden layers. Random weights
are
16
sufficient to propagate useful information through a rectified linear network, enabling the
classifier layer at the top to learn how to map different feature vectors to class identities.
When more data is available, learning begins to extract enough useful knowledge to exceed
the performance of randomly chosen parameters. Glorot et al. (2011a) showed that learning
is far easier in deep rectified linear networks than in deep networks that have curvature or
two-sided saturation in their activation functions.
When the modern resurgence of deep learning began in 2006, feedforward networks
continued to have a bad reputation. From about 2006 to 2012, it was widely believed that
feedforward networks would not perform well unless they were assisted by other models,
such as probabilistic models. Today, it is now known that with the right resources and
engineering practices, feedforward networks perform very well. Today, gradient-based
learning in feedforward networks is used as a tool to develop probabilistic models.
Feedforward networks continue to have unfulfilled potential. In the future, we expect they
will be applied to many more tasks, and that advances in optimization algorithms and model
design will improve their performance even further.
17
2.2 Back Propagation Networks (BPN)
2.2.1. Need for Multilayer Networks
Single Layer networks cannot used to solve Linear Inseparable problems &
can only be used to solve linear separable problems
Single layer networks cannot solve complex problems
Single layer networks cannot be used when large input-output data set is
available
Single layer networks cannot capture the complex information’s available in
the training pairs
Hence to overcome the above said Limitations we use Multi-Layer Networks.
2.2.2. Multi-Layer Networks
Any neural network which has at least one layer in between input and
output layers is called Multi-Layer Networks
Layers present in between the input and out layers are called Hidden Layers
Input layer neural unit just collects the inputs and forwards them to the next
higher layer
Hidden layer and output layer neural units process the information’s feed to
them and produce an appropriate output
Multi -layer networks provide optimal solution for arbitrary classification
problems
Multi -layer networks use linear discriminants, where the inputs are non
linear
2.2.3. Back Propagation Networks (BPN)
Introduced by Rumelhart, Hinton, & Williams in 1986. BPN is a Multi-
layer Feedforward Network but error is back propagated, Hence the name Back
Propagation Network (BPN). It uses Supervised Training process; it has a
systematic procedure for training the network and is used in Error Detection and
Correction. Generalized Delta Law /Continuous Perceptron Law/ Gradient
Descent Law is used in this network. Generalized Delta rule minimizes the mean
squared error of the output calculated from the output. Delta law has faster
convergence rate when compared with Perceptron Law. It is the extended version
of Perceptron Training Law. Limitations of this law is the Local minima problem.
Due to this the convergence speed reduces, but it is better than perceptron’s.
Figure 1 represents a BPN network architecture. Even though Multi level
perceptron’s can be used they are flexible and efficient that BPN. In figure 1 the
weights between input and the hidden portion is considered as Wij and the weight
between first hidden to the next layer is considered as Vjk. This network is valid
only for Differential Output functions. The Training process used in
backpropagation involves three stages, which are listed as below
1. Feedforward of input training pair
18
2. Calculation and backpropagation of associated error
3. Adjustments of weights
19
Yk = f(yink)
III. Backpropagation of Errors
Step 7: δk = (tk – Yk)f(yink )
Step 8: δinj = Σ δjVjk
IV. Updating of Weights & Biases
Step 8: Weight correction is Δwij = αδkZj
bias Correction is Δwoj = αδk
V. Updating of Weights & Biases
Step 9: continued:
New Weight is
Wij(new) = Wij(old) + Δwij
Vjk(new) = Vjk(old) + ΔVjk
New bias is
Woj(new) = Woj(old) + Δwoj
Vok(new) = Vok(old) + ΔVok
2.2.5 Merits
•Has smooth effect on weight correction
•Computing time is less if weight’s are small
•100 times faster than perceptron model
• Has a systematic weight updating procedure
2.2.6. Demerits
• Learning phase requires intensive calculations
• Selection of number of Hidden layer neurons is an issue
• Selection of number of Hidden layers is also an issue
• Network gets trapped in Local Minima
• Temporal Instability
• Network Paralysis
• Training time is more for Complex problems
2.3 Regularization
A fundamental problem in machine learning is how to make an algorithm that
will perform well not just on the training data, but also on new inputs. Many
strategies used in machine learning are explicitly designed to reduce the test error,
possibly at the expense of increased training error. These strategies are known
collectively as regularization.
Definition: - “any modification we make to a learning algorithm that is intended to
reduce its generalization error but not its training error.”
In the context of deep learning, most regularization strategies are based on
regularizing estimators.
Regularization of an estimator works by trading increased bias for reduced
variance.
20
An effective regularizer is one that makes a profitable trade, reducing variance
significantly while not overly increasing the bias.
Many regularization approaches are based on limiting the capacity of models, such as
neural networks, linear regression, or logistic regression, by adding a parameter norm
penalty Ω(θ) to the objective function J. We denote the regularized objective function
by J˜
J˜(θ; X, y) = J(θ; X, y) + αΩ(θ)
We can see that the addition of the weight decay term has modified the learning rule to
multiplicatively shrink the weight vector by a constant factor on each step, just before
performing the usual gradient update. This describes what happens in a single step.
The approximation ^J is Given by
21
The minimum of ˆJ occurs where its gradient ∇wˆJ(w) = H(w − w∗) is equal to ‘0’
To study the eff ect of weight decay,
23
L1 regularization on the model parameter w is defined as the sum of absolute values of
the individual parameters.
L1 weight decay controls the strength of the regularization by scaling the penalty Ω using a
positive hyperparameter α. Thus, the regularized objective function J˜(w; X, y) is given by
By inspecting equation 1, we can see immediately that the effect of L 1 regularization is quite
different from that of L 2 regularization. Specifically, we can see that the regularization
contribution to the gradient no longer scales linearly with each wi ; instead it is a constant factor
with a sign equal to sign(wi).
24
L1 regularization can add the penalty term in cost function. But L2 regularization appends
the squared value of weights in the cost function.
L1 regularization can be helpful in features selection by eradicating the unimportant
features, whereas, L2 regularization is not recommended for feature selection
L1 doesn’t have a closed form solution since it includes an absolute value and it is a non-
differentiable function, while L2 has a solution in closed form as it’s a square of a weight
25
Image Source: https://www.analyticsvidhya.com/blog/2021/03/introduction-to-batch-normalization/
Even though the input X was normalized but the output is no longer on the same scale. The
data passes through multiple layers of network with multiple times(sigmoidal) activation functions
are applied, which leads to an internal co-variate shift in the data.
This motivates us to move towards Batch Normalization
Normalization is the process of altering the input data to have mean as zero and standard deviation
value as one.
2.4.1 Procedure to do Batch Normalization:
(1) Consider the batch input from layer h, for this layer we need to calculate the mean of this hidden
activation.
(2) After calculating the mean the next step is to calculate the standard deviation of the hidden
activations.
(3) Now we normalize the hidden activations using these Mean & Standard Deviation values. To do
this, we subtract the mean from each input and divide the whole value with the sum of standard
deviation and the smoothing term (ε).
(4) As the final stage, the re-scaling and offsetting of the input is performed. Here two components
of the BN algorithm is used, γ(gamma) and β (beta). These parameters are used for re-scaling (γ)
and shifting(β) the vector contains values from the previous operations.
These two parameters are learnable parameters, Hence during the training of neural
network, the optimal values of γ and β are obtained and used. Hence we get the accurate
normalization of each batch.
26
2.5. Shallow Networks
Shallow neural networks give us basic idea about deep neural network which consist
of only 1 or 2 hidden layers. Understanding a shallow neural network gives us an
understanding into what exactly is going on inside a deep neural network A neural network
is built using various hidden layers. Now that we know the computations that occur in a
particular layer, let us understand how the whole neural network computes the output for a
given input X. These can also be called the forward-propagation equations.
27
2.5.1 Difference Between a Shallow Net & Deep Learning Net:
1 One Hidden layer(or very less no. of Deep Net’s has many layers of Hidden
Hidden Layers) layers with more no. of neurons in
each layers
2 Takes input only as VECTORS DL can have raw data like image, text
as inputs
3 Shallow net’s needs more parameters DL can fit functions better with less
to have better fit parameters than a shallow network
Reference Books:
1. B. Yegnanarayana, “Artificial Neural Networks” Prentice Hall Publications.
2. Simon Haykin, “Artificial Neural Networks”, Second Edition, Pearson Education.
3. Laurene Fausett, “Fundamentals of Neural Networks, Architectures, Algorithms and
Applications”, Prentice Hall publications.
4. Cosma Rohilla Shalizi, Advanced Data Analysis from an Elementary Point of View, 2015.
5. 2. Deng & Yu, Deep Learning: Methods and Applications, Now Publishers, 2013.
6. 3. Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, MIT Press, 2016.
7. 4. Michael Nielsen, Neural Networks and Deep Learning, Determination Press, 2015.
28
UNIT III CONVOLUTIONAL NEURAL NETWORKS
learning - Auto encoders and dimensionality reduction in networks - Introduction to
Convnet - Architectures – AlexNet, VGG, Inception, ResNet
- Training a Convnet: weights initialization, batch normalization, hyperparameter optimization.
Convolutional Neural Network (CNN) is an advanced version of artificial neural networks (ANNs) , primarily designed
to extract features from grid-like matrix datasets. This is particularly useful for visual datasets such as images or videos,
where data patterns play a crucial role. CNNs are widely used in computer vision applications due to their effectiveness in
processing visual data.
CNNs consist of multiple layers like the input layer, Convolutional layer, pooling layer, and fully connected layers. Let's
learn more about CNNs in detail.
Now imagine taking a small patch of this image and running a small neural network, called a filter or kernel on it, with say, K
outputs and representing them vertically.
Now slide that neural network across the whole image, as a result, we will get another image with different widths, heights, and
depths. Instead of just R, G, and B channels now we have more channels but lesser width and height. This operation is
called Convolution. If the patch size is the same as that of the image it will be a regular neural network. Because of this small
patch, we have fewer weights.
29
Image source: Deep Learning Udacity
Mathematical Overview of Convolution
Now let’s talk about a bit of mathematics that is involved in the whole convolution process.
Convolution layers consist of a set of learnable filters (or kernels) having small widths and heights and the same depth as that of
input volume (3 if the input layer is image input).
For example, if we have to run convolution on an image with dimensions 34x34x3.
The possible size of filters can be axax3, where ‘a’ can be anything like 3, 5, or 7 but smaller as compared to the image
dimension.
During the forward pass, we slide each filter across the whole input volume step by step where each step is called stride (which
can have a value of 2, 3, or even 4 for high-dimensional images) and compute the dot product between the kernel weights and
patch from input volume.
As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together as a result, we’ll get output volume
having a depth equal to the number of filters. The network will learn all the filters.
Layers Used to Build ConvNets
A complete Convolution Neural Networks architecture is also known as covnets. A covnets is a sequence of layers, and every layer
transforms one volume to another through a differentiable function.
Let’s take an example by running a covnets on of image of dimension 32 x 32 x 3.
Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the input will be an image or a sequence
of images. This layer holds the raw input of the image with width 32, height 32, and depth 3.
Convolutional Layers : This is the layer, which is used to extract the feature from the input dataset. It applies a set of learnable
filters known as the kernels to the input images. The filters/kernels are smaller matrices usually 2x2, 3x3, or 5x5 shape. it slides
over the input image data and computes the dot product between kernel weight and the corresponding input image patch. The
output of this layer is referred as feature maps. Suppose we use a total of 12 filters for this layer we’ll get an output volume of
dimension 32 x 32 x 12.
Activation Layer : By adding an activation function to the output of the preceding layer, activation layers add nonlinearity to
the network. it will apply an element-wise activation function to the output of the convolution layer. Some common activation
functions are RELU: max(0, x), Tanh, Leaky RELU, etc. The volume remains unchanged hence output volume will have
dimensions 32 x 32 x 12.
Pooling layer: This layer is periodically inserted in the covnets and its main function is to reduce the size of volume which
makes the computation fast reduces memory and also prevents overfitting. Two common types of pooling layers are max
pooling and average pooling. If we use a max pool with 2 x 2 filters and stride 2, the resultant volume will be of dimension
16x16x12.
30
Flattening: The resulting feature maps are flattened into a one-dimensional vector after the convolution and pooling layers so
they can be passed into a completely linked layer for categorization or regression.
Fully Connected Layers: It takes the input from the previous layer and computes the final classification or regression task.
Output Layer: The output from the fully connected layers is then fed into a logistic function for classification tasks like
sigmoid or softmax which converts the output of each class into the probability score of each class.
Step:
import the necessary libraries
set the parameter
define the kernel
Load the image and plot it.
Reformat the image
31
Apply convolution layer operation and plot the output image.
Apply activation layer operation and plot the output image.
Apply pooling layer operation and plot the output image.
Figure 3A: PCA for Data Representation Figure 3B: PCA Dimension Reduction
If the variation in a data set is caused by some natural property, or is caused by random
experimental error, then we may expect it to be normally distributed. In this case we show
the nominal extent of the normal distribution by a hyper-ellipse (the two-dimensional
ellipse in the example). The hyper ellipse encloses data points that are thought of as
belonging to a class. It is drawn at a distance beyond which the probability of a point
belonging to the class is low, and can be thought of as a class boundary.
If the variation in the data is caused by some other relationship, then PCA gives us a
way of reducing the dimensionality of a data set. Consider two variables that are nearly
related linearly as shown in figure 3B. As in figure 3A the principal direction in which the
data varies is shown by the U axis, and the secondary direction by the V axis. However in
this case all the V coordinates are all very close to zero. We may assume, for example, that
they are only non zero because of experimental noise. Thus in the U V axis system we can
represent the data set by one variable U and discard V . Thus we have reduced the
dimensionality of the problem by 1Computing the Principal Components
33
Let A be a n × n matrix. The eigenvalues of A are defined as the roots of:
The vector x is called an eigenvector of A associated with the eigenvalue λ. Notice that
there is no unique solution for x in the above equation. It is a direction vector only and can
be scaled to any magnitude. To find a numerical solution for x we need to set one of its
elements to an arbitrary value, say 1, which gives us a set of simultaneous equations to
solve for the other elements. If there is no solution, we repeat the process with another
element. Ordinarily we normalize the final values so that x has length one, that is x · x T =
1.
Suppose we have a 3 × 3 matrix A with eigenvectors x1, x2, x3, and eigenvalues λ1, λ2, λ3
so:
Ax1 = λ1x1 Ax2 = λ2x2 Ax3 = λ3x3
Putting the eigenvectors as the columns of a matrix gives:
35
o Calculate the Eigen vectors and Eigen Values
o Choose Principal Component from Feature Vectors
o Derive the new Data Set
4. Improves Visualization:
36
Linear Discriminant Analysis as its name suggests is a linear model for classification and
dimensionality reduction. Most commonly used for feature extraction in pattern classification
problems.
3.4.1 Need for LDA:
Logistic Regression is perform well for binary classification but fails in the case of multiple
classification problems with well-separated classes. While LDA handles these quite
efficiently.
LDA can also be used in data pre-processing to reduce the number of features just as PCA
which reduces the computing cost significantly.
3.4.2. Limitations:
Linear decision boundaries may not effectively separate non-linearly separable classes.
More flexible boundaries are desired.
In cases where the number of observations exceeds the number of features, LDA might not
perform as desired. This is called Small Sample Size (SSS) problem. Regularization is
required.
1. Simple prototype classifier: Distance to the class mean is used, it’s simple to interpret.
2. Decision boundary is linear: It’s simple to implement and the classification is robust.
3. Dimension reduction: It provides informative low-dimensional view on the data, which is
both useful for visualization and feature engineering.
Shortcomings of LDA:
37
1. Linear decision boundaries may not adequately separate the classes. Support for more
general boundaries is desired.
2. In a high-dimensional setting, LDA uses too many parameters. A regularized version of
LDA is desired.
3. Support for more complex prototype classification is desired.
3.5. Manifold Learnings:
Manifold learning for dimensionality reduction has recently gained much attention to
assist image processing tasks such as segmentation, registration, tracking,
recognition, and computational anatomy.
The drawbacks of PCA in handling dimensionality reduction problems for non-linear
weird and curved shaped surfaces necessitated development of more advanced
algorithms like Manifold Learning.
There are different variant’s of Manifold Learning that solves the problem of reducing
data dimensions and feature-sets obtained from real world problems representing
uneven weird surfaces by sub-optimal data representation.
This kind of data representation selectively chooses data points from a low-
dimensional manifold that is embedded in a high-dimensional space in an attempt to
generalize linear frameworks like PCA.
Manifolds give a look of flat and featureless space that behaves like Euclidean space.
Manifold learning problems are unsupervised where it learns the high-dimensional
structure of the data from the data itself, without the use of predetermined
classifications and loss of importance of information regarding some characteristic of
the original variables.
The goal of the manifold-learning algorithms is to recover the original domain
structure, up to some scaling and rotation. The nonlinearity of these algorithms allows
them to reveal the domain structure even when the manifold is not linearly embedded.
It uses some scaling and rotation for this purpose.
Manifold learning algorithms are divided in to two categories:
Global methods: Allows high-dimensional data to be mapped from high-
dimensional to low-dimensional such that the global properties are preserved.
Examples include Multidimensional Scaling (MDS), Isomaps covered in the
following sections.
Local methods: Allows high-dimensional data to be mapped to low dimensional
such that local properties are preserved. Examples are Locally linear embedding
(LLE), Laplacian eigenmap (LE), Local tangent space alignment (LSTA),
Hessian Eigenmapping (HLLE)
Three popular manifold learning algorithms:
IsoMap (Isometric Mapping)
38
Isomap seeks a lower-dimensional representation that maintains
‘geodesic distances’ between the points. A geodesic distance is a generalization
of distance for curved surfaces. Hence, instead of measuring distance in pure
Euclidean distance with the Pythagorean theorem-derived distance formula,
Isomap optimizes distances along a discovered manifold
Locally Linear Embeddings
Locally Linear Embeddings use a variety of tangent linear patches (as
demonstrated with the diagram above) to model a manifold. It can be thought of
as performing a PCA on each of these neighborhoods locally, producing a linear
hyperplane, then comparing the results globally to find the best nonlinear
embedding. The goal of LLE is to ‘unroll’ or ‘unpack’ in distorted fashion the
structure of the data, so often LLE will tend to have a high density in the center
with extending rays
t-SNE
t-SNE is one of the most popular choices for high-dimensional
visualization, and stands for t-distributed Stochastic Neighbor Embeddings.
The algorithm converts relationships in original space into t-distributions, or
normal distributions with small sample sizes and relatively unknown standard
deviations. This makes t-SNE very sensitive to the local structure, a common
theme in manifold learning. It is considered to be the go-to visualization method
because of many advantages it possesses.
3.6. Auto Encoders:
AutoEncoder is an unsupervised Artificial Neural Network that attempts
to encode the data by compressing it into the lower dimensions (bottlenecklayer or code)
and then decoding the datato reconstruct the original input.The bottleneck layer (or code)
holds the compressed representation of theinputdata. In AutoEncoder the number of output
units must be equal to the number ofinputunits since we’re attempting to reconstruct
theinput data.
AutoEncoders usually consist of an encoder and a decoder. The encoder encodes the
provided data into a lower dimension which is the size of thebottleneck layer and the
decoder decodes the compressed data into itsoriginalform.The number of neurons in the
layers of the encoder will be decreasing as we move on with further layers, whereas the
number of neurons in thelayers of thedecoderwillbeincreasingas we
moveonwithfurtherlayers. Thereare three layersused in the encoder and decoder in the
following example. The encoder contains 32, 16, and 7 units in each layer respectively and
the decodercontains 7, 16, and 32 unitsineachlayer respectively. The code size/the number
of neurons in bottle-neck must be less than the
39
number of featuresin the data. Before feeding the data into the AutoEncoder the data must
definitely be scaled between 0 and 1 using MinMaxScaler since we are going to use
sigmoid
40
activation function in the output layer which outputs values between0 and 1.When we are
using AutoEncoders for dimensionality reduction we’ll beextracting the bottleneck layer and
use it to reduce the dimensions. Thisprocess can be viewed as feature extraction.
The type of AutoEncoder that we’re using is Deep AutoEncoder, where theencoder and
the decoder are symmetrical. The Autoencoders don’t necessarily have a symmetrical
encoder and decoder but we can have the encoderanddecodernon-symmetricalaswell.
Deep Autoencoder
Sparse Autoencoder
Under complete Autoencoder
Variational Autoencoder
LSTM Autoencoder
41
3.7. AlexNet:
Alexnet model was proposed in 2012 in the research paper named Imagenet
Classification with Deep Convolution Neural Network by Alex Krizhevsky and his colleagues
42
Then the fourth convolution operation with 384 filters of size 3X3. The stride value
along with the padding is 1.The output size remains unchanged as 13X13X384.
After this, we have the final convolution layer of size 3X3 with 256 such filters. The
stride and padding are set to 1,also the activation function is relu. The resulting feature
map is of shape 13X13X256
If we look at the architecture now, the number of filters is increasing as we are going
deeper. Hence more features are extracted as we move deeper into the architecture.
Also, the filter size is reducing, which means a decrease in the feature map shape.
3.8. VGG-16
The major shortcoming of too many hyper-parameters of AlexNet was solved by
VGG Net by replacing large kernel-sized filters (11 and 5 in the first and second
convolution layer, respectively) with multiple 3×3 kernel-sized filters one after
another.
The architecture developed by Simonyan and Zisserman was the 1st runner up of the
Visual Recognition Challenge of 2014.
The architecture consist of 3*3 Convolutional filters, 2*2 Max Pooling layer with a
stride of 1.
Padding is kept same to preserve the dimension.
There are 16 layers in the network where the input image is RGB format with
dimension of 224*224*3, followed by 5 pairs of Convolution(filters: 64, 128,
256,512,512) and Max Pooling.
The output of these layers is fed into three fully connected layers and a softmax
function in the output layer.
In total there are 138 Million parameters in VGG Net
Figure 7: InceptionNet
Inception network also known as GoogleLe Net was proposed by developers at google
in “Going Deeper with Convolutions” in 2014. The motivation of InceptionNet comes from
the presence of sparse features Salient parts in the image that can have a large variation in
size. Due to this, the selection of right kernel size becomes extremely difficult as big kernels
are selected for global features and small kernels when the features are locally located. The
InceptionNets resolves this by stacking multiple kernels at the same level. Typically it uses
5*5, 3*3 and 1*1 filters in one go.
3.11. Hyperparameter Optimization:
Hyperparameter optimization in machine learning intends to find the
hyperparameters of a given machine learning algorithm that deliver the best performance as
measured on a validation set. Hyperparameters, in contrast to model parameters, are set by the
machine learning engineer before training. The number of trees in a random forest is a
hyperparameter while the weights in a neural network are model parameters learned during
training. Hyperparameter optimization finds a combination of hyperparameters that returns
44
an optimal
45
model which reduces a predefined loss function and in turn increases the accuracy on given
independent data
3.11.1 Hyperparameter Optimization methods
Reference Books:
1. B. Yegnanarayana, “Artificial Neural Networks” Prentice Hall Publications.
2. Simon Haykin, “Artificial Neural Networks”, Second Edition, Pearson Education.
3. Laurene Fausett, “Fundamentals of Neural Networks, Architectures, Algorithms and
Applications”, Prentice Hall publications.
4. Cosma Rohilla Shalizi, Advanced Data Analysis from an Elementary Point of View, 2015.
5. 2. Deng & Yu, Deep Learning: Methods and Applications, Now Publishers, 2013.
6. 3. Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, MIT Press, 2016.
7. 4. Michael Nielsen, Neural Networks and Deep Learning, Determination Press, 2015.
46
UNIT IV
Recurrent networks, LSTM Recurrent Neural Network Language Models- Word-Level
RNNs & Deep Reinforcement Learning - Computational & Artificial Neuroscience.
4.1 Optimization in Deep Learning:
In Deep Learning, with the help of loss function, the performance of the model is estimated/
evaluated. This loss is used to train the network so that it performs better. Essentially, we try
to minimize the Loss function. Lower Loss means the model performs better. The Process of
minimizing any mathematical function is called Optimization.
Optimizers are algorithms or methods used to change the features of the neural network such
as weights and learning rate so that the loss is reduced. Optimizers are used to solve optimization
problems by minimizing the function
The Goal of an Optimizer is to minimize the Objective Function(Loss Function based on the
Training Data set). Simply Optimization is to minimize the Training Error.
4.1.1 Need for Optimization:
Prescence of Local Minima reduces the model performance
Prescence of Saddle Points which creates Vanishing Gradients or Exploding Gradient Issues
To select appropriate weight values and other associated model parameters
To minimize the loss value (Training error)
47
Figure 4.1: Convex Regions
49
Localisation Net:
With input feature map U, with width W, height H and C channels, outputs
are θ, the parameters of transformation Tθ. It can be learnt as affine transform
Grid Generator:
Suppose we have a regular grid G, this G is a set of points with target
coordinates (xt_i, yt_i). Then we apply transformation T θ on G, i.e. T θ( G).
After Tθ(G), a set of points with destination coordinates (xt_i, yt_i) is outputted.
These points have been altered based on the transformation parameters. It can be
Translation, Scale, Rotation or More Generic Warping depending on how we set θ as
mentioned above.
Sampler:
Based on the new set of coordinates (xt_i, yt_i), we generate a
transformed output feature map V. This V is translated, scaled, rotated, warped,
projective transformed or affined, whatever. It is noted that STN can be applied to
not only input image, but also intermediate feature maps.
STN is a mechanism that rotates or scales an input image or a feature
map in order to focus on the target object and to remove rotational variance .
One of the most notable features of STNs is their modularity (the module can
be injected into any part of the model) and their ability to be trained with a single backprop
algorithm without modification of the initial model.
4.4.1. Advantages:
Helps in learning explicit spatial transformations like translation, rotation, scaling,
cropping, non-rigid deformations, etc. of features.
Can be used in any networks and at any layer and learnt in an end-to-end trainable
manner.
Provides improvement in the performance of existing models.
4.5. Recurrent Neural Networks:
RNNs are very powerful, because they combine two properties:
Distributed hidden state that allows them to store a lot of information about
the past efficiently.
Non-linear dynamics that allows them to update their hidden state in
complicated ways.
With enough neurons and time, RNNs can compute anything that can be computed
by your computer.
4.5.1. Need for RNN:
Normal Networks cannot handle sequential data
50
They considers only the current input
Normal Neural networks cannot memorize previous inputs
The solution to these issues is the RNN
RNN works on the principle of saving the output of a particular layer and feeding
this back to the input in order to predict the output of the layer. We can convert a Feed-
Forward Neural Network into a Recurrent Neural Network as given below in figure 4.4.
51
Figure 4.5 A: Recurrent Network
52
Figure 4.5 B: Fully Connected RNN
4.5.2. Providing Input to RNN:
We can specify inputs in several ways:
– Specify the initial states of all the units.
– Specify the initial states of a subset of the units.
– Specify the states of the same subset of the units at every time step.
4.5.3. providing Targets to RNN:
We can specify targets in several ways:
– Specify desired final activities of all the units
– Specify desired activities of all units for the last few steps
• Good for learning attractors
• It is easy to add in extra error derivatives as we backpropagate.
– Specify the desired activity of a subset of the units
4.6. Long Short Term Memory Network’s ( LSTM):
LSTMs are a special kind of RNN — capable of learning long-term dependencies by
remembering information for long periods is the default behavior. All RNN are in the form
of a chain of repeating modules of a neural network. In standard RNNs, this repeating
module will have a very simple structure, such as a single tanh layer.
LSTMs also have a chain-like structure, but the repeating module is a bit different
structure. Instead of having a single neural network layer, four interacting layers are
communicating extraordinarily.
Hochreiter & Schmidhuber (1997) solved the problem of getting an RNN to
remember things for a long time (like hundreds of time steps). They designed a memory
53
cell using
54
logistic and linear units with multiplicative interactions. Information gets into the cell
whenever its “write” gate is on. The information stays in the cell so long as its “keep” gate
is on. Information can be read from the cell by turning on its “read” gate.(Refer Figure 4.6
– shown Below)
To preserve information for a long time in the activities of an RNN, we use a circuit
that implements an analog memory cell.
– A linear unit that has a self-link with a weight of 1 will maintain its state.
– Information is stored in the cell by activating its write gate.
– Information is retrieved by activating the read gate.
– We can backpropagate through this circuit because logistics are had nice
derivatives.
55
Step 2: Decide how much this unit adds to the current state
In the second layer, there are two parts. One is the sigmoid function, and the other is
the tanh function. In the sigmoid function, it decides which values to let through (0
or 1). tanh function gives weightage to the values which are passed, deciding their level of
importance (-1 to 1).
Step 3: Decide what part of the current cell state makes it to the output
The third step is to decide what the output will be. First, we run a sigmoid layer,
which decides what parts of the cell state make it to the output. Then, we put the cell state
through tanh to push the values to be between -1 and 1 and multiply it by the output of the
sigmoid gate.
4.6.2. Applications of LSTM include:
• Robot control
• Time series prediction
• Speech recognition
• Rhythm learning
• Music composition
• Grammar learning
• Handwriting recognition
4.7. Computational and Artificial Neuro-Science:
Computational neuroscience is the field of study in which mathematical tools and theories
are used to investigate brain function.
The term “computational neuroscience” has two different definitions:
1. using a computer to study the brain
2. studying the brain as a computer
Computational and Artificial Neuroscience deals with the study or understanding of how
signals are transmitted through and from the human brain. A better understanding of How
decision is made in human brain by processing the data or signals will help us in
developing Intelligent algorithms or programs to solve complex problems. Hence, we need
to understand the basics of Biological Neural Networks (BNN).
4.7.1. The Biological Neurons:
The human brain consists of a large number, more than a billion of neural cells that
process information. Each cell works like a simple processor. The massive interaction
between all cells and their parallel processing only makes the brain’s abilities possible.
Figure 1 represents a human biological nervous unit. Various parts of biological neural
network(BNN) is marked in Figure 4.7.
56
Figure 4.7: Biological Neural Network
Dendrites are branching fibres that extend from the cell body or soma.
Soma or cell body of a neuron contains the nucleus and other structures, support
chemical processing and production of neurotransmitters.
Axon is a singular fiber carries information away from the soma to the synaptic
sites of other neurons (dendrites ans somas), muscels, or glands.
Axon hillock is the site of summation for incoming information. At any moment,
the collective influence of all neurons that conduct impulses to a given neuron will
determine whether or n ot an action potential will be initiated at the axon hillock and
propagated along the axon.
Myelin sheath consists of fat-containing cells that insulate the axon from electrical
activity. This insulation acts to increase the rate of transmission of signals. A gap exists
between each myelin sheath cell along the axon. Since fat inhibits the propagation of
electricity, the signals jump from one gap to the next.
Nodes of Ranvier are the gaps (about 1 μm) between myelin sheath cells. Since fat
serves as a good insulator, the myelin sheaths speed the rate of transmission of an
electrical impulse along the axon.
Synapse is the point of connection between two neurons or a neuron and a muscle
or a gland. Electrochemical communication between neurons take place at these junctions.
57
Terminal buttons of a neuron are the small knobs at the end of an axon that release
chemicals called neurotransmitters.
Information flow in a neural cell
The input/output and the propagation of information are shown below.
4.7.2. Artificial neuron model
An artificial neuron is a mathematical function conceived as a simple model of a real
(biological) neuron.
The McCulloch-Pitts Neuron
This is a simplified model of real neurons, known as a Threshold Logic Unit.
A set of input connections brings in activations from other neuron.
A processing unit sums the inputs, and then applies a non-linear activation function
(i.e. squashing/transfer/threshold function).
An output line transmits the result to other neurons.
4.7.3. Basic Elements of ANN:
Neuron consists of three basic components –weights, thresholds and a single
activation function. An Artificial neural network(ANN) model based on the biological
neural sytems is shown in figure 4.8.
58
4.7.4. Applications of Computational Neuro Science:
Reference Books:
1. B. Yegnanarayana, “Artificial Neural Networks” Prentice Hall Publications.
2. Simon Haykin, “Artificial Neural Networks”, Second Edition, Pearson Education.
3. Laurene Fausett, “Fundamentals of Neural Networks, Architectures, Algorithms and
Applications”, Prentice Hall publications.
4. Cosma Rohilla Shalizi, Advanced Data Analysis from an Elementary Point of View, 2015.
5. 2. Deng & Yu, Deep Learning: Methods and Applications, Now Publishers, 2013.
6. 3. Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, MIT Press, 2016.
7. 4. Michael Nielsen, Neural Networks and Deep Learning, Determination Press, 2015.
59
UNIT V APPLICATIONS OF DEEP LEARNING
ImageNet is useful for many computer vision applications such as object recognition, image
classification and object localization.Prior to ImageNet, a researcher wrote one algorithm to
identify dogs, another to identify cats, and so on. After training with ImageNet, the same
algorithm could be used to identify different objects. The diversity and size of ImageNet meant
that a computer looked at and learned from many variations of the same object. These variations
could include camera angles, lighting conditions, and so on. Models built from such extensive
training were better at many computers vision tasks. ImageNet convinced researchers those
large datasets were important for algorithms and models to work well.
5.1.1. Technical details of Image Net:
ImageNet did not define these subcategories on its own but derived these from
WordNet. WordNet is a database of English words linked together by semantic relationships.
Words of similar meaning are grouped together into a synonym set, simply called synset.
Hypernyms are synsets that are more general. Thus, "organism" is a hypernym of "plant".
Hyponyms are synsets that are more specific. Thus, "aquatic" is a hyponym of "plant". This
hierarchy makes it useful for computer vision tasks. If the model is not sure about a
60
subcategory,
61
it can simply classify the image higher up the hierarchy where the error probability is less. For
example, if model is unsure that it's looking at a rabbit, it can simply classify it as a mammal.
While WordNet has 100K+ synsets, only the nouns have been considered by ImageNet.
Humans make mistakes and therefore we must have checks in place to overcome them.
Each human is given a task of 100 images. In each task, 6 "gold standard" images are placed
with known labels. At most 2 errors are allowed on these standard images, otherwise the task
has to be restarted.
In addition, the same image is labelled by three different humans. When there's
disagreement, such ambiguous images are resubmitted to another human with tighter quality
threshold (only one allowed error on the standard images).
For public access, ImageNet provides image thumbnails and URLs from where the original
images were downloaded. Researchers can use these URLs to download the original images.
However, those who wish to use the images for non-commercial or educational purpose, can
create an account on ImageNet and request access. This will allow direct download of images
from ImageNet. This is useful when the original sources of images are no longer available.
The dataset can be explored via a browser-based user interface. Alternatively, there's also
an API. Researchers may want to read the API Documentation. This documentation also shares
how to download image features and bounding boxes.
Images are not uniformly distributed across subcategories. One research team found that
by considering 200 subcategories, they found that the top 11 had 50% of the images, followed
by a long tail.
When classifying people, ImageNet uses labels that are racist, misogynist and offensive.
People are treated as objects. Their photos have been used without their knowledge. About
5.8% labels are wrong. ImageNet lacks geodiversity. Most of the data represents North
America and Europe. China and India are represented in only 1% and 2.1% of the images
respectively. This implies that models trained on ImageNet will not work well when applied
for the developing world.
62
Another study from 2016 found that 30% of ImageNet's image URLs are broken. This is
about 4.4 million annotations lost. Copyright laws prevent caching and redistribution of these
images by ImageNet itself
5.2. WaveNet:
WaveNet is a deep generative model of raw audio waveforms. We show that WaveNets
are able to generate speech which mimics any human voice and which sounds more natural
than the best existing Text-to-Speech systems, reducing the gap with human performance by
over 50%. Allowing people to converse with machines is a long-standing dream of human-
computer interaction. The ability of computers to understand natural speech has been
revolutionised in the last few years by the application of deep neural networks. However,
generating speech with computers — a process usually referred to as speech synthesis or text-
to-speech (TTS) — is still largely based on so-called concatenative TTS, where a very large
database of short speech fragments are recorded from a single speaker and then recombined to
form complete utterances. This makes it difficult to modify the voice (for example switching
to a different speaker, or altering the emphasis or emotion of their speech) without recording a
whole new database.
This has led to a great demand for parametric TTS, where all the information required to
generate the data is stored in the parameters of the model, and the contents and characteristics
of the speech can be controlled via the inputs to the model. So far, however, parametric TTS
has tended to sound less natural than concatenative. Existing parametric models typically
generate audio signals by passing their outputs through signal processing algorithms known
as vocoders. WaveNet changes this paradigm by directly modelling the raw waveform of the
audio signal, one sample at a time. As well as yielding more natural-sounding speech, using
raw waveforms means that WaveNet can model any kind of audio, including music.
63
The WaveNet proposes an autoregressive learning with the help of convolutional
networks with some tricks. Basically, we have a convolution window sliding on the audio
data, and at each step try to predict the next sample value that it did not see yet. In other
words, it builds a network that learns the causal relationships between consecutive timesteps
(as shown in figure 5.1)
Typically, the speech audio has a sampling rate of 22K or 16K. For few seconds of speech,
it means there are more than 100K values for a single data and it is enormous for the network
to consume. Hence, we need to restrict the size, preferably to around 8K. At the end, the
values are predicted in Q channels (eg. Q=256 or 65536), which is compared to the original
audio data compressed to Q distinct values. For that, the mulaw quantization could be used:
it maps the values to the range of [0,Q]. And the loss can be computed either by
cross-entropy, or discretized logistic mixture.
64
And the element-wise addition of a skip connection and output of causal 1D results in
the residual
65
The above diagram (Figure5.3 ) shows the phases or logical steps involved in natural
language processing
5.4. Word2Vec:
Word embedding is one of the most popular representation of document vocabulary. It
is capable of capturing context of a word in a document, semantic and syntactic similarity,
relation with other words, etc. What are word embeddings exactly? Loosely speaking, they
66
are vector
67
representations of a particular word. Having said this, what follows is how do we generate
them? More importantly, how do they capture the context? Word2Vec is one of the most
popular technique to learn word embeddings using shallow neural network. It was developed
by Tomas Mikolov in 2013 at Google.
The purpose and usefulness of Word2vec is to group the vectors of similar words
together in vector space. That is, it detects similarities mathematically. Word2vec creates
vectors that are distributed numerical representations of word features, features such as the
context of individual words. It does so without human intervention.
Given enough data, usage and contexts, Word2vec can make highly accurate guesses
about a word’s meaning based on past appearances. Those guesses can be used to establish a
word’s association with other words (e.g. “man” is to “boy” what “woman” is to “girl”), or
cluster documents and classify them by topic. Those clusters can form the basis of search,
sentiment analysis and recommendations in such diverse fields as scientific research, legal
discovery, e-commerce and customer relationship management. Measuring cosine similarity,
no similarity is expressed as a 90 degree angle, while total similarity of 1 is a 0 degree angle,
complete overlap.
Word2vec is a two-layer neural net that processes text by “vectorizing” words. Its
input is a text corpus and its output is a set of vectors: feature vectors that represent words in
that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form
that deep neural networks can understand.
Figure 5.4: Two models of Word2Vec (A- CBOW & B- Skip-Gram model)
68
Word2vec is similar to an autoencoder, encoding each word in a vector, but rather
than training against the input words through reconstruction, as a restricted Boltzmann machine
does, word2vec trains words against other words that neighbour them in the input corpus. t
does so in one of two ways, either using context to predict a target word (a method known as
continuous bag of words, or CBOW), or using a word to predict a target context, which is
called skip-gram.
When the feature vector assigned to a word cannot be used to accurately predict that
word’s context, the components of the vector are adjusted. Each word’s context in the corpus
is the teacher sending error signals back to adjust the feature vector. The vectors of words
judged similar by their context are nudged closer together by adjusting the numbers in the
vector.
Similar things and ideas are shown to be “close”. Their relative meanings have been
translated to measurable distances. Qualities become quantities, and algorithms can do their
work. But similarity is just the basis of many associations that Word2vec can learn. For
example, it can gauge relations between words of one language, and map them to another.
The main idea of word2Vec is to design a model whose parameters are the word
vectors. Then, train the model on a certain objective. At every iteration we run our model,
evaluate the errors, and follow an update rule that has some notion of penalizing the model
parameters that caused the error. Thus, we learn our word vectors.
Content Source: (1) https://towardsdatascience.com/introduction-to-word-embedding-and-
word2vec-652d0c2060fa
(2) https://wiki.pathmind.com/word2vec
70
Figure 5.5: General representation of Bone Joint detection system
71
Figure 5.7: CNN based Knee Joint Detection Model
Figure 5.7 shows the full model of a joint detection procedure. The Convolution
filter moves to the right with a certain Stride Value till it parses the complete width. Moving
on, it hops down to the beginning (left) of the image with the same Stride Value and repeats
the process until the entire image is traversed. The Kernel has the same depth as that of the
input image. The objective of the Convolution Operation is to extract the high-level
features such as edges, from the input image. Stride is the number of pixels shifts over
the input matrix. When the stride is 1 then we move the filters to 1 pixel at a time. When the
stride is 2 then we move the filters to 2 pixels at a time and so on
Pooling layers section would reduce the number of parameters when the images are
too large. Spatial pooling also called subsampling or down sampling which reduces the
dimensionality of each map but retains important information. This is to decrease the
computational power required to process the data by reducing the dimensions
Types of Pooling:
• Max Pooling
• Average Pooling
• Sum Pooling
• The image is flattened into a column vector.
• The flattened output is fed to a feed-forward neural network and backpropagation
applied to every iteration of training.
Over a series of epochs, the model is able to distinguish between dominating and
certain low-level features in images and classify them using the Softmax
Classification technique. The feature map matrix will be converted as vector (x1, x2, x3,
…). These features are combined together to create a model.
72
Finally, an activation function such as softmax or sigmoid is used to classify the outputs as
Normal and Abnormal.
5.5.1 Steps Involved:
• Provide input image into convolution layer
• Choose parameters, apply filters with strides, padding if requires. Perform convolution
on the image and apply ReLU activation to the matrix.
• Perform pooling to reduce dimensionality size
• Add as many convolutional layers until satisfied
• Flatten the output and feed into a fully connected layer (FC Layer)
• Output the class using an activation function (Logistic Regression with cost functions)
and classifies images.
5.6. Other Applications:
Similarly for the other Applications such as Facial Recognition and Scene
Matching applications appropriate Deep Learning Based Algorithms such as AlexNet,
VGG, Inception, ResNet and or Deep learning-based LSTM or RNN can be used. These
Networks has to be explained with necessary Diagrams and appropriate Explanations.
Reference Books:
1. B. Yegnanarayana, “Artificial Neural Networks” Prentice Hall Publications.
2. Simon Haykin, “Artificial Neural Networks”, Second Edition, Pearson Education.
3. Laurene Fausett, “Fundamentals of Neural Networks, Architectures, Algorithms and
Applications”, Prentice Hall publications.
4. Cosma Rohilla Shalizi, Advanced Data Analysis from an Elementary Point of View, 2015.
5. 2. Deng & Yu, Deep Learning: Methods and Applications, Now Publishers, 2013.
6. 3. Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, MIT Press, 2016.
7. 4. Michael Nielsen, Neural Networks and Deep Learning, Determination Press, 2015.
73