0% found this document useful (0 votes)
123 views

DNN Hyperparameter Tuning

This document provides an introduction to machine learning and deep neural networks. It discusses key concepts like hyperparameters, model capacity, and regularization. Hyperparameters like batch size, epochs, activation functions, and optimizer algorithms must be tuned carefully for deep learning models. While deep neural networks can theoretically represent any function, practical challenges like vanishing gradients must be addressed. Overall, the document stresses the importance of hyperparameter tuning in developing accurate and generalizable deep learning models.

Uploaded by

Venkatesh Kg
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views

DNN Hyperparameter Tuning

This document provides an introduction to machine learning and deep neural networks. It discusses key concepts like hyperparameters, model capacity, and regularization. Hyperparameters like batch size, epochs, activation functions, and optimizer algorithms must be tuned carefully for deep learning models. While deep neural networks can theoretically represent any function, practical challenges like vanishing gradients must be addressed. Overall, the document stresses the importance of hyperparameter tuning in developing accurate and generalizable deep learning models.

Uploaded by

Venkatesh Kg
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 105

Introduction to machine learning

Deep Neural networks


Hyperparameter Tuning

1
Introduction to machine learning

Objective of DNN models

1. The keys success parameters for DNN are


a. Validation accuracy / accuracy of the model in production
b. Minimal computation requirements (CPU / GPU / TPU time, memory, storage)
c. Ease of implementing the model in production

2. To succeed in design and implementation of DNN model, an understanding


of NN training as a non-convex optimization problem is a must

3. Due to lack of understanding the various aspects of DNN, practitioners


make arbitrary choices for various hyperparameters, yielding potentially
subpar performance.

2
Introduction to machine learning

Theoretical power of DNN

1. Theoretically, the deeper a neural network is, the more powerful it is


2. Deeper neural networks learn more coefficients / weights and hence become
better universal approximators
3. Deep neural networks are more efficient in terms of coefficients than broad
networks

3
Introduction to machine learning

Practical challenges in DNN

1. When the neural network becomes deep, the neurons in the initial layer(closest
to the input layer) become slow / sluggish in learning

2. This is due to the vanishing gradient problem. Sometimes the opposite of it i.e.
exploding gradients hamper the learnability of the neurons in the first few layers

3. As a result the neurons continue to hold on to the initial randomly assigned


weights or slightly modified weights

4. Though the neural network may perform acceptably in training, it will perform
poorly in production

5. This is because the first few layers are supposed to learn to extract the features
from the input images. The learning process should optimize the weights

6. Optimization of weights means changing the randomly assigned weights to


optimal weights. This does not happen due to vanishing gradient problem

4
Introduction to machine learning

Practical challenges in DNN

7. The neural networks can also easily become overfit when the number of
weights to learn is too many while the input data is small

8. In such situation having too many layers, each with multiple neurons while the
input data is limited is akin to too many parameters with low depth

9. The models become complex and overfit and will suffer in production from high
degree of variance error

10. Thus, in summary, the challenge is how to create deep neural networks and
avoid these challenges simultaneously

Solution : ….. Careful hyper parameter tuning…

5
Introduction to machine learning

DNN – Parameters

1. In every model parameters are used to train the model and make predictions.
There are however, two types of parameters… those that help learn (a.k.a
hyper parameters and those that help predict (a.k.a. model parameters)

2. Hyperparameters come into play while the algorithm of choice is building the
model. The objective is to make the model accurate and generalizable

3. Model parameters are the result of learning / model building and represent the
learnt relationships between the target and predictor variables. They are used
only for prediction

6
Introduction to machine learning

DNN Model Capacity

7
Introduction to machine learning

Capacity / Underfitting / Overfitting

1. Informally, a model’s capacity is its ability to fit a wide variety of functions. It is


the ability to express any kind of relation between the independent and target
variables

2. Models with low capacity may struggle to fit the training set. Models with high
capacity can overfit by memorizing properties of the training set that is of no
use in test data

3. ML algorithms will generally perform best when their capacity is appropriate


for the true complexity of the task they need to perform and the amount of
training data they are provided with.

8
Introduction to machine learning

Capacity / Regularization
1. All machine learning algorithms have hyperparameters settings that we can use to
control the algorithm’s behavior.

2. The values of hyperparameters are not learnt from the data. They have to be fixed
outside the model being built

3. In a neural network, every node is a model and associated with parameters. At


network level, these metrics can easily run into thousands

4. One way to look at Neural Networks with fully-connected layers is that they define
a family of functions that are parameterized by the weights of the network.

5. A natural question that arises is: What is the representational power of this family
of functions? In particular, are there functions that cannot be modeled with a
Neural Network?

6. It turns out that Neural Networks with at least one hidden layer are universal
approximators
9
Introduction to machine learning

Capacity / Regularization
1. That is, it can be shown ( intuitive explanation from Michael Nielsen) that given any
continuous function f(x) and some ϵ>0, there exists a Neural Network g(x) with one hidden
layer (with a reasonable choice of non-linearity, e.g. sigmoid) such that ∀x,∣f(x)−g(x)∣<ϵ.

2. In other words, neural network with one hidden layer can approximate any continuous
function. If so, why do we need deep neural network?

3. It has been empirically demonstrated that deep networks is more efficient in terms of
parameter usage to represent a function than a single hidden layer network. With given
number of parameters a larger family of functions can be approximated with deep network

4. The problem is, with such high capacity to represent any function, neural networks can easily
become overfit. Hence regularization for neural network is a must

5. Regularizing means controlling the behavior of the machine learning algorithm while it builds
a neural network. Regularization is the task of finding the optimal values of the various
parameter values which help the neural network best represent a given function

6. Finding the optimal combination is like finding needle in a huge hay stack!

10
Introduction to machine learning

Deep Neural Network Hyperparameters

Hyperparameters

Neural Network Structure related Training Algorithm related

1. Number of hidden layers 5. Epoch, iterations and batch size


2. Weights initialization 6. Optimizer algorithm / Loss function
3. Activation function 7. Learning rate
4. Dropout 8. Momentum

11
Introduction to machine learning

Batch size and epochs


Batch size –

1. gradient descent hyper parameter to control the number of training samples to work
through before the model’s internal parameters are updated

2. At the end of each batch, the error is calculated that is used to update the coefficients
through back propagation
a. Batch Gradient Descent. Batch Size = Size of Training Set
b. Stochastic Gradient Descent. Batch Size = 1
c. Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set

Epohcs –
a. is a gradient descent hyperparameter that controls the number of complete
passes through the training dataset
b. One epoch means that each sample in the training dataset is used once to
update the coefficients
c. An epoch consists of one or more batches. An epoch that has one batch is called
the batch gradient descent learning algorithm.

12
Introduction to machine learning

Batch size and epochs

Batch 64 128 256 512


size /
Epoch
s
20 Testloss . 0.105 Testloss . 0.116 Testloss . 0.075 Testloss . 0.0827
Test accuracy: Test accuracy: 0.982 Test accuracy: 0.984 Test accuracy:
0.982 Epoch 12 ES Epoch 17 ES 0.984
Epoch 16 ES Epoch 13 ES

30 Testloss . 0.11 Testloss . 0.078 Testloss . 0.084 Testloss . 0.073


Test accuracy: Test accuracy: 0.985 Test accuracy: 0.983 Test accuracy:
0.984 Epoch 14 ES Epoch 19 ES 0.982
Epoch 14 ES Epoch 20 ES

13
Introduction to machine learning

Activation Functions
1. The last step in a neuron is application of activation function. Choice of activation
function decides the ANN learning capability. There are different activation functions

2. The entire power of an artificial neural network comes from the activation function.
Without the activation function, the power of ANN collapses to that of a single neuron
which cannot handle non linear classifications such as an XOR

3. The activation functions are non linear i.e. more than simple addition and multiplication
which are linear functions. It is this non linearity that gives the artificial neural network its
power to approximate any complex function

4. An activation function (what helps a neuron fire / activate) is a mathematical operation


that takes as input a floating point number and returns a new floating point number

5. In the software implementation of ANN, we can have different activation functions for
each layer but not for each neuron

14
Introduction to machine learning

Activation Functions -

15
Introduction to machine learning

Activation Functions
Basic forms of activation functions include –

Linear functions –
1. the input floating point and output floating point numbers are linearly related and expressed in the
form of y = mx + c

2. When activation functions are linear, they are only doing multiplication and addition and they are no
use in ANN. They collapse the network into one single neuron

Stair Step functions –


1. When the function outputs a single value for a given range of input values and different but same
value for subsequent range of values appears like steps

2. It is a combination of linear surface but still overall a non-linear function

Step function –
1. Function similar to Stair Step function but with only one step. Up until the threshold value the output
is one value and on reaching the threshold it changes to another

16
Introduction to machine learning

Activation Functions (Contd…)


Basic forms of activation functions include –

Piecewise Linear functions (ReLU)–


1. Activation function made of several functions each of which is a linear function, it is called piecewise
linear. Overall it is still a non-linear function.

2. The most popular one being the ReLU (Rectified Linear Units). The names comes from electronics
where a rectifier component prevents negative voltage passing through a circuit

3. If input is less than 0 (-ive) , output is 0. Otherwise the output is greater than 0

4. Output is a linear function of x after the kink

5. Fast to train and easy to implement but can run into problems for <0 inputs

6. Can lead to negative side of 0 on some weight updates.

7. That neuron will stop firing anything other than 0, literally it is dead

8. Variants of ReLU help overcome these limitations of ReLU

17
Introduction to machine learning

Activation Functions (Contd…)


Leaky ReLU–
1. Changes the response for inputs less than 0. Rather than output a 0 for negative inputs, it scales the
output down by a factor of 10 for negative inputs
2. The factor can be changed from 10 (default) to other values in Parametric ReLU

Advantages
Leaky ReLUs are one attempt to fix the “dying ReLU” problem by having a small negative
slope (of 0.01, or so).

Disadvantages
As it possess linearity, it can’t be used for the complex Classification. It lags behind the
Sigmoid and Tanh for some of the use cases.

18
Introduction to machine learning

Activation Functions (Contd…)

Parametric ReLU-
1. Do not change the parameter to 1 as that will remove the kink

19
Introduction to machine learning

Activation Functions (Contd…)


Basic forms of activation functions include –

Shifted ReLU–
Moves the kink down and left

Note: Mathematically a concept called derivatives (rate of change of output for a unit change in input) is not defined at the
kink. However, there are some mathematical tricks that can be employed to overcome this problem with ReLU functions.

20
Introduction to machine learning

Activation Functions (Contd…)


1. Smooth functions –Softplus –
2. Smoothens out the ReLU function
3. Do not need any mathematical trick to find the derivative at the
kink

Explonential ReLU –

1. Smoothened shifted ReLU

21
Introduction to machine learning

Activation Functions (Contd…)


Sigmoid / Logistic function / curve –
1. It outputs a value less than .5 for every negative input and greater than .5 for every value greater than 0
2. The least value it can output approaches 0 at negative infinity and the max value it can output
approaches 1 at positive infinity
3. suffer from vanishing gradient and exploding gradient problem

Advantages
1. It is nonlinear in nature i.e. it projects a linear data set into non-linear space
2. It will give a continuous derivable activation unlike step function.
3. Suitable for classification as it maps to probability range 0 - 1
4. Squashes input values into a range of 0 – 1 hence output is never too large in magnitude

Disadvantages
1. On the extremes of X, the output tend to respond very less to changes in X (small gradients)
2. It gives rise to a problem of “vanishing gradients”. Sigmoids saturate and kill gradients.
3. As the gradient curve (blue) for the Sigmoid is almost flat, any change in the change in the gradient will
have very low impact on the output
4. The network refuses to learn further or is drastically slow ( depending on use case and until gradient
/computation gets hit by floating point value limits (explosion) ).

22
Introduction to machine learning

Activation Functions (Contd…)


Hyperbolic Tangent / tanh –
1. Similar to logistic but main difference is for negative inputs, the least value
output approaches -1 at negative infinity

2. At 0 input value the output is also 0. suffer from vanishing gradient and
exploding gradient problem

3. The Sigmoid and tanh functions squash the input values to a range of values
between (0,1) and (-1 , 1) respectively

Advantages
The gradient is stronger for tanh than sigmoid ( derivatives are steeper).

Disadvantages
Tanh also has the vanishing gradient problem.

23
Introduction to machine learning

Activation Functions (Contd…)


Swish –
1. Combination of the sigmoid and basic ReLU function. In essence it is ReLU with a small bump to the left
of 0 which flattens out over more and more negative values

2. Drawback of smooth functions is they produce useful results in a small range of input values where the
gradients can be meaningfully calculated.

3. At extreme input values the gradients flatten out which leads to a problem in learning process

24
Introduction to machine learning

Activation Functions (Contd…)


Self normalizing ReLU (SELU)–

1. These functions self normalize that mitigates the problem of vanishing and exploding gradient

2. Keeps network activation in a range of values defined by a mean and variance.

3. The mean is fixed to 0 and variance to one unit

4. The ELU function is scaled to keep the activation with that mean and variance

5. α and λ are the scaling factors

6. The scaling factors appear to be 1.6732632423543772848170429916717


and 1.0507009873554804934193349852946 respectively

7. These scaling factors ensure specifically mean 0 and variance 1

8. Weight initialization is also another important aspect. It is suggest to sample weights from a Gaussian
distribution with mean 0 and variance 1/n where n is number of weights.

25
Introduction to machine learning

Activation Functions (Contd…)

Note the derivative of Sigmoid. It is


maxing at .25 and almost flat for a major
part of its function. That means smaller

Fig Ref: https://towardsdatascience.com/activation-functions-


neural-networks-1cbd9f8d91d6

26
Introduction to machine learning

Activation Functions (Contd…)


Which activation function to use?
1. Sigmoid function use to be popular at the beginning of the work on neural networks. However,
due to the vanishing and exploding gradient problem, it has fallen out of favor

2. Instead ReLU and it’s variants are often used in deep learning in the hidden layers

3. Softmax function for output layer in case of classification problems and a linear function if it is a
regression problem

Ref: https://keras.io/activations/

DL_Hyperparameter_Tuning.ipynb

27
Introduction to machine learning

Activation Functions (Contd…)

1. The blue line is for a linear function and surprisingly it is giving 92% + accuracy!
2. However, all the other standard activations ReLU, tanh, Sigmoid are performing relatively much better
3. All converge to same 98% +- accuracy score
4. Any further increase in epochs may not help as all of them stop increasing beyond
5. Sigmoid is the slowest learner followed by tanh while ReLU seems to learn fast.... (Why ?)
6. The different non-linear activation functions deliver the same end result, the difference is in how fast
they learn
7. There is no one single best activation function. Each can give relatively better performance under
different conditions

28
Introduction to machine learning

Number of Layers
The Input Layer
1. Every NN has exactly one layer of input neurons which are not really neurons. Some call it
pass-thru neurons

2. The number of neurons comprising this layer, this parameter is completely and uniquely
determined by the number of attributes in the training data

3. Other than these neurons, we may also have a neuron dedicated to hold the bias term

The Output Layer


1. Like the Input layer, every NN has exactly one output layer

2. Number of neurons is completely determined by the problem domain. Is it classification or


regression case

3. If regression case, is the output a single value or a vector of values

4. If classifier, is it binary classification then need only one output node

5. or multi-class classification. Then will need as many nodes as many classes

29
Introduction to machine learning

Number of Hidden Layers

1. Conceptually, one hidden layer with multiple hidden nodes is equivalent of projection into higher
dimensional space and it has been shown that it can model most complex functions fetching reasonable
results

2. However, deep networks have a higher parameter efficiency than a shallow one. Parameter efficiency is
number of parameters required to represent a function. A deep neural network can represent the same
complex function with lesser neurons

3. Lesser the neurons, more the speed and less demanding on the computational resources. Overall the
neural network can also get higher performance for the same number of neurons

4. Deep neural network take advantage of the hidden hierarchies in real world data where high level
features consist of lower level features. E.g. a object consists of boundary, a boundary consist of sub-
boundaries etc.

https://cs.stanford.edu/people/karpa
thy/convnetjs/demo/classify2d.html

30
Introduction to machine learning

Number of Hidden Layers (Contd…)

5. If we consider high level features as functions, these functions can be represented as sub-functions and
sub-sub functions…. DNN have natural affinity towards such structured data

6. In many problems, we can start with just one or two hidden layers and get good results for e.g. 97%
accuracy on MNIST dataset with one hidden layer and a few hundreds of neurons and get 98%
accuracy with same number of neurons with two layers!

7. Gradually start increasing the number of layers and neurons while checking for overfitting the data

DL_Hyperparameter_Tuning
.ipynb

31
Introduction to machine learning

Number of Neurons per Hidden Layer

1. With increase in number of layers we see a performance improvement from 93.8


to 95.3
2. The training and validation accuracy converge and this may be true only for this
data set which is a cleaned (de noised) data
3. What would happen if we increase range of number of layers to say 15?

32
Introduction to machine learning

Number of Neurons per Hidden Layer

As number of layers increase, there is a slight drop in test accuracy and the model seems to become
relatively unstable
33
Introduction to machine learning

Number of Neurons per Hidden Layer

1. Another knob we can turn is the number of nodes in each hidden layer.

2. This is called the width of the layer. Making wider layers tends to scale the number of parameters faster
than adding more layers.

3. Every time we add a single node to layer i, we have to give that new node an edge to every node in
layer i+1. This is akin to projecting data into higher dimensional space

4. For number of hidden layers, there is no thumb rule except the fact that deeper the network, more
efficient the model becomes in terms of representing complex functions

5. But how many neurons per hidden layer? That is how to distribute the neurons across different layers.
Some use the funnel approach where the mouth of the funnel is first hidden layer and the nozzle is the
last hidden layer

6. The belief is… many sub features represented by the first few hidden layers can coalesce into higher
features in the subsequent layer and hence less neurons in subsequent layer

7. Finding the right combination of the number of neurons in each layer is a challenge.

34
Introduction to machine learning

Number of Neurons per Hidden Layer (Theoretical approaches)

1. In 1957, Kolmogorov proved that any real-valued, continuous function ‘f()’ defined on a n-dimensional
unit cube can be represented as the sum of continuous functions of a single variable

2. This is interpreted as 3 layer NN.


a. First layer is the input layer of n neurons
b. Second layer is the hidden layer consisting of 2n neurons (the outer loop). Each hidden neuron transforms the
inner sum using a function ‘g’
c. The third layer is the output layer represented by the function f(x1, x2, x3 ,…)
d. In this three layer NN, λp is a constant and Ø is a linear function, p is number of dimensions of the cube

3. Hecht-Nielsen rephrased it to state that any continuous function f() defined on a n-dimensional unit
cube can be implemented exactly by a 3 layer network with (2n + 1) hidden nodes

4. However, functions g and Ø are highly non-smooth unlike the sigmoid functions used in NN. Hence this
theoretical interpretations should not be taken literally

5. The theorem does not tell us how to determine the number of neurons. It only mentions about the
possibility of representing the function

35
Introduction to machine learning

Number of Neurons per Hidden Layer (Theoretical approaches)

6. More recently, Huang and Babri lowered these bounds, proving that an single hidden layer NN with at
most Nh hidden neurons can learn Ns distinct samples with zero error. This is true for any bounded,
non-linear activation function which has a limit at one, infinity. Thus the upper bound for SLFNs is
Nh<= Ns

7. Huang later extended the work of Tamura and Tateshi to rigorously prove that the upper bound on the
number of hidden nodes ,Nh for two hidden layer NN with sigmoid activation function is given by
equation below, where No is the number of outputs. These can learn at least Ns distinct samples with
any degree of precision

8. It is well known that functions which are linearly separable require no hidden nodes at all. For the rest,
the discussions above indicate that within the constraints of their respective theorems, a function of any
complexity can be reproduced with any degree of precision within the specified bounds.

9. However, if the training set contains noise, as is the case in most practical situations, exactly
reproducing the training set will guarantee overfitting. This is undesirable

10. In practice therefore, the number of hidden nodes must necessarily be lower than these bounds and will
depend on a number of specific factors such as degree of noise in data, complexity of the function we
are modeling, number of inputs and outputs etc.

DL_Hyperparameter_Tuning.ipynb

36
Introduction to machine learning

Number of Neurons per Hidden Layer

1. With increase in neurons for fixed number of layers, the


accuracy improves
2. With increase in layers it improves further
3. Though in this data the training and validation accuracies seem
to be converging, this cannot be taken as a general trend
4. When models overfit, there will be drop in the validation
accuracy

37
Introduction to machine learning

Learning Rate
1. In order for Gradient Descent to work efficiently especially when error surface is complex, we set the
λ (learning rate). This parameter determines how quickly the optimal values of the coefficients are
found

2. The learning rate hyperparameter controls the magnitude of change in the parameter weights in the
gradient descent

3. A learning rate that is too small leads to slow convergence, while a learning rate that is too large can
hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge
(explode)

4. Adjusting the learning rate - adjust the learning rate during training by e.g. annealing, i.e. reducing
the learning rate according to a pre-defined schedule or when the change in objective between
epochs falls below a threshold.

5. These schedules and thresholds, however, have to be defined in advance and are thus unable to
adapt to a dataset’s characteristics

6. Additionally, the same learning rate applies to all parameter updates. If our data is sparse and our
features have very different frequencies, we might not want to update all of them to the same extent,
but perform a larger update for rarely occurring features

7. Another key challenge of minimizing highly non-convex error functions common for neural networks
is avoiding getting trapped in their numerous suboptimal local minima.
38
Introduction to machine learning

Learning Rate - Tuning

1. Configure the learning rate with sensible defaults


2. Assess the impact of learning rate – is it acceptable or is it stuck (unable to learn)
3. Perform sensitivity analysis by jiggling the learning rate
4. Improve performance with learning rate schedules, momentum, and adaptive learning
rates.
DL_Hyperparameter_Tuning.ipynb

https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-
learning-d0d4059c1c10

https://arxiv.org/abs/1704.00109 (Stochastic weight averaging method)


39
Introduction to machine learning

1. For small learning rate… the accuracy is low and apparently increases with higher learning rate

2. Lower learning rate will need more epochs as the learning slow. Hence within the given number
of epochs it appear to give low accuracy that increases with higher learning rate.

3. Error surface looks relatively smooth. With higher learning rate, even with few epochs the
network reaches high accuracy and the training and validation accuracy converge

4. Given the cleansed data, this will not always be the case. Usually the validation and training
scores will deviate due to overfit.

40
Introduction to machine learning

Weights Initialization
1. This concerns with the first set of weights that are assigned at the start of the training stage. These
weights used to be randomly assigned in early days of DNN.

2. Weights were generated from a uniform random distribution. This means take all values between two
extremes with equal probability. Though the weights are assigned randomly, they are expected to
converge thru the learning process to optimal values

3. Normal initialization using normal distribution – also known as Gaussian distribution of random
numbers. In this approach weights drawn randomly are likely to be from closer to the central value
i.e. 0. 0 is most likely! The random values are replaced with optimal values through learning process

4. Three well researched strategies for initial weight selection include LeCun uniform, Glorot (a.k.a
Xavier) uniform, He uniform algorithms to draw from uniform distribution and similarly normal
distribution i.e. LeCun Normal, Golrot Normal, He Normal

Ref:
https://keras.io/initializers/
https://towardsdatascience.com/hyper-parameters-in-action-part-ii-weight-initializers-35aee1a28404

41
Introduction to machine learning

Weights Initialization and problem of vanishing and exploding gradients


1. It was empirically observed that DNN often suffered from unstable gradients, suddenly grow too big
(the hardware registers unable to hold leading to “Nan”, or the gradients become close to 0 or even
0)

2. This behavior of the DNN made them difficult to train and DNN was almost abandoned. In 2010
Xavier Glorot and Yoshua Bengio published a paper explaining the reason behind these problems.
http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

3. They suspected the popular non-linear activation function (Sigmoid) and Tanh combined with the
random weight initialization using normal distribution centered at 0 with standard deviation of 1.

4. They showed that under this approach the variance of the output of a hidden layer becomes greater
and greater than the variance of the input of that layer. Going forward in the network the variance
keeps increasing

5. The activation functions in the last few layers / top layers saturate i.e. reach the flat regions of their
sigmoid functions given the magnitude of the inputs

6. As a result, when the back propagation kicks in, the error gradients in the top layers are small and
when these small error gradients are propagated back, they tend to become smaller and smaller!

42
Introduction to machine learning

Weights Initialization and problem of vanishing and exploding gradients


1. In each subsequent layer the activation function (Sigmoid) is reaching saturation point i.e. the inputs
are becoming large in magnitude (due to increase in variance) leading the output close to a constant
0 or 1 (regions of very tiny gradients i.e dy/dx is almost 0)

43
Introduction to machine learning

Weights Initialization and problem of vanishing and exploding gradients

Graident of a sigmoid
function

1. The sigmoid function and tanh have flat slope for most of the values of X. That means for most of
the values of the change in the function i.e. d(𝜎(x) / d(x)) will be close to 0 as can be seen in the red
dashed curve (plot of derivative of sigmoid function)
2. The max value for the derivative of sigmoid is .25, i.e. 1/4th , a small fraction
3. In the flat regions, the derivative will be very small, close to 0 as it is almost flat there. When stuck in
such regions, the change in weights (look at the formula below) is very small or 0 i.e. the learning
stops
𝑊 𝑁𝑒𝑤 = 𝑊 𝑂𝑙𝑑 – ὴ(d(E)/d(w))

44
Introduction to machine learning

Weights Initialization and problem of vanishing and exploding gradients


Neuron Chain Equation for weight learning

a. One neuron with a linear function followed by a nonlinear transformation to ouput O


b. Error is a function of (O and Y). Since Y is constant (comes from data), only O can change
c. Thus rate of change of E with respect to O is d(E)/ d(O)
d. But O is a function of nonlinear transformation NLF(). Thus rate of change of O = d(O)/d(NLF())
e. The non linear function is dependent on linear function hence rate of change of NLF() = d(NLF())/D(LF())
f. Linear function is dependent on the input
g. Thus the overall impact of change in i over E i.e. d(E) / d(i)

I O E=
LF()v /
f(O,Y)
NLF()

D(E) / d(i) D(LF()) / *


d(NLF()) / *
d(O) / *
D(E) /
= d(i) d(LF()) d(NLF()) d(O)

45
Introduction to machine learning

Weights Initialization and problem of vanishing and exploding gradients


Chains across layers

i O1 = i2 O
1 v
LF()1 / v
LF()2 / E=
NLF()1 NLF()2
f(O,Y)
d(E) / d(i1) d(LF()1) / * d(NLF()1) / * d(i2) / * D(LF()2) / * d(NLF()2) / * d(O) / * D(E) /
= d(i1) d(LF()1) d(NLF()1) d(i2) d(LF()2) d(NLF()2) d(O)
a. As we add more and more layers to a neural network (make it deeper), the chain equation extends
across layers as shown above
b. The chain of equations with multiplication operator helps estimate the effect on E of a change in i1
c. i1 is the input to the first neuron which is the input value multiplied by a coefficient / slope (not shown
separately)
d. Similarly, i2 is the output of first neuron multiplied by a coefficient (not shown separately)
e. As a result of this chain, there is a possibility of fractions appearing in the equations either because of
the random values of slope fixed in first iteration and due to the derivation of the non-linear function.
f. If the non linear function is a sigmoid or Tanh function, fractions are guaranteed on derivation
g. Fractions multiply to give smaller fractions and as a result, the d(E) / d(i1) i.e. change in overall error for
a change in inputs of first layer neuron becomes very very small in magnitude leading to no significant
change in the coefficient
h. Note: change in coef at any stage is 𝑊 𝑁𝑒𝑤 = 𝑊 𝑂𝑙𝑑 – ὴ(d(E)/d(w))

46
Introduction to machine learning

Weights Initialization and problem of vanishing and exploding gradients


Combination of chain effect
1. Replace the unitary NN with multi-layer multi-neuron network as shown below
2. The error at every layer will be E (same as error in the final layer)
3. However, the E will be proportionally assigned to each neuron in layer
4. Hence, we are interested in estimating the impact of change in each neuron’s input on the overall E at
that level
5. The chain rule will structurally be similar but will have more terms for each neuron at each layer given
multiple connection from a node to nodes in subsequent layer . For e.g. at node 1,1
d(E) / d(i1) = d(lf()1) / d(i1) * d(nlf()1)/d(lf()1) * d(i2)/d(nlf()1) * d(lf()2)/d(i2) * d(nlf()2/dlf()2) * d(O)/d(nlf()2) * d(E)/d(O) +
d(E) / d(i2) = d(lf()1) / d(i2) * d(nlf()1)/d(lf()1) * d(i2)/d(nlf()1) * d(lf()2)/d(i2) * d(nlf()2/dlf()2) * d(O)/d(nlf()2) * d(E)/d(O) +
d(E) / d(i3) = d(lf()1) / d(i3) * d(nlf()1)/d(lf()1) * d(i2)/d(nlf()1) * d(lf()2)/d(i2) * d(nlf()2/dlf()2) * d(O)/d(nlf()2) * d(E)/d(O)

i1
i2
i3

E E

47
Introduction to machine learning

Weights Initialization and problem of vanishing and exploding gradients


Vanishing Gradient in long chains
1. Fractions can appear in the chain equation at any layer
2. Fractions can be caused by the derivatives of the non linear functions (if the non linear
function is a tanh or sigmoid) when the same is computed in the flat regions of the function
3. We end up in the flat regions of the sigmoid / tanh functions when the input to the non
linear transformation (output of the linear transformation) is very small or very large in
magnitude. Remember , in every neuron we first have linear transformation whose output
is passed to non linear function
4. The output of linear transformation can be very high or very small when the random
weights assigned in the first iterations
5. If the random weights are too large, leading to a large value for mx+c, this will position the
output value in the flat region of the non-linear curve (near to 1).
6. If the random initialization a small number (negative value), it position the output in the flat
region of the non-linear transformation near 0

48
Introduction to machine learning

Weights Initialization and problem of vanishing and exploding gradients

Xavier and He weight initialization methods

1. Glorot and Bengio argued that for signal to flow properly in both directions (not die out or explode) we
need to ensure variance of outputs of each layer to be equal to the variance of the inputs. Also the
gradients should have equal variance before and after a layer in backprop

2. This is possible only if the layer has an equal number of input and output connections. Since this
restrains the design, they proposed a compromise that works in practice

3. The connection weights must be initialize randomly as per the equation below… where ni and ni+1
are the number of input and output connections for the layer, also called fan-in and fan-out

49
Introduction to machine learning

Weights Initialization and problem of vanishing and exploding gradients


Xavier and He weight initialization methods

With Xavier initialization

Credit: Glorot & Bengio.

Without Xavier initialization

Without Xavier init. Credit: Glorot & Bengio.

50
Introduction to machine learning

Weights Initialization and problem of vanishing and exploding gradients


Vanishing gradient – possible solutions to minimize chance of getting stuck in vanishing gradient

1. Do not initialize the weights with zeroes or randomly


a. If the weights in a network start too small, the feed into the non-linear functions become too
small (as discussed earlier), can lead to flatter regions
b. If the weights in a network start too large, the feed into the non-linear functions become too
large leading to flatter regions again

2. Use Xavier initialization technique - initializes the weights in your network by drawing them from a
distribution with zero mean and a specific variance

3. Fan_in is number of input connections into a neuron

4. It draws samples from a truncated normal distribution centered on 0 with stddev = sqrt(1 / fan_in)

5. Generally used with tanh activation

51
Introduction to machine learning

Weights Initialization and problem of vanishing and exploding gradients


Vanishing gradient – possible solutions to minimize chance of getting stuck in vanishing
gradient

6. Use He Normal - In this method, the weights are initialized keeping in mind
the size of the previous layer which helps in attaining a global minimum of the
cost function faster and more efficiently.
7. Unlike in Xavier method, in this method we multiply the ratio by 2
8. The weights are still random but differ in range depending on the size of the
previous layer of neurons.
9. This provides a controlled initialization hence the faster and more efficient
gradient descent.
10.Generally used with RELU activation

11.Weight initialization in a neuron in a layer -

52
Introduction to machine learning

Weights Initialization and problem of vanishing and exploding gradients


Exploding Gradient

1. Exploding gradients are a problem when large error gradients accumulate and
result in very large updates to neural network model weights during training.
2. The magnitudes become so large that they cannot even be held in the
underlying hardware leading to the coefficients becoming “Nan “
3. This can be due to random initialization of weights or due to the non-linear
transformation functions used that are not Sigmoid or tanh
4. Larger magnitude of coefficients will lead to overfitting and unstable neural
network which will impact the generalizability and predictive power of the
model in test / production
5. There are methods to fix exploding gradients, which include gradient clipping ,
weight regularization, gradient scaling etc.
6. Use of Xavier and He normalization technique also reduces the probability of
exploding gradient

Ref: https://mc.ai/xavier-and-he-normal-he-et-al-initialization/

53
Introduction to machine learning

Loss Functions

Regression Loss Functions


Mean Squared Error Loss
Mean Squared Logarithmic Error Loss
Mean Absolute Error Loss

Binary Classification Loss Functions


Binary Cross-Entropy
Hinge Loss
Squared Hinge Loss

Multi-Class Classification Loss Functions


Multi-Class Cross-Entropy Loss
Sparse Multiclass Cross-Entropy Loss
Kullback Leibler Divergence Loss

54
Introduction to machine learning

Large Computational Requirement

4. Error Function Basics in one dimension Objective : find a threshold that


minimises misclassification
a. Global& local minima
b. Local maxima
c. Flat regions
Global Maxima
Error Magnitude

Local Maxima

Local Minima
Global Minima
& flat region

Threshold

55
Introduction to machine learning

Large Computational Requirement

5. Error Function Basics in two dimensions

Combined error surface in


two dimesions

Coeff1

Two error curves on respective coefficients

56
Introduction to machine learning

Large Computational Requirement


6. We do not know how the error function is! We only know what local gradients look like
7. This is like feeling your way down to the valley from hill top in thick fog!

57
Introduction to machine learning

Large Computational Requirement


8. That means, depending on the starting point, the paths taken and destination may be
different!
9. If we evaluate all the points and evaluate the gradients, we will get complete picture of
the error surface but this is not possible
10. Thus, the objective is, irrespective of starting point, reach the absolute minima on the
invisible surface

58
Introduction to machine learning

Large Computational Requirement

Flat regions

Local maxima

Local Minima

Saddle point

Global Minima
1. Flat regions have very low gradients. As a result, the search for minima will slow down
(sometimes significantly)
2. Local minima is a point surrounding which are points with higher error. Once that point is
reached, there is no way to get out!
3. Saddle point is the point which is local maxima in one direction and local minima in other
direction
4. Global minima is the sought goal that may or may not be reached. Depends on where
the journey began
59
Introduction to machine learning

Gradients of the error function

60
Introduction to machine learning

Error Contours

1. Every ring on the error function represents a


combination of coefficients (m1 and m2 in the
image) which result in same quantum of error
i.e. SSE

2. Let us convert that to a 2d contour plot. In the


contour plot, every ring represents same
magnitude of error.

3. The innermost ring / bull’s eye is the


combination of the coefficients that gives the
least error

61
Introduction to machine learning

Contours of non-conovex error functions

62
Introduction to machine learning
Randomly selected starting point

Gradient Descent Steps –

1. First evaluate dy(error)/d(weight) to find the direction of


highest increase in error given a unit change in weight
(Blue arrow). Partial derivative w.r.t. to weight

2. Next find dy(error) /d(bias) to find the direction of highest


increase in error given a unit change in bias (green
arrow). Partial derivative w.r.t. to bias

3. Partial derivatives give the gradient in the given axis and


Gradient vector gradient is a vector

4. Add the two vectors to get the direction of gradient (black


arrow) i.e. direction of max increase in error

5. We want to decrease error, so find negative of the


gradient i.e. opposite to black arrow ( Orange arrow). The
arrow tip is new value of bias and weight.

6. Recalculate the error at this combination an iterate to step


1 till movement in any direction only increases the error

63
Introduction to machine learning

Random Starting
coefficient values

64
Introduction to machine learning

65
Introduction to machine learning

Loss function – Sum of Squared Errors

1. What is an optimization algorithm and what is its use? - Optimization algorithms helps us to minimize
(or maximize) an Objective function (another name for Error function) E(x) which is simply a
mathematical function dependent on the Model’s internal learnable parameters which are used in
computing the target values(Y) from the set of predictors(X) used in the model

2. C = ½(( wi.xi + b) – y). In this expression Xi and y come from the data and are given. What the ML
algorithm learns is the weight wi and bias b. Thus C = f( wi, b)

3. The optimizer algorithms try to estimate the values of wi and b which when used will give minimum or
maximum C. In ML we look for minimum

66
Introduction to machine learning

Loss function – Cross Entropy

1. Cross-entropy loss (often called Log loss) increases as the predicted output diverges from
actual output

Sum loss over entire dataset i.e. i = 1 to n

2. Perfect prediction would have a log loss of 0

3. Gradient descent tries to reduce this cross-entropy loss

67
Introduction to machine learning

Loss function – Error surface and gradients

Logistic regression or conditional log-likelihood cost function (− log P(y|x) coupled with
softmax outputs) worked much better (for classification problems) than the quadratic cost
which was traditionally used to train feedforward neural networks

Ref: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

1. Black surface – Logistic cost


function, has huge gradients

2. Red surface – quadratic cost


function, is mostly flat

Both loss functions will give similar accuracies


but logistic cost function will be faster in
reaching the minima than quadratic function
due to the difference in the flatness of the
surfaces

Image source - http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

68
Introduction to machine learning

Optimizer for cost / loss function

1. There are two broad categories of optimization algorithms –

I. First Order Optimization Algorithms —


I. These algorithms minimize or maximize a Loss function E(x) using concept of gradient of
cost function w.r.t to learnable parameters.
II. Gradient descent is the default choice in most applications of machine learning models
III. The First order derivative tells us whether the cost function increases or decreases at a
particular value of the learnable parameters
IV. First order Derivative basically give us a line which is Tangential to a point on its Error
Surface
V. A Gradient is represented by a Jacobian Matrix — which is simply a Matrix consisting of first
order partial Derivatives(Gradients).

II. Second Order Optimization Algorithms—


I. Second-order methods use the second order derivative which is also called Hessian to
minimize or maximize the Loss function
II. The second order derivative tells us whether the first order derivative is at maximum or
minimum of the cost function when it is 0
III. This is not used as frequently as gradient descent due to computation requirements
https://web.stanford.edu/class/msande311/lecture13.pdf

69
Introduction to machine learning

Optimizer for cost / loss function

First Order Optimization Algorithms -

Batch Gradient Descent (default) –


I. The most common vanilla version being the batch gradient descent.
II. Computes the gradient of the cost function w.r.t to the model parameters for
the entire data set before updating the parameters.
III. Many data points may have errors, all the errors averaged and change in m
parameter is the avg SSE / avg of x
IV. Batch gradient descent is slow, computationally intensive, guaranteed to
reach absolute minima in convex functions
V. We can introduce randomness in the gradients to fasten up things as in SGD

for i in range(no_of_epochs): # for every epoch


parameterwise_gradient = evaluate_gradient(loss_function, data, params) # evaluate gradient using
# partial differentiation
params = params - learning_rate * parameterwise_gradient # update parameters

70
Introduction to machine learning

Gradient Descent variants


Stochastic Gradient Descent (SGD) –
I. In every iteration, a randomly chosen record is picked up to refine the
parameters
II. Updates the parameters for each training example of xi and yi.
III. Relatively faster to compute but the parameters fluctuate and may take long
to converge
IV. You can imagine the point on the error function moving haphazardly towards
the minimum

V. Non-convex error functions with saddle points and local minima become
difficult to get out of
VI. Practically very time consuming and resource intensive compared to mini-
batch and batch gradient descent algorithm

for i in range(no_of_epochs): # for every epoch


for record in given data:
parameterwise_gradient = evaluate_gradient(loss_function, record, params) # evaluate gradient
#using partial diff
params = params - learning_rate * parameterwise_gradient # update parameters

71
Introduction to machine learning

Gradient Descent variants


b. Mini Batch Gradient Descent –
I. Takes best of the both batch and stochastic GD. Performs update for a mini
batch of n input samples
II. Leads to faster and more stable convergence to minima compared to batch
GD and Stochastic GD on non-convex surface

III. Does not guarantee optimal convergence

for i in range(no_of_epochs): # for every epoch


for batch in make_batch(data, batch_size=50)
parameterwise_gradient = evaluate_gradient(loss_function, data, params) # evaluate gradient
#using partial diff
params = params - learning_rate * parameterwise_gradient # update parameters

72
Introduction to machine learning

Error surface characterisitcs & Challenges of Gradient Descent algorithm


1. Local minima - for non-convex error functions, the gradient descent algorithm can get stuck into a local
minima i.e. sub-optimal combination of the parameters that gives an impression of having minimized the
errors to the possible

2. Saddle points – The algorithm may reach a saddle point where there are relatively no or very little
gradients (flat regions) where the algorithm stops to learn!
Error

http://www.denizyuret.com/2015/03/alec-radfords-animations-
for.html

3. Even with learning rates, such challenges may make the gradient descent inefficient as the learning rates
are not determined from the data
73
Introduction to machine learning

Gradient Descent optimization algorithms


1. Momentum –

a. SGD finds it difficult to handle situations where the error surface gradients is
different in each direction. Steep in one while shallow in other. In such cases it
oscillates across the slopes of the ravine making slow progress towards the local
optima

b. Momentum helps accelerate the SGD towards the relevant minima quickly. It adds
a fraction of the update vector (combination of the gradient) of previous step to the
current update step.

Momentum – fraction of the previous step


gradient added to new gradient

The momentum term increases for dimensions whose gradients


point in the same directions and reduces updates for
dimensions whose gradients change directions. As a result, we
gain faster convergence and reduced oscillation.

74
Introduction to machine learning

Gradient Descent Optimization Algorithms

2) Nesterov Accelerated Gradient (NAG)-


a. Tries to build intelligence into the updating step based on approximate future
parameter values
b. Unlike momentum, this algorithm estimates parameters based on the momentum
from previous step and applies the gradient there to find the final position
c. It slows down the movement when approaching a higher gradient i.e. in increasing
slope in error

Source: G. Hinton's lecture 6c

C A B

75
Introduction to machine learning

Gradient Descent Optimization Algorithms


3) Adagrad –
D2

D1

lsd1 lsd2

result

1. At point P, the gradient in D1 direction is much higher than in D2


2. Why should the learning step be same in both directions
3. Would it not be advantageous to have larger learning step in D2
direction and small steps in D1

76
Introduction to machine learning

Gradient Descent Optimization Algorithms


3) Adagrad –
a. Adapts learning rate to the individual parameters rather than using a single learning
rate for all

b. Performs larger updates for infrequent parameters and smaller updates for frequent
parameters

c. Other methods perform an update for all parameters θ at once as every parameter θi
used the same learning rate η

d. As Adagrad uses a different learning rate for every parameter θi at every time step t
a. g(t,i) is partial derivative of the objective function w.r.t. to the parameter θi at time step t

b. The SGD update for every parameter θi at each time step t then becomes

c. But Adagrad modifies the general learning rate η at each time step t for every
parameter θi based on the past gradients that have been computed for θi

Source : http://www.deeplearningbook.org/contents/optimization.html

77
Introduction to machine learning

Gradient Descent Optimization Algorithms


3) Adagrad –

a. individually adapts the learning rates of all model parameters by scaling them inversely
proportional to the square root of the sum of all the historical squared values of the
gradient

b. The parameters with the largest partial derivative of the loss have a correspondingly
rapid decrease in their learning rate, while parameters with small partial derivatives
have a relatively small decrease in their learning rate.

c. The net effect is greater progress in the more gently sloped directions of parameter
space. In the context of convex optimization, the AdaGrad algorithm enjoys some
desirable theoretical properties.

d. Empirically, however, for training deep neural network models, the accumulation of
squared gradients from the beginning of training can result in a premature and
excessive decrease in the effective learning rate. AdaGrad performs well for some but
not all deep learning models
Source : http://www.deeplearningbook.org/contents/optimization.html

78
Introduction to machine learning

Gradient Descent Optimization Algorithms


3) Adagrad (Contd) –
a. As Adagrad uses a different learning rate for every parameter θi at every time step t

Gt(ii) P1 P2
P1 4.6 0
P2 0 8.2
Error

Diagonal matrix for two Error term is added to


parameter space with prevent divide by 0
elements i,i as sum of square error. This is like the
of gradients of parameter pi standard deviation of
upto the time step t gradient for pi

Learning step has to be regularized from the default eta based on the under root
quantity. Too frequently updated gradient will need small steps while rarely
changing gradient will need large steps (the under root will be close to error
term)

b. Drawback of Adagrad is rapidly monotonically shrinking learning rate for certain


frequently occurring parameters

79
Introduction to machine learning

Gradient Descent Optimization Algorithms

AdaGrad –

Source : http://www.deeplearningbook.org/contents/optimization.html

80
Introduction to machine learning

Gradient Descent Optimization Algorithms


RMSProp –

a. modifies AdaGrad to perform better in the nonconvex setting by changing the gradient
accumulation into an exponentially weighted moving average. AdaGrad is designed to
converge rapidly when applied to a convex function. When applied to a nonconvex
function to train a neural network, the learning trajectory may pass through many
different structures and eventually arrive at a region that is a locally convex bowl

b. AdaGrad shrinks the learning rate according to the entire history of the squared
gradient and may have made the learning rate too small before arriving at such a
convex structure

c. RMSprop and Adadelta came out independently around the same time stemming from
the need to resolve Adagrad's radically diminishing learning rates.

d. RMSprop (proposed by Geoff Hinton) in fact is identical to the vector of Adadelta

Source : http://www.deeplearningbook.org/contents/optimization.html

81
Introduction to machine learning

Gradient Descent Optimization Algorithms


RMSProp –

e. RMSprop (proposed by Geoff Hinton) in fact is identical to the vector of Adadelta

f. RMSProp uses an exponentially decaying average to discard history from the extreme
past so that it can converge rapidly after finding a convex bowl, as if it were an
instance of the AdaGrad algorithm initialized within that bowl

a. RMSprop divides the learning rate by an exponentially decaying average of squared


gradients. Hinton suggests γ to be set to 0.9, while a good default value for the
learning rate η is 0.001.

Source : http://www.deeplearningbook.org/contents/optimization.html

82
Introduction to machine learning

Gradient Descent Optimization Algorithms

RMSPROP –

Source : http://www.deeplearningbook.org/contents/optimization.html

83
Introduction to machine learning

Gradient Descent Optimization Algorithms


4) Adadelta –
a. Ada delta attempts to overcome the Adagrad rapidly shrinking learning rate by defining
a time window to accumulate past gradients instead of taking all

b. Instead of inefficiently storing all past w window gradients, sum of gradients is


recursively defined as decaying average of all past gradients

c. The running average E [(g^2)t ] at time t depends on a fraction of previous average


and current gradient

d. Thus change in parameters is given as -

84
Introduction to machine learning

Gradient Descent Optimization Algorithms


5) Adam –
a. Stands for adaptive moments
b. Adds a momentum term to RMSProp rescaled gradients

Source : http://www.deeplearningbook.org/contents/optimization.html

85
Introduction to machine learning

Gradient Descent Optimization Algorithms


SGD RMSProp

Adam
Choosing the right optimization technique –
1. There is currently no consensus
2. Some empirical studies favor algorithms with adaptive
learning rates but no single algorithm emerged as the
best
3. Most popular ones – RMSProp, RMSProp with
momentum , AdaDelta, Adam, SGD with momentum,
SGD

86
Introduction to machine learning

DROP OUT
1. Dropout is a regularization technique for neural network models

2. Dropout is a technique where randomly selected neurons are ignored during training stage.
They are “dropped-out” randomly.

3. They do not contribute to the activation of downstream neurons temporally i.e. they are
temporarily removed in the forward pass

4. They are not considered for any weight updates on the corresponding backward pass.

5. In a normal dense network, as the network learns the optimal neural weights, the neurons
become tuned to specific features (specialized).

6. Neighboring neurons rely on this specialization, to give the complete picture of the features
down the line. This specialization can result in a fragile model too specialized to the training
data.

7. This reliant on context for a neuron during training is referred to complex co-adaptations.

Ref: Dropout: A Simple Way to Prevent Neural Networks from Overfitting (download the PDF)

87
Introduction to machine learning

DROP OUT
8. Dropout forces a neural network to learn more robust features that are useful in conjunction
with many different random subsets of the other neurons.

9. Dropout roughly doubles the number of iterations required to converge. However, training
time for each epoch is less.

10. With K hidden units, each of which can be dropped, we have 2^K possible models. This is
like an ensemble of neural networks! We don’t drop neurons from output layer

11. Once we have N input records, for each input we would have essentially trained a different
network. These networks are not entirely independent as they share many weights

12. In testing phase, the entire network is considered and each activation is reduced by a
factor p. where P is probability of a neuron being temporarily being dropped out

13. Supposes p = 50%, during testing a neuron will be connected to twice as many input
neurons as it was during training on average. To compensate for this, we need to multiply
each neuron’s input connection weights by .5[

Ref: Dropout: A Simple Way to Prevent Neural Networks from Overfitting (download the PDF)

88
Introduction to machine learning

DROP OUT

Ref: Dropout: DeepLearning Ian Goodfellow, Yoshua Bengio Aaron Courville


89
Introduction to machine learning

DROP OUT (Guidelines from original paper)


The original paper on Dropout provides experimental results on a suite of standard machine
learning problems. As a result they shared useful heuristics to consider for dropout in practice.

1. Generally, use a small dropout value of 20%-50% of neurons with 20% providing a good starting point. A
probability too low has minimal effect and a value too high results in under-learning by the network

2. Use a larger network. You are likely to get better performance when dropout is used on a larger network,
giving the model more of an opportunity to learn independent representations

3. Use dropout on incoming (visible) as well as hidden units. Application of dropout at each layer of the
network has shown good results

4. Use a large learning rate with decay and a large momentum. Increase your learning rate by a factor of 10
to 100 and use a high momentum value of 0.9 or 0.99

5. Constrain the size of network weights. A large learning rate can result in very large network weights.
Imposing a constraint on the size of network weights such as max-norm regularization with a size of 4 or
5 has been shown to improve results.

90
Introduction to machine learning

Batch Normalization

Batch normalisation is a technique for improving the performance and


stability of neural networks

The idea is to normalise the inputs of each layer in such a way that they have
a mean output activation of zero and standard deviation of one.

This is analogous to how the inputs to networks are standardised.

Source - https://medium.com/deeper-learning/glossary-of-deep-learning-batch-normalisation-8266dcd2fa82

91
Introduction to machine learning

Batch Normalization - benefits

1. Before activation function (non-linearity)


2. After every dense layer if required

92
Introduction to machine learning

Batch Normalization - benefits

1. Networks learn faster. Though each iteration will be slower due to extra
computations, overall convergence will be faster
2. Gradient descent steps could be larger compared to a case without batch
normalization.
3. Makes weights easier to initialize – weight initialization can be difficult in deeper
networks (due to vanishing gradient and exploding gradient problems). Batch
normalization helps reduce the sensitivity to initial starting weights
4. More activation functions become viable. Sigmoid, TanH and ReLU suffer from
vanishing / exploding gradient and dying neuron problems respectively. With
batch normalization these challenges are minimized
5. It also seems to work like a regularization technique
6. Overall simplifies creating a deep neural network

93
Introduction to machine learning

Batch Normalization

1. Due to this normalization “layers” between each fully connected layers, the range of input
distribution of each layer stays the same, no matter the changes in the previous layer.

2. Given x inputs from k-th neuron, Consider a batch of activations at some layer

3. For making each feature dimension unit gaussian, use:

Where, Xk = activation of layer k and E[xk]= Mean

= standard deviation

Sourced from: Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift by loffe and Szegedy , et.al 2015

94
Introduction to machine learning

Batch Normalization – caution!

1. For sigmoid and tanh activation, normalized region is more of linear than nonlinear.

2. For relu activation, half of the inputs are zeroed out.

3. So, some transformation has to be done to move the distribution away from 0.

4. A scaling factor γ and shifting factor β are used to do this.

95

95
Introduction to machine learning

Batch Normalization – complete picture

Normalize:

And then allow the network to squash


the range if it wants to:

Where

And are
learnable parameters

Sourced from: Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift by loffe and Szegedy , et.al 2015

96
Introduction to machine learning

Batch Normalization – Mini batches

• Improves gradient flow through the


network
• Allows higher learning rates
• Reduces the strong dependence on
initialization

Sourced from: Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift by loffe and Szegedy , et.al 2015

97
Introduction to machine learning

Early Stopping

1. Early Stopping is not only a regularization technique, but is also a mechanism for
preventing a waste of resources when the training is not going in the right direction.

2. Assess the impact of learning rate – is it acceptable or is it stuck (unable to learn)


•monitor: quantity to be monitored.
•min_delta: minimum change in the monitored quantity to qualify as an improvement, i.e. an
absolute change of less than min_delta, will count as no improvement.
•patience: number of epochs with no improvement after which training will be stopped.
•verbose: verbosity mode.
•mode: one of {auto, min, max}. In min mode, training will stop when the quantity monitored has
stopped decreasing; in max mode it will stop when the quantity monitored has stopped increasing;
in auto mode, the direction is automatically inferred from the name of the monitored quantity.
•baseline: Baseline value for the monitored quantity to reach. Training will stop if the model
doesn't show improvement over the baseline.
•restore_best_weights: whether to restore model weights from the epoch with the best value of
the monitored quantity. If False, the model weights obtained at the last step of training are used.

keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=0, verbose=0,


mode='auto', baseline=None, restore_best_weights=False)

98
Introduction to machine learning

Softmax Classifier

● Softmax function is a multinomial logistic classifier, i.e. it can handle multiple classes

● Softmax typically the last layer of a neural network based classifier

● Softmax function is itself an activation function, so doesn’t need to be combined with an activation
function

99
Introduction to machine learning

Softmax

SoftMax

Annotation:

S - Input
R - Real Number
d - Dimension
p - Probability

10
0
Introduction to machine learning

Softmax

SoftMax

one 3.15 23.33 0.104


exp normalize
two 5.3 200.3 0.895

three -2.1 .122 0.00

unnormalized unnormalized probabilities


log probabilities
probabilities

10
1
Introduction to machine learning

Hyperparameter tuning

In a real neural network project, hyper parameters can be optimized manually ( a difficult
task) or through use of several third-party hyperparameter optimization tools.

If you use Keras, you can use these libraries for hyperparameter optimization:
Hyperopt, Kopt and Talos

If you use TensorFlow, you can use GPflowOpt for bayesian optimization, and
commercial solutions like Google’s Cloud Machine Learning Engine which provide multiple
optimization options.

Ref: https://medium.com/@mikkokotila/a-comprehensive-list-of-hyperparameter-optimization-
tuning-solutions-88e067f19d9

10
2
Introduction to machine learning

Grid Search

1. Grid search is semi-automated way of tuning

2. It involves systematically testing multiple values of each hyperparameter, by


automatically retraining the model for each value of the parameter

3. For example, you can perform a grid search for the optimal batch size by automatically
training the model for batch sizes between 10-100 samples, in steps of 20

4. The model will run 5 times and the batch size selected will be the one which yields
highest accuracy.

Pros: Maps out the hyperparameter space to the extend specified and helps locate the
most optimal combination without the need to list out all the permutations manually

Cons: Can be slow to run for large numbers of hyperparameter values, the most optimal
combination may not be present in the grid!

10
3
Introduction to machine learning

Random Search

1. Instead of testing systematically to cover “promising areas” of the hyperparameter


space, it is preferable to test random values drawn from the entire hyperparameter
space

Pros: According to the study, provides higher accuracy with less training cycles, for
problems with high dimensionality

Cons: execution time depends on the number of permutations given the hyperparameter
range of values

10
4
Introduction to machine learning

Thank You

10
5

You might also like