DNN Hyperparameter Tuning
DNN Hyperparameter Tuning
1
Introduction to machine learning
2
Introduction to machine learning
3
Introduction to machine learning
1. When the neural network becomes deep, the neurons in the initial layer(closest
to the input layer) become slow / sluggish in learning
2. This is due to the vanishing gradient problem. Sometimes the opposite of it i.e.
exploding gradients hamper the learnability of the neurons in the first few layers
4. Though the neural network may perform acceptably in training, it will perform
poorly in production
5. This is because the first few layers are supposed to learn to extract the features
from the input images. The learning process should optimize the weights
4
Introduction to machine learning
7. The neural networks can also easily become overfit when the number of
weights to learn is too many while the input data is small
8. In such situation having too many layers, each with multiple neurons while the
input data is limited is akin to too many parameters with low depth
9. The models become complex and overfit and will suffer in production from high
degree of variance error
10. Thus, in summary, the challenge is how to create deep neural networks and
avoid these challenges simultaneously
5
Introduction to machine learning
DNN – Parameters
1. In every model parameters are used to train the model and make predictions.
There are however, two types of parameters… those that help learn (a.k.a
hyper parameters and those that help predict (a.k.a. model parameters)
2. Hyperparameters come into play while the algorithm of choice is building the
model. The objective is to make the model accurate and generalizable
3. Model parameters are the result of learning / model building and represent the
learnt relationships between the target and predictor variables. They are used
only for prediction
6
Introduction to machine learning
7
Introduction to machine learning
2. Models with low capacity may struggle to fit the training set. Models with high
capacity can overfit by memorizing properties of the training set that is of no
use in test data
8
Introduction to machine learning
Capacity / Regularization
1. All machine learning algorithms have hyperparameters settings that we can use to
control the algorithm’s behavior.
2. The values of hyperparameters are not learnt from the data. They have to be fixed
outside the model being built
4. One way to look at Neural Networks with fully-connected layers is that they define
a family of functions that are parameterized by the weights of the network.
5. A natural question that arises is: What is the representational power of this family
of functions? In particular, are there functions that cannot be modeled with a
Neural Network?
6. It turns out that Neural Networks with at least one hidden layer are universal
approximators
9
Introduction to machine learning
Capacity / Regularization
1. That is, it can be shown ( intuitive explanation from Michael Nielsen) that given any
continuous function f(x) and some ϵ>0, there exists a Neural Network g(x) with one hidden
layer (with a reasonable choice of non-linearity, e.g. sigmoid) such that ∀x,∣f(x)−g(x)∣<ϵ.
2. In other words, neural network with one hidden layer can approximate any continuous
function. If so, why do we need deep neural network?
3. It has been empirically demonstrated that deep networks is more efficient in terms of
parameter usage to represent a function than a single hidden layer network. With given
number of parameters a larger family of functions can be approximated with deep network
4. The problem is, with such high capacity to represent any function, neural networks can easily
become overfit. Hence regularization for neural network is a must
5. Regularizing means controlling the behavior of the machine learning algorithm while it builds
a neural network. Regularization is the task of finding the optimal values of the various
parameter values which help the neural network best represent a given function
6. Finding the optimal combination is like finding needle in a huge hay stack!
10
Introduction to machine learning
Hyperparameters
11
Introduction to machine learning
1. gradient descent hyper parameter to control the number of training samples to work
through before the model’s internal parameters are updated
2. At the end of each batch, the error is calculated that is used to update the coefficients
through back propagation
a. Batch Gradient Descent. Batch Size = Size of Training Set
b. Stochastic Gradient Descent. Batch Size = 1
c. Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set
Epohcs –
a. is a gradient descent hyperparameter that controls the number of complete
passes through the training dataset
b. One epoch means that each sample in the training dataset is used once to
update the coefficients
c. An epoch consists of one or more batches. An epoch that has one batch is called
the batch gradient descent learning algorithm.
12
Introduction to machine learning
13
Introduction to machine learning
Activation Functions
1. The last step in a neuron is application of activation function. Choice of activation
function decides the ANN learning capability. There are different activation functions
2. The entire power of an artificial neural network comes from the activation function.
Without the activation function, the power of ANN collapses to that of a single neuron
which cannot handle non linear classifications such as an XOR
3. The activation functions are non linear i.e. more than simple addition and multiplication
which are linear functions. It is this non linearity that gives the artificial neural network its
power to approximate any complex function
5. In the software implementation of ANN, we can have different activation functions for
each layer but not for each neuron
14
Introduction to machine learning
Activation Functions -
15
Introduction to machine learning
Activation Functions
Basic forms of activation functions include –
Linear functions –
1. the input floating point and output floating point numbers are linearly related and expressed in the
form of y = mx + c
2. When activation functions are linear, they are only doing multiplication and addition and they are no
use in ANN. They collapse the network into one single neuron
Step function –
1. Function similar to Stair Step function but with only one step. Up until the threshold value the output
is one value and on reaching the threshold it changes to another
16
Introduction to machine learning
2. The most popular one being the ReLU (Rectified Linear Units). The names comes from electronics
where a rectifier component prevents negative voltage passing through a circuit
3. If input is less than 0 (-ive) , output is 0. Otherwise the output is greater than 0
5. Fast to train and easy to implement but can run into problems for <0 inputs
7. That neuron will stop firing anything other than 0, literally it is dead
17
Introduction to machine learning
Advantages
Leaky ReLUs are one attempt to fix the “dying ReLU” problem by having a small negative
slope (of 0.01, or so).
Disadvantages
As it possess linearity, it can’t be used for the complex Classification. It lags behind the
Sigmoid and Tanh for some of the use cases.
18
Introduction to machine learning
Parametric ReLU-
1. Do not change the parameter to 1 as that will remove the kink
19
Introduction to machine learning
Shifted ReLU–
Moves the kink down and left
Note: Mathematically a concept called derivatives (rate of change of output for a unit change in input) is not defined at the
kink. However, there are some mathematical tricks that can be employed to overcome this problem with ReLU functions.
20
Introduction to machine learning
Explonential ReLU –
21
Introduction to machine learning
Advantages
1. It is nonlinear in nature i.e. it projects a linear data set into non-linear space
2. It will give a continuous derivable activation unlike step function.
3. Suitable for classification as it maps to probability range 0 - 1
4. Squashes input values into a range of 0 – 1 hence output is never too large in magnitude
Disadvantages
1. On the extremes of X, the output tend to respond very less to changes in X (small gradients)
2. It gives rise to a problem of “vanishing gradients”. Sigmoids saturate and kill gradients.
3. As the gradient curve (blue) for the Sigmoid is almost flat, any change in the change in the gradient will
have very low impact on the output
4. The network refuses to learn further or is drastically slow ( depending on use case and until gradient
/computation gets hit by floating point value limits (explosion) ).
22
Introduction to machine learning
2. At 0 input value the output is also 0. suffer from vanishing gradient and
exploding gradient problem
3. The Sigmoid and tanh functions squash the input values to a range of values
between (0,1) and (-1 , 1) respectively
Advantages
The gradient is stronger for tanh than sigmoid ( derivatives are steeper).
Disadvantages
Tanh also has the vanishing gradient problem.
23
Introduction to machine learning
2. Drawback of smooth functions is they produce useful results in a small range of input values where the
gradients can be meaningfully calculated.
3. At extreme input values the gradients flatten out which leads to a problem in learning process
24
Introduction to machine learning
1. These functions self normalize that mitigates the problem of vanishing and exploding gradient
4. The ELU function is scaled to keep the activation with that mean and variance
8. Weight initialization is also another important aspect. It is suggest to sample weights from a Gaussian
distribution with mean 0 and variance 1/n where n is number of weights.
25
Introduction to machine learning
26
Introduction to machine learning
2. Instead ReLU and it’s variants are often used in deep learning in the hidden layers
3. Softmax function for output layer in case of classification problems and a linear function if it is a
regression problem
Ref: https://keras.io/activations/
DL_Hyperparameter_Tuning.ipynb
27
Introduction to machine learning
1. The blue line is for a linear function and surprisingly it is giving 92% + accuracy!
2. However, all the other standard activations ReLU, tanh, Sigmoid are performing relatively much better
3. All converge to same 98% +- accuracy score
4. Any further increase in epochs may not help as all of them stop increasing beyond
5. Sigmoid is the slowest learner followed by tanh while ReLU seems to learn fast.... (Why ?)
6. The different non-linear activation functions deliver the same end result, the difference is in how fast
they learn
7. There is no one single best activation function. Each can give relatively better performance under
different conditions
28
Introduction to machine learning
Number of Layers
The Input Layer
1. Every NN has exactly one layer of input neurons which are not really neurons. Some call it
pass-thru neurons
2. The number of neurons comprising this layer, this parameter is completely and uniquely
determined by the number of attributes in the training data
3. Other than these neurons, we may also have a neuron dedicated to hold the bias term
29
Introduction to machine learning
1. Conceptually, one hidden layer with multiple hidden nodes is equivalent of projection into higher
dimensional space and it has been shown that it can model most complex functions fetching reasonable
results
2. However, deep networks have a higher parameter efficiency than a shallow one. Parameter efficiency is
number of parameters required to represent a function. A deep neural network can represent the same
complex function with lesser neurons
3. Lesser the neurons, more the speed and less demanding on the computational resources. Overall the
neural network can also get higher performance for the same number of neurons
4. Deep neural network take advantage of the hidden hierarchies in real world data where high level
features consist of lower level features. E.g. a object consists of boundary, a boundary consist of sub-
boundaries etc.
https://cs.stanford.edu/people/karpa
thy/convnetjs/demo/classify2d.html
30
Introduction to machine learning
5. If we consider high level features as functions, these functions can be represented as sub-functions and
sub-sub functions…. DNN have natural affinity towards such structured data
6. In many problems, we can start with just one or two hidden layers and get good results for e.g. 97%
accuracy on MNIST dataset with one hidden layer and a few hundreds of neurons and get 98%
accuracy with same number of neurons with two layers!
7. Gradually start increasing the number of layers and neurons while checking for overfitting the data
DL_Hyperparameter_Tuning
.ipynb
31
Introduction to machine learning
32
Introduction to machine learning
As number of layers increase, there is a slight drop in test accuracy and the model seems to become
relatively unstable
33
Introduction to machine learning
1. Another knob we can turn is the number of nodes in each hidden layer.
2. This is called the width of the layer. Making wider layers tends to scale the number of parameters faster
than adding more layers.
3. Every time we add a single node to layer i, we have to give that new node an edge to every node in
layer i+1. This is akin to projecting data into higher dimensional space
4. For number of hidden layers, there is no thumb rule except the fact that deeper the network, more
efficient the model becomes in terms of representing complex functions
5. But how many neurons per hidden layer? That is how to distribute the neurons across different layers.
Some use the funnel approach where the mouth of the funnel is first hidden layer and the nozzle is the
last hidden layer
6. The belief is… many sub features represented by the first few hidden layers can coalesce into higher
features in the subsequent layer and hence less neurons in subsequent layer
7. Finding the right combination of the number of neurons in each layer is a challenge.
34
Introduction to machine learning
1. In 1957, Kolmogorov proved that any real-valued, continuous function ‘f()’ defined on a n-dimensional
unit cube can be represented as the sum of continuous functions of a single variable
3. Hecht-Nielsen rephrased it to state that any continuous function f() defined on a n-dimensional unit
cube can be implemented exactly by a 3 layer network with (2n + 1) hidden nodes
4. However, functions g and Ø are highly non-smooth unlike the sigmoid functions used in NN. Hence this
theoretical interpretations should not be taken literally
5. The theorem does not tell us how to determine the number of neurons. It only mentions about the
possibility of representing the function
35
Introduction to machine learning
6. More recently, Huang and Babri lowered these bounds, proving that an single hidden layer NN with at
most Nh hidden neurons can learn Ns distinct samples with zero error. This is true for any bounded,
non-linear activation function which has a limit at one, infinity. Thus the upper bound for SLFNs is
Nh<= Ns
7. Huang later extended the work of Tamura and Tateshi to rigorously prove that the upper bound on the
number of hidden nodes ,Nh for two hidden layer NN with sigmoid activation function is given by
equation below, where No is the number of outputs. These can learn at least Ns distinct samples with
any degree of precision
8. It is well known that functions which are linearly separable require no hidden nodes at all. For the rest,
the discussions above indicate that within the constraints of their respective theorems, a function of any
complexity can be reproduced with any degree of precision within the specified bounds.
9. However, if the training set contains noise, as is the case in most practical situations, exactly
reproducing the training set will guarantee overfitting. This is undesirable
10. In practice therefore, the number of hidden nodes must necessarily be lower than these bounds and will
depend on a number of specific factors such as degree of noise in data, complexity of the function we
are modeling, number of inputs and outputs etc.
DL_Hyperparameter_Tuning.ipynb
36
Introduction to machine learning
37
Introduction to machine learning
Learning Rate
1. In order for Gradient Descent to work efficiently especially when error surface is complex, we set the
λ (learning rate). This parameter determines how quickly the optimal values of the coefficients are
found
2. The learning rate hyperparameter controls the magnitude of change in the parameter weights in the
gradient descent
3. A learning rate that is too small leads to slow convergence, while a learning rate that is too large can
hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge
(explode)
4. Adjusting the learning rate - adjust the learning rate during training by e.g. annealing, i.e. reducing
the learning rate according to a pre-defined schedule or when the change in objective between
epochs falls below a threshold.
5. These schedules and thresholds, however, have to be defined in advance and are thus unable to
adapt to a dataset’s characteristics
6. Additionally, the same learning rate applies to all parameter updates. If our data is sparse and our
features have very different frequencies, we might not want to update all of them to the same extent,
but perform a larger update for rarely occurring features
7. Another key challenge of minimizing highly non-convex error functions common for neural networks
is avoiding getting trapped in their numerous suboptimal local minima.
38
Introduction to machine learning
https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-
learning-d0d4059c1c10
1. For small learning rate… the accuracy is low and apparently increases with higher learning rate
2. Lower learning rate will need more epochs as the learning slow. Hence within the given number
of epochs it appear to give low accuracy that increases with higher learning rate.
3. Error surface looks relatively smooth. With higher learning rate, even with few epochs the
network reaches high accuracy and the training and validation accuracy converge
4. Given the cleansed data, this will not always be the case. Usually the validation and training
scores will deviate due to overfit.
40
Introduction to machine learning
Weights Initialization
1. This concerns with the first set of weights that are assigned at the start of the training stage. These
weights used to be randomly assigned in early days of DNN.
2. Weights were generated from a uniform random distribution. This means take all values between two
extremes with equal probability. Though the weights are assigned randomly, they are expected to
converge thru the learning process to optimal values
3. Normal initialization using normal distribution – also known as Gaussian distribution of random
numbers. In this approach weights drawn randomly are likely to be from closer to the central value
i.e. 0. 0 is most likely! The random values are replaced with optimal values through learning process
4. Three well researched strategies for initial weight selection include LeCun uniform, Glorot (a.k.a
Xavier) uniform, He uniform algorithms to draw from uniform distribution and similarly normal
distribution i.e. LeCun Normal, Golrot Normal, He Normal
Ref:
https://keras.io/initializers/
https://towardsdatascience.com/hyper-parameters-in-action-part-ii-weight-initializers-35aee1a28404
41
Introduction to machine learning
2. This behavior of the DNN made them difficult to train and DNN was almost abandoned. In 2010
Xavier Glorot and Yoshua Bengio published a paper explaining the reason behind these problems.
http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf
3. They suspected the popular non-linear activation function (Sigmoid) and Tanh combined with the
random weight initialization using normal distribution centered at 0 with standard deviation of 1.
4. They showed that under this approach the variance of the output of a hidden layer becomes greater
and greater than the variance of the input of that layer. Going forward in the network the variance
keeps increasing
5. The activation functions in the last few layers / top layers saturate i.e. reach the flat regions of their
sigmoid functions given the magnitude of the inputs
6. As a result, when the back propagation kicks in, the error gradients in the top layers are small and
when these small error gradients are propagated back, they tend to become smaller and smaller!
42
Introduction to machine learning
43
Introduction to machine learning
Graident of a sigmoid
function
1. The sigmoid function and tanh have flat slope for most of the values of X. That means for most of
the values of the change in the function i.e. d(𝜎(x) / d(x)) will be close to 0 as can be seen in the red
dashed curve (plot of derivative of sigmoid function)
2. The max value for the derivative of sigmoid is .25, i.e. 1/4th , a small fraction
3. In the flat regions, the derivative will be very small, close to 0 as it is almost flat there. When stuck in
such regions, the change in weights (look at the formula below) is very small or 0 i.e. the learning
stops
𝑊 𝑁𝑒𝑤 = 𝑊 𝑂𝑙𝑑 – ὴ(d(E)/d(w))
44
Introduction to machine learning
I O E=
LF()v /
f(O,Y)
NLF()
45
Introduction to machine learning
i O1 = i2 O
1 v
LF()1 / v
LF()2 / E=
NLF()1 NLF()2
f(O,Y)
d(E) / d(i1) d(LF()1) / * d(NLF()1) / * d(i2) / * D(LF()2) / * d(NLF()2) / * d(O) / * D(E) /
= d(i1) d(LF()1) d(NLF()1) d(i2) d(LF()2) d(NLF()2) d(O)
a. As we add more and more layers to a neural network (make it deeper), the chain equation extends
across layers as shown above
b. The chain of equations with multiplication operator helps estimate the effect on E of a change in i1
c. i1 is the input to the first neuron which is the input value multiplied by a coefficient / slope (not shown
separately)
d. Similarly, i2 is the output of first neuron multiplied by a coefficient (not shown separately)
e. As a result of this chain, there is a possibility of fractions appearing in the equations either because of
the random values of slope fixed in first iteration and due to the derivation of the non-linear function.
f. If the non linear function is a sigmoid or Tanh function, fractions are guaranteed on derivation
g. Fractions multiply to give smaller fractions and as a result, the d(E) / d(i1) i.e. change in overall error for
a change in inputs of first layer neuron becomes very very small in magnitude leading to no significant
change in the coefficient
h. Note: change in coef at any stage is 𝑊 𝑁𝑒𝑤 = 𝑊 𝑂𝑙𝑑 – ὴ(d(E)/d(w))
46
Introduction to machine learning
i1
i2
i3
E E
47
Introduction to machine learning
48
Introduction to machine learning
1. Glorot and Bengio argued that for signal to flow properly in both directions (not die out or explode) we
need to ensure variance of outputs of each layer to be equal to the variance of the inputs. Also the
gradients should have equal variance before and after a layer in backprop
2. This is possible only if the layer has an equal number of input and output connections. Since this
restrains the design, they proposed a compromise that works in practice
3. The connection weights must be initialize randomly as per the equation below… where ni and ni+1
are the number of input and output connections for the layer, also called fan-in and fan-out
49
Introduction to machine learning
50
Introduction to machine learning
2. Use Xavier initialization technique - initializes the weights in your network by drawing them from a
distribution with zero mean and a specific variance
4. It draws samples from a truncated normal distribution centered on 0 with stddev = sqrt(1 / fan_in)
51
Introduction to machine learning
6. Use He Normal - In this method, the weights are initialized keeping in mind
the size of the previous layer which helps in attaining a global minimum of the
cost function faster and more efficiently.
7. Unlike in Xavier method, in this method we multiply the ratio by 2
8. The weights are still random but differ in range depending on the size of the
previous layer of neurons.
9. This provides a controlled initialization hence the faster and more efficient
gradient descent.
10.Generally used with RELU activation
52
Introduction to machine learning
1. Exploding gradients are a problem when large error gradients accumulate and
result in very large updates to neural network model weights during training.
2. The magnitudes become so large that they cannot even be held in the
underlying hardware leading to the coefficients becoming “Nan “
3. This can be due to random initialization of weights or due to the non-linear
transformation functions used that are not Sigmoid or tanh
4. Larger magnitude of coefficients will lead to overfitting and unstable neural
network which will impact the generalizability and predictive power of the
model in test / production
5. There are methods to fix exploding gradients, which include gradient clipping ,
weight regularization, gradient scaling etc.
6. Use of Xavier and He normalization technique also reduces the probability of
exploding gradient
Ref: https://mc.ai/xavier-and-he-normal-he-et-al-initialization/
53
Introduction to machine learning
Loss Functions
54
Introduction to machine learning
Local Maxima
Local Minima
Global Minima
& flat region
Threshold
55
Introduction to machine learning
Coeff1
56
Introduction to machine learning
57
Introduction to machine learning
58
Introduction to machine learning
Flat regions
Local maxima
Local Minima
Saddle point
Global Minima
1. Flat regions have very low gradients. As a result, the search for minima will slow down
(sometimes significantly)
2. Local minima is a point surrounding which are points with higher error. Once that point is
reached, there is no way to get out!
3. Saddle point is the point which is local maxima in one direction and local minima in other
direction
4. Global minima is the sought goal that may or may not be reached. Depends on where
the journey began
59
Introduction to machine learning
60
Introduction to machine learning
Error Contours
61
Introduction to machine learning
62
Introduction to machine learning
Randomly selected starting point
63
Introduction to machine learning
Random Starting
coefficient values
64
Introduction to machine learning
65
Introduction to machine learning
1. What is an optimization algorithm and what is its use? - Optimization algorithms helps us to minimize
(or maximize) an Objective function (another name for Error function) E(x) which is simply a
mathematical function dependent on the Model’s internal learnable parameters which are used in
computing the target values(Y) from the set of predictors(X) used in the model
2. C = ½(( wi.xi + b) – y). In this expression Xi and y come from the data and are given. What the ML
algorithm learns is the weight wi and bias b. Thus C = f( wi, b)
3. The optimizer algorithms try to estimate the values of wi and b which when used will give minimum or
maximum C. In ML we look for minimum
66
Introduction to machine learning
1. Cross-entropy loss (often called Log loss) increases as the predicted output diverges from
actual output
67
Introduction to machine learning
Logistic regression or conditional log-likelihood cost function (− log P(y|x) coupled with
softmax outputs) worked much better (for classification problems) than the quadratic cost
which was traditionally used to train feedforward neural networks
Ref: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf
68
Introduction to machine learning
69
Introduction to machine learning
70
Introduction to machine learning
V. Non-convex error functions with saddle points and local minima become
difficult to get out of
VI. Practically very time consuming and resource intensive compared to mini-
batch and batch gradient descent algorithm
71
Introduction to machine learning
72
Introduction to machine learning
2. Saddle points – The algorithm may reach a saddle point where there are relatively no or very little
gradients (flat regions) where the algorithm stops to learn!
Error
http://www.denizyuret.com/2015/03/alec-radfords-animations-
for.html
3. Even with learning rates, such challenges may make the gradient descent inefficient as the learning rates
are not determined from the data
73
Introduction to machine learning
a. SGD finds it difficult to handle situations where the error surface gradients is
different in each direction. Steep in one while shallow in other. In such cases it
oscillates across the slopes of the ravine making slow progress towards the local
optima
b. Momentum helps accelerate the SGD towards the relevant minima quickly. It adds
a fraction of the update vector (combination of the gradient) of previous step to the
current update step.
74
Introduction to machine learning
C A B
75
Introduction to machine learning
D1
lsd1 lsd2
result
76
Introduction to machine learning
b. Performs larger updates for infrequent parameters and smaller updates for frequent
parameters
c. Other methods perform an update for all parameters θ at once as every parameter θi
used the same learning rate η
d. As Adagrad uses a different learning rate for every parameter θi at every time step t
a. g(t,i) is partial derivative of the objective function w.r.t. to the parameter θi at time step t
b. The SGD update for every parameter θi at each time step t then becomes
c. But Adagrad modifies the general learning rate η at each time step t for every
parameter θi based on the past gradients that have been computed for θi
Source : http://www.deeplearningbook.org/contents/optimization.html
77
Introduction to machine learning
a. individually adapts the learning rates of all model parameters by scaling them inversely
proportional to the square root of the sum of all the historical squared values of the
gradient
b. The parameters with the largest partial derivative of the loss have a correspondingly
rapid decrease in their learning rate, while parameters with small partial derivatives
have a relatively small decrease in their learning rate.
c. The net effect is greater progress in the more gently sloped directions of parameter
space. In the context of convex optimization, the AdaGrad algorithm enjoys some
desirable theoretical properties.
d. Empirically, however, for training deep neural network models, the accumulation of
squared gradients from the beginning of training can result in a premature and
excessive decrease in the effective learning rate. AdaGrad performs well for some but
not all deep learning models
Source : http://www.deeplearningbook.org/contents/optimization.html
78
Introduction to machine learning
Gt(ii) P1 P2
P1 4.6 0
P2 0 8.2
Error
Learning step has to be regularized from the default eta based on the under root
quantity. Too frequently updated gradient will need small steps while rarely
changing gradient will need large steps (the under root will be close to error
term)
79
Introduction to machine learning
AdaGrad –
Source : http://www.deeplearningbook.org/contents/optimization.html
80
Introduction to machine learning
a. modifies AdaGrad to perform better in the nonconvex setting by changing the gradient
accumulation into an exponentially weighted moving average. AdaGrad is designed to
converge rapidly when applied to a convex function. When applied to a nonconvex
function to train a neural network, the learning trajectory may pass through many
different structures and eventually arrive at a region that is a locally convex bowl
b. AdaGrad shrinks the learning rate according to the entire history of the squared
gradient and may have made the learning rate too small before arriving at such a
convex structure
c. RMSprop and Adadelta came out independently around the same time stemming from
the need to resolve Adagrad's radically diminishing learning rates.
Source : http://www.deeplearningbook.org/contents/optimization.html
81
Introduction to machine learning
f. RMSProp uses an exponentially decaying average to discard history from the extreme
past so that it can converge rapidly after finding a convex bowl, as if it were an
instance of the AdaGrad algorithm initialized within that bowl
Source : http://www.deeplearningbook.org/contents/optimization.html
82
Introduction to machine learning
RMSPROP –
Source : http://www.deeplearningbook.org/contents/optimization.html
83
Introduction to machine learning
84
Introduction to machine learning
Source : http://www.deeplearningbook.org/contents/optimization.html
85
Introduction to machine learning
Adam
Choosing the right optimization technique –
1. There is currently no consensus
2. Some empirical studies favor algorithms with adaptive
learning rates but no single algorithm emerged as the
best
3. Most popular ones – RMSProp, RMSProp with
momentum , AdaDelta, Adam, SGD with momentum,
SGD
86
Introduction to machine learning
DROP OUT
1. Dropout is a regularization technique for neural network models
2. Dropout is a technique where randomly selected neurons are ignored during training stage.
They are “dropped-out” randomly.
3. They do not contribute to the activation of downstream neurons temporally i.e. they are
temporarily removed in the forward pass
4. They are not considered for any weight updates on the corresponding backward pass.
5. In a normal dense network, as the network learns the optimal neural weights, the neurons
become tuned to specific features (specialized).
6. Neighboring neurons rely on this specialization, to give the complete picture of the features
down the line. This specialization can result in a fragile model too specialized to the training
data.
7. This reliant on context for a neuron during training is referred to complex co-adaptations.
Ref: Dropout: A Simple Way to Prevent Neural Networks from Overfitting (download the PDF)
87
Introduction to machine learning
DROP OUT
8. Dropout forces a neural network to learn more robust features that are useful in conjunction
with many different random subsets of the other neurons.
9. Dropout roughly doubles the number of iterations required to converge. However, training
time for each epoch is less.
10. With K hidden units, each of which can be dropped, we have 2^K possible models. This is
like an ensemble of neural networks! We don’t drop neurons from output layer
11. Once we have N input records, for each input we would have essentially trained a different
network. These networks are not entirely independent as they share many weights
12. In testing phase, the entire network is considered and each activation is reduced by a
factor p. where P is probability of a neuron being temporarily being dropped out
13. Supposes p = 50%, during testing a neuron will be connected to twice as many input
neurons as it was during training on average. To compensate for this, we need to multiply
each neuron’s input connection weights by .5[
Ref: Dropout: A Simple Way to Prevent Neural Networks from Overfitting (download the PDF)
88
Introduction to machine learning
DROP OUT
1. Generally, use a small dropout value of 20%-50% of neurons with 20% providing a good starting point. A
probability too low has minimal effect and a value too high results in under-learning by the network
2. Use a larger network. You are likely to get better performance when dropout is used on a larger network,
giving the model more of an opportunity to learn independent representations
3. Use dropout on incoming (visible) as well as hidden units. Application of dropout at each layer of the
network has shown good results
4. Use a large learning rate with decay and a large momentum. Increase your learning rate by a factor of 10
to 100 and use a high momentum value of 0.9 or 0.99
5. Constrain the size of network weights. A large learning rate can result in very large network weights.
Imposing a constraint on the size of network weights such as max-norm regularization with a size of 4 or
5 has been shown to improve results.
90
Introduction to machine learning
Batch Normalization
The idea is to normalise the inputs of each layer in such a way that they have
a mean output activation of zero and standard deviation of one.
Source - https://medium.com/deeper-learning/glossary-of-deep-learning-batch-normalisation-8266dcd2fa82
91
Introduction to machine learning
92
Introduction to machine learning
1. Networks learn faster. Though each iteration will be slower due to extra
computations, overall convergence will be faster
2. Gradient descent steps could be larger compared to a case without batch
normalization.
3. Makes weights easier to initialize – weight initialization can be difficult in deeper
networks (due to vanishing gradient and exploding gradient problems). Batch
normalization helps reduce the sensitivity to initial starting weights
4. More activation functions become viable. Sigmoid, TanH and ReLU suffer from
vanishing / exploding gradient and dying neuron problems respectively. With
batch normalization these challenges are minimized
5. It also seems to work like a regularization technique
6. Overall simplifies creating a deep neural network
93
Introduction to machine learning
Batch Normalization
1. Due to this normalization “layers” between each fully connected layers, the range of input
distribution of each layer stays the same, no matter the changes in the previous layer.
2. Given x inputs from k-th neuron, Consider a batch of activations at some layer
= standard deviation
Sourced from: Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift by loffe and Szegedy , et.al 2015
94
Introduction to machine learning
1. For sigmoid and tanh activation, normalized region is more of linear than nonlinear.
3. So, some transformation has to be done to move the distribution away from 0.
95
95
Introduction to machine learning
Normalize:
Where
And are
learnable parameters
Sourced from: Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift by loffe and Szegedy , et.al 2015
96
Introduction to machine learning
Sourced from: Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift by loffe and Szegedy , et.al 2015
97
Introduction to machine learning
Early Stopping
1. Early Stopping is not only a regularization technique, but is also a mechanism for
preventing a waste of resources when the training is not going in the right direction.
98
Introduction to machine learning
Softmax Classifier
● Softmax function is a multinomial logistic classifier, i.e. it can handle multiple classes
● Softmax function is itself an activation function, so doesn’t need to be combined with an activation
function
99
Introduction to machine learning
Softmax
SoftMax
Annotation:
S - Input
R - Real Number
d - Dimension
p - Probability
10
0
Introduction to machine learning
Softmax
SoftMax
10
1
Introduction to machine learning
Hyperparameter tuning
In a real neural network project, hyper parameters can be optimized manually ( a difficult
task) or through use of several third-party hyperparameter optimization tools.
If you use Keras, you can use these libraries for hyperparameter optimization:
Hyperopt, Kopt and Talos
If you use TensorFlow, you can use GPflowOpt for bayesian optimization, and
commercial solutions like Google’s Cloud Machine Learning Engine which provide multiple
optimization options.
Ref: https://medium.com/@mikkokotila/a-comprehensive-list-of-hyperparameter-optimization-
tuning-solutions-88e067f19d9
10
2
Introduction to machine learning
Grid Search
3. For example, you can perform a grid search for the optimal batch size by automatically
training the model for batch sizes between 10-100 samples, in steps of 20
4. The model will run 5 times and the batch size selected will be the one which yields
highest accuracy.
Pros: Maps out the hyperparameter space to the extend specified and helps locate the
most optimal combination without the need to list out all the permutations manually
Cons: Can be slow to run for large numbers of hyperparameter values, the most optimal
combination may not be present in the grid!
10
3
Introduction to machine learning
Random Search
Pros: According to the study, provides higher accuracy with less training cycles, for
problems with high dimensionality
Cons: execution time depends on the number of permutations given the hyperparameter
range of values
10
4
Introduction to machine learning
Thank You
10
5