Neural Network Training Guide
Neural Network Training Guide
n n vs au t o
traditional e n co d e r
methods
w
simplest 12
21
h ow n n
3
nn b12
w
trains
11
21
3
2
w1
w b22
f12
14
w22
perceptron 2 backprop
w
22 &
3 f22
f o r wa r d
prop
1
12
2 b13
w
issue 12 1
w 23 32
w w2 b23
issue while
f13 w23 training
x i2 w123 3
f23
l i n e a r a c t i vat i o n
w
12
41
4 2
24
w2
b14 w
solutions 3 exploring gra
24
w
va n i s h i n g g r a
f14 dead neuro
ov e r f i t t i n
training
l a ye r 2
o p t i m i z at i o n
non linear c l a s s i f i cat i o n input l a ye r 1
(output
a c t i vat i o n a c t i vat i o n l a ye r (hidden
l a ye r )
l a ye r ) solutions
e x p l o r i n g g r a d
va n i s h i n g g r a d
training
dead neuron ov e r f i t t i n g o p t i m i z at i o n
h y p e r pa r a m e t e r s
weight r e g u l a r i z at i o n
o p t i m i z e r s
i n i t i a l i z at i o n
b at c h n o r m
learning
h y p e r pa r a m e t e r glarot e a r ly s t o p p i n g
r at e
tuning i n i t i a l i z at i o n
dropout epoch
random
h y p e r pa r a m e t e r s
NEURAL NETWORK Cheat sheet
Neural Network Applications Activation plot Equation Derivative Range Problem Advantage N e u ra l N e two rk T ra i n i n g
Pattern recognition and natural language processing RW RD
FO A
(0,1) Vanishing Adds non X= A PROP G IO A AT N
Recognition and processing of images Gradient linearity
[0]
A [1]
A2 [ ]
A L-
[ 1]
AL= y[ ]
w LL [ L- 1]
b
[1] [0] [ ]
b [1] [1] [ ]
[ ]
b [ ]
[ ]
AL E S
S
UPD
A R AM E T E R S
E
O S
N TI N
Adds non linearit
R E ERS
Vanishing
P V U P FU C O
w w 22 wL [ ]
AbZ b
[1]
dw
[1] [ ]
db
[ ]
[ ]
b
[1]
Tanh db d
initialization
[1]
ZL
[0] [ ] [ 1]
[ ]
[1]
[1] [ ] [ ]
Neural Network
Traditional ML
d A d A2 d A L- d AL
T raditional models refer
[1]
[ ] [ 1] [ ]
A G AT I N
D
unstructured data li k e well wit h structures / f model weights grow F or the L layered NN,
I
the L derivatives are Vanishing/exploding gradients are
image and te x t tabular data exponentially and caused when the value of weights
Leaky Solves the multiplied together
ReLU - issue with
Exploding Gradients become unexpectedly and if the gradients are large.
Requires more computation ReLU large during training, have value >1 then the it can be avoided if the variance of
power to train. M ost NN Require less training Exploding Gradients
gradient for the Lth the weights can be kept small
network acts as a perceptron If all the weights of the Random Weight Initialization:
Issue with Perceptron NN are initialized as 0 or
Initializing all weights to
zeros will cause the If all the weights have different
A Perceptron with L layers will act same as a linear neural any constant value , the derivatives to remain values, then the gradients will be
network without any hidden layers which makes the data multi- class gradient value for these the same for every different for every layer of NN
softmax [0,1] classification ead Neurons neurons will be zero
information in the hidden layers redundan weight in the eight which makes every neuron useful
D
Cannot learn non Linear patterns in data irrespective of the input, matrix because of which
W
b b
Easy computation May trap at local minima
Takes the whole dataset Easy to implement If the dataset is very large, then the time for convergence
Batch GD Where, t is the iteration, n is for weight updation Easy to implement. will be very high
the number of samples Requires large memory to calculate gradient on the whole
dataset.
Frequent updates of model weights Due to frequent updates, the optimizer steps will be noisy, hence high w
Updates the weight on a variance in weight value w
Stochastic GD hence, converges in less time
random sample of the Requires less memory. May shoot even after achieving global minima
dataset. To get the same convergence as Batch gradient descent, it needs to
slowly reduce the value of learning rate per epoch. Batch Normalization normalizes the input by performing scaling and
shifting in every layer of the NN which not only prevents internal covariate
shift but also helps in faster convergence as normalized data tends to
Frequently updates the weight value Choosing an optimum value of the learning rate have a circular shaped loss function plot (which helps the optimizer to
Updates the weights based on with less variance May get trapped at local minima.
reach global minima faster) as compared to elliptical one
Mini Batch GD a random batch of the Requires medium amount of It's a good practise to use Batch Normalization before activation function
dataset. memory Scaling in Batch Normalization.
Improvement over Batch GD and
SGD.
convergence. descent.
Updates the weights based on a The algorithm converges quickly The learning rate has to be defined manually and may not work for I mpact o f L earnin rate on trainin
g g
RMSprop random batch of the dataset Requires lesser convergence time every application.
Further reduces the noisy steps, than gradient descent. Learning rate has a huge impact on training time of a NN as :
Bias Correction:
training time and variance mall earning ate The ptimal earning oo arge f earning
Removes biasness so to have a
A s l r o l t l o a l
oiny
r m t m p u w
ivergent ehaviours
l t
starting.
p d b
R eso v n
l i g O ver tt nfi i g: Dropout causes a single neural network to have different network PCA and autoencoders are used to tackle these issues. Discussed in detail in impact of earning Rate.
L
Autoencoders vs PCA
m m
Due to hidden layers in NN, -Regularization cannot be used, Momentum helps know the direction of the next step with the knowledge of
L2
Early Stopping Autoencoders PCA the previous step. This will help prevent oscillations(slow convergence).
hence to regularize the weights, robenius Norm is used
The main idea of PCA is to
F
error
Autoencoders are NN which are Discussed in detail in Optimizer
used for compressing the data nd the est value of vector
fi b
Validation reconstruct the actual high maximum information and ) Controls the number of iterations a Neural Network be trained fo
dimensional data. along which we should rotate Too many Epochs can lead to Overfittin
our existing coordinates .
h i d d e n
(encoding)
input output