0% found this document useful (0 votes)
119 views4 pages

Neural Network Training Guide

This cheat sheet provides an overview of neural networks, comparing them to traditional machine learning methods, and highlights their applications in pattern recognition, image processing, and automated translation. It discusses various activation functions, training issues, optimization techniques, and the importance of hyperparameters in neural network performance. Additionally, it addresses challenges like overfitting and vanishing gradients, along with solutions such as dropout and batch normalization.

Uploaded by

Abhishek Puvvadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views4 pages

Neural Network Training Guide

This cheat sheet provides an overview of neural networks, comparing them to traditional machine learning methods, and highlights their applications in pattern recognition, image processing, and automated translation. It discusses various activation functions, training issues, optimization techniques, and the importance of hyperparameters in neural network performance. Additionally, it addresses challenges like overfitting and vanishing gradients, along with solutions such as dropout and batch normalization.

Uploaded by

Abhishek Puvvadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NEURAL NETWORK Cheat sheet

n n vs au t o
traditional e n co d e r
methods

why use unsupervised


nn f11 b21 setting
w21
1
11
w1 w
x i1 21
2
w1 f21

w
simplest 12

21
h ow n n

3
nn b12

w
trains

11
21

3
2

w1
w b22
f12

14
w22
perceptron 2 backprop
w
22 &
3 f22
f o r wa r d
prop

1
12
2 b13

w
issue 12 1
w 23 32
w w2 b23
issue while
f13 w23 training
x i2 w123 3
f23
l i n e a r a c t i vat i o n
w
12

41
4 2
24

w2
b14 w
solutions 3 exploring gra
24
w
va n i s h i n g g r a
f14 dead neuro
ov e r f i t t i n
training

l a ye r 2

o p t i m i z at i o n
non linear c l a s s i f i cat i o n input l a ye r 1

(output
a c t i vat i o n a c t i vat i o n l a ye r (hidden
l a ye r )

l a ye r ) solutions

e x p l o r i n g g r a d

va n i s h i n g g r a d

training
dead neuron ov e r f i t t i n g o p t i m i z at i o n

h y p e r pa r a m e t e r s

weight r e g u l a r i z at i o n
 o p t i m i z e r s

i n i t i a l i z at i o n
b at c h n o r m
learning
h y p e r pa r a m e t e r glarot e a r ly s t o p p i n g
 r at e 

tuning i n i t i a l i z at i o n
dropout epoch
random

h y p e r pa r a m e t e r s
NEURAL NETWORK Cheat sheet
Neural Network Applications Activation plot Equation Derivative Range Problem Advantage N e u ra l N e two rk T ra i n i n g
Pattern recognition and natural language processing RW RD
FO A
(0,1) Vanishing Adds non X= A PROP G IO A AT N
Recognition and processing of images Gradient linearity
[0]

A [1]
A2 [ ]
A L-
[ 1]
AL= y[ ]

Automated translation Sigmoid


Analysis of sentiment AZ ZA 2 ZA L
w w 22 [ ] [1]

w LL [ L- 1]

b
[1] [0] [ ]

b [1] [1] [ ]
[ ]

b [ ]
[ ]

System for answering questions


N T AT A R AM E T E R AT L S
Classification and Detection of Objects. I I I
A AM T
E
P

AL E S
S
UPD

A R AM E T E R S
E
O S

N TI N
Adds non linearit
R E ERS
Vanishing
P V U P FU C O

w w 22 wL [ ]

works with Glorot b LL -


[ ]

AbZ b
[1]

NN vs Traditional Machine Learning (-1,1) Gradient d w 22 d w LL


ZA 2 A
[ ]

dw
[1] [ ]

db
[ ]
[ ]

b
[1]

Tanh db d
initialization
[1]

ZL
[0] [ ] [ 1]
[ ]
[1]
[1] [ ] [ ]

Neural Network
Traditional ML
d A d A2 d A L- d AL
T raditional models refer
[1]
[ ] [ 1] [ ]

N eural N etwor k s are a


to algorit h ms t h at B A KWA R
group of M ac h ine L earning C

A G AT I N
D

algorit h ms w h ic h mimic statistically analy z e and ReLU To reduce possibility of R


Can cause
P OP O

discern patterns from vanishing Gradient and I s s u e s w h i l e t ra i n i n g N e u ra l N e two rk


biological neurons Dead
data Neuron
faster convergenc Training Issue Explanation Reason Solution
N eural N etwor k s can scenario
works with He
adapt well to T raditional models wor k initialization
Weight Initialization :

unstructured data li k e well wit h structures / f model weights grow F or the L layered NN,
I
the L derivatives are Vanishing/exploding gradients are
image and te x t tabular data exponentially and caused when the value of weights
Leaky Solves the multiplied together
ReLU - issue with
Exploding Gradients become unexpectedly and if the gradients are large. 

Requires more computation ReLU large during training, have value >1 then the it can be avoided if the variance of
power to train. M ost NN Require less training Exploding Gradients
gradient for the Lth the weights can be kept small

libraries are capable of time and computation happen.

layer increases Glorot/Xavier Initialization:

le v eraging G P U for speeding resources in general exponentially. Glorot Normal

t h e training process Output Layer Function for Classification


W hen back-
Speciali z ed NN M anual feature Output layer
plot Equation Derivative Range classification propagating to the
arc h itectures and engineering is function initial layers, the
deep layers implicitly required derivatives of L hidden It is caused when the
learn features V anishing Gradients layers are multiplied gradient values
NN are h ard to T raditional
together and if the become 1 
<

interpret and often models are easy gradients are small,


Sigmoid
(0,1) Binary then the gradient value
act li k e a blac k bo x to interpret classification decreases exponentially
PERCEPTRON which slows the model s ’

If there is linear or no activation functions used, a neural learning proces s

network acts as a perceptron If all the weights of the Random Weight Initialization:
Issue with Perceptron NN are initialized as 0 or
Initializing all weights to
zeros will cause the If all the weights have different
A Perceptron with L layers will act same as a linear neural any constant value , the derivatives to remain values, then the gradients will be
network without any hidden layers which makes the data multi- class gradient value for these the same for every different for every layer of NN
softmax [0,1] classification ead Neurons neurons will be zero
information in the hidden layers redundan weight in the eight which makes every neuron useful
D

Cannot learn non Linear patterns in data irrespective of the input, matrix because of which
W

making the neurons neurons will learn the


Non Linear Activation Functio dead which results in same features in every
Adds non-linearity to Neural Network no learning of the iteration.
Neural Network
Optimizers W ei ht pdation
g U H o it or s
w w k Ad anta es
v g Disad anta es
v g
Batch Normalization

b b
Easy computation May trap at local minima
Takes the whole dataset Easy to implement If the dataset is very large, then the time for convergence
Batch GD Where, t is the iteration, n is for weight updation Easy to implement. will be very high
the number of samples Requires large memory to calculate gradient on the whole
dataset.
Frequent updates of model weights Due to frequent updates, the optimizer steps will be noisy, hence high w
Updates the weight on a variance in weight value w
Stochastic GD hence, converges in less time
random sample of the Requires less memory. May shoot even after achieving global minima
dataset. To get the same convergence as Batch gradient descent, it needs to
slowly reduce the value of learning rate per epoch. Batch Normalization normalizes the input by performing scaling and
shifting in every layer of the NN which not only prevents internal covariate
shift but also helps in faster convergence as normalized data tends to
Frequently updates the weight value Choosing an optimum value of the learning rate have a circular shaped loss function plot (which helps the optimizer to
Updates the weights based on with less variance May get trapped at local minima.
reach global minima faster) as compared to elliptical one
Mini Batch GD a random batch of the Requires medium amount of It's a good practise to use Batch Normalization before activation function
dataset. memory Scaling in Batch Normalization.
Improvement over Batch GD and
SGD.

Updates the weights based on Reduces the noisy steps by gaining


a random batch of the dataset information from previous steps
Momentum Uses previous batch gradients Further reduces the variance of the
to boost the momentum of GD weight Shi ting in Batch Normalization.
in the direction of Converges faster than gradient
f

convergence. descent.

Updates the weights based on a The algorithm converges quickly The learning rate has to be defined manually and may not work for I mpact o f L earnin rate on trainin
g g

RMSprop random batch of the dataset Requires lesser convergence time every application.
Further reduces the noisy steps, than gradient descent. Learning rate has a huge impact on training time of a NN as :

by penalizing the gradients of too low just right too high


the weight which causes it.
cost cost cost
Momentum:
Updates the weights based on a The method is too fast and converges Computationally costly.
RMSprop: random batch of the dataset rapidly
Uses both the Momentum and Rectifies vanishing learning rate, high
Adam RMSprop techniques to reduce variance. w w w

Bias Correction:
training time and variance mall earning ate The ptimal earning oo arge f earning
Removes biasness so to have a
A s l r o l t l o a l

r equires any pdates


m u rate wiftly eaches
s r r ate ause rastic
c d

efore eaching inimum he inimum oint pdates hich ead o


better moving average at the b

oiny
r m t m p u w

ivergent ehaviours
l t

starting.
p d b

P riori selection of earning rate is hard, hence


L

earning Rate Decay is used


L
ver tt n n eur et or WHY USE AUTOENCODERS? H yp er p r m eters
Randomly drops neurons/edges during trainin
O fi i g i N al N w k:
a a

u pdating eights ith igh imensional atrix or eep eural


w w h d m f d n
Rate at which dropout happens is called Dropout Rate (r L earning Rate
etwork s hat akes he odel verfit n ata During test time, no dropout takes place, but weight values are Curse of Dimensionality causes lots of difficulties while training a model
earning rate controls how quickly the Neural Network reaches global minima.
n i w m t m o o d

multiplied by a factor of p = 1- because it requires training a lot of parameters on a scarce dataset L

R eso v n
l i g O ver tt nfi i g: Dropout causes a single neural network to have different network PCA and autoencoders are used to tackle these issues. Discussed in detail in impact of earning Rate.
L

architecture while training, which improves generalization errors as the


ro enius Nor Regulari ation 2 . M o entu
NN cannot put a higher weight value to any of the features.
F b m z :

Autoencoders vs PCA
m m

Due to hidden layers in NN, -Regularization cannot be used, Momentum helps know the direction of the next step with the knowledge of
L2
Early Stopping Autoencoders PCA the previous step. This will help prevent oscillations(slow convergence).
hence to regularize the weights, robenius Norm is used
The main idea of PCA is to
F

error
Autoencoders are NN which are Discussed in detail in Optimizer
used for compressing the data nd the est value of vector
fi b

into a low dimensional latent u which is the direction of


, 3. Nu mb er of epo h s
c

space and then try to maximum variance or (

Validation reconstruct the actual high maximum information and ) Controls the number of iterations a Neural Network be trained fo
dimensional data. along which we should rotate Too many Epochs can lead to Overfittin
our existing coordinates .

Fewer Epochs may lead to Underfitting


training

The eigenvector associated 4 . B at h s i e


stop training Number of epochs c z

with the largest eigenvalue


indicates the direction in Controls the number of samples passed in one iteratio
which the data has the most Increased Batch size, hinders the model to generalize better
After maximum validation performance is reached, the Neural Network is variance.
trained for a few epochs and if there is no improvement in performance, 5 . I nitiali ation of w eig h ts
z

then training stops


Use Glorot Normal or Uniform Distributio
Autoencoders are usually PCA essentially learns a Or use He Normal or Uniform Distribution
Neural Network in Unsupervised setting preferred when there is a need linear transformation.
Dropout :
Autoen oders
c
for modeling non-linearities and 6 . Setting th e n u mb er of h idden l ayers
relatively complex relationships.
It is a neural networ that
k
The networ is trained to
k
With the increasing number of hidden layers and neurons, the number of
has layers
3
resconstruct its inputs. ere H , parameters increases which makes the neural network not only overfit but
the input neurons are e ual to q
also increases the number of computations and training time.
the output neurons.

The networ s target output is same as


k’

the input. t uses dimensionality


I

reduction to restructure the input.

h i d d e n

(encoding)

input output

You might also like