0% found this document useful (0 votes)

119 views4 pages

Neural Network Training Guide

This cheat sheet provides an overview of neural networks, comparing them to traditional machine learning methods, and highlights their applications in pattern recognition, image processing, and automated translation. It discusses various activation functions, training issues, optimization techniques, and the importance of hyperparameters in neural network performance. Additionally, it addresses challenges like overfitting and vanishing gradients, along with solutions such as dropout and batch normalization.

Uploaded by

Abhishek Puvvadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

119 views4 pages

Neural Network Training Guide

Uploaded by

Abhishek Puvvadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

NEURAL NETWORK Cheat sheet

n n vs au t o
traditional e n co d e r
methods

why use unsupervised

nn f11 b21 setting
w21
1
11
w1 w
x i1 21
2
w1 f21

w
simplest 12

21
h ow n n

3
nn b12

w
trains

11
21

3
2

w1
w b22
f12

14
w22
perceptron 2 backprop
w
22 &
3 f22
f o r wa r d
prop

1
12
2 b13

w
issue 12 1
w 23 32
w w2 b23
issue while
f13 w23 training
x i2 w123 3
f23
l i n e a r a c t i vat i o n
w
12

41
4 2
24

w2
b14 w
solutions 3 exploring gra
24
w
va n i s h i n g g r a
f14 dead neuro
ov e r f i t t i n
training

l a ye r 2

o p t i m i z at i o n
non linear c l a s s i f i cat i o n input l a ye r 1

(output
a c t i vat i o n a c t i vat i o n l a ye r (hidden
l a ye r )

l a ye r ) solutions

e x p l o r i n g g r a d

va n i s h i n g g r a d

training
dead neuron ov e r f i t t i n g o p t i m i z at i o n

h y p e r pa r a m e t e r s

weight r e g u l a r i z at i o n  o p t i m i z e r s 
i n i t i a l i z at i o n
b at c h n o r m
learning
h y p e r pa r a m e t e r glarot e a r ly s t o p p i n g  r at e  
tuning i n i t i a l i z at i o n
dropout epoch
random

h y p e r pa r a m e t e r s
NEURAL NETWORK Cheat sheet
Neural Network Applications Activation plot Equation Derivative Range Problem Advantage N e u ra l N e two rk T ra i n i n g
Pattern recognition and natural language processing RW RD
FO A
(0,1) Vanishing Adds non X= A PROP G IO A AT N
Recognition and processing of images Gradient linearity
[0]

A [1]
A2 [ ]
A L-
[ 1]
AL= y[ ]

Automated translation Sigmoid

Analysis of sentiment AZ ZA 2 ZA L
w w 22 [ ] [1]

w LL [ L- 1]

b
[1] [0] [ ]

b [1] [1] [ ]
[ ]

b [ ]
[ ]

System for answering questions

N T AT A R AM E T E R AT L S
Classification and Detection of Objects. I I I
A AM T
E
P

AL E S
S
UPD

A R AM E T E R S
E
O S

N TI N
Adds non linearit
R E ERS
Vanishing
P V U P FU C O

w w 22 wL [ ]

works with Glorot b LL -

[ ]

AbZ b
[1]

NN vs Traditional Machine Learning (-1,1) Gradient d w 22 d w LL

ZA 2 A
[ ]

dw
[1] [ ]

db
[ ]
[ ]

b
[1]

Tanh db d
initialization
[1]

ZL
[0] [ ] [ 1]
[ ]
[1]
[1] [ ] [ ]

Neural Network
Traditional ML
d A d A2 d A L- d AL
T raditional models refer
[1]
[ ] [ 1] [ ]

N eural N etwor k s are a

to algorit h ms t h at B A KWA R
group of M ac h ine L earning C

A G AT I N
D

algorit h ms w h ic h mimic statistically analy z e and ReLU To reduce possibility of R

Can cause
P OP O

discern patterns from vanishing Gradient and I s s u e s w h i l e t ra i n i n g N e u ra l N e two rk

biological neurons Dead
data Neuron
faster convergenc Training Issue Explanation Reason Solution
N eural N etwor k s can scenario
works with He
adapt well to T raditional models wor k initialization
Weight Initialization :

unstructured data li k e well wit h structures / f model weights grow F or the L layered NN,
I
the L derivatives are Vanishing/exploding gradients are
image and te x t tabular data exponentially and caused when the value of weights
Leaky Solves the multiplied together
ReLU - issue with
Exploding Gradients become unexpectedly and if the gradients are large.

Requires more computation ReLU large during training, have value >1 then the it can be avoided if the variance of
power to train. M ost NN Require less training Exploding Gradients
gradient for the Lth the weights can be kept small

libraries are capable of time and computation happen.

layer increases Glorot/Xavier Initialization:

le v eraging G P U for speeding resources in general exponentially. Glorot Normal

t h e training process Output Layer Function for Classification

W hen back-
Speciali z ed NN M anual feature Output layer
plot Equation Derivative Range classification propagating to the
arc h itectures and engineering is function initial layers, the
deep layers implicitly required derivatives of L hidden It is caused when the
learn features V anishing Gradients layers are multiplied gradient values
NN are h ard to T raditional
together and if the become 1
<

interpret and often models are easy gradients are small,

Sigmoid
(0,1) Binary then the gradient value
act li k e a blac k bo x to interpret classification decreases exponentially
PERCEPTRON which slows the model s ’

If there is linear or no activation functions used, a neural learning proces s

network acts as a perceptron If all the weights of the Random Weight Initialization:
Issue with Perceptron NN are initialized as 0 or
Initializing all weights to
zeros will cause the If all the weights have different
A Perceptron with L layers will act same as a linear neural any constant value , the derivatives to remain values, then the gradients will be
network without any hidden layers which makes the data multi- class gradient value for these the same for every different for every layer of NN
softmax [0,1] classification ead Neurons neurons will be zero
information in the hidden layers redundan weight in the eight which makes every neuron useful
D

Cannot learn non Linear patterns in data irrespective of the input, matrix because of which
W

making the neurons neurons will learn the

Non Linear Activation Functio dead which results in same features in every
Adds non-linearity to Neural Network no learning of the iteration.
Neural Network
Optimizers W ei ht pdation
g U H o it or s
w w k Ad anta es
v g Disad anta es
v g
Batch Normalization

b b
Easy computation May trap at local minima
Takes the whole dataset Easy to implement If the dataset is very large, then the time for convergence
Batch GD Where, t is the iteration, n is for weight updation Easy to implement. will be very high
the number of samples Requires large memory to calculate gradient on the whole
dataset.
Frequent updates of model weights Due to frequent updates, the optimizer steps will be noisy, hence high w
Updates the weight on a variance in weight value w
Stochastic GD hence, converges in less time
random sample of the Requires less memory. May shoot even after achieving global minima
dataset. To get the same convergence as Batch gradient descent, it needs to
slowly reduce the value of learning rate per epoch. Batch Normalization normalizes the input by performing scaling and
shifting in every layer of the NN which not only prevents internal covariate
shift but also helps in faster convergence as normalized data tends to
Frequently updates the weight value Choosing an optimum value of the learning rate have a circular shaped loss function plot (which helps the optimizer to
Updates the weights based on with less variance May get trapped at local minima.
reach global minima faster) as compared to elliptical one
Mini Batch GD a random batch of the Requires medium amount of It's a good practise to use Batch Normalization before activation function
dataset. memory Scaling in Batch Normalization.
Improvement over Batch GD and
SGD.

Updates the weights based on Reduces the noisy steps by gaining

a random batch of the dataset information from previous steps
Momentum Uses previous batch gradients Further reduces the variance of the
to boost the momentum of GD weight Shi ting in Batch Normalization.
in the direction of Converges faster than gradient
f

convergence. descent.

Updates the weights based on a The algorithm converges quickly The learning rate has to be defined manually and may not work for I mpact o f L earnin rate on trainin
g g

RMSprop random batch of the dataset Requires lesser convergence time every application.
Further reduces the noisy steps, than gradient descent. Learning rate has a huge impact on training time of a NN as :

by penalizing the gradients of too low just right too high

the weight which causes it.
cost cost cost
Momentum:
Updates the weights based on a The method is too fast and converges Computationally costly.
RMSprop: random batch of the dataset rapidly
Uses both the Momentum and Rectifies vanishing learning rate, high
Adam RMSprop techniques to reduce variance. w w w

Bias Correction:
training time and variance mall earning ate The ptimal earning oo arge f earning
Removes biasness so to have a
A s l r o l t l o a l

r equires any pdates

m u rate wiftly eaches
s r r ate ause rastic
c d

efore eaching inimum he inimum oint pdates hich ead o

better moving average at the b

oiny
r m t m p u w

ivergent ehaviours
l t

starting.
p d b

P riori selection of earning rate is hard, hence

earning Rate Decay is used

L
ver tt n n eur et or WHY USE AUTOENCODERS? H yp er p r m eters
Randomly drops neurons/edges during trainin
O fi i g i N al N w k:
a a

u pdating eights ith igh imensional atrix or eep eural

w w h d m f d n
Rate at which dropout happens is called Dropout Rate (r L earning Rate
etwork s hat akes he odel verfit n ata During test time, no dropout takes place, but weight values are Curse of Dimensionality causes lots of difficulties while training a model
earning rate controls how quickly the Neural Network reaches global minima.
n i w m t m o o d

multiplied by a factor of p = 1- because it requires training a lot of parameters on a scarce dataset L

R eso v n
l i g O ver tt nfi i g: Dropout causes a single neural network to have different network PCA and autoencoders are used to tackle these issues. Discussed in detail in impact of earning Rate.
L

architecture while training, which improves generalization errors as the

ro enius Nor Regulari ation 2 . M o entu
NN cannot put a higher weight value to any of the features.
F b m z :

Autoencoders vs PCA
m m

Due to hidden layers in NN, -Regularization cannot be used, Momentum helps know the direction of the next step with the knowledge of
L2
Early Stopping Autoencoders PCA the previous step. This will help prevent oscillations(slow convergence).
hence to regularize the weights, robenius Norm is used
The main idea of PCA is to
F

error
Autoencoders are NN which are Discussed in detail in Optimizer
used for compressing the data nd the est value of vector
fi b

into a low dimensional latent u which is the direction of

, 3. Nu mb er of epo h s
c

space and then try to maximum variance or (

Validation reconstruct the actual high maximum information and ) Controls the number of iterations a Neural Network be trained fo
dimensional data. along which we should rotate Too many Epochs can lead to Overfittin
our existing coordinates .

Fewer Epochs may lead to Underfitting

training

The eigenvector associated 4 . B at h s i e

stop training Number of epochs c z

with the largest eigenvalue

indicates the direction in Controls the number of samples passed in one iteratio
which the data has the most Increased Batch size, hinders the model to generalize better
After maximum validation performance is reached, the Neural Network is variance.
trained for a few epochs and if there is no improvement in performance, 5 . I nitiali ation of w eig h ts
z

then training stops

Use Glorot Normal or Uniform Distributio
Autoencoders are usually PCA essentially learns a Or use He Normal or Uniform Distribution
Neural Network in Unsupervised setting preferred when there is a need linear transformation.
Dropout :
Autoen oders
c
for modeling non-linearities and 6 . Setting th e n u mb er of h idden l ayers
relatively complex relationships.
It is a neural networ that
k
The networ is trained to
k
With the increasing number of hidden layers and neurons, the number of
has layers
3
resconstruct its inputs. ere H , parameters increases which makes the neural network not only overfit but
the input neurons are e ual to q
also increases the number of computations and training time.
the output neurons.

The networ s target output is same as

k’

the input. t uses dimensionality

reduction to restructure the input.

h i d d e n

(encoding)

input output

Deep Learning: Gradient Issues & Solutions
No ratings yet
Deep Learning: Gradient Issues & Solutions
48 pages
Introduction to Training Deep Models
No ratings yet
Introduction to Training Deep Models
45 pages
Training Deep Learning Models Explained
No ratings yet
Training Deep Learning Models Explained
129 pages
DL-Module 2
No ratings yet
DL-Module 2
30 pages
Deep Neural Network Training Techniques
No ratings yet
Deep Neural Network Training Techniques
60 pages
NPTEL LLM Week3 Complete Guide
No ratings yet
NPTEL LLM Week3 Complete Guide
19 pages
Regularization and Optimization in Deep Learning
No ratings yet
Regularization and Optimization in Deep Learning
79 pages
Neural Networks: Types and Techniques
No ratings yet
Neural Networks: Types and Techniques
23 pages
Unit 2 - DLTM
No ratings yet
Unit 2 - DLTM
62 pages
Vanishing Gradient in Deep Networks
No ratings yet
Vanishing Gradient in Deep Networks
47 pages
Supervised Deep Learning Techniques
No ratings yet
Supervised Deep Learning Techniques
28 pages
Introduction To Neural Networks - NCU - 1 (2)
No ratings yet
Introduction To Neural Networks - NCU - 1 (2)
22 pages
Training Deep Learning Models Explained
No ratings yet
Training Deep Learning Models Explained
152 pages
Optimizing Neural Networks with Gradients
No ratings yet
Optimizing Neural Networks with Gradients
31 pages
Deep Learning Tips for Neural Networks
No ratings yet
Deep Learning Tips for Neural Networks
49 pages
Neural Networks: Week 3 Overview
No ratings yet
Neural Networks: Week 3 Overview
21 pages
Deep Learning Techniques and Strategies
No ratings yet
Deep Learning Techniques and Strategies
52 pages
Understanding Deep Learning Basics
No ratings yet
Understanding Deep Learning Basics
112 pages
Deep Learning Basics: ANN & Optimization
No ratings yet
Deep Learning Basics: ANN & Optimization
11 pages
Lec0 - NN Fundamentals
No ratings yet
Lec0 - NN Fundamentals
96 pages
Neural Networks and Gradient Learning
No ratings yet
Neural Networks and Gradient Learning
72 pages
Stochastic Gradient Descent Basics
No ratings yet
Stochastic Gradient Descent Basics
41 pages
Understanding Perceptrons in Deep Learning
No ratings yet
Understanding Perceptrons in Deep Learning
62 pages
Introduction to Neural Networks Basics
No ratings yet
Introduction to Neural Networks Basics
28 pages
Module 05 Regul Param Init
No ratings yet
Module 05 Regul Param Init
83 pages
Optimizing Neural Network Training Techniques
No ratings yet
Optimizing Neural Network Training Techniques
59 pages
Neural Network Training Essentials
No ratings yet
Neural Network Training Essentials
48 pages
Lecture 03 2
No ratings yet
Lecture 03 2
47 pages
Constant Learning Rate and Batch Normalization
No ratings yet
Constant Learning Rate and Batch Normalization
11 pages
Deep Learning: Gradient Descent Explained
No ratings yet
Deep Learning: Gradient Descent Explained
41 pages
Neural Network Training Fundamentals
No ratings yet
Neural Network Training Fundamentals
14 pages
Training Neural Networks: Loss & Norms
No ratings yet
Training Neural Networks: Loss & Norms
144 pages
Deep Learning Optimizers Overview
No ratings yet
Deep Learning Optimizers Overview
41 pages
Training Supervised Deep Learning Models
No ratings yet
Training Supervised Deep Learning Models
25 pages
Deep Learning Model Optimization Strategies
100% (1)
Deep Learning Model Optimization Strategies
81 pages
Deep Learning Ecosystem
No ratings yet
Deep Learning Ecosystem
73 pages
Deep Neural Network Optimization Techniques
No ratings yet
Deep Neural Network Optimization Techniques
1 page
Lecture-3 Training PDF
No ratings yet
Lecture-3 Training PDF
48 pages
Deep Neural Network Optimization Guide
100% (1)
Deep Neural Network Optimization Guide
84 pages
Deep Learning Unin 2 For Engineering Sem 3
No ratings yet
Deep Learning Unin 2 For Engineering Sem 3
158 pages
Neural Network Optimization Techniques
No ratings yet
Neural Network Optimization Techniques
7 pages
Deep Learning Fundamentals in NN
No ratings yet
Deep Learning Fundamentals in NN
46 pages
Multilayer Feedforward Neural Network Architecture
No ratings yet
Multilayer Feedforward Neural Network Architecture
44 pages
Deep Learning: Complete Study Notes With Interview Q&A
No ratings yet
Deep Learning: Complete Study Notes With Interview Q&A
19 pages
Neural Network Calculation Overview
No ratings yet
Neural Network Calculation Overview
12 pages
Deep Learning Techniques Overview
100% (1)
Deep Learning Techniques Overview
299 pages
Introduction to Neural Networks
No ratings yet
Introduction to Neural Networks
20 pages
Deep Learning Techniques and Trends
No ratings yet
Deep Learning Techniques and Trends
67 pages
Chapter 2. AI and Machine Learning
No ratings yet
Chapter 2. AI and Machine Learning
66 pages
Deep Learning Fundamentals with PyTorch
No ratings yet
Deep Learning Fundamentals with PyTorch
108 pages
Deep Neural Networks: Training & Regularization
No ratings yet
Deep Neural Networks: Training & Regularization
26 pages
Mini-Batch Gradient Descent Explained
No ratings yet
Mini-Batch Gradient Descent Explained
23 pages
Deep Learning: Gradient Optimization Techniques
No ratings yet
Deep Learning: Gradient Optimization Techniques
40 pages
Overfitting Prevention in Neural Networks
No ratings yet
Overfitting Prevention in Neural Networks
39 pages
History and Basics of Deep Learning
No ratings yet
History and Basics of Deep Learning
17 pages
NN 3 Learning
No ratings yet
NN 3 Learning
42 pages
Understanding Neural Networks Basics
No ratings yet
Understanding Neural Networks Basics
64 pages
Machine Learning vs. Deep Learning Explained
No ratings yet
Machine Learning vs. Deep Learning Explained
24 pages
Suryansh Sharma: MBA Graduate Profile
No ratings yet
Suryansh Sharma: MBA Graduate Profile
2 pages
Veeam Management Pack 8.0 Release Notes
No ratings yet
Veeam Management Pack 8.0 Release Notes
26 pages
Minor's Contractual Capacity Explained
100% (1)
Minor's Contractual Capacity Explained
24 pages
Test Bank For Cost Management A Strategic Emphasis 7th Edition by Edward Blocher
No ratings yet
Test Bank For Cost Management A Strategic Emphasis 7th Edition by Edward Blocher
61 pages
GE 105: Purposive Communication Tasks
No ratings yet
GE 105: Purposive Communication Tasks
3 pages
Satyaki Mahanti: HR Professional Profile
No ratings yet
Satyaki Mahanti: HR Professional Profile
3 pages
Sans 282
No ratings yet
Sans 282
31 pages
Understanding SQCR in Quality Control
No ratings yet
Understanding SQCR in Quality Control
12 pages
Fruit Jam, Jelly, and Marmalade Guide
No ratings yet
Fruit Jam, Jelly, and Marmalade Guide
8 pages
Hospital Bed Elevator Specifications
0% (1)
Hospital Bed Elevator Specifications
2 pages
Invoice for Coconut Water Purchase
No ratings yet
Invoice for Coconut Water Purchase
4 pages
Describe Your Favorite Town or City
No ratings yet
Describe Your Favorite Town or City
3 pages
GATE 2013 Metallurgy Syllabus
No ratings yet
GATE 2013 Metallurgy Syllabus
2 pages
Woodsmith Magazine Index 1-216 (2014)
100% (2)
Woodsmith Magazine Index 1-216 (2014)
31 pages
Fundamentals of Algorithm Design
100% (1)
Fundamentals of Algorithm Design
297 pages
Commerce and Management Exam Paper
No ratings yet
Commerce and Management Exam Paper
36 pages
Daikin Industries Supply Chain Analysis
No ratings yet
Daikin Industries Supply Chain Analysis
4 pages
CCSHAU M.sc. Agriculture Entrance Exam 2021-2022 - Question Paper - Analysis
No ratings yet
CCSHAU M.sc. Agriculture Entrance Exam 2021-2022 - Question Paper - Analysis
21 pages
Qualcomm UMTS Overview
No ratings yet
Qualcomm UMTS Overview
4 pages
CAT 2023 Admit Card Instructions
No ratings yet
CAT 2023 Admit Card Instructions
1 page
Experienced Truck Driver Resume
No ratings yet
Experienced Truck Driver Resume
2 pages
C Function Calling Methods Explained
No ratings yet
C Function Calling Methods Explained
39 pages
International Lingerie Trends 2011
No ratings yet
International Lingerie Trends 2011
32 pages
IoT Applications and Protocols Overview
No ratings yet
IoT Applications and Protocols Overview
26 pages
Alsubaie 2024 Laparoscopic Sleeve Gastrectomy in
No ratings yet
Alsubaie 2024 Laparoscopic Sleeve Gastrectomy in
11 pages
The Statistical Mechanics of Financial Markets ISBN 9783540414094, 3540414096 Digital Download
100% (6)
The Statistical Mechanics of Financial Markets ISBN 9783540414094, 3540414096 Digital Download
15 pages
FCPA and UKBA Compliance Overview
No ratings yet
FCPA and UKBA Compliance Overview
16 pages
John Bean Mobile Column Lifts Overview
No ratings yet
John Bean Mobile Column Lifts Overview
2 pages
Hybrid Energy Management System Design
No ratings yet
Hybrid Energy Management System Design
36 pages
Workplace Success and Information Credibility
No ratings yet
Workplace Success and Information Credibility
4 pages

Neural Network Training Guide

Uploaded by

Neural Network Training Guide

Uploaded by

NEURAL NETWORK Cheat sheet

why use unsupervised

Automated translation Sigmoid

System for answering questions

works with Glorot b LL -

NN vs Traditional Machine Learning (-1,1) Gradient d w 22 d w LL

N eural N etwor k s are a

algorit h ms w h ic h mimic statistically analy z e and ReLU To reduce possibility of R

discern patterns from vanishing Gradient and I s s u e s w h i l e t ra i n i n g N e u ra l N e two rk

libraries are capable of time and computation happen.

layer increases Glorot/Xavier Initialization:

le v eraging G P U for speeding resources in general exponentially. Glorot Normal

t h e training process Output Layer Function for Classification

interpret and often models are easy gradients are small,

If there is linear or no activation functions used, a neural learning proces s

making the neurons neurons will learn the

Updates the weights based on Reduces the noisy steps by gaining

by penalizing the gradients of too low just right too high

r equires any pdates

efore eaching inimum he inimum oint pdates hich ead o

P riori selection of earning rate is hard, hence

earning Rate Decay is used

u pdating eights ith igh imensional atrix or eep eural

multiplied by a factor of p = 1- because it requires training a lot of parameters on a scarce dataset L

architecture while training, which improves generalization errors as the

into a low dimensional latent u which is the direction of

space and then try to maximum variance or (

Fewer Epochs may lead to Underfitting

The eigenvector associated 4 . B at h s i e

with the largest eigenvalue

then training stops

The networ s target output is same as

the input. t uses dimensionality

reduction to restructure the input.

You might also like