0% found this document useful (0 votes)
19 views52 pages

TrainingNN 1

The document discusses training neural networks and provides guidance on topics like learning rates, overfitting, hyperparameters, and debugging. It outlines a basic recipe for machine learning including splitting data into train, validation, and test sets. It also discusses techniques like learning curves and grid search to optimize hyperparameters.

Uploaded by

Arooj Fatima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views52 pages

TrainingNN 1

The document discusses training neural networks and provides guidance on topics like learning rates, overfitting, hyperparameters, and debugging. It outlines a basic recipe for machine learning including splitting data into train, validation, and test sets. It also discusses techniques like learning curves and grid search to optimize hyperparameters.

Uploaded by

Arooj Fatima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Training Neural

Networks

1
Training Schedule
Manually specify learning rate for entire training process

• Manually set learning rate every n-epochs


• How?
– Trial and error (the hard way)
– Some experience (only generalizes to some degree)

Consider: #epochs, training set size, network size, etc.

22
Basic Recipe for Training

23
Learning
• Learning means generalization to unknown
dataset
– I.e., train on known dataset → test with optimized
parameters on unknown dataset

• Basically, we hope that based on the train set, the
optimized parameters will give similar results on
different data (i.e., test data)
Learning
• Training set (‘train’):
– Use for training your neural network
• Validation set (‘val’) - often cross-validation:
– Hyper parameter optimization
– Check generalization progress
• Test set (‘test’):
– Only for the very end
– NEVER TOUCH DURING DEVELOPMENT OR TRAINING
Learning
• Typical splits
– Train (60%), Val (20%), Test (20%)
– Train (80%), Val (10%), Test (10%)
– Train (98%), Val (1%), Test (1%)
– Train (80%), Cross-Val, Test (20%)

• During training:
– Train error comes from average minibatch error
– Typically take subset of validation every n iterations
Cross validation

27
Basic Recipe for Machine Learning

Done
I
Over- and Underfitting

Underfitted Appropriate
Overfitted
Source: Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017
Over- and Underfitting

Source: https://srdas.github.io/DLBook/ImprovingModelGeneralization.html

32
Learning Curves
• Training graphs
- Accuracy Loss
Learning Curves
val
t
e
s
t
Overfitting Curves
Val
t
e
s
t

Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
Other Curves

Underfitting (loss still decreasing) Validation set is easier than Training set
Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
To Summarize
• Underfitting
– Training and validation losses decrease even at the end of
training
• Overfitting
– Training loss decreases and validation loss increases
• Ideal Training
– Small gap between training and validation loss, and both
go down at same rate (stable without fluctuations).
To Summarize
• Bad Signs
– Training error not going down
– Validation error not going down
– Performance on validation better than on training set
– Tests on train set different than during training
• Bad Practice
– Training set contains test data Never touch during
development or
– Debug algorithm on test data
training
Hyperparameters
• Network architecture (e.g., num layers, #weights)
• Number of iterations
• Learning rate(s) (i.e., solver parameters, decay, etc.)
• Regularization
• Batch size
• …
• Overall:
learning setup + optimization = hyperparameters
Hyperparameter Tuning
Grid search
• Methods: 1

Second Parameter
– Manual search: 0,8

0,6
• experience-based
– Grid search (structured, for ‘real’ applications) 0,4

0,2
• Define ranges for all parameters spaces and select points
0
• Usually pseudo-uniformly distributed 0 0,2 0,4 0,6 0,8 1
→ Iterate over all possible configurations First Parameter

– Random search: Random search


Like grid search but one picks points at random in the 1
predefined ranges

Second Parameter
0,8
– Auto-ML search: 0,6
Bayesian framework; gradient descent on gradient 0,4
descent, typically complex 0,2

0
0 0,2 0,4 0,6 0,8 1
First Parameter
How to Start
ONE

FEW

MANY

Find a Good Learning Rate

Loss

Training time
Karpathy’s constant

43
Coarse Grid Search
• Choose a few values of learning rate and weight
decay around what worked from
• Train a few models for a few epochs.
• Good weight decay to try: 1e-4, 1e-5, 0
Grid search
• Not scalable when there are many 1

Second Parameter
0,8

hyperparameters. 0,6

0,4

0,2

0
0 0,2 0,4 0,6 0,8 1
First Parameter

45
46
Refine Grid
• Pick best models found with coarse grid.
• Refine grid search around these models.
• Train them for longer (10-20 epochs) without learning
rate decay
• Study loss curves <- most important debugging tool!
Network Architecture
• Frequent mistake: “Let’s use this super big network,
train for two weeks and we see where we stand.”

• Instead: start with a simple


network.

• Get debug cycles down


– Ideally, minutes
Debugging
• Use train/validation/test curves
– Evaluation needs to be consistent
– Numbers need to be comparable

• Only make one change at a time


– “I’ve added 5 more layers and double the training size, and
now I also trained 5 days longer. Now it’s better, but why?”
Weight Initialization
Small Random Numbers
• Gaussian with zero mean and standard deviation 0.01
• Let’s see what happens:
– Network with 10 layers with 500 neurons each
–Tanh as activation functions
– Input unit Gaussian data

You might also like