TrainingNN 1
TrainingNN 1
Networks
1
Training Schedule
Manually specify learning rate for entire training process
22
Basic Recipe for Training
23
Learning
• Learning means generalization to unknown
dataset
– I.e., train on known dataset → test with optimized
parameters on unknown dataset
•
• Basically, we hope that based on the train set, the
optimized parameters will give similar results on
different data (i.e., test data)
Learning
• Training set (‘train’):
– Use for training your neural network
• Validation set (‘val’) - often cross-validation:
– Hyper parameter optimization
– Check generalization progress
• Test set (‘test’):
– Only for the very end
– NEVER TOUCH DURING DEVELOPMENT OR TRAINING
Learning
• Typical splits
– Train (60%), Val (20%), Test (20%)
– Train (80%), Val (10%), Test (10%)
– Train (98%), Val (1%), Test (1%)
– Train (80%), Cross-Val, Test (20%)
• During training:
– Train error comes from average minibatch error
– Typically take subset of validation every n iterations
Cross validation
27
Basic Recipe for Machine Learning
Done
I
Over- and Underfitting
Underfitted Appropriate
Overfitted
Source: Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017
Over- and Underfitting
Source: https://srdas.github.io/DLBook/ImprovingModelGeneralization.html
32
Learning Curves
• Training graphs
- Accuracy Loss
Learning Curves
val
t
e
s
t
Overfitting Curves
Val
t
e
s
t
Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
Other Curves
Underfitting (loss still decreasing) Validation set is easier than Training set
Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
To Summarize
• Underfitting
– Training and validation losses decrease even at the end of
training
• Overfitting
– Training loss decreases and validation loss increases
• Ideal Training
– Small gap between training and validation loss, and both
go down at same rate (stable without fluctuations).
To Summarize
• Bad Signs
– Training error not going down
– Validation error not going down
– Performance on validation better than on training set
– Tests on train set different than during training
• Bad Practice
– Training set contains test data Never touch during
development or
– Debug algorithm on test data
training
Hyperparameters
• Network architecture (e.g., num layers, #weights)
• Number of iterations
• Learning rate(s) (i.e., solver parameters, decay, etc.)
• Regularization
• Batch size
• …
• Overall:
learning setup + optimization = hyperparameters
Hyperparameter Tuning
Grid search
• Methods: 1
Second Parameter
– Manual search: 0,8
0,6
• experience-based
– Grid search (structured, for ‘real’ applications) 0,4
0,2
• Define ranges for all parameters spaces and select points
0
• Usually pseudo-uniformly distributed 0 0,2 0,4 0,6 0,8 1
→ Iterate over all possible configurations First Parameter
Second Parameter
0,8
– Auto-ML search: 0,6
Bayesian framework; gradient descent on gradient 0,4
descent, typically complex 0,2
0
0 0,2 0,4 0,6 0,8 1
First Parameter
How to Start
ONE
FEW
MANY
…
Find a Good Learning Rate
Loss
Training time
Karpathy’s constant
43
Coarse Grid Search
• Choose a few values of learning rate and weight
decay around what worked from
• Train a few models for a few epochs.
• Good weight decay to try: 1e-4, 1e-5, 0
Grid search
• Not scalable when there are many 1
Second Parameter
0,8
hyperparameters. 0,6
0,4
0,2
0
0 0,2 0,4 0,6 0,8 1
First Parameter
45
46
Refine Grid
• Pick best models found with coarse grid.
• Refine grid search around these models.
• Train them for longer (10-20 epochs) without learning
rate decay
• Study loss curves <- most important debugging tool!
Network Architecture
• Frequent mistake: “Let’s use this super big network,
train for two weeks and we see where we stand.”