DL CS 7 M4 Live Class Flow
DL CS 7 M4 Live Class Flow
Module 4
Course Owner : Lead Instructor: Section Faculty:
Seetha Parameswaran Bharatesh Chakravarthi Raja vadhana Prabhakar
The designers/authors of this course deck is gratefully
acknowledging the orginal authors who made their
course materials freely available online.
2
Course Content
3
Module 4
Regularization of Deep models
4
Agenda
What happens :
when x = -10?
when x = 0?
when x = 10?
Source Credit: “Understanding the Difficulty of Training Deep Feedforward Neural Networks,” X. Glorot, Y Bengio (2010), Fei-Fei Li &
Justin Johnson & Serena Yeung, https://ayearofai.com/rohan-4-the-vanishing-gradient-problem-ec68f76ffb9b BITS Pilani, Pilani Campus
Challenges:
Training of Lower Layers
Cost of Data Readiness
(Sometimes) Noisy Data
Speed of Model Training
Complexity of the Model
Vanishing Gradient Zero Saturation
What happens :
when x = -4?
when x = +4?
Source Credit: “Understanding the Difficulty of Training Deep Feedforward Neural Networks,” X. Glorot, Y Bengio (2010), Fei-Fei Li &
Justin Johnson & Serena Yeung, https://ayearofai.com/rohan-4-the-vanishing-gradient-problem-ec68f76ffb9b BITS Pilani, Pilani Campus
Challenges:
Training of Lower Layers
Cost of Data Readiness
(Sometimes) Noisy Data
Speed of Model Training
Complexity of the Model
Vanishing Gradient
Source Credit: “Understanding the Difficulty of Training Deep Feedforward Neural Networks,” X. Glorot, Y Bengio (2010), Fei-Fei Li &
Justin Johnson & Serena Yeung, https://ayearofai.com/rohan-4-the-vanishing-gradient-problem-ec68f76ffb9b BITS Pilani, Pilani Campus
Non-Zero Centered Outputs
Batch Normalization
Challenges:
Training of Lower Layers
Cost of Data Readiness
(Sometimes) Noisy Data
Speed of Model Training
Complexity of the Model
Source Credit : “Self-Normalizing Neural Networks, " G. Klambauer, T. Unterthiner and A. Mayr (2017), “Batch Normalization:
Accelerating Deep Network Training by Reducing Internal Covariate Shift,” S. Ioffe and C. Szegedy (2015) BITS Pilani, Pilani Campus
Vanishing/Exploding Gradient Problem
Challenges:
Training of Lower Layers
Cost of Data Readiness
(Sometimes) Noisy Data
Speed of Model Training
Complexity of the Model
Source Credit : “Self-Normalizing Neural Networks, " G. Klambauer, T. Unterthiner and A. Mayr (2017), “Batch Normalization:
Accelerating Deep Network Training by Reducing Internal Covariate Shift,” S. Ioffe and C. Szegedy (2015) BITS Pilani, Pilani Campus
Initializing neural networks
Consider linear activation for the above NN, weights are all matrices of size (2,2)
• Model selection is the process of selecting the final model after evaluating several
candidate models.
• With MLPs, compare models with
• different numbers of hidden layers,
• different numbers of hidden units
• different activation functions applied to each hidden layer.
• We should touch the test data once, to assess the very best model or to compare a small
number of models to each other
• Use Validation dataset to determine the best among our candidate models
• In deep learning, with millions of data available, the split is generally
• Training = 98-99 % of the original dataset
• Validation = 1-2 % of training dataset
• Testing = 1-2 % of the original dataset
Model selection
Model complexity
• Why do we care about this bias variance tradeoff and model complexity?
• Deep Neural networks are highly complex models. Many parameters, many
nonlinearities.
• It is easy for them to overfit and drive training error to 0.
• Hence we need some form of regularization.
Different forms of regularization
• l 2 regularization
• Dataset augmentation
• Early stopping
• Ensemble methods
• Dropout
l 2 regularization- weight decay
Regularized Cost function Add the norm as a penalty
term to the problem of
minimizing the loss. This will
ensure that the weight vector
is small.
Regularized Cost function – Logistic regression
• We initialize all the parameters (weights) of the network and start training
• For the first training instance (or mini-batch), we apply dropout resulting in
the thinned network
• We compute the loss and back propagate
• Which parameters will we update? Only those which are active
Drop out
Error
• Track the validation error
• Have a patience parameter p
• If you are at step k and
there was no improvement
V a l i d a t i on e r r o r
in validation error in the
previous p steps then stop
T r a i n i n g e r ror
training and return the
model stored at step k —p
k Steps
k −p
r et ur n stop • Basically, stop the training
t h i s model early before it drives the
training error to 0 and
blows up the validation
error
Sour c e Cr edit: Mites h
M. K h a p r a
Early stopping
Dataset augmentation
label = 2
• https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
• Ref TB Dive into Deep Learning Sections 5.4, 5.5, 5.6 online
version
• IIT M CS7015 (Deep Learning) : Lecture 8
Thank you
37