0% found this document useful (0 votes)
2 views37 pages

DL CS 7 M4 Live Class Flow

This document outlines the content and agenda for a Deep Learning course module, focusing on the regularization of deep models. It covers various topics such as model selection, regularization techniques, challenges in training deep networks, and strategies for improving model performance. Key concepts include dropout, L1 and L2 regularization, batch normalization, and the bias-variance tradeoff.

Uploaded by

2023dc04001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views37 pages

DL CS 7 M4 Live Class Flow

This document outlines the content and agenda for a Deep Learning course module, focusing on the regularization of deep models. It covers various topics such as model selection, regularization techniques, challenges in training deep networks, and strategies for improving model performance. Key concepts include dropout, L1 and L2 regularization, batch normalization, and the bias-variance tradeoff.

Uploaded by

2023dc04001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Deep Learning

Module 4
Course Owner : Lead Instructor: Section Faculty:
Seetha Parameswaran Bharatesh Chakravarthi Raja vadhana Prabhakar
The designers/authors of this course deck is gratefully
acknowledging the orginal authors who made their
course materials freely available online.

2
Course Content

● Fundamentals of Neural Network


● Multilayer Perceptron
● Deep Feedforward Neural Network
● Improve the DNN performance by Optimization and Regularization
● Convolutional Neural Networks
● Sequence Models
● Attention Mechanism
● Representational Learning
● Generative Adversarial Networks

3
Module 4
Regularization of Deep models

4
Agenda

• Model Selection, Under fitting, Overfitting


• L1 and L2 Regularization
• Dropout
• Challenge - Vanishing and Exploding Gradients
• Parameter Initialization
• Challenge Covariance Shift
• Batch Normalization

BITS Pilani, Pilani Campus


DNN – General Strategy
Challenges:
 Training of Lower Layers
I Design the architecture of the
 Cost of Data Readiness
network
 (Sometimes) Noisy Data
 Speed of Model Training
II Choose the activation function to  Complexity of the Model
compute the hidden layer values

III Choose the cost function

IV Choose the optimizer algorithm

V. Train the feedforward network

VI Evaluate the performance of the


network BITS Pilani, Pilani Campus
Challenges:
 Training of Lower Layers
 Cost of Data Readiness
 (Sometimes) Noisy Data
 Speed of Model Training
 Complexity of the Model
Vanishing Gradient Non-Zero Centered Outputs Zero Saturation

What happens :
when x = -10?
when x = 0?
when x = 10?

Source Credit: “Understanding the Difficulty of Training Deep Feedforward Neural Networks,” X. Glorot, Y Bengio (2010), Fei-Fei Li &
Justin Johnson & Serena Yeung, https://ayearofai.com/rohan-4-the-vanishing-gradient-problem-ec68f76ffb9b BITS Pilani, Pilani Campus
Challenges:
 Training of Lower Layers
 Cost of Data Readiness
 (Sometimes) Noisy Data
 Speed of Model Training
 Complexity of the Model
Vanishing Gradient Zero Saturation

What happens :
when x = -4?
when x = +4?

Source Credit: “Understanding the Difficulty of Training Deep Feedforward Neural Networks,” X. Glorot, Y Bengio (2010), Fei-Fei Li &
Justin Johnson & Serena Yeung, https://ayearofai.com/rohan-4-the-vanishing-gradient-problem-ec68f76ffb9b BITS Pilani, Pilani Campus
Challenges:
 Training of Lower Layers
 Cost of Data Readiness
 (Sometimes) Noisy Data
 Speed of Model Training
 Complexity of the Model
Vanishing Gradient

Source Credit: “Understanding the Difficulty of Training Deep Feedforward Neural Networks,” X. Glorot, Y Bengio (2010), Fei-Fei Li &
Justin Johnson & Serena Yeung, https://ayearofai.com/rohan-4-the-vanishing-gradient-problem-ec68f76ffb9b BITS Pilani, Pilani Campus
Non-Zero Centered Outputs
Batch Normalization
Challenges:
 Training of Lower Layers
 Cost of Data Readiness
 (Sometimes) Noisy Data
 Speed of Model Training
 Complexity of the Model

Source Credit : “Self-Normalizing Neural Networks, " G. Klambauer, T. Unterthiner and A. Mayr (2017), “Batch Normalization:
Accelerating Deep Network Training by Reducing Internal Covariate Shift,” S. Ioffe and C. Szegedy (2015) BITS Pilani, Pilani Campus
Vanishing/Exploding Gradient Problem
Challenges:
 Training of Lower Layers
 Cost of Data Readiness
 (Sometimes) Noisy Data
 Speed of Model Training
 Complexity of the Model

Source Credit : “Self-Normalizing Neural Networks, " G. Klambauer, T. Unterthiner and A. Mayr (2017), “Batch Normalization:
Accelerating Deep Network Training by Reducing Internal Covariate Shift,” S. Ioffe and C. Szegedy (2015) BITS Pilani, Pilani Campus
Initializing neural networks

• Zero initialization for weights


• The choice of initialization is crucial for • All the neurons learn the same
maintaining numerical stability. features during training
• hidden units will have identical
• The choices of initialization can be tied influence on the cost, which will lead
up in interesting ways with the choice to identical gradients.
of the nonlinear activation function.
• Which function we choose and how
we initialize parameters can determine
how quickly our optimization algorithm
converges.
• Poor choices can cause to encounter
exploding or vanishing gradientswhile
training.
Initializing neural networks

Consider linear activation for the above NN, weights are all matrices of size (2,2)

A too-large initialization A too-small initialization

• Values of 𝑎[𝑙] increase • values of 𝑎[𝑙] decrease


exponentially with 𝑙 exponentially with 𝑙
• Gradients explodes due to large • Gradients vanishes due to small
activations leads to the activations leads to the vanishing
exploding gradient problem gradient problem
• Cost oscillates around its min. • Fails to converge
Xavier Initialization
Challenges:
• Samples weights from a Gaussian  Training of Lower Layers
distribution with zero mean and  Cost of Data Readiness
variance  (Sometimes) Noisy Data
 Speed of Model Training
 Complexity of the Model
When
fanin = fanout LeCun Intialization
Only Half of fanin Kaiming He Initialization
• ni = size of ith layer
• fanin ni , ni+1 fanout

• Now-standard and practically


beneficial
https://proceedings.mlr.press/v 9/gl orot10a/glor ot10a.pdf?hc_loc ation=ufi
Fit of the model
Underfitting Overfitting
Challenges:
 Training of Lower Layers
 Cost of Data Readiness
 (Sometimes) Noisy Data
 Speed of Model Training
 Complexity of the Model

High Training loss


Low Training loss Low Training loss
High Validation loss
Low Validation loss High Validation loss
Little gap between both

Slide credit: Andrew Ng


Factors that influence the generalizability of a model

1. The number of tunable parameters.


• When the number of tunable parameters, called the
degrees of freedom, is large, models tend to be more
susceptible to overfitting.
2. The values taken by the parameters.
• When weights can take a wider range of values, models
can be more susceptible to overfitting.
3. The number of training examples.
• It is trivially easy to overfit a dataset containing only one
or two examples even if your model is simple.
• But overfitting a dataset with millions of examples
requires an extremely flexible model.
Bias- variance

• Simple models trained on different


samples of the data do not differ much from
each other

• However they are very far from the true


sinusoidal curve (under fitting)

• On the other hand, complex models


trained on different samples of the data are
very different from each other (high
variance)

Simple model: high bias, low variance

Complex model: low bias, high variance

Slide credit: IITM CS7015 BITS Pilani, Pilani Campus


Model complexity

• Simple models and abundant data


• Expect the generalization error to resemble the training error.
• More complex models and fewer examples
• Expect the training error to go down but the generalization gap to grow.
• Model complexity
• A model with more parameters might be considered more complex.
• A model whose parameters can take a wider range of values might be more
complex.
• A neural network model that takes more training iterations are more complex,
and
• One subject to early stopping (fewer training iterations) are less complex.
Model complexity

BITS Pilani, Pilani Campus


Model complexity

• Let there be n training points and m test (validation) points

• As the model complexity increases trainerr becomes overly


optimistic and gives us a wrong picture of how close f̂ is to f
• The validation error gives the real picture of how close f̂ is to f
Sour c e Cr edir : Mites h
M. K h a p r a
Model selection

• Model selection is the process of selecting the final model after evaluating several
candidate models.
• With MLPs, compare models with
• different numbers of hidden layers,
• different numbers of hidden units
• different activation functions applied to each hidden layer.
• We should touch the test data once, to assess the very best model or to compare a small
number of models to each other
• Use Validation dataset to determine the best among our candidate models
• In deep learning, with millions of data available, the split is generally
• Training = 98-99 % of the original dataset
• Validation = 1-2 % of training dataset
• Testing = 1-2 % of the original dataset
Model selection
Model complexity

• Why do we care about this bias variance tradeoff and model complexity?
• Deep Neural networks are highly complex models. Many parameters, many
nonlinearities.
• It is easy for them to overfit and drive training error to 0.
• Hence we need some form of regularization.
Different forms of regularization

• l 2 regularization
• Dataset augmentation
• Early stopping
• Ensemble methods
• Dropout
l 2 regularization- weight decay
Regularized Cost function Add the norm as a penalty
term to the problem of
minimizing the loss. This will
ensure that the weight vector
is small.
Regularized Cost function – Logistic regression

wt+1 = wt —η∇L (wt ) —η  wt


w0 is not regularized
Regularized Cost function – Neural network
Drop out
• Dropout refers to dropping out units
• Temporarily remove a node and all its
incoming/outgoing connections resulting in a thinned
network
• Each node is retained with a fixed probability (typically p
= 0.5) for hidden nodes and p = 0.8 for visible nodes

Sour c e Cr edit: Mites h


M. K h a p r a
Drop out

• Suppose a neural network has n nodes


• Using the dropout idea, each node can be retained or dropped
• For example, in the above case we drop 5 nodes to get a thinned
network Given a total of n nodes, what are the total number of thinned
networks that can be formed? 2n
• we cannot possibly train so many networks
• T rick : (1) Share the weights across all the networks
(2) Sample a different network for each training
instance
Drop out

• We initialize all the parameters (weights) of the network and start training
• For the first training instance (or mini-batch), we apply dropout resulting in
the thinned network
• We compute the loss and back propagate
• Which parameters will we update? Only those which are active
Drop out

• For the second training instance (or mini-batch), we again apply


dropout resulting in a different thinned network
• We again compute the loss and back propagate to the active weights
• If the weight was active for both the training instances then it would
have received two updates by now
• If the weight was active for only one of the training instances then it
would have received only one updates by now
• Parameter sharing ensures that no model has untrained or poorly
trained parameters
Drop out

• Prevents hidden units from coadoption


• Dropout gives a smaller neural network, giving the effect of
• regularization.
• In general,
• Vary keep probability (0.5 to 0.8) for each hidden layer.
• The input layer has a keep probability of 1.0 or 0.9.
• The output layer has a keep probability of 1.0.
Early stopping

Error
• Track the validation error
• Have a patience parameter p
• If you are at step k and
there was no improvement
V a l i d a t i on e r r o r
in validation error in the
previous p steps then stop
T r a i n i n g e r ror
training and return the
model stored at step k —p
k Steps
k −p
r et ur n stop • Basically, stop the training
t h i s model early before it drives the
training error to 0 and
blows up the validation
error
Sour c e Cr edit: Mites h
M. K h a p r a
Early stopping
Dataset augmentation

label = 2

[given training data] We


exploit the fact that certain
transformations to the image do
not change the label of the image.

• Typically, More data = better


learning
• Works well for image classification [augmented data = created using
/ object recognition tasks Also some knowledge of the task]
shown to work well for speech
• For some tasks it may not be clear
how to generate such data
Slide credit: IITM CS7015
Ensemble - Bagging

Each model trained with a different sample


of the data (sampling with replacement)
Ensemble - Bagging
• Typically model averaging(bagging ensemble) always helps
• Training several large neural net- works for making an
ensemble is prohibitively expensive
• Option 1: Train several neural networks having different
architectures(obviously expensive)
• Option 2: Train multiple instances of the same network using
different training samples (again expensive)
• Even if we manage to train with option 1 or option 2, combining
several models at test time is infeasible in real time applications

Sour c e Cr edit: Mites h


M. K h a p r a
References

• https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
• Ref TB Dive into Deep Learning Sections 5.4, 5.5, 5.6 online
version
• IIT M CS7015 (Deep Learning) : Lecture 8
Thank you

37

You might also like