0% found this document useful (0 votes)

2 views37 pages

DL CS 7 M4 Live Class Flow

This document outlines the content and agenda for a Deep Learning course module, focusing on the regularization of deep models. It covers various topics such as model selection, regularization techniques, challenges in training deep networks, and strategies for improving model performance. Key concepts include dropout, L1 and L2 regularization, batch normalization, and the bias-variance tradeoff.

Uploaded by

2023dc04001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views37 pages

DL CS 7 M4 Live Class Flow

Uploaded by

2023dc04001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Deep Learning

Module 4
Course Owner : Lead Instructor: Section Faculty:
Seetha Parameswaran Bharatesh Chakravarthi Raja vadhana Prabhakar
The designers/authors of this course deck is gratefully
acknowledging the orginal authors who made their
course materials freely available online.

2
Course Content

3
Module 4
Regularization of Deep models

4
Agenda

• Model Selection, Under fitting, Overfitting

• L1 and L2 Regularization
• Dropout
• Challenge - Vanishing and Exploding Gradients
• Parameter Initialization
• Challenge Covariance Shift
• Batch Normalization

BITS Pilani, Pilani Campus

DNN – General Strategy
Challenges:
 Training of Lower Layers
I Design the architecture of the
 Cost of Data Readiness
network
 (Sometimes) Noisy Data
 Speed of Model Training
II Choose the activation function to  Complexity of the Model
compute the hidden layer values

III Choose the cost function

IV Choose the optimizer algorithm

V. Train the feedforward network

VI Evaluate the performance of the

network BITS Pilani, Pilani Campus
Challenges:
 Training of Lower Layers
 Cost of Data Readiness
 (Sometimes) Noisy Data
 Speed of Model Training
 Complexity of the Model
Vanishing Gradient Non-Zero Centered Outputs Zero Saturation

What happens :
when x = -10?
when x = 0?
when x = 10?

Source Credit: “Understanding the Difficulty of Training Deep Feedforward Neural Networks,” X. Glorot, Y Bengio (2010), Fei-Fei Li &
Justin Johnson & Serena Yeung, https://ayearofai.com/rohan-4-the-vanishing-gradient-problem-ec68f76ffb9b BITS Pilani, Pilani Campus
Challenges:
 Training of Lower Layers
 Cost of Data Readiness
 (Sometimes) Noisy Data
 Speed of Model Training
 Complexity of the Model
Vanishing Gradient Zero Saturation

What happens :
when x = -4?
when x = +4?

Source Credit: “Understanding the Difficulty of Training Deep Feedforward Neural Networks,” X. Glorot, Y Bengio (2010), Fei-Fei Li &
Justin Johnson & Serena Yeung, https://ayearofai.com/rohan-4-the-vanishing-gradient-problem-ec68f76ffb9b BITS Pilani, Pilani Campus
Non-Zero Centered Outputs
Batch Normalization
Challenges:
 Training of Lower Layers
 Cost of Data Readiness
 (Sometimes) Noisy Data
 Speed of Model Training
 Complexity of the Model

Source Credit : “Self-Normalizing Neural Networks, " G. Klambauer, T. Unterthiner and A. Mayr (2017), “Batch Normalization:
Accelerating Deep Network Training by Reducing Internal Covariate Shift,” S. Ioffe and C. Szegedy (2015) BITS Pilani, Pilani Campus
Vanishing/Exploding Gradient Problem
Challenges:
 Training of Lower Layers
 Cost of Data Readiness
 (Sometimes) Noisy Data
 Speed of Model Training
 Complexity of the Model

• Zero initialization for weights

• The choice of initialization is crucial for • All the neurons learn the same
maintaining numerical stability. features during training
• hidden units will have identical
• The choices of initialization can be tied influence on the cost, which will lead
up in interesting ways with the choice to identical gradients.
of the nonlinear activation function.
• Which function we choose and how
we initialize parameters can determine
how quickly our optimization algorithm
converges.
• Poor choices can cause to encounter
exploding or vanishing gradientswhile
training.
Initializing neural networks

Consider linear activation for the above NN, weights are all matrices of size (2,2)

A too-large initialization A too-small initialization

• Values of 𝑎[𝑙] increase • values of 𝑎[𝑙] decrease

exponentially with 𝑙 exponentially with 𝑙
• Gradients explodes due to large • Gradients vanishes due to small
activations leads to the activations leads to the vanishing
exploding gradient problem gradient problem
• Cost oscillates around its min. • Fails to converge
Xavier Initialization
Challenges:
• Samples weights from a Gaussian  Training of Lower Layers
distribution with zero mean and  Cost of Data Readiness
variance  (Sometimes) Noisy Data
 Speed of Model Training
 Complexity of the Model
When
fanin = fanout LeCun Intialization
Only Half of fanin Kaiming He Initialization
• ni = size of ith layer
• fanin ni , ni+1 fanout

• Now-standard and practically

beneficial
https://proceedings.mlr.press/v 9/gl orot10a/glor ot10a.pdf?hc_loc ation=ufi
Fit of the model
Underfitting Overfitting
Challenges:
 Training of Lower Layers
 Cost of Data Readiness
 (Sometimes) Noisy Data
 Speed of Model Training
 Complexity of the Model

High Training loss

Low Training loss Low Training loss
High Validation loss
Low Validation loss High Validation loss
Little gap between both

Slide credit: Andrew Ng

Factors that influence the generalizability of a model

1. The number of tunable parameters.

• When the number of tunable parameters, called the
degrees of freedom, is large, models tend to be more
susceptible to overfitting.
2. The values taken by the parameters.
• When weights can take a wider range of values, models
can be more susceptible to overfitting.
3. The number of training examples.
• It is trivially easy to overfit a dataset containing only one
or two examples even if your model is simple.
• But overfitting a dataset with millions of examples
requires an extremely flexible model.
Bias- variance

• Simple models trained on different

samples of the data do not differ much from
each other

• However they are very far from the true

sinusoidal curve (under fitting)

• On the other hand, complex models

trained on different samples of the data are
very different from each other (high
variance)

Simple model: high bias, low variance

Complex model: low bias, high variance

Slide credit: IITM CS7015 BITS Pilani, Pilani Campus

Model complexity

• Simple models and abundant data

• Expect the generalization error to resemble the training error.
• More complex models and fewer examples
• Expect the training error to go down but the generalization gap to grow.
• Model complexity
• A model with more parameters might be considered more complex.
• A model whose parameters can take a wider range of values might be more
complex.
• A neural network model that takes more training iterations are more complex,
and
• One subject to early stopping (fewer training iterations) are less complex.
Model complexity

BITS Pilani, Pilani Campus

Model complexity

• Let there be n training points and m test (validation) points

• As the model complexity increases trainerr becomes overly

optimistic and gives us a wrong picture of how close f̂ is to f
• The validation error gives the real picture of how close f̂ is to f
Sour c e Cr edir : Mites h
M. K h a p r a
Model selection

• Model selection is the process of selecting the final model after evaluating several
candidate models.
• With MLPs, compare models with
• different numbers of hidden layers,
• different numbers of hidden units
• different activation functions applied to each hidden layer.
• We should touch the test data once, to assess the very best model or to compare a small
number of models to each other
• Use Validation dataset to determine the best among our candidate models
• In deep learning, with millions of data available, the split is generally
• Training = 98-99 % of the original dataset
• Validation = 1-2 % of training dataset
• Testing = 1-2 % of the original dataset
Model selection
Model complexity

• Why do we care about this bias variance tradeoff and model complexity?
• Deep Neural networks are highly complex models. Many parameters, many
nonlinearities.
• It is easy for them to overfit and drive training error to 0.
• Hence we need some form of regularization.
Different forms of regularization

• l 2 regularization
• Dataset augmentation
• Early stopping
• Ensemble methods
• Dropout
l 2 regularization- weight decay
Regularized Cost function Add the norm as a penalty
term to the problem of
minimizing the loss. This will
ensure that the weight vector
is small.
Regularized Cost function – Logistic regression

wt+1 = wt —η∇L (wt ) —η  wt

w0 is not regularized
Regularized Cost function – Neural network
Drop out
• Dropout refers to dropping out units
• Temporarily remove a node and all its
incoming/outgoing connections resulting in a thinned
network
• Each node is retained with a fixed probability (typically p
= 0.5) for hidden nodes and p = 0.8 for visible nodes

Sour c e Cr edit: Mites h

M. K h a p r a
Drop out

• Suppose a neural network has n nodes

• Using the dropout idea, each node can be retained or dropped
• For example, in the above case we drop 5 nodes to get a thinned
network Given a total of n nodes, what are the total number of thinned
networks that can be formed? 2n
• we cannot possibly train so many networks
• T rick : (1) Share the weights across all the networks
(2) Sample a different network for each training
instance
Drop out

• We initialize all the parameters (weights) of the network and start training
• For the first training instance (or mini-batch), we apply dropout resulting in
the thinned network
• We compute the loss and back propagate
• Which parameters will we update? Only those which are active
Drop out

• For the second training instance (or mini-batch), we again apply

dropout resulting in a different thinned network
• We again compute the loss and back propagate to the active weights
• If the weight was active for both the training instances then it would
have received two updates by now
• If the weight was active for only one of the training instances then it
would have received only one updates by now
• Parameter sharing ensures that no model has untrained or poorly
trained parameters
Drop out

• Prevents hidden units from coadoption

• Dropout gives a smaller neural network, giving the effect of
• regularization.
• In general,
• Vary keep probability (0.5 to 0.8) for each hidden layer.
• The input layer has a keep probability of 1.0 or 0.9.
• The output layer has a keep probability of 1.0.
Early stopping

Error
• Track the validation error
• Have a patience parameter p
• If you are at step k and
there was no improvement
V a l i d a t i on e r r o r
in validation error in the
previous p steps then stop
T r a i n i n g e r ror
training and return the
model stored at step k —p
k Steps
k −p
r et ur n stop • Basically, stop the training
t h i s model early before it drives the
training error to 0 and
blows up the validation
error
Sour c e Cr edit: Mites h
M. K h a p r a
Early stopping
Dataset augmentation

label = 2

[given training data] We

exploit the fact that certain
transformations to the image do
not change the label of the image.

• Typically, More data = better

learning
• Works well for image classification [augmented data = created using
/ object recognition tasks Also some knowledge of the task]
shown to work well for speech
• For some tasks it may not be clear
how to generate such data
Slide credit: IITM CS7015
Ensemble - Bagging

Each model trained with a different sample

of the data (sampling with replacement)
Ensemble - Bagging
• Typically model averaging(bagging ensemble) always helps
• Training several large neural networks for making an
ensemble is prohibitively expensive
• Option 1: Train several neural networks having different
architectures(obviously expensive)
• Option 2: Train multiple instances of the same network using
different training samples (again expensive)
• Even if we manage to train with option 1 or option 2, combining
several models at test time is infeasible in real time applications

Sour c e Cr edit: Mites h

M. K h a p r a
References

• https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
• Ref TB Dive into Deep Learning Sections 5.4, 5.5, 5.6 online
version
• IIT M CS7015 (Deep Learning) : Lecture 8
Thank you

DL UNIT II PART II(IMP)Optimization for Training Deep Model
No ratings yet
DL UNIT II PART II(IMP)Optimization for Training Deep Model
81 pages
Ceng403 - Week 6b
No ratings yet
Ceng403 - Week 6b
51 pages
DL M 2
No ratings yet
DL M 2
35 pages
A Little Book of Deep Learning - Francois Fleuret
No ratings yet
A Little Book of Deep Learning - Francois Fleuret
149 pages
DeepLearningHandBook2024
No ratings yet
DeepLearningHandBook2024
185 pages
Unit 4 Short Notes
No ratings yet
Unit 4 Short Notes
27 pages
DL Regularization (1)
No ratings yet
DL Regularization (1)
28 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Chap4slides
No ratings yet
Chap4slides
61 pages
21CSC305P ML_ Unit 1-E.pptx
No ratings yet
21CSC305P ML_ Unit 1-E.pptx
137 pages
Home Assignment Submission Solutions
No ratings yet
Home Assignment Submission Solutions
82 pages
Multiple Disease Detection
No ratings yet
Multiple Disease Detection
79 pages
6_Tips for Training Deep Neural Networks
No ratings yet
6_Tips for Training Deep Neural Networks
59 pages
WEEK 10
No ratings yet
WEEK 10
69 pages
DNN tip
No ratings yet
DNN tip
49 pages
Lecture # 4-2 Autoregressive Models
No ratings yet
Lecture # 4-2 Autoregressive Models
39 pages
LecML -3 NN
No ratings yet
LecML -3 NN
33 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
CMPE257 - W2C3 - ML Fundamentals_ Part 2
No ratings yet
CMPE257 - W2C3 - ML Fundamentals_ Part 2
34 pages
4. DeepLearning
No ratings yet
4. DeepLearning
32 pages
Fit without fear- remarkable mathematical phenomena of deep learning through the prism of interpolation
No ratings yet
Fit without fear- remarkable mathematical phenomena of deep learning through the prism of interpolation
51 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
Lecture W15ab
No ratings yet
Lecture W15ab
44 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
Lecture 05
No ratings yet
Lecture 05
34 pages
4 MachineLearningForCV
No ratings yet
4 MachineLearningForCV
73 pages
Unit 3
No ratings yet
Unit 3
110 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
cours4
No ratings yet
cours4
30 pages
Jawaban Huawei
No ratings yet
Jawaban Huawei
58 pages
SS 2020 Solutions
No ratings yet
SS 2020 Solutions
22 pages
Slides Scalable Machine Learning With Apache Spark
No ratings yet
Slides Scalable Machine Learning With Apache Spark
155 pages
Deep Learning Model
No ratings yet
Deep Learning Model
144 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
4 - DNN Tip
No ratings yet
4 - DNN Tip
52 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
205 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
CNN Basic Structure, Hyper-Parameter Tuning, Regularization-Dropouts
No ratings yet
CNN Basic Structure, Hyper-Parameter Tuning, Regularization-Dropouts
54 pages
AI Assoscate Final
No ratings yet
AI Assoscate Final
161 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Learning and Generalization in Single Layer Perceptrons: Introduction To Neural Networks: Lecture 4
No ratings yet
Learning and Generalization in Single Layer Perceptrons: Introduction To Neural Networks: Lecture 4
16 pages
NN
No ratings yet
NN
12 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Unit 2
No ratings yet
Unit 2
37 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
No ratings yet
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
61 pages
Deep Learning (1)
No ratings yet
Deep Learning (1)
19 pages
Training Neural
No ratings yet
Training Neural
16 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Artificial Neural Networks - Lect - 4
No ratings yet
Artificial Neural Networks - Lect - 4
17 pages
DL Class3
No ratings yet
DL Class3
28 pages
Curs5site PDF
No ratings yet
Curs5site PDF
47 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
MCA (Management) 2024 Syllabus_Sem_2
No ratings yet
MCA (Management) 2024 Syllabus_Sem_2
43 pages
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
No ratings yet
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
39 pages
Deep Learning Module 1
No ratings yet
Deep Learning Module 1
46 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
IoT - Lecture 11
No ratings yet
IoT - Lecture 11
58 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Module 3-1 PDF
No ratings yet
Module 3-1 PDF
43 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
REVIEW 1
No ratings yet
REVIEW 1
18 pages
Dropout As A Bayesian Approximation
No ratings yet
Dropout As A Bayesian Approximation
10 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Fake Social Media Profile Detection and Reporting
No ratings yet
Fake Social Media Profile Detection and Reporting
6 pages
Ensemble Learning-Bagging-Boosting-Stacking
No ratings yet
Ensemble Learning-Bagging-Boosting-Stacking
12 pages
Amazon Recommender System
No ratings yet
Amazon Recommender System
24 pages
A Complete Tutorial On Tree Based Modeling From Scratch (In R & Python) PDF
No ratings yet
A Complete Tutorial On Tree Based Modeling From Scratch (In R & Python) PDF
28 pages
Decisiontree 2
No ratings yet
Decisiontree 2
16 pages
A Comparison of Machine Learning Algorithms for Customer Churn Prediction
No ratings yet
A Comparison of Machine Learning Algorithms for Customer Churn Prediction
6 pages
NN 17 663
No ratings yet
NN 17 663
17 pages
Pyrespect: A Computer Program To Extract Discrete and Continuous Spectra From Stress Relaxation Experiments
No ratings yet
Pyrespect: A Computer Program To Extract Discrete and Continuous Spectra From Stress Relaxation Experiments
24 pages
Sample Paper For The Machine Learning Course Ajay Sharma
No ratings yet
Sample Paper For The Machine Learning Course Ajay Sharma
19 pages
Random Forest - Basics
No ratings yet
Random Forest - Basics
9 pages
Smart City Surveillance
No ratings yet
Smart City Surveillance
6 pages
ML BIT Ans
No ratings yet
ML BIT Ans
5 pages
10999-Manuscript (Word) - 48093-2-15-20231227
No ratings yet
10999-Manuscript (Word) - 48093-2-15-20231227
8 pages
Udacity Homework Answers
100% (1)
Udacity Homework Answers
8 pages
Pseudo-Mathematics and Financial Charlatanism. The Effects of Backtest Overfitting On Out of Sample Performance PDF
No ratings yet
Pseudo-Mathematics and Financial Charlatanism. The Effects of Backtest Overfitting On Out of Sample Performance PDF
14 pages
Crop Yield Prediction Using ML Algorithms
No ratings yet
Crop Yield Prediction Using ML Algorithms
8 pages
JETIR2202408
No ratings yet
JETIR2202408
7 pages
ML Quiz 1
No ratings yet
ML Quiz 1
4 pages
KLS'S Vishwanathrao Deshpande Institute of Technology, Haliyal
No ratings yet
KLS'S Vishwanathrao Deshpande Institute of Technology, Haliyal
17 pages
Python Machine Learning By Example: Unlock machine learning best practices with real-world use cases
From Everand
Python Machine Learning By Example: Unlock machine learning best practices with real-world use cases
Yuxi (Hayden) Liu
No ratings yet