0% found this document useful (0 votes)
24 views86 pages

Deep Learning (MODULE-2)

deep learning mod-2

Uploaded by

Shivanshu Tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views86 pages

Deep Learning (MODULE-2)

deep learning mod-2

Uploaded by

Shivanshu Tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 86

BCSE332L

DEEP LEARNING

Module:2
Module:2
IMPROVING DEEP NEURAL NETWORKS
1. Mini-Batch gradient Descent
2. Exponential Weighted Averages
3. Gradient Descent with Momentum
4. RMSProp and Adam Optimization
5. Hyperparameter Tuning
6. Batch Normalization
7. Softmax Regression
8. Softmax Classifier
9. Deep Learning Frameworks
10. Data Augmentation
1.) Mini-Batch gradient Descent
Why Gradient Descent?
An algorithm to minimize a cost function by
optimizing its parameters.
We start with random guess and slowly move to
right or best answer.
Need – Parameter Optimization
Formula:
New value = old value - step size
Where,
step size = Learning rate x Slope.
1.) Mini-Batch gradient Descent
1.) Mini-Batch gradient Descent
Note:
If the learning rate is too high, you might jump
across the valley and end up on the other side,
possibly even higher up than you were before.

This might make the algorithm diverge, with


larger values, failing to find a good solution.
1.) Mini-Batch gradient Descent
1.) Mini-Batch gradient Descent

What is Mini-Batch gradient Descent:


This is a compromise between batch and
stochastic gradient descent, where the algorithm
calculates the gradient of the cost function with
respect to the parameters for a small batch of
training examples at each iteration.

This can provide a good balance between speed


and stability.
1.) Mini-Batch gradient Descent
Neither we use all the dataset all at once nor
we use the single example at a time.

We use a batch of a fixed number of training


examples which is less than the actual dataset and call
it a mini-batch.

Doing this helps us achieve the advantages of


both the former variants.

So, after creating the mini-batches of fixed size,


1.) Mini-Batch gradient Descent
1.) Mini-Batch gradient Descent
The main advantage of Mini-batch GD over
Stochastic GD is that you can get a performance boost
from hardware optimization of matrix operations.
This method offers a compromise between
speed and stability, making it a popular choice in
deep learning applications.
Mini-Batch Gradient Descent is like a skilled
juggler, managing the trade-off between
computational efficiency and the fidelity of the error
gradient.
1.) Mini-Batch gradient Descent

It processes data in smaller, manageable


chunks, allowing quicker and more frequent updates
than batch gradient descent, yet more stable and
efficient than the stochastic approach.
1.) Mini-Batch gradient Descent
1.) Mini-Batch gradient Descent
1.) Mini-Batch gradient Descent
1.) Mini-Batch gradient Descent
1.) Mini-Batch gradient Descent
2.) Exponentially Weighted Averages
The Exponentially Weighted Moving
Average (EWMA) is commonly used as a
smoothing technique in time series.
However, due to several computational
advantages (fast, low-memory cost), the EWMA
is behind the scenes of many optimization
algorithms in deep learning, including Gradient
Descent with Momentum, RMSprop, Adam,
etc.
2.) Exponentially Weighted Averages
2.) Exponentially Weighted Averages
In order to compute the EWMA, you must
define one parameter β.
This parameter decides how important the
current observation is in the calculation of the
EWMA.
2.) Exponentially Weighted Averages
2.) Exponentially Weighted Averages
Example:
2.) Exponentially Weighted Averages
2.) Exponentially Weighted Averages
Example:
2.) Exponentially Weighted Averages

Substitute V98
2.) Exponentially Weighted Averages
Example:
2.) Exponentially Weighted Averages
2.) Exponentially Weighted Averages
Example:
3.) Gradient Descent with Momentum
Stochastic Gradient Descent / Batch Gradient Descent

Different Directions finally moving towards the convergence


3.) Gradient Descent with Momentum
Gradient Descent
3.) Gradient Descent with Momentum

In Exponential Weighted Averages – Smooth Curve

In SGD /BGD:
3.) Gradient Descent with Momentum

In SGD /BGD:
3.) Gradient Descent with Momentum

In SGD /BGD:
3.) Gradient Descent with Momentum

Advantages:

The Momentum-based Gradient Optimizer has several


advantages over the basic Gradient Descent algorithm, including
faster convergence, improved stability, and the ability to overcome
local minima.
It is widely used in deep learning applications and is an
important optimization technique for training deep neural networks.
4.) Optimization
In deep learning, optimization algorithms are crucial
components that help neural networks learn efficiently
and converge to optimal solutions.

Optimization provides a way to minimize the loss


function for deep learning, in essence, the goals of
optimization and deep learning are fundamentally
different.

The former is primarily concerned with minimizing


an objective whereas the latter is concerned with finding
a suitable model, given a finite amount of data
4.) RMSProp Optimization
RMSProp (Root Mean Squared Propagation) is an
adaptive learning rate optimization algorithm. It is an
extension of the popular Adaptive Gradient Algorithm and is
designed to dramatically reduce the amount of
computational effort used in training neural networks.
This algorithm works by exponentially decaying the
learning rate every time the squared gradient is less than a
certain threshold.
This helps reduce the learning rate more quickly when
the gradients become small.
In this way, RMSProp is able to smoothly adjust the
learning rate for each of the parameters in the network,
providing a better performance than regular Gradient
4.) RMSProp Optimization
4.) RMSProp Optimization

One key feature is its use of a moving average of the squared


gradients to scale the learning rate for each parameter.
This helps to stabilize the learning process and prevent
oscillations in the optimization trajectory.
4.) RMSProp Optimization
4.) RMSProp Optimization
4.) RMSProp Optimization
Advantages:

(a) Fast Convergence:


RMSprop is known for its fast convergence speed,
which means that it can find good solutions to
optimization problems in fewer iterations than some
other algorithms.

(b) Stable Learning:


The use of a moving average of the squared
gradients in RMSprop helps to stabilize the learning
process and prevent oscillations in the optimization
trajectory.
4.) RMSProp Optimization
Advantages:
(c) Fewer hyperparameters:
RMSprop has fewer hyperparameters than some
other optimization algorithms that make it easier to
tune and use in practice.

(d) Good performance on non-convex problems:


RMSprop tends to perform well on non-convex
optimization problems, common in Machine Learning
and deep learning.
Non-convex optimization problems have multiple
local minima, and RMSprop’s fast convergence speed
and stable learning can help it find good solutions even
in these cases.
5.) Adam Optimization
What is Adam Optimization?
Adam optimization is a gradient descent-based
optimization algorithm introduced by Diederik P. Kingma
and Jimmy Ba in 2014.
Adam stands for Adaptive Moment Estimation,
which describes the optimizer's method to update
weights during training.
The basic idea behind Adam optimization is to adjust
the learning rate adaptively for each parameter in the
model based on the history of gradients calculated for
that parameter.
This helps the optimizer converge faster and more
accurately than fixed learning rate methods like stochastic
5.) Adam Optimization
What is Adam Optimization?
 Adam is the most famous optimization
algorithm in deep learning.
At a high level, Adam combines Momentum
and RMSProp algorithms.
To achieve it, it simply keeps track of the
exponentially moving averages for computed
gradients and squared gradients respectively.
5.) Adam Optimization
What is Adam Optimization?
5.) Adam Optimization
What is Adam Optimization?
Furthermore, it is possible to use bias
correction for moving averages for a more
precise approximation of gradient trend during
the first several iterations.
The experiments show that Adam adapts well
to almost any type of neural network architecture
taking the advantages of both Momentum and
5.) Adam Optimization
What is Adam Optimization?
5.) Adam Optimization
Advantages of Adam Optimization?
(a) Adaptive Learning Rates:
Unlike fixed learning rate methods like SGD,
Adam optimization provides adaptive learning rates
for each parameter based on the history of gradients.
This allows the optimizer to converge faster and
more accurately, especially in high-dimensional
parameter spaces.
(b) Momentum:
Adam optimization uses momentum to smooth
out fluctuations in the optimization process, which
can help the optimizer avoid local minima and saddle
5.) Adam Optimization
Advantages of Adam Optimization?
(c) Bias Correction:
Adam optimization applies bias correction to
the first and second moment estimates to ensure
that they are unbiased estimates of the true values.

(d) Robustness:
Adam optimization is relatively robust to
hyperparameter choices and works well across a
wide range of deep learning architectures.
5.) Hyperparameter Tuning
What is hyperparameter tuning?
5.) Hyperparameter Tuning
What is hyperparameter tuning?
Hyperparameter- Example:
5.) Hyperparameter Tuning
What is hyperparameter tuning?
5.) Hyperparameter Tuning
What is hyperparameter tuning?
When you’re training machine learning
models, each dataset and model needs a different
set of hyperparameters, which are a kind of
variable.

The only way to determine these is through


multiple experiments, where you pick a set of
hyperparameters and run them through your
model. This is called hyperparameter tuning.
5.) Hyperparameter Tuning
What is hyperparameter tuning?
5.) Hyperparameter Tuning
What is hyperparameter tuning?
5.) Hyperparameter Tuning
What is hyperparameter tuning?
5.) Hyperparameter Tuning
What is hyperparameter tuning?
5.) Hyperparameter Tuning
What is hyperparameter tuning?
(a) Grid Search:
 We always find the best performing
combination in the Grid.
 Can be computationally Expensive

(a) Random Search:


 But Not overall best one
 Can be lead to good solutions, but it’s Not
guaranteed.
5.) Hyperparameter Tuning
What is hyperparameter tuning?
(a) Grid Search Vs Random Search
5.) Hyperparameter Tuning
What is hyperparameter tuning?
(a) Random Search:
5.) Hyperparameter Tuning
What is hyperparameter tuning?
5.) Hyperparameter Tuning
5.) Hyperparameter Tuning

Total Model –Build and Tested: (10 Hyperparameter combination *10 Cross Validation = 100
5.) Hyperparameter Tuning (Regularization)
Let’s explore some more detailed explanations about the role of
Regularization:
1. Complexity Control: Regularization helps control model complexity by
preventing overfitting to training data, resulting in better generalization
to new data.

2. Preventing Overfitting: One way to prevent overfitting is to use


regularization, which penalizes large coefficients and constrains their
magnitudes, thereby preventing a model from becoming overly complex
and memorizing the training data instead of learning its underlying
patterns.

3. Balancing Bias and Variance: Regularization can help balance the


trade-off between model bias (underfitting) and model variance
(overfitting) in machine learning, which leads to improved performance.
Hyperparameter Tuning (Regularization)
Let’s explore some more detailed explanations about the role of Regularization:

4. Feature Selection: Some regularization methods, such as L1


regularization (Lasso), promote sparse solutions that drive
some feature coefficients to zero. This automatically selects
important features while excluding less important ones.
5.Handling Multicollinearity: When features are highly
correlated (multicollinearity), regularization can stabilize the
model by reducing coefficient sensitivity to small data changes.
6. Generalization: Regularized models learn underlying patterns
of data for better generalization to new data, instead of
memorizing specific examples.
Hyperparameter Tuning (Regularization)
What are Overfitting and Underfitting?
Overfitting is a phenomenon that occurs when a Machine
Learning model is constrained to the training set and not able to perform
well on unseen data. That is when our model learns the noise in the training
data as well. This is the case when our model memorizes the training data
instead of learning the patterns in it.
Underfitting on the other hand is the case when our model is not able to
learn even the basic patterns available in the dataset. In the case of the
underfitting model is unable to perform well even on the training data hence
we cannot expect it to perform well on the validation data. This is the case
when we are supposed to increase the complexity of the model or add more
features to the feature set.
Hyperparameter Tuning (Regularization)
What are Overfitting and Underfitting?
Hyperparameter Tuning (Regularization)
What are Bias and Variance?
Bias refers to the errors which occur when we try to fit a statistical model
on real-world data which does not fit perfectly well on some mathematical
model. If we use a way too simplistic a model to fit the data then we are
more probably face the situation of High Bias which refers to the case
when the model is unable to learn the patterns in the data at hand and
hence performs poorly.
Variance implies the error value that occurs when we try to make
predictions by using data that is not previously seen by the model. There is
a situation known as high variance that occurs when the model learns
noise that is present in the data.
Hyperparameter Tuning (Regularization)
What are Bias and Variance?
Hyperparameter Tuning (Regularization)
Bias Variance tradeoff
The bias-variance tradeoff is a fundamental concept in machine learning.
It refers to the balance between bias and variance, which affect predictive
model performance.
Finding the right tradeoff is crucial for creating models that generalize well
to new data.
The bias-variance tradeoff demonstrates the inverse relationship between
bias and variance.
When one decreases, the other tends to increase, and vice versa.
Finding the right balance is crucial.
An overly simple model with high bias won’t capture the underlying
patterns, while an overly complex model with high variance will fit the noise
in the data.
Hyperparameter Tuning (Regularization)
Bias Variance tradeoff
Hyperparameter Tuning (Regularization)
Regularization in Machine Learning
Regularization is a technique used to reduce errors by fitting the function
appropriately on the given training set and avoiding overfitting.
The commonly used regularization techniques are :

Lasso Regularization – L1 Regularization

Ridge Regularization – L2 Regularization

Elastic Net Regularization – L1 and L2 Regularization


Hyperparameter Tuning (Regularization)
Regularization in Machine Learning
Hyperparameter Tuning (Regularization)
Regularization in Machine Learning
Hyperparameter Tuning (Regularization)
Regularization in Machine Learning
6.) Batch Normalization
6.) Batch Normalization
6.) Batch Normalization
6.) Batch Normalization
6.) Batch Normalization
6.) Batch Normalization
6.) Batch Normalization
6.) Batch Normalization
6.) Batch Normalization

 Training Deep Neural Networks is a difficult task


that involves several problems to tackle.
 Despite their huge potential, they can be slow
and be prone to overfitting.
 Thus, studies on methods to solve these
problems are constant in Deep Learning research.
 Batch Normalization – commonly abbreviated
as Batch Norm – is one of these methods.
6.) Batch Normalization

 Currently, it is a widely used technique in the


field of Deep Learning.
 It improves the learning speed of Neural
Networks and provides regularization, avoiding
overfitting.
 But why is it so important? How does it work?
Furthermore, how can it be applied to non-regular
networks such as Convolutional Neural
6.) Batch Normalization
6.) Batch Normalization

You might also like