0% found this document useful (0 votes)
31 views

Deep learning

The document discusses gradient-based learning in deep learning, emphasizing the differences between training neural networks and linear models, particularly the non-convex nature of neural network loss functions. It covers the importance of choosing appropriate cost functions, output units, and optimizers for effective training, as well as the use of back-propagation for gradient computation. Additionally, it explains concepts like maximum likelihood estimation, negative log-likelihood, and cross-entropy in the context of neural network training.

Uploaded by

vijaymaya8501
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Deep learning

The document discusses gradient-based learning in deep learning, emphasizing the differences between training neural networks and linear models, particularly the non-convex nature of neural network loss functions. It covers the importance of choosing appropriate cost functions, output units, and optimizers for effective training, as well as the use of back-propagation for gradient computation. Additionally, it explains concepts like maximum likelihood estimation, negative log-likelihood, and cross-entropy in the context of neural network training.

Uploaded by

vijaymaya8501
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

CSE4006

DEEP LEARNING

Gradient based Learning


Gradient based Learning - Introduction

• Designing and training a neural network is not much different from training any other
machine learning model with gradient descent.
• The largest difference between the linear models, and neural networks is the non-linearity
of a neural network.
• Non-linearity causes loss functions to become non-convex.
• Neural networks are usually trained using iterative, gradient-based optimizers that drive the
cost function to a very low value, rather than the linear equation solvers used to train
linear regression models or the convex optimization algorithms with guaranteed global
convergence (e.g., logistic regression or SVMs).
• Convex optimization converges starting from any initial parameters.
• Stochastic gradient descent applied to non-convex loss functions does not have such
convergence guarantee, and is sensitive to the values of the initial parameters.
Gradient based Learning - Introduction
• For feedforward neural networks, initialize all weights to small random values.
• The biases may be initialized to zero or small positive values.
• The training phase is always using the gradient to descend the cost function.
• Linear regression and SVM models can train using gradient descent for larger training set.
• Computing the gradient is slightly more complicated for a neural network than linear
models, but can still be done efficiently and exactly.
𝜕𝐽
𝜕𝐽 ( )
• Compute the gradient ( ) using the back-propagation algorithm. 𝜕𝑤
𝜕𝑤𝑖
To apply gradient-based learning, Choosing
1. The cost function, and
2. The form of the output units.
3. The optimizer
Feed Forward Neural Network - Recap

• Hidden layers were introduced between the input and the output layers.
• Activation functions were used in the hidden layers and the choice of activation functions could
be based on the problem domain.
• Design the architecture of the network by deciding the number of layers, Connectivity of layers,
and Number of (neurons) units in each layer.

• Learning in deep neural networks requires computing the gradients of complicated functions.
• Back-propagation algorithm can be used to efficiently compute these gradients.
Cost function
• To apply gradient-based learning cost function must be chosen, and additionally what output should
be obtained also must be chosen

• The total cost function is used to train a neural network.

• Most modern neural networks are trained using maximum likelihood i.e., the cost function is simply
the negative log-likelihood (cross-entropy between the training data and the model distribution).

• The gradient of the cost function must be large, and able to predict enough for learning algorithm.

• Functions that saturate (become very flat) undermine this objective because they make the gradient
become very small.

• Activation functions used to produce the output of the hidden units or the output units saturate.

• The negative log-likelihood helps to avoid this problem for many models.
Maximum Likelihood Estimation
• Objective is to find theta that maximizes the likelihood of observing the data.
• where 𝑦ො is the predicted probability of the positive class, and sigma is some non-linear
ෝ𝒊 = 𝝈(𝒇 𝒙𝒊 )
activation function that maps value from (−inf, inf) 𝑡𝑜 [0, 1]. 𝒚
• Likelihood = 𝑃 𝐷 𝜃 = ς𝑛𝑖=1 𝑦ො𝑖 𝑦𝑖 ∗ (1 − 𝑦ො𝑖 )(1−𝑦𝑖)
• Log-likelihood
• To check predictions accuracy, verify the predicted probabilities assigned to the correct labels.
𝒎

ෝ𝒊 + 𝟏 − 𝒚𝒊 . 𝒍𝒐𝒈 𝟏 − 𝒚
𝑃 𝐷 𝜃 = ෍(𝒚𝒊 . 𝐥𝐨 𝐠 𝒚 ෝ𝒊 )
𝒊=𝟏
• log() – natural log (base e) logarithm.
• y represents binary class values either zero or one.
• Hence for each index i, model adding either log (𝑦ො𝑖 ) or log(1 − 𝑦ො𝑖 ).
• 𝑦ො𝑖 - the predicted probability of the ith data point being positive class.
• (1 − 𝑦ො𝑖 ) - the predicted probability of the ith data point being negative class.
Procedure to compute log likelihood
• Start with predicted probabilities for the positive class (𝑦).

• If we were given raw prediction values, apply sigmoid to make it a probability.
• Compute the probabilities for the negative class (1-𝑦).ො
• Compute the log probabilities.
• Summing up the log probabilities associated with the true labels.
𝟎. 𝟔𝟒 𝟎. 𝟑𝟔 −𝟎. 𝟒𝟓 −𝟏. 𝟎𝟏
𝟎. 𝟐𝟕 𝟎. 𝟕𝟑 −𝟏. 𝟑𝟏 −𝟎. 𝟑𝟏
• 𝑦ො = 𝟎. 𝟎𝟒 (1 − 𝑦)
ො = 𝟎. 𝟗𝟔 log(𝑦)
ො = −𝟑. 𝟏𝟗 log(1 − 𝑦)
ො = −𝟎. 𝟎𝟒
𝟎. 𝟎𝟐 𝟎. 𝟗𝟖 −𝟏. 𝟏 −𝟎. 𝟎𝟐
𝟎. 𝟖𝟏 𝟎. 𝟏𝟗 −𝟎. 𝟐𝟏 −𝟏. 𝟔𝟖
log(1 − 𝑦)
ො log(𝑦)

𝟏 −1.01 −0.45
𝟎 −0.31 −1.31
• 𝑦= 𝟎 ෝ = −0.04
𝒚 −3.19 Loss = y. log 𝑦ො + 1 − 𝑦 . log 1 − 𝑦ො = −𝟑. 𝟓𝟖
𝟏 −0.02 −1.1 = −0.45 − 0.31 − 0.04 − 1.1 − 1.68 = −3.58
𝟎 −1.68 −0.21
Procedure to compute log likelihood
• The final operation of picking out the correct entries in a matrix is also sometimes referred
to as masking.
• The mask is constructed based on the true labels.
Minimizing the Negative Log-Likelihood
• Apply the negative of the log-likelihood to minimize the loss is Negative Log-Likelihood
Loss:
• 𝐽 𝜃 = − σ𝒎 ෝ𝒊 + 𝟏 − 𝒚𝒊 . 𝒍𝒐𝒈 𝟏 − 𝒚
𝒊=𝟏(𝒚𝒊 . 𝐥𝐨 𝐠 𝒚 ෝ𝒊 )
Cross-Entropy
• Negative log-likelihood is the same as the cross-entropy between 𝑦 (true labels) and 𝒚

(predicted probabilities of the true labels). 0.7
𝒎
0.2
ෝ) = − ෍(𝒚𝒊 . 𝐥𝐨 𝐠 𝒚
ℎ(𝑦, 𝒚 ෝ𝒊 )
0.1
𝒊=𝟏

• Cross Entropy Loss applies a softmax activation followed by a log transformation.

• But Negative Log-Likelihood does not applied.

Generalizing to Multiclass
• Based on “masking principle”, rewrite the log-likelihood

𝒎 𝒎
(𝒚𝒊 )
𝑙𝑜𝑔𝑃 𝐷 𝜃 = ෍ 𝒍𝒐𝒈ෝ
𝒚𝒊 = ෍(𝒚𝒊 ∗ 𝒍𝒐𝒈ෝ
𝒚𝒊 )
𝒊=𝟏 𝒊=𝟏
WHEN TO USE Negative Log Likelihood Loss, BCE and CE?
• Loss(ෝ ෝ is the prediction values or some transformed version of hypothesis,
𝒚, y), where 𝒚
and y is the label.
• Apply BCE Loss if 𝒚
ෝ is the probability of a data point being positive.
• Apply Negative Log-Likelihood Loss, and Cross Entropy Loss when 𝒚
ෝ is two-
dimensional and y is one-dimensional, taking values of 0 to C-1 with C classes.
• Apply Negative Log-Likelihood Loss if 𝒚
ෝ encodes log-likelihood (it essentially performs
the masking step followed by mean reduction).
• Apply Cross Entropy Loss if 𝒚
ෝ encodes raw prediction values that need to be activated
using the softmax function.
Choosing Output Units – Gradient Based Learning
• The choice of cost function is tightly coupled with the choice of output unit.
• Mostly, user simply use the cross-entropy between the data distribution and the model
distribution.
• Determine the form of the cross-entropy function (Binary / Categorical) based on output
representation.
• Neural network unit can be used as an output and also used as a hidden unit.
1. Linear Units for Gaussian Output Distributions
• Linear Unit is simple output unit based on an affine transformation without non-linearity.
• Given features ℎ, a layer of linear output units produces a vector 𝑦ො = 𝑤 𝑇 . 𝑥 + 𝑏.
• Linear output layers are often used to produce the mean of a conditional Gaussian
distribution: 𝑝(𝑦 | 𝑥) = 𝒩(𝑦; 𝑦,
ො 𝐼).
Choosing Output Units – Gradient Based Learning
2. Sigmoid Units for Bernoulli Output Distributions
• Choose it for Binary Classification problems
• The maximum-likelihood approach defines a Bernoulli distribution over y conditioned on
x.
• A Bernoulli distribution is defined by just a single number(Scalar).
• The neural net needs to predict only 𝑃 (𝑦 = 1 | 𝑥).
• For this number to be a valid probability, it must lie in the interval [0, 1].

𝟏
𝑦′ =𝝈 𝒛 =
𝟏 + 𝒆−𝒛
z = 𝑤 𝑇 . 𝑥 + 𝑏.
Choosing Output Units – Gradient Based Learning
3. Softmax Units for Multinoulli Output Distributions
• To represent a probability distribution over a discrete variable with n possible classes.
• More rarely, softmax functions can be used inside the model itself, to choose between one
of n different options for some internal variable.
• In the case of binary variables, we wished to produce a single number

𝒆𝒂 𝒊
𝑺𝒐𝒇𝒕𝒎𝒂𝒙𝒊 = 𝑵 𝒂𝒋 ∀𝒋 ∈ 𝟏. . 𝑵
σ𝒋=𝟏 𝒆
Choosing Optimizers – Gradient Based Learning
• Gradient Descent
• Stochastic Gradient Descent
• Mini-batch Gradient Descent
• Adagrad
• RMSProp
• Adam
• Adadelta
https://www.analyticsvidhya.com/blog/2021/03/variants-of-gradient-descent-algorithm/

Issues in GD
https://www.scaler.com/topics/momentum-based-gradient-descent/

Example

https://www.geeksforgeeks.org/intuition-of-adam-optimizer/
https://optimization.cbe.cornell.edu/index.php?title=Adam
brief

https://towardsai.net/p/l/impact-of-optimizers-in-image-classifiers

You might also like