Deep learning

The document discusses gradient-based learning in deep learning, emphasizing the differences between training neural networks and linear models, particularly the non-convex nature of neural network loss functions. It covers the importance of choosing appropriate cost functions, output units, and optimizers for effective training, as well as the use of back-propagation for gradient computation. Additionally, it explains concepts like maximum likelihood estimation, negative log-likelihood, and cross-entropy in the context of neural network training.

Uploaded by

vijaymaya8501

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Deep learning

Uploaded by

vijaymaya8501

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

CSE4006

DEEP LEARNING

Gradient based Learning

Gradient based Learning - Introduction

• Designing and training a neural network is not much different from training any other
machine learning model with gradient descent.
• The largest difference between the linear models, and neural networks is the non-linearity
of a neural network.
• Non-linearity causes loss functions to become non-convex.
• Neural networks are usually trained using iterative, gradient-based optimizers that drive the
cost function to a very low value, rather than the linear equation solvers used to train
linear regression models or the convex optimization algorithms with guaranteed global
convergence (e.g., logistic regression or SVMs).
• Convex optimization converges starting from any initial parameters.
• Stochastic gradient descent applied to non-convex loss functions does not have such
convergence guarantee, and is sensitive to the values of the initial parameters.
Gradient based Learning - Introduction
• For feedforward neural networks, initialize all weights to small random values.
• The biases may be initialized to zero or small positive values.
• The training phase is always using the gradient to descend the cost function.
• Linear regression and SVM models can train using gradient descent for larger training set.
• Computing the gradient is slightly more complicated for a neural network than linear
models, but can still be done efficiently and exactly.
𝜕𝐽
𝜕𝐽 ( )
• Compute the gradient ( ) using the back-propagation algorithm. 𝜕𝑤
𝜕𝑤𝑖
To apply gradient-based learning, Choosing
1. The cost function, and
2. The form of the output units.
3. The optimizer
Feed Forward Neural Network - Recap

• Hidden layers were introduced between the input and the output layers.
• Activation functions were used in the hidden layers and the choice of activation functions could
be based on the problem domain.
• Design the architecture of the network by deciding the number of layers, Connectivity of layers,
and Number of (neurons) units in each layer.

• Learning in deep neural networks requires computing the gradients of complicated functions.
• Back-propagation algorithm can be used to efficiently compute these gradients.
Cost function
• To apply gradient-based learning cost function must be chosen, and additionally what output should
be obtained also must be chosen

• The total cost function is used to train a neural network.

• Most modern neural networks are trained using maximum likelihood i.e., the cost function is simply
the negative log-likelihood (cross-entropy between the training data and the model distribution).

• The gradient of the cost function must be large, and able to predict enough for learning algorithm.

• Functions that saturate (become very flat) undermine this objective because they make the gradient
become very small.

• Activation functions used to produce the output of the hidden units or the output units saturate.

• The negative log-likelihood helps to avoid this problem for many models.
Maximum Likelihood Estimation
• Objective is to find theta that maximizes the likelihood of observing the data.
• where 𝑦ො is the predicted probability of the positive class, and sigma is some non-linear
ෝ𝒊 = 𝝈(𝒇 𝒙𝒊 )
activation function that maps value from (−inf, inf) 𝑡𝑜 [0, 1]. 𝒚
• Likelihood = 𝑃 𝐷 𝜃 = ς𝑛𝑖=1 𝑦ො𝑖 𝑦𝑖 ∗ (1 − 𝑦ො𝑖 )(1−𝑦𝑖)
• Log-likelihood
• To check predictions accuracy, verify the predicted probabilities assigned to the correct labels.
𝒎

ෝ𝒊 + 𝟏 − 𝒚𝒊 . 𝒍𝒐𝒈 𝟏 − 𝒚
𝑃 𝐷 𝜃 = ෍(𝒚𝒊 . 𝐥𝐨 𝐠 𝒚 ෝ𝒊 )
𝒊=𝟏
• log() – natural log (base e) logarithm.
• y represents binary class values either zero or one.
• Hence for each index i, model adding either log (𝑦ො𝑖 ) or log(1 − 𝑦ො𝑖 ).
• 𝑦ො𝑖 - the predicted probability of the ith data point being positive class.
• (1 − 𝑦ො𝑖 ) - the predicted probability of the ith data point being negative class.
Procedure to compute log likelihood
• Start with predicted probabilities for the positive class (𝑦).
ො
• If we were given raw prediction values, apply sigmoid to make it a probability.
• Compute the probabilities for the negative class (1-𝑦).ො
• Compute the log probabilities.
• Summing up the log probabilities associated with the true labels.
𝟎. 𝟔𝟒 𝟎. 𝟑𝟔 −𝟎. 𝟒𝟓 −𝟏. 𝟎𝟏
𝟎. 𝟐𝟕 𝟎. 𝟕𝟑 −𝟏. 𝟑𝟏 −𝟎. 𝟑𝟏
• 𝑦ො = 𝟎. 𝟎𝟒 (1 − 𝑦)
ො = 𝟎. 𝟗𝟔 log(𝑦)
ො = −𝟑. 𝟏𝟗 log(1 − 𝑦)
ො = −𝟎. 𝟎𝟒
𝟎. 𝟎𝟐 𝟎. 𝟗𝟖 −𝟏. 𝟏 −𝟎. 𝟎𝟐
𝟎. 𝟖𝟏 𝟎. 𝟏𝟗 −𝟎. 𝟐𝟏 −𝟏. 𝟔𝟖
log(1 − 𝑦)
ො log(𝑦)
ො
𝟏 −1.01 −0.45
𝟎 −0.31 −1.31
• 𝑦= 𝟎 ෝ = −0.04
𝒚 −3.19 Loss = y. log 𝑦ො + 1 − 𝑦 . log 1 − 𝑦ො = −𝟑. 𝟓𝟖
𝟏 −0.02 −1.1 = −0.45 − 0.31 − 0.04 − 1.1 − 1.68 = −3.58
𝟎 −1.68 −0.21
Procedure to compute log likelihood
• The final operation of picking out the correct entries in a matrix is also sometimes referred
to as masking.
• The mask is constructed based on the true labels.
Minimizing the Negative Log-Likelihood
• Apply the negative of the log-likelihood to minimize the loss is Negative Log-Likelihood
Loss:
• 𝐽 𝜃 = − σ𝒎 ෝ𝒊 + 𝟏 − 𝒚𝒊 . 𝒍𝒐𝒈 𝟏 − 𝒚
𝒊=𝟏(𝒚𝒊 . 𝐥𝐨 𝐠 𝒚 ෝ𝒊 )
Cross-Entropy
• Negative log-likelihood is the same as the cross-entropy between 𝑦 (true labels) and 𝒚
ෝ
(predicted probabilities of the true labels). 0.7
𝒎
0.2
ෝ) = − ෍(𝒚𝒊 . 𝐥𝐨 𝐠 𝒚
ℎ(𝑦, 𝒚 ෝ𝒊 )
0.1
𝒊=𝟏

• Cross Entropy Loss applies a softmax activation followed by a log transformation.

• But Negative Log-Likelihood does not applied.

Generalizing to Multiclass
• Based on “masking principle”, rewrite the log-likelihood

𝒎 𝒎
(𝒚𝒊 )
𝑙𝑜𝑔𝑃 𝐷 𝜃 = ෍ 𝒍𝒐𝒈ෝ
𝒚𝒊 = ෍(𝒚𝒊 ∗ 𝒍𝒐𝒈ෝ
𝒚𝒊 )
𝒊=𝟏 𝒊=𝟏
WHEN TO USE Negative Log Likelihood Loss, BCE and CE?
• Loss(ෝ ෝ is the prediction values or some transformed version of hypothesis,
𝒚, y), where 𝒚
and y is the label.
• Apply BCE Loss if 𝒚
ෝ is the probability of a data point being positive.
• Apply Negative Log-Likelihood Loss, and Cross Entropy Loss when 𝒚
ෝ is two-
dimensional and y is one-dimensional, taking values of 0 to C-1 with C classes.
• Apply Negative Log-Likelihood Loss if 𝒚
ෝ encodes log-likelihood (it essentially performs
the masking step followed by mean reduction).
• Apply Cross Entropy Loss if 𝒚
ෝ encodes raw prediction values that need to be activated
using the softmax function.
Choosing Output Units – Gradient Based Learning
• The choice of cost function is tightly coupled with the choice of output unit.
• Mostly, user simply use the cross-entropy between the data distribution and the model
distribution.
• Determine the form of the cross-entropy function (Binary / Categorical) based on output
representation.
• Neural network unit can be used as an output and also used as a hidden unit.
1. Linear Units for Gaussian Output Distributions
• Linear Unit is simple output unit based on an affine transformation without non-linearity.
• Given features ℎ, a layer of linear output units produces a vector 𝑦ො = 𝑤 𝑇 . 𝑥 + 𝑏.
• Linear output layers are often used to produce the mean of a conditional Gaussian
distribution: 𝑝(𝑦 | 𝑥) = 𝒩(𝑦; 𝑦,
ො 𝐼).
Choosing Output Units – Gradient Based Learning
2. Sigmoid Units for Bernoulli Output Distributions
• Choose it for Binary Classification problems
• The maximum-likelihood approach defines a Bernoulli distribution over y conditioned on
x.
• A Bernoulli distribution is defined by just a single number(Scalar).
• The neural net needs to predict only 𝑃 (𝑦 = 1 | 𝑥).
• For this number to be a valid probability, it must lie in the interval [0, 1].

𝟏
𝑦′ =𝝈 𝒛 =
𝟏 + 𝒆−𝒛
z = 𝑤 𝑇 . 𝑥 + 𝑏.
Choosing Output Units – Gradient Based Learning
3. Softmax Units for Multinoulli Output Distributions
• To represent a probability distribution over a discrete variable with n possible classes.
• More rarely, softmax functions can be used inside the model itself, to choose between one
of n different options for some internal variable.
• In the case of binary variables, we wished to produce a single number

𝒆𝒂 𝒊
𝑺𝒐𝒇𝒕𝒎𝒂𝒙𝒊 = 𝑵 𝒂𝒋 ∀𝒋 ∈ 𝟏. . 𝑵
σ𝒋=𝟏 𝒆
Choosing Optimizers – Gradient Based Learning
• Gradient Descent
• Stochastic Gradient Descent
• Mini-batch Gradient Descent
• Adagrad
• RMSProp
• Adam
• Adadelta
https://www.analyticsvidhya.com/blog/2021/03/variants-of-gradient-descent-algorithm/

Issues in GD
https://www.scaler.com/topics/momentum-based-gradient-descent/

Example

https://www.geeksforgeeks.org/intuition-of-adam-optimizer/
https://optimization.cbe.cornell.edu/index.php?title=Adam
brief

https://towardsai.net/p/l/impact-of-optimizers-in-image-classifiers

Locke Christopher Gonzo Marketing Winning Through Worst Practices Basic Books 2009
No ratings yet
Locke Christopher Gonzo Marketing Winning Through Worst Practices Basic Books 2009
304 pages
Juan Martin Garcia System Dynamics Exercises
No ratings yet
Juan Martin Garcia System Dynamics Exercises
294 pages
Rescue Water Craft Operator: Learner Guide
No ratings yet
Rescue Water Craft Operator: Learner Guide
40 pages
English Grammar Bsics PDF
90% (10)
English Grammar Bsics PDF
241 pages
10 Gradient Based Learning 10-08-2024
No ratings yet
10 Gradient Based Learning 10-08-2024
22 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
MODULE 2 DL SNOTES P1
No ratings yet
MODULE 2 DL SNOTES P1
16 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Lecture_09_slides_-_after
No ratings yet
Lecture_09_slides_-_after
57 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Deep Learning (1)
No ratings yet
Deep Learning (1)
19 pages
3 ArtificialNeuralNetworks PDF
No ratings yet
3 ArtificialNeuralNetworks PDF
77 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
Chapter 2 - 2 Shallow neural network 2_2
No ratings yet
Chapter 2 - 2 Shallow neural network 2_2
34 pages
cs188 sp23 Note25
No ratings yet
cs188 sp23 Note25
8 pages
Feedforward Networks: Marco Kuhlmann
No ratings yet
Feedforward Networks: Marco Kuhlmann
53 pages
Lecture8 DeepLearning
No ratings yet
Lecture8 DeepLearning
94 pages
DL - M2 - Deep Feedforward NN
No ratings yet
DL - M2 - Deep Feedforward NN
97 pages
Module 2
No ratings yet
Module 2
44 pages
Unit-1 and 2 and 3 (1)
No ratings yet
Unit-1 and 2 and 3 (1)
212 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
DL-2
No ratings yet
DL-2
62 pages
cs188-sp24-note22
No ratings yet
cs188-sp24-note22
8 pages
DNN - M2 - Deep Feedforward NN 23dec
No ratings yet
DNN - M2 - Deep Feedforward NN 23dec
97 pages
Module 2 Deep Feed Forward Networks
No ratings yet
Module 2 Deep Feed Forward Networks
18 pages
tutorial 1,2
No ratings yet
tutorial 1,2
12 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Chap 6 - Deep FeedForward Networks - Eunjeong Yi
No ratings yet
Chap 6 - Deep FeedForward Networks - Eunjeong Yi
21 pages
neural-networks-essay-feranmi-dere
No ratings yet
neural-networks-essay-feranmi-dere
7 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000689_2025-01-03_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000689_2025-01-03_Reference-Material-I
39 pages
L3_CSE256_FA24_FFN
No ratings yet
L3_CSE256_FA24_FFN
64 pages
Neural Networks
No ratings yet
Neural Networks
14 pages
Lecture Slides 2 - Neural Networks - 2021
No ratings yet
Lecture Slides 2 - Neural Networks - 2021
42 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
16-dl-1 - converted
No ratings yet
16-dl-1 - converted
9 pages
A2.2 DNN Update 2
No ratings yet
A2.2 DNN Update 2
51 pages
lec05
No ratings yet
lec05
46 pages
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
No ratings yet
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
61 pages
Chapter 11 Neural Nets (Python)
No ratings yet
Chapter 11 Neural Nets (Python)
43 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
01 - Introduction To Deep Learning
No ratings yet
01 - Introduction To Deep Learning
56 pages
Chapter 5 Final
No ratings yet
Chapter 5 Final
80 pages
Types of Neural Networks
No ratings yet
Types of Neural Networks
7 pages
Deep+Learning+Module-02+Search+Creators
No ratings yet
Deep+Learning+Module-02+Search+Creators
15 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
cs188-fa24-lec24
No ratings yet
cs188-fa24-lec24
46 pages
L7-Lecture-Image.classification.DNN-v4
No ratings yet
L7-Lecture-Image.classification.DNN-v4
61 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
AML 03 Dense Neural Networks
No ratings yet
AML 03 Dense Neural Networks
20 pages
ML Fundamentals by Bitspace
No ratings yet
ML Fundamentals by Bitspace
19 pages
cst414- Deep learning
No ratings yet
cst414- Deep learning
34 pages
Slides 11
No ratings yet
Slides 11
48 pages
Chapter 11 Neural Nets
No ratings yet
Chapter 11 Neural Nets
39 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
AI - W7L13
No ratings yet
AI - W7L13
46 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
NVIS Antenna Theory and Design
No ratings yet
NVIS Antenna Theory and Design
56 pages
Chapter 3 Example
No ratings yet
Chapter 3 Example
15 pages
EnerRocket ES 65-12
No ratings yet
EnerRocket ES 65-12
2 pages
Child Labour Act
No ratings yet
Child Labour Act
10 pages
PowderRiverBasin Type Log
No ratings yet
PowderRiverBasin Type Log
1 page
DYNAMICS OF MARKETS - RELATIONSHIP BETWEEN MARKETS GRADE 11 (2)
No ratings yet
DYNAMICS OF MARKETS - RELATIONSHIP BETWEEN MARKETS GRADE 11 (2)
23 pages
DTH (Direct To Home)
No ratings yet
DTH (Direct To Home)
25 pages
Qaqc
No ratings yet
Qaqc
83 pages
Expense Sheet
No ratings yet
Expense Sheet
2 pages
SCI6M5Q4
No ratings yet
SCI6M5Q4
20 pages
Deane Mental Illness Sowa Rigpa
No ratings yet
Deane Mental Illness Sowa Rigpa
27 pages
Service Manual: XM-2165GTX
No ratings yet
Service Manual: XM-2165GTX
24 pages
Dye-Sensitized Solar Cells DSSCs Based on Extracte
No ratings yet
Dye-Sensitized Solar Cells DSSCs Based on Extracte
10 pages
Forest Conservation
No ratings yet
Forest Conservation
29 pages
GB 50265-2010
No ratings yet
GB 50265-2010
97 pages
1500 kVA Transformer Specification
No ratings yet
1500 kVA Transformer Specification
19 pages
Epileptic Seizure Prediction Based On Features Extracted Using Wavelet Decomposition and Linear Prediction Filter
No ratings yet
Epileptic Seizure Prediction Based On Features Extracted Using Wavelet Decomposition and Linear Prediction Filter
6 pages
Horizon Scanning - Final2 1
100% (1)
Horizon Scanning - Final2 1
24 pages
Chapter 50
No ratings yet
Chapter 50
20 pages
Grove GMK 5275
No ratings yet
Grove GMK 5275
24 pages
SLVR Cartas PDF
No ratings yet
SLVR Cartas PDF
18 pages
General Overview and Advances in Deep Soil Mixing
No ratings yet
General Overview and Advances in Deep Soil Mixing
30 pages
기출문제 영어 비상 홍민표 3단원 1회
No ratings yet
기출문제 영어 비상 홍민표 3단원 1회
9 pages
WP Scada PDF
No ratings yet
WP Scada PDF
12 pages
Dentin Bonding Agents
No ratings yet
Dentin Bonding Agents
47 pages
Human Organ Systems and Bio-Designs - 2: at Myintuition4865
No ratings yet
Human Organ Systems and Bio-Designs - 2: at Myintuition4865
8 pages

Deep learning

Uploaded by

Deep learning

Uploaded by

CSE4006

Gradient based Learning

• The total cost function is used to train a neural network.

• Cross Entropy Loss applies a softmax activation followed by a log transformation.

• But Negative Log-Likelihood does not applied.

You might also like