Deep learning
Deep learning
DEEP LEARNING
• Designing and training a neural network is not much different from training any other
machine learning model with gradient descent.
• The largest difference between the linear models, and neural networks is the non-linearity
of a neural network.
• Non-linearity causes loss functions to become non-convex.
• Neural networks are usually trained using iterative, gradient-based optimizers that drive the
cost function to a very low value, rather than the linear equation solvers used to train
linear regression models or the convex optimization algorithms with guaranteed global
convergence (e.g., logistic regression or SVMs).
• Convex optimization converges starting from any initial parameters.
• Stochastic gradient descent applied to non-convex loss functions does not have such
convergence guarantee, and is sensitive to the values of the initial parameters.
Gradient based Learning - Introduction
• For feedforward neural networks, initialize all weights to small random values.
• The biases may be initialized to zero or small positive values.
• The training phase is always using the gradient to descend the cost function.
• Linear regression and SVM models can train using gradient descent for larger training set.
• Computing the gradient is slightly more complicated for a neural network than linear
models, but can still be done efficiently and exactly.
𝜕𝐽
𝜕𝐽 ( )
• Compute the gradient ( ) using the back-propagation algorithm. 𝜕𝑤
𝜕𝑤𝑖
To apply gradient-based learning, Choosing
1. The cost function, and
2. The form of the output units.
3. The optimizer
Feed Forward Neural Network - Recap
• Hidden layers were introduced between the input and the output layers.
• Activation functions were used in the hidden layers and the choice of activation functions could
be based on the problem domain.
• Design the architecture of the network by deciding the number of layers, Connectivity of layers,
and Number of (neurons) units in each layer.
• Learning in deep neural networks requires computing the gradients of complicated functions.
• Back-propagation algorithm can be used to efficiently compute these gradients.
Cost function
• To apply gradient-based learning cost function must be chosen, and additionally what output should
be obtained also must be chosen
• Most modern neural networks are trained using maximum likelihood i.e., the cost function is simply
the negative log-likelihood (cross-entropy between the training data and the model distribution).
• The gradient of the cost function must be large, and able to predict enough for learning algorithm.
• Functions that saturate (become very flat) undermine this objective because they make the gradient
become very small.
• Activation functions used to produce the output of the hidden units or the output units saturate.
• The negative log-likelihood helps to avoid this problem for many models.
Maximum Likelihood Estimation
• Objective is to find theta that maximizes the likelihood of observing the data.
• where 𝑦ො is the predicted probability of the positive class, and sigma is some non-linear
ෝ𝒊 = 𝝈(𝒇 𝒙𝒊 )
activation function that maps value from (−inf, inf) 𝑡𝑜 [0, 1]. 𝒚
• Likelihood = 𝑃 𝐷 𝜃 = ς𝑛𝑖=1 𝑦ො𝑖 𝑦𝑖 ∗ (1 − 𝑦ො𝑖 )(1−𝑦𝑖)
• Log-likelihood
• To check predictions accuracy, verify the predicted probabilities assigned to the correct labels.
𝒎
ෝ𝒊 + 𝟏 − 𝒚𝒊 . 𝒍𝒐𝒈 𝟏 − 𝒚
𝑃 𝐷 𝜃 = (𝒚𝒊 . 𝐥𝐨 𝐠 𝒚 ෝ𝒊 )
𝒊=𝟏
• log() – natural log (base e) logarithm.
• y represents binary class values either zero or one.
• Hence for each index i, model adding either log (𝑦ො𝑖 ) or log(1 − 𝑦ො𝑖 ).
• 𝑦ො𝑖 - the predicted probability of the ith data point being positive class.
• (1 − 𝑦ො𝑖 ) - the predicted probability of the ith data point being negative class.
Procedure to compute log likelihood
• Start with predicted probabilities for the positive class (𝑦).
ො
• If we were given raw prediction values, apply sigmoid to make it a probability.
• Compute the probabilities for the negative class (1-𝑦).ො
• Compute the log probabilities.
• Summing up the log probabilities associated with the true labels.
𝟎. 𝟔𝟒 𝟎. 𝟑𝟔 −𝟎. 𝟒𝟓 −𝟏. 𝟎𝟏
𝟎. 𝟐𝟕 𝟎. 𝟕𝟑 −𝟏. 𝟑𝟏 −𝟎. 𝟑𝟏
• 𝑦ො = 𝟎. 𝟎𝟒 (1 − 𝑦)
ො = 𝟎. 𝟗𝟔 log(𝑦)
ො = −𝟑. 𝟏𝟗 log(1 − 𝑦)
ො = −𝟎. 𝟎𝟒
𝟎. 𝟎𝟐 𝟎. 𝟗𝟖 −𝟏. 𝟏 −𝟎. 𝟎𝟐
𝟎. 𝟖𝟏 𝟎. 𝟏𝟗 −𝟎. 𝟐𝟏 −𝟏. 𝟔𝟖
log(1 − 𝑦)
ො log(𝑦)
ො
𝟏 −1.01 −0.45
𝟎 −0.31 −1.31
• 𝑦= 𝟎 ෝ = −0.04
𝒚 −3.19 Loss = y. log 𝑦ො + 1 − 𝑦 . log 1 − 𝑦ො = −𝟑. 𝟓𝟖
𝟏 −0.02 −1.1 = −0.45 − 0.31 − 0.04 − 1.1 − 1.68 = −3.58
𝟎 −1.68 −0.21
Procedure to compute log likelihood
• The final operation of picking out the correct entries in a matrix is also sometimes referred
to as masking.
• The mask is constructed based on the true labels.
Minimizing the Negative Log-Likelihood
• Apply the negative of the log-likelihood to minimize the loss is Negative Log-Likelihood
Loss:
• 𝐽 𝜃 = − σ𝒎 ෝ𝒊 + 𝟏 − 𝒚𝒊 . 𝒍𝒐𝒈 𝟏 − 𝒚
𝒊=𝟏(𝒚𝒊 . 𝐥𝐨 𝐠 𝒚 ෝ𝒊 )
Cross-Entropy
• Negative log-likelihood is the same as the cross-entropy between 𝑦 (true labels) and 𝒚
ෝ
(predicted probabilities of the true labels). 0.7
𝒎
0.2
ෝ) = − (𝒚𝒊 . 𝐥𝐨 𝐠 𝒚
ℎ(𝑦, 𝒚 ෝ𝒊 )
0.1
𝒊=𝟏
Generalizing to Multiclass
• Based on “masking principle”, rewrite the log-likelihood
𝒎 𝒎
(𝒚𝒊 )
𝑙𝑜𝑔𝑃 𝐷 𝜃 = 𝒍𝒐𝒈ෝ
𝒚𝒊 = (𝒚𝒊 ∗ 𝒍𝒐𝒈ෝ
𝒚𝒊 )
𝒊=𝟏 𝒊=𝟏
WHEN TO USE Negative Log Likelihood Loss, BCE and CE?
• Loss(ෝ ෝ is the prediction values or some transformed version of hypothesis,
𝒚, y), where 𝒚
and y is the label.
• Apply BCE Loss if 𝒚
ෝ is the probability of a data point being positive.
• Apply Negative Log-Likelihood Loss, and Cross Entropy Loss when 𝒚
ෝ is two-
dimensional and y is one-dimensional, taking values of 0 to C-1 with C classes.
• Apply Negative Log-Likelihood Loss if 𝒚
ෝ encodes log-likelihood (it essentially performs
the masking step followed by mean reduction).
• Apply Cross Entropy Loss if 𝒚
ෝ encodes raw prediction values that need to be activated
using the softmax function.
Choosing Output Units – Gradient Based Learning
• The choice of cost function is tightly coupled with the choice of output unit.
• Mostly, user simply use the cross-entropy between the data distribution and the model
distribution.
• Determine the form of the cross-entropy function (Binary / Categorical) based on output
representation.
• Neural network unit can be used as an output and also used as a hidden unit.
1. Linear Units for Gaussian Output Distributions
• Linear Unit is simple output unit based on an affine transformation without non-linearity.
• Given features ℎ, a layer of linear output units produces a vector 𝑦ො = 𝑤 𝑇 . 𝑥 + 𝑏.
• Linear output layers are often used to produce the mean of a conditional Gaussian
distribution: 𝑝(𝑦 | 𝑥) = 𝒩(𝑦; 𝑦,
ො 𝐼).
Choosing Output Units – Gradient Based Learning
2. Sigmoid Units for Bernoulli Output Distributions
• Choose it for Binary Classification problems
• The maximum-likelihood approach defines a Bernoulli distribution over y conditioned on
x.
• A Bernoulli distribution is defined by just a single number(Scalar).
• The neural net needs to predict only 𝑃 (𝑦 = 1 | 𝑥).
• For this number to be a valid probability, it must lie in the interval [0, 1].
𝟏
𝑦′ =𝝈 𝒛 =
𝟏 + 𝒆−𝒛
z = 𝑤 𝑇 . 𝑥 + 𝑏.
Choosing Output Units – Gradient Based Learning
3. Softmax Units for Multinoulli Output Distributions
• To represent a probability distribution over a discrete variable with n possible classes.
• More rarely, softmax functions can be used inside the model itself, to choose between one
of n different options for some internal variable.
• In the case of binary variables, we wished to produce a single number
𝒆𝒂 𝒊
𝑺𝒐𝒇𝒕𝒎𝒂𝒙𝒊 = 𝑵 𝒂𝒋 ∀𝒋 ∈ 𝟏. . 𝑵
σ𝒋=𝟏 𝒆
Choosing Optimizers – Gradient Based Learning
• Gradient Descent
• Stochastic Gradient Descent
• Mini-batch Gradient Descent
• Adagrad
• RMSProp
• Adam
• Adadelta
https://www.analyticsvidhya.com/blog/2021/03/variants-of-gradient-descent-algorithm/
Issues in GD
https://www.scaler.com/topics/momentum-based-gradient-descent/
Example
https://www.geeksforgeeks.org/intuition-of-adam-optimizer/
https://optimization.cbe.cornell.edu/index.php?title=Adam
brief
https://towardsai.net/p/l/impact-of-optimizers-in-image-classifiers