0% found this document useful (0 votes)
8 views

Lecture 18. Backpropagation

Back propaganda

Uploaded by

Satyarth Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture 18. Backpropagation

Back propaganda

Uploaded by

Satyarth Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Backpropagation

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


MuCulloch Pitts Neuron and Perceptron

𝑛 𝑛

𝑦 = 1 𝑖𝑓 ෍ 𝑥𝑖 ≥ 0 𝑦 = 1 𝑖𝑓 ෍ 𝒘𝒊 ∗ 𝑥𝑖 ≥ 0
𝑖=0 𝑖=0

𝑛 𝑛

= 0 𝑖𝑓 ෍ 𝑥𝑖 < 0 = 0 𝑖𝑓 ෍ 𝒘𝒊 ∗ 𝑥𝑖 < 0
𝑖=0 𝑖=0

➢ A perceptron separates the input space into two halves

➢ In other words, a single perceptron can only be used to implement linearly separable functions

➢ The weights (including threshold) can be learned and the inputs can be real valued
Credit: Mitesh Khapra, IITM
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Perceptron Learning

➢ Consider two vectors w and x

w = [w0​,w1​,w2​,...,wn​]
x = [1,x1​,x2​,...,xn​]

➢ We can rewrite the perceptron rule as

Credit: Mitesh Khapra, IITM


DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Perceptron Learning

➢ Consider some points (vectors) which lie in the positive


half space of this line (i.e., wTx>0).

➢ What will be the angle between any such vector and w?


▪ Obviously, less than 90∘

➢ Consider some points (vectors) which lie in the negative


half space of this line (i.e., wTx<0).

➢ What will be the angle between any such vector and w ?


▪ Obviously, greater than 90∘

➢ The algorithm has converge Credit: Mitesh Khapra, IITM


DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Proof of Perceptron Learning

➢ So far we made corrections for 𝑤 𝑇 𝑃𝑖 < 0

➢ 𝑐𝑜𝑠𝛽 thus grows proportional to 𝑘

➢ Thus, there can only be a finite number of


corrections (k) to w and the algorithm will converge!
Credit: Mitesh Khapra, IITM
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Multilayered Network of Perceptron's

Credit: Mitesh Khapra, IITM


DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Sigmoid Neuron

Credit: Mitesh Khapra, IITM


DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Sigmoid Neuron

Not smooth smooth


not continuous continuous
not differentiable differentiable
Credit: Mitesh Khapra, IITM
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Sigmoid Neuron

➢ Typical supervised machine learning setup :

▪ Data: {𝑥𝑖 , 𝑦𝑖 }𝑛𝑖=1

▪ Model: Approximation of the relation between x and y


𝟏
ෝ=
𝒚 𝑻𝒙
𝟏 + 𝒆−𝒘

▪ Parameters: In all the above cases, w is a parameter which needs to be learned from the data

▪ Learning algorithm: An algorithm for learning the parameters w of the model (for example,
perceptron learning algorithm, gradient descent, etc.)

▪ Objective/Loss/Error function: To guide the learning algorithm

Credit: Mitesh Khapra, IITM


DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Typical Supervised Learning : Guess work

➢ Training the Network: Finding w* and b* manually

▪ With some guess work, we are able to find out the optimal values for w and b.

Credit: Mitesh Khapra, IITM


DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Error surface for Guess work

➢ Geometric interpretation of our “guess work” algorithm in terms of this error surface

Credit: Mitesh Khapra, IITM


DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Gradient Descent: More Efficient and principled way

➢ Thus, ℒ 𝜃 + 𝜂𝑢 − ℒ 𝜃 = 𝑢𝑇 ∇𝜃 ℒ 𝜃 = 𝑘 ∗ cos 𝛽 is most negative when cos 𝛽 = −1 𝑖. 𝑒., 𝛽 = 1800

➢ The direction u that we intend to move in should be at 180° w.r.t. the gradient

➢ In other words, move in a direction opposite to the gradient

Credit: Mitesh Khapra, IITM


DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Gradient Descent

➢ Algorithm

➢ For two points,

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Feed-forward Neural Network

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Feed-forward Neural Network

➢ The input layer can be called the 0th layer and the
output layer can be called the (L)th layer

➢ 𝑊𝑖 ∈ ℝ𝑛×𝑛 and 𝑏𝑖 ∈ ℝ𝑛 are the weights and bias


between layers 𝑖 − 1 and 𝑖(0 < 𝑖 < 𝐿)

➢ 𝑊𝑖 ∈ ℝ𝑘×𝑛 and 𝑏𝑖 ∈ ℝ𝑘 are the weight and bias


between the last hidden layer and the output layer
(L=3 in this case)

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Feed-forward Neural Network

➢ The pre-activation at layer 𝑖 is given by,


𝑎𝑖 = 𝑏𝑖 + 𝑊𝑖 ℎ𝑖−1

➢ The activation at layer 𝑖 is given by,


ℎ𝑖 = 𝑔(𝑎𝑖 )
▪ where, g is called the activation function (for
example, logistic, tanh, linear, etc.)

➢ The activation at the output layer is given by,


𝑓 𝑥 = ℎ𝐿 = 𝑂(𝑎𝐿 )
▪ Where, O is the output activation function
(for example, softmax, linear, etc.)

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


➢ Now 𝜵𝜽 looks much more nasty,

➢ 𝛻𝜃 is this composed of,


𝛻𝑊1 , 𝛻𝑊2 , … , 𝛻𝑊𝐿−1 ∈ ℝ𝑛×𝑛 , 𝛻𝑊𝐿 ∈ ℝ𝑘×𝑛
𝛻𝑏1 , 𝛻𝑏2 , … , 𝛻𝑏𝐿−1 ∈ ℝ𝑛 , 𝛻𝑏𝐿 ∈ ℝ𝑘
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
➢ We need to answer two equations?

▪ How to choose the loss function ℒ 𝜃 ?

▪ How to compute 𝛻𝜃 which is composed of

𝛻𝑊1 , 𝛻𝑊2 , … , 𝛻𝑊𝐿−1 ∈ ℝ𝑛×𝑛 , 𝛻𝑊𝐿 ∈ ℝ𝑘×𝑛

𝛻𝑏1 , 𝛻𝑏2 , … , 𝛻𝑏𝐿−1 ∈ ℝ𝑛 , 𝛻𝑏𝐿 ∈ ℝ𝑘

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Output Functions and Loss Functions

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?

➢ The choice of loss function depends on the problem at hand.

➢ We will illustrate this with the help of two examples

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?

➢ Consider our movie example again but this time we are


interested in predicting ratings

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?

➢ Consider our movie example again but this time we are


interested in predicting ratings

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?

➢ Consider our movie example again but this time we are


interested in predicting ratings

➢ Here, 𝑦𝑗 ∈ ℝ3

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?

➢ Consider our movie example again but this time we are


interested in predicting ratings

➢ Here, 𝑦𝑗 ∈ ℝ3

➢ The loss function should capture how much 𝑦ො𝑗 deviates


from 𝑦𝑗

➢ If 𝑦𝑗 ∈ ℝ𝑘 then the squared error loss can capture this


deviation

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?

➢ Consider our movie example again but this time we are


interested in predicting ratings

➢ Here, 𝑦𝑗 ∈ ℝ3

➢ The loss function should capture how much 𝑦ො𝑗 deviates


from 𝑦𝑗

➢ If 𝑦𝑗 ∈ ℝ𝑘 then the squared error loss can capture this


deviation
𝟏
𝓛 𝜽 = σ𝑵 𝒌
𝒚𝒊𝒋 − 𝒚𝒊𝒋 )𝟐
𝒊=𝟏 σ𝒋=𝟏(ෝ
𝑵

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ A related question: What should the output function
'O' be if 𝑦𝑗 ∈ ℝ ?

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ A related question: What should the output function
'O' be if 𝑦𝑗 ∈ ℝ ?

➢ More specifically, can it be the logistic function?


▪ No, because it restricts 𝑦𝑗 to a value between 0 &
1 but we want 𝑦𝑗 ∈ ℝ

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ A related question: What should the output function
'O' be if 𝑦𝑗 ∈ ℝ ?

➢ More specifically, can it be the logistic function?


▪ No, because it restricts 𝑦𝑗 to a value between 0 &
1 but we want 𝑦𝑗 ∈ ℝ

➢ So, in such cases it makes sense to have 'O' as


linear function

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ A related question: What should the output function
'O' be if 𝑦𝑗 ∈ ℝ ?

➢ More specifically, can it be the logistic function?


▪ No, because it restricts 𝑦𝑗 to a value between 0 &
1 but we want 𝑦𝑗 ∈ ℝ

➢ So, in such cases it makes sense to have 'O' as


linear function

➢ 𝑦ො𝑗 is no longer bounded between 0 and 1

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Entropy

➢ Expectations 𝐸𝑥𝑝 = ෍ 𝑃𝑖 𝑉(𝑖)


𝑖=1

➢ Entropy 𝐸𝑛𝑡 = ෍ 𝑃𝑖 log(𝑃𝑖 )


𝑖=1

➢ Cross Entropy 𝐶𝑟𝑜𝑠𝐸 = ෍ 𝑃𝑖 log(𝑞𝑖 )


𝑖=1

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ Now let us consider another problem for which a
different loss function would be appropriate

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ Now let us consider another problem for which a
different loss function would be appropriate

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ Now let us consider another problem for which a
different loss function would be appropriate

➢ Suppose we want to classify an image into 1 of k classes

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ Now let us consider another problem for which a
different loss function would be appropriate

➢ Suppose we want to classify an image into 1 of k classes

➢ Here, again we could use the squared error loss to


capture the deviation

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ Now let us consider another problem for which a
different loss function would be appropriate

➢ Suppose we want to classify an image into 1 of k classes

➢ Here, again we could use the squared error loss to


capture the deviation

➢ But, can you think of a better function?

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ Notice that y is a probability distribution

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ Notice that y is a probability distribution

➢ Therefore, we should also ensure that 𝑦𝑗 is a probability


distribution

➢ What choice of the output activation 'O' will ensure this ?

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ Notice that y is a probability distribution

➢ Therefore, we should also ensure that 𝑦𝑗 is a probability


distribution

➢ What choice of the output activation 'O' will ensure this ?

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ Notice that y is a probability distribution

➢ Therefore, we should also ensure that 𝑦𝑗 is a probability


distribution

➢ What choice of the output activation 'O' will ensure this ?


𝑎𝐿 = 𝑊𝐿 ℎ𝐿−1 + 𝑏𝐿

➢ 𝑂(𝑎𝐿 )𝑗 is the 𝑗𝑡ℎ element of 𝑦ො and 𝑎𝐿,𝑗 is the 𝑗𝑡ℎ element


of vector 𝑎𝐿

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ Notice that y is a probability distribution

➢ Therefore, we should also ensure that 𝑦𝑗 is a probability


distribution

➢ What choice of the output activation 'O' will ensure this ?


𝑎𝐿 = 𝑊𝐿 ℎ𝐿−1 + 𝑏𝐿

➢ 𝑂(𝑎𝐿 )𝑗 is the 𝑗𝑡ℎ element of 𝑦ො and 𝑎𝐿,𝑗 is the 𝑗𝑡ℎ element


of vector 𝑎𝐿

➢ This function is called the softmax function


DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Now that we have ensured that both y & 𝑦ො are probability
distributions can you think of a function which captures
the difference between them?

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ Now that we have ensured that both y & 𝑦ො are probability
distributions can you think of a function which captures
the difference between them?

➢ Cross-entropy 𝑘

ℒ 𝜃 = − ෍ 𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐
𝑐=1

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ Now that we have ensured that both y & 𝑦ො are probability
distributions can you think of a function which captures
the difference between them?

➢ Cross-entropy 𝑘

ℒ 𝜃 = − ෍ 𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐
𝑐=1

➢ So, for classification problem (where you have to choose


1 of K classes), we use the following objective function
min ℒ 𝜃 = −𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐
𝜃

max −ℒ 𝜃 = 𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐
𝜃

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ Now that we have ensured that both y & 𝑦ො are probability
distributions can you think of a function which captures
the difference between them?

➢ Cross-entropy 𝑘

ℒ 𝜃 = − ෍ 𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐
𝑐=1

➢ So, for classification problem (where you have to choose


1 of K classes), we use the following objective function
min ℒ 𝜃 = −𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐= − 𝑙𝑜𝑔 𝑦ො 𝑙
𝜃

max −ℒ 𝜃 = 𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐 = 𝑙𝑜𝑔 𝑦ො 𝑙


𝜃

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ Now that we have ensured that both y & 𝑦ො are probability
distributions can you think of a function which captures
the difference between them?

➢ Cross-entropy 𝑘

ℒ 𝜃 = − ෍ 𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐
𝑐=1

➢ So, for classification problem (where you have to choose


1 of K classes), we use the following objective function
min ℒ 𝜃 = −𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐= − 𝑙𝑜𝑔 𝑦ො 𝑙
𝜃

max −ℒ 𝜃 = 𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐 = 𝑙𝑜𝑔 𝑦ො 𝑙


𝜃
ෝ𝒍 is called log-likelihood of the data
𝒍𝒐𝒈 𝒚
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Is 𝑦ෝ𝑙 a function of 𝜃 = [𝑊1 , 𝑊2 , … , 𝑊𝐿 , 𝑏1 , 𝑏2 , … , 𝑏𝐿 ]?

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ Is 𝑦ෝ𝑙 a function of 𝜃 = [𝑊1 , 𝑊2 , … , 𝑊𝐿 , 𝑏1 , 𝑏2 , … , 𝑏𝐿 ]?

➢ Yes, it is indeed a function of 𝜃


𝑦ො𝑙 = [𝑂(𝑊3 𝑔 𝑊2 𝑔 𝑊1 𝑥𝑖 + 𝑏1 + 𝑏2 + 𝑏1 )] 𝑙

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ Is 𝑦ෝ𝑙 a function of 𝜃 = [𝑊1 , 𝑊2 , … , 𝑊𝐿 , 𝑏1 , 𝑏2 , … , 𝑏𝐿 ]?

➢ Yes, it is indeed a function of 𝜃


𝑦ො𝑙 = [𝑂(𝑊3 𝑔 𝑊2 𝑔 𝑊1 𝑥𝑖 + 𝑏1 + 𝑏2 + 𝑏1 )] 𝑙

➢ What does 𝑦ො𝑙 encode?

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?
➢ Is 𝑦ෝ𝑙 a function of 𝜃 = [𝑊1 , 𝑊2 , … , 𝑊𝐿 , 𝑏1 , 𝑏2 , … , 𝑏𝐿 ]?

➢ Yes, it is indeed a function of 𝜃


𝑦ො𝑙 = [𝑂(𝑊3 𝑔 𝑊2 𝑔 𝑊1 𝑥𝑖 + 𝑏1 + 𝑏2 + 𝑏1 )] 𝑙

➢ What does 𝑦ො𝑙 encode?

➢ It is the probability that 𝑥 belongs to the 𝑙 𝑡ℎ class


(bring it as close to 1).

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?

Real Values Probabilities


Output Activation
Loss Function

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?

Real Values Probabilities


Output Activation Linear
Loss Function Squared Error

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?

Real Values Probabilities


Output Activation Linear Softmax
Loss Function Squared Error Cross Entropy

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?

Real Values Probabilities


Output Activation Linear Softmax
Loss Function Squared Error Cross Entropy

➢ Of course, there could be other loss functions depending on the problem at hand but the two loss
functions that we just saw are encountered very often

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Loss function 𝓛 𝜽 ?

Real Values Probabilities


Output Activation Linear Softmax
Loss Function Squared Error Cross Entropy

➢ Of course, there could be other loss functions depending on the problem at hand but the two loss
functions that we just saw are encountered very often

➢ For the rest of this lecture, we will focus on the case where the output activation is a softmax function
and the loss function is cross entropy

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI


Any questions….

DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI

You might also like