0% found this document useful (0 votes)
85 views

Proof of Softmax

1) The softmax function is used as an activation function for multi-class classification problems in models like logistic regression and neural networks. It takes the weighted input values and normalizes them into a probability distribution over predicted classes. 2) The derivative of the cross-entropy loss function with respect to the weights is derived. For a training example belonging to class t, the derivative is equal to the input x multiplied by the difference between the predicted probability θ and the actual label y. 3) For stochastic gradient descent, the derivative is equal to x multiplied by the predicted probability θ when the weights correspond to different classes, and x multiplied by θ-1 when the weights are for the same class.

Uploaded by

Anurag Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views

Proof of Softmax

1) The softmax function is used as an activation function for multi-class classification problems in models like logistic regression and neural networks. It takes the weighted input values and normalizes them into a probability distribution over predicted classes. 2) The derivative of the cross-entropy loss function with respect to the weights is derived. For a training example belonging to class t, the derivative is equal to the input x multiplied by the difference between the predicted probability θ and the actual label y. 3) For stochastic gradient descent, the derivative is equal to x multiplied by the predicted probability θ when the weights correspond to different classes, and x multiplied by θ-1 when the weights are for the same class.

Uploaded by

Anurag Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

CS412 Semester I, 2019, USP, Fiji

Lecturer: Anuraganand Sharma

Proof of cost and derivative of softmax function

Softmax function is used as an activation function for multi-class classification problems. This
proof is applicable for use of softmax function in linear models like logistic regression and
neural networks [1, 2].

We assume softmax is our hypothesis ℎ(. ) of target classification function which is interpreted
as a probability function. Here a data instance belongs to one and only one out of 𝑇 classes.
𝑠
𝑒 𝑗
ℎ(𝑠𝑗 ) = ∑𝑇 𝑠𝑘 where 𝑠𝑗 = 𝑊𝑗𝑇 𝑥𝑛 in the context of logistic regression where 𝑊 represents
𝑘=1 𝑒
weight vector and 𝑥𝑛 is 𝑛𝑡ℎ data instance.

ℎ(𝑥) represents a training example 𝑥 belongs to a given class 𝑡 then probability:

ℎ(𝑥) , 𝑦 ∈ 𝑐𝑙𝑎𝑠𝑠 𝑡
𝑃(𝑦|𝑥) = {
1 − ℎ(𝑥) , 𝑦 ∉ 𝑐𝑙𝑎𝑠𝑠 𝑡

𝑦 can be written as a vector of binary numbers of size 𝑇 where an 𝑖 𝑡ℎ value 1 indicates the class
belongs to 𝑖 𝑡ℎ class and 0 otherwise.

If a training example 𝑥 belongs to class 𝑡 then:

𝑒 𝑠1
∑𝑁𝐶
𝑘=1 𝑒
𝑠𝑘


𝑒 𝑠𝑡
𝑃(𝑦|𝑥) = [𝑦1 ⋯ 𝑦𝑡 ⋯ 𝑦𝑇 ] ∑𝑁𝐶 𝑠𝑘 , 𝑦𝑡 = 1 ∧ {∀𝑦𝑖 = 0|𝑖 ≠ 𝑡}
𝑘=1 𝑒

𝑒 𝑠𝑇
{ [ ∑𝑘=1 𝑒 𝑠𝑘 ]
𝑇

Substitute ℎ(𝑥) with the following term:

𝑒 𝑠1
∑𝑁𝐶
𝑘=1 𝑒
𝑠𝑘


𝑒 𝑠𝑡 𝑦 𝑒 𝑠𝑡 𝑒 𝑠𝑡
ℎ(𝑥, 𝑦) = [𝑦1 ⋯ 𝑦𝑡 ⋯ 𝑦𝑇 ] ∑𝑁𝐶 𝑠𝑘 = ∑𝑇 𝑡 𝑠𝑘 = ∑𝑇 𝑠𝑘
𝑘=1 𝑒 𝑘=1 𝑒 𝑘=1 𝑒

𝑒 𝑠𝑇
[ ∑𝑘=1 𝑒 𝑠𝑘 ]
𝑇
If 𝑥 does not belong to class 𝑡 then it belongs to either of the other 𝑇 − 1 classes.

𝑒 𝑠1
∑𝑁𝐶
𝑘=1 𝑒
𝑠𝑘


𝑒 𝑠𝑡 𝑦′ 𝑒 𝑊𝑖𝑥
1 − ℎ(𝑥, 𝑦′) = [𝑦′1 ⋯ 𝑦′𝑡 ⋯ 𝑦′ 𝑇 ] ∑𝑁𝐶 𝑠𝑘 = 1 − ∑𝑇𝑖=1 ∑𝑇 𝑖 𝑊𝑖 𝑥
𝑘=1 𝑒 𝑘=1 𝑒

𝑒 𝑠𝑇
[ ∑𝑘=1 𝑒 𝑠𝑘 ]
𝑇

where 𝑦′𝑡 = 0, {⋃𝑇𝑖=1 𝑦′𝑖 | ∀𝑦′𝑖 = 1⋀𝑖 ≠ 𝑡}

For logistic regression we have 𝑠 = 𝑊 𝑇 𝑥 where 𝑥 is a data instance with 𝑑 × 1 dimension and
𝑊 is a weight matrix with 𝑑 × 𝑇 dimension.
𝑇
𝑒 𝑊𝑖 𝑥 𝑒 𝑊𝑡 𝑥
∴ 1 − ℎ(𝑥, 𝑦′) = 1 − ∑𝑇𝑖=1 ∑𝑇 𝑊𝑖 𝑥 = 𝑊𝑡 𝑇 𝑥
= ℎ(𝑥, 𝑦)
𝑘=1 𝑒 ∑𝑇
𝑘=1 𝑒

ℎ(𝑥, 𝑦) , 𝑦 ∈ 𝑐𝑙𝑎𝑠𝑠 𝑡
𝑃(𝑦|𝑥) = {
1 − ℎ(𝑥, 𝑦 ′ ) , 𝑦 ′ ∉ 𝑐𝑙𝑎𝑠𝑠 𝑡
𝑇
𝑦𝑒 𝑊 𝑥
𝑃(𝑦|𝑥) = ℎ(𝑥, 𝑦) = ∑𝑇 𝑊𝑘 𝑥 = 𝜃(𝑦𝑛 , 𝑊 𝑇 𝑥𝑛 ) where 𝑦𝑛 (𝑡) = 1 ∧ {∀𝑦𝑛 (𝑖) = 0|𝑖 ≠ 𝑡}
𝑘=1 𝑒

To maximize the likelihood:

∏𝑁 𝑁 𝑇
𝑛=1 𝑃(𝑦𝑛 |𝑥𝑛 ) = ∏𝑛=1 𝜃(𝑦𝑛 , 𝑊 𝑥𝑛 )

Or minimize the following error function:

𝑛=1 𝜃(𝑦𝑛 , 𝑊 𝑥𝑛 ) or
𝐸 = − ∏𝑁 𝑇

1 1 1
𝐸 = − 𝑁 ln(∏𝑁 𝑇 𝑁
𝑛=1 𝜃(𝑦𝑛 , 𝑊 𝑥𝑛 ) ) = 𝑁 ∑𝑛=1 ln (𝜃(𝑦 𝑇𝑥 ) )
𝑛 ,𝑊 𝑛

𝑇𝑥
1 1 1 ∑𝑇𝑖=1 𝑒𝑊 𝑛
𝐸 = 𝑁 ∑𝑁𝑛=1 ln (𝜃(𝑦 𝑇
)
) = 𝑁 ∑𝑁𝑛=1 ln ( 𝑇 )
𝑛 ,𝑊 𝑥𝑛 𝑦𝑛 𝑒𝑊 𝑥𝑛

𝑇
𝑊𝑇
𝜕𝐸𝑖 1 1×𝑒 𝑖 𝑥𝑛 𝜕 ∑𝑇𝑘=1 𝑒𝑊𝑘 𝑥𝑛
= ∑𝑁𝑛=1 ( 𝑇 ) 𝜕𝑊 ( 𝑇 ) where 𝑦𝑛 (𝑖) = 1 ∧ {∀𝑦𝑛 (𝑗) = 0|𝑗 ≠ 𝑖}
𝜕𝑊𝑗 𝑁 ∑𝑇𝑘=1 𝑒𝑊𝑘 𝑥𝑛 𝑗 𝑊 𝑥
𝑒 𝑖 𝑛

𝑢(𝑥) 𝑢′ (𝑥)𝑣(𝑥)−𝑣 ′ (𝑥)𝑢(𝑥)


Next we use quotient rule for derivatives 𝑓(𝑥) = 𝑣(𝑥) = 𝑣(𝑥)2

𝑇
Here 𝑢′ (𝑥) = 𝑥𝑛 𝑒 𝑊𝑗 𝑥𝑛
𝑇
𝑣 ′ (𝑥) = 𝑥𝑛 𝑒 𝑊𝑗 𝑥𝑛 𝑜𝑟 0 𝑖𝑓(𝑖 ≠ 𝑗)

For (𝑖 = 𝑗)
𝑇 𝑊 𝑥 𝑇 𝑊 𝑥 𝑇 𝑇 𝑇
𝑊 𝑥 𝑊 𝑥 𝑊 𝑥
1 𝑒 𝑖 𝑛 𝑥𝑛 𝑒 𝑗 𝑛 .𝑒 𝑖 𝑛 −𝑥𝑛 𝑒 𝑗 𝑛 .∑𝑇𝑘=1 𝑒 𝑘 𝑛
∑𝑁𝑛=1 ( 𝑇 ) ( 𝑇 𝑇 )
𝑁 ∑𝑇𝑘=1 𝑒𝑊𝑘 𝑥𝑛 𝑊 𝑥 𝑊 𝑥
𝑒 𝑖 𝑛 .𝑒 𝑖 𝑛
𝑇 𝑇
𝑊 𝑥 𝑊
𝜕𝐸𝑖 1 𝑥 𝑒 𝑗 𝑛 1 𝑒 𝑗 1
= ∑𝑁𝑛=1 ( 𝑛 𝑊𝑇𝑥 − 𝑥𝑛 ) = ∑𝑁 𝑥 ( − 1) = ∑𝑁𝑛=1 𝑥𝑛 (𝜃(𝑊𝑇𝑗 𝑥𝑛 ) − 1)
𝜕𝑊𝑗 𝑁 ∑𝑁𝐶 𝑖 𝑛 𝑁 𝑛=1 𝑛 𝑇
∑𝑇𝑖=1 𝑒𝑊𝑖 𝑥𝑛 𝑁
𝑖=1 𝑒

For stochastic gradient descent when (𝑖 ≠ 𝑗)


𝜕𝐸𝑖
𝜕𝑊𝑗
= 𝑥𝑛 . 𝜃(𝑊𝑗𝑇 𝑥𝑛 )

Diagonal values (𝑖 = 𝑗)

𝑥𝑛 (𝜃(𝑊1𝑇 𝑥𝑛 ) − 1) ⋯ 𝑥𝑛 . 𝜃(𝑊1𝑇 𝑥𝑛 )
𝜕𝐸𝑖
𝜕𝑊𝑗
=[ ⋮ ⋱ ⋮ ]
𝑥𝑛 . 𝜃(𝑊𝑇𝑇 𝑥𝑛 ) ⋯ 𝑥𝑛 (𝜃(𝑊𝑇𝑇 𝑥𝑛 ) − 1)

𝜃(𝑊𝑇1 𝑥𝑛 )
𝜕𝐸
= 𝑥𝑛 ([ ⋮ ] − [𝑦𝑛 ])
𝜕𝑊𝑗
𝜃(𝑊𝑇𝑇 𝑥𝑛 )
𝑇
𝑑𝐸 𝑒𝑊 𝑥𝑛+𝐷
= 𝑥𝑛 (𝜃(𝑊𝑇 𝑥𝑛 ) − 𝑦𝑛 ) = 𝑥𝑛 ( 𝑇 − 𝑦𝑛 )
𝑑𝑊 ∑𝑇𝑘=1 𝑒𝑊𝑘 𝑥𝑛+𝐷

𝑇 𝑥 +𝐷
1 ∑𝑇𝑖=1 𝑒𝑊 𝑛
𝐸 = 𝑁 ∑𝑁𝑛=1 ln ( 𝑇 )
𝑒𝑊 𝑥𝑛 +𝐷

An additional constant 𝐷 is introduced to cater for larger input values where 𝐷 =


−𝑚𝑎𝑥(𝑊1𝑇 𝑥𝑛 , … , 𝑊𝑇𝑇 𝑥𝑛 ) [2].

References:

[1] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, Learning From Data. S.l.: AMLBook, 2012.
[2] Eli Bendersky, “The Softmax function and its derivative - Eli Bendersky’s website,” 2017-2003. [Online].
Available: http://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/. [Accessed: 17-Jun-
2017].

You might also like