Proof of Softmax
Proof of Softmax
Softmax function is used as an activation function for multi-class classification problems. This
proof is applicable for use of softmax function in linear models like logistic regression and
neural networks [1, 2].
We assume softmax is our hypothesis ℎ(. ) of target classification function which is interpreted
as a probability function. Here a data instance belongs to one and only one out of 𝑇 classes.
𝑠
𝑒 𝑗
ℎ(𝑠𝑗 ) = ∑𝑇 𝑠𝑘 where 𝑠𝑗 = 𝑊𝑗𝑇 𝑥𝑛 in the context of logistic regression where 𝑊 represents
𝑘=1 𝑒
weight vector and 𝑥𝑛 is 𝑛𝑡ℎ data instance.
ℎ(𝑥) , 𝑦 ∈ 𝑐𝑙𝑎𝑠𝑠 𝑡
𝑃(𝑦|𝑥) = {
1 − ℎ(𝑥) , 𝑦 ∉ 𝑐𝑙𝑎𝑠𝑠 𝑡
𝑦 can be written as a vector of binary numbers of size 𝑇 where an 𝑖 𝑡ℎ value 1 indicates the class
belongs to 𝑖 𝑡ℎ class and 0 otherwise.
𝑒 𝑠1
∑𝑁𝐶
𝑘=1 𝑒
𝑠𝑘
⋮
𝑒 𝑠𝑡
𝑃(𝑦|𝑥) = [𝑦1 ⋯ 𝑦𝑡 ⋯ 𝑦𝑇 ] ∑𝑁𝐶 𝑠𝑘 , 𝑦𝑡 = 1 ∧ {∀𝑦𝑖 = 0|𝑖 ≠ 𝑡}
𝑘=1 𝑒
⋮
𝑒 𝑠𝑇
{ [ ∑𝑘=1 𝑒 𝑠𝑘 ]
𝑇
𝑒 𝑠1
∑𝑁𝐶
𝑘=1 𝑒
𝑠𝑘
⋮
𝑒 𝑠𝑡 𝑦 𝑒 𝑠𝑡 𝑒 𝑠𝑡
ℎ(𝑥, 𝑦) = [𝑦1 ⋯ 𝑦𝑡 ⋯ 𝑦𝑇 ] ∑𝑁𝐶 𝑠𝑘 = ∑𝑇 𝑡 𝑠𝑘 = ∑𝑇 𝑠𝑘
𝑘=1 𝑒 𝑘=1 𝑒 𝑘=1 𝑒
⋮
𝑒 𝑠𝑇
[ ∑𝑘=1 𝑒 𝑠𝑘 ]
𝑇
If 𝑥 does not belong to class 𝑡 then it belongs to either of the other 𝑇 − 1 classes.
𝑒 𝑠1
∑𝑁𝐶
𝑘=1 𝑒
𝑠𝑘
⋮
𝑒 𝑠𝑡 𝑦′ 𝑒 𝑊𝑖𝑥
1 − ℎ(𝑥, 𝑦′) = [𝑦′1 ⋯ 𝑦′𝑡 ⋯ 𝑦′ 𝑇 ] ∑𝑁𝐶 𝑠𝑘 = 1 − ∑𝑇𝑖=1 ∑𝑇 𝑖 𝑊𝑖 𝑥
𝑘=1 𝑒 𝑘=1 𝑒
⋮
𝑒 𝑠𝑇
[ ∑𝑘=1 𝑒 𝑠𝑘 ]
𝑇
For logistic regression we have 𝑠 = 𝑊 𝑇 𝑥 where 𝑥 is a data instance with 𝑑 × 1 dimension and
𝑊 is a weight matrix with 𝑑 × 𝑇 dimension.
𝑇
𝑒 𝑊𝑖 𝑥 𝑒 𝑊𝑡 𝑥
∴ 1 − ℎ(𝑥, 𝑦′) = 1 − ∑𝑇𝑖=1 ∑𝑇 𝑊𝑖 𝑥 = 𝑊𝑡 𝑇 𝑥
= ℎ(𝑥, 𝑦)
𝑘=1 𝑒 ∑𝑇
𝑘=1 𝑒
ℎ(𝑥, 𝑦) , 𝑦 ∈ 𝑐𝑙𝑎𝑠𝑠 𝑡
𝑃(𝑦|𝑥) = {
1 − ℎ(𝑥, 𝑦 ′ ) , 𝑦 ′ ∉ 𝑐𝑙𝑎𝑠𝑠 𝑡
𝑇
𝑦𝑒 𝑊 𝑥
𝑃(𝑦|𝑥) = ℎ(𝑥, 𝑦) = ∑𝑇 𝑊𝑘 𝑥 = 𝜃(𝑦𝑛 , 𝑊 𝑇 𝑥𝑛 ) where 𝑦𝑛 (𝑡) = 1 ∧ {∀𝑦𝑛 (𝑖) = 0|𝑖 ≠ 𝑡}
𝑘=1 𝑒
∏𝑁 𝑁 𝑇
𝑛=1 𝑃(𝑦𝑛 |𝑥𝑛 ) = ∏𝑛=1 𝜃(𝑦𝑛 , 𝑊 𝑥𝑛 )
𝑛=1 𝜃(𝑦𝑛 , 𝑊 𝑥𝑛 ) or
𝐸 = − ∏𝑁 𝑇
1 1 1
𝐸 = − 𝑁 ln(∏𝑁 𝑇 𝑁
𝑛=1 𝜃(𝑦𝑛 , 𝑊 𝑥𝑛 ) ) = 𝑁 ∑𝑛=1 ln (𝜃(𝑦 𝑇𝑥 ) )
𝑛 ,𝑊 𝑛
𝑇𝑥
1 1 1 ∑𝑇𝑖=1 𝑒𝑊 𝑛
𝐸 = 𝑁 ∑𝑁𝑛=1 ln (𝜃(𝑦 𝑇
)
) = 𝑁 ∑𝑁𝑛=1 ln ( 𝑇 )
𝑛 ,𝑊 𝑥𝑛 𝑦𝑛 𝑒𝑊 𝑥𝑛
𝑇
𝑊𝑇
𝜕𝐸𝑖 1 1×𝑒 𝑖 𝑥𝑛 𝜕 ∑𝑇𝑘=1 𝑒𝑊𝑘 𝑥𝑛
= ∑𝑁𝑛=1 ( 𝑇 ) 𝜕𝑊 ( 𝑇 ) where 𝑦𝑛 (𝑖) = 1 ∧ {∀𝑦𝑛 (𝑗) = 0|𝑗 ≠ 𝑖}
𝜕𝑊𝑗 𝑁 ∑𝑇𝑘=1 𝑒𝑊𝑘 𝑥𝑛 𝑗 𝑊 𝑥
𝑒 𝑖 𝑛
𝑇
Here 𝑢′ (𝑥) = 𝑥𝑛 𝑒 𝑊𝑗 𝑥𝑛
𝑇
𝑣 ′ (𝑥) = 𝑥𝑛 𝑒 𝑊𝑗 𝑥𝑛 𝑜𝑟 0 𝑖𝑓(𝑖 ≠ 𝑗)
For (𝑖 = 𝑗)
𝑇 𝑊 𝑥 𝑇 𝑊 𝑥 𝑇 𝑇 𝑇
𝑊 𝑥 𝑊 𝑥 𝑊 𝑥
1 𝑒 𝑖 𝑛 𝑥𝑛 𝑒 𝑗 𝑛 .𝑒 𝑖 𝑛 −𝑥𝑛 𝑒 𝑗 𝑛 .∑𝑇𝑘=1 𝑒 𝑘 𝑛
∑𝑁𝑛=1 ( 𝑇 ) ( 𝑇 𝑇 )
𝑁 ∑𝑇𝑘=1 𝑒𝑊𝑘 𝑥𝑛 𝑊 𝑥 𝑊 𝑥
𝑒 𝑖 𝑛 .𝑒 𝑖 𝑛
𝑇 𝑇
𝑊 𝑥 𝑊
𝜕𝐸𝑖 1 𝑥 𝑒 𝑗 𝑛 1 𝑒 𝑗 1
= ∑𝑁𝑛=1 ( 𝑛 𝑊𝑇𝑥 − 𝑥𝑛 ) = ∑𝑁 𝑥 ( − 1) = ∑𝑁𝑛=1 𝑥𝑛 (𝜃(𝑊𝑇𝑗 𝑥𝑛 ) − 1)
𝜕𝑊𝑗 𝑁 ∑𝑁𝐶 𝑖 𝑛 𝑁 𝑛=1 𝑛 𝑇
∑𝑇𝑖=1 𝑒𝑊𝑖 𝑥𝑛 𝑁
𝑖=1 𝑒
Diagonal values (𝑖 = 𝑗)
𝑥𝑛 (𝜃(𝑊1𝑇 𝑥𝑛 ) − 1) ⋯ 𝑥𝑛 . 𝜃(𝑊1𝑇 𝑥𝑛 )
𝜕𝐸𝑖
𝜕𝑊𝑗
=[ ⋮ ⋱ ⋮ ]
𝑥𝑛 . 𝜃(𝑊𝑇𝑇 𝑥𝑛 ) ⋯ 𝑥𝑛 (𝜃(𝑊𝑇𝑇 𝑥𝑛 ) − 1)
𝜃(𝑊𝑇1 𝑥𝑛 )
𝜕𝐸
= 𝑥𝑛 ([ ⋮ ] − [𝑦𝑛 ])
𝜕𝑊𝑗
𝜃(𝑊𝑇𝑇 𝑥𝑛 )
𝑇
𝑑𝐸 𝑒𝑊 𝑥𝑛+𝐷
= 𝑥𝑛 (𝜃(𝑊𝑇 𝑥𝑛 ) − 𝑦𝑛 ) = 𝑥𝑛 ( 𝑇 − 𝑦𝑛 )
𝑑𝑊 ∑𝑇𝑘=1 𝑒𝑊𝑘 𝑥𝑛+𝐷
𝑇 𝑥 +𝐷
1 ∑𝑇𝑖=1 𝑒𝑊 𝑛
𝐸 = 𝑁 ∑𝑁𝑛=1 ln ( 𝑇 )
𝑒𝑊 𝑥𝑛 +𝐷
References:
[1] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, Learning From Data. S.l.: AMLBook, 2012.
[2] Eli Bendersky, “The Softmax function and its derivative - Eli Bendersky’s website,” 2017-2003. [Online].
Available: http://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/. [Accessed: 17-Jun-
2017].