0% found this document useful (0 votes)

85 views

Proof of Softmax

1) The softmax function is used as an activation function for multi-class classification problems in models like logistic regression and neural networks. It takes the weighted input values and normalizes them into a probability distribution over predicted classes. 2) The derivative of the cross-entropy loss function with respect to the weights is derived. For a training example belonging to class t, the derivative is equal to the input x multiplied by the difference between the predicted probability θ and the actual label y. 3) For stochastic gradient descent, the derivative is equal to x multiplied by the predicted probability θ when the weights correspond to different classes, and x multiplied by θ-1 when the weights are for the same class.

Uploaded by

Anurag Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views

Proof of Softmax

Uploaded by

Anurag Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

CS412 Semester I, 2019, USP, Fiji

Lecturer: Anuraganand Sharma

Proof of cost and derivative of softmax function

Softmax function is used as an activation function for multi-class classification problems. This
proof is applicable for use of softmax function in linear models like logistic regression and
neural networks [1, 2].

We assume softmax is our hypothesis ℎ(. ) of target classification function which is interpreted
as a probability function. Here a data instance belongs to one and only one out of 𝑇 classes.
𝑠
𝑒 𝑗
ℎ(𝑠𝑗 ) = ∑𝑇 𝑠𝑘 where 𝑠𝑗 = 𝑊𝑗𝑇 𝑥𝑛 in the context of logistic regression where 𝑊 represents
𝑘=1 𝑒
weight vector and 𝑥𝑛 is 𝑛𝑡ℎ data instance.

ℎ(𝑥) represents a training example 𝑥 belongs to a given class 𝑡 then probability:

ℎ(𝑥) , 𝑦 ∈ 𝑐𝑙𝑎𝑠𝑠 𝑡
𝑃(𝑦|𝑥) = {
1 − ℎ(𝑥) , 𝑦 ∉ 𝑐𝑙𝑎𝑠𝑠 𝑡

𝑦 can be written as a vector of binary numbers of size 𝑇 where an 𝑖 𝑡ℎ value 1 indicates the class
belongs to 𝑖 𝑡ℎ class and 0 otherwise.

If a training example 𝑥 belongs to class 𝑡 then:

𝑒 𝑠1
∑𝑁𝐶
𝑘=1 𝑒
𝑠𝑘

⋮
𝑒 𝑠𝑡
𝑃(𝑦|𝑥) = [𝑦1 ⋯ 𝑦𝑡 ⋯ 𝑦𝑇 ] ∑𝑁𝐶 𝑠𝑘 , 𝑦𝑡 = 1 ∧ {∀𝑦𝑖 = 0|𝑖 ≠ 𝑡}
𝑘=1 𝑒
⋮
𝑒 𝑠𝑇
{ [ ∑𝑘=1 𝑒 𝑠𝑘 ]
𝑇

Substitute ℎ(𝑥) with the following term:

𝑒 𝑠1
∑𝑁𝐶
𝑘=1 𝑒
𝑠𝑘

⋮
𝑒 𝑠𝑡 𝑦 𝑒 𝑠𝑡 𝑒 𝑠𝑡
ℎ(𝑥, 𝑦) = [𝑦1 ⋯ 𝑦𝑡 ⋯ 𝑦𝑇 ] ∑𝑁𝐶 𝑠𝑘 = ∑𝑇 𝑡 𝑠𝑘 = ∑𝑇 𝑠𝑘
𝑘=1 𝑒 𝑘=1 𝑒 𝑘=1 𝑒
⋮
𝑒 𝑠𝑇
[ ∑𝑘=1 𝑒 𝑠𝑘 ]
𝑇
If 𝑥 does not belong to class 𝑡 then it belongs to either of the other 𝑇 − 1 classes.

𝑒 𝑠1
∑𝑁𝐶
𝑘=1 𝑒
𝑠𝑘

⋮
𝑒 𝑠𝑡 𝑦′ 𝑒 𝑊𝑖𝑥
1 − ℎ(𝑥, 𝑦′) = [𝑦′1 ⋯ 𝑦′𝑡 ⋯ 𝑦′ 𝑇 ] ∑𝑁𝐶 𝑠𝑘 = 1 − ∑𝑇𝑖=1 ∑𝑇 𝑖 𝑊𝑖 𝑥
𝑘=1 𝑒 𝑘=1 𝑒
⋮
𝑒 𝑠𝑇
[ ∑𝑘=1 𝑒 𝑠𝑘 ]
𝑇

where 𝑦′𝑡 = 0, {⋃𝑇𝑖=1 𝑦′𝑖 | ∀𝑦′𝑖 = 1⋀𝑖 ≠ 𝑡}

For logistic regression we have 𝑠 = 𝑊 𝑇 𝑥 where 𝑥 is a data instance with 𝑑 × 1 dimension and
𝑊 is a weight matrix with 𝑑 × 𝑇 dimension.
𝑇
𝑒 𝑊𝑖 𝑥 𝑒 𝑊𝑡 𝑥
∴ 1 − ℎ(𝑥, 𝑦′) = 1 − ∑𝑇𝑖=1 ∑𝑇 𝑊𝑖 𝑥 = 𝑊𝑡 𝑇 𝑥
= ℎ(𝑥, 𝑦)
𝑘=1 𝑒 ∑𝑇
𝑘=1 𝑒

ℎ(𝑥, 𝑦) , 𝑦 ∈ 𝑐𝑙𝑎𝑠𝑠 𝑡
𝑃(𝑦|𝑥) = {
1 − ℎ(𝑥, 𝑦 ′ ) , 𝑦 ′ ∉ 𝑐𝑙𝑎𝑠𝑠 𝑡
𝑇
𝑦𝑒 𝑊 𝑥
𝑃(𝑦|𝑥) = ℎ(𝑥, 𝑦) = ∑𝑇 𝑊𝑘 𝑥 = 𝜃(𝑦𝑛 , 𝑊 𝑇 𝑥𝑛 ) where 𝑦𝑛 (𝑡) = 1 ∧ {∀𝑦𝑛 (𝑖) = 0|𝑖 ≠ 𝑡}
𝑘=1 𝑒

To maximize the likelihood:

∏𝑁 𝑁 𝑇
𝑛=1 𝑃(𝑦𝑛 |𝑥𝑛 ) = ∏𝑛=1 𝜃(𝑦𝑛 , 𝑊 𝑥𝑛 )

Or minimize the following error function:

𝑛=1 𝜃(𝑦𝑛 , 𝑊 𝑥𝑛 ) or
𝐸 = − ∏𝑁 𝑇

1 1 1
𝐸 = − 𝑁 ln(∏𝑁 𝑇 𝑁
𝑛=1 𝜃(𝑦𝑛 , 𝑊 𝑥𝑛 ) ) = 𝑁 ∑𝑛=1 ln (𝜃(𝑦 𝑇𝑥 ) )
𝑛 ,𝑊 𝑛

𝑇𝑥
1 1 1 ∑𝑇𝑖=1 𝑒𝑊 𝑛
𝐸 = 𝑁 ∑𝑁𝑛=1 ln (𝜃(𝑦 𝑇
)
) = 𝑁 ∑𝑁𝑛=1 ln ( 𝑇 )
𝑛 ,𝑊 𝑥𝑛 𝑦𝑛 𝑒𝑊 𝑥𝑛

𝑇
𝑊𝑇
𝜕𝐸𝑖 1 1×𝑒 𝑖 𝑥𝑛 𝜕 ∑𝑇𝑘=1 𝑒𝑊𝑘 𝑥𝑛
= ∑𝑁𝑛=1 ( 𝑇 ) 𝜕𝑊 ( 𝑇 ) where 𝑦𝑛 (𝑖) = 1 ∧ {∀𝑦𝑛 (𝑗) = 0|𝑗 ≠ 𝑖}
𝜕𝑊𝑗 𝑁 ∑𝑇𝑘=1 𝑒𝑊𝑘 𝑥𝑛 𝑗 𝑊 𝑥
𝑒 𝑖 𝑛

𝑢(𝑥) 𝑢′ (𝑥)𝑣(𝑥)−𝑣 ′ (𝑥)𝑢(𝑥)

Next we use quotient rule for derivatives 𝑓(𝑥) = 𝑣(𝑥) = 𝑣(𝑥)2

𝑇
Here 𝑢′ (𝑥) = 𝑥𝑛 𝑒 𝑊𝑗 𝑥𝑛
𝑇
𝑣 ′ (𝑥) = 𝑥𝑛 𝑒 𝑊𝑗 𝑥𝑛 𝑜𝑟 0 𝑖𝑓(𝑖 ≠ 𝑗)

For (𝑖 = 𝑗)
𝑇 𝑊 𝑥 𝑇 𝑊 𝑥 𝑇 𝑇 𝑇
𝑊 𝑥 𝑊 𝑥 𝑊 𝑥
1 𝑒 𝑖 𝑛 𝑥𝑛 𝑒 𝑗 𝑛 .𝑒 𝑖 𝑛 −𝑥𝑛 𝑒 𝑗 𝑛 .∑𝑇𝑘=1 𝑒 𝑘 𝑛
∑𝑁𝑛=1 ( 𝑇 ) ( 𝑇 𝑇 )
𝑁 ∑𝑇𝑘=1 𝑒𝑊𝑘 𝑥𝑛 𝑊 𝑥 𝑊 𝑥
𝑒 𝑖 𝑛 .𝑒 𝑖 𝑛
𝑇 𝑇
𝑊 𝑥 𝑊
𝜕𝐸𝑖 1 𝑥 𝑒 𝑗 𝑛 1 𝑒 𝑗 1
= ∑𝑁𝑛=1 ( 𝑛 𝑊𝑇𝑥 − 𝑥𝑛 ) = ∑𝑁 𝑥 ( − 1) = ∑𝑁𝑛=1 𝑥𝑛 (𝜃(𝑊𝑇𝑗 𝑥𝑛 ) − 1)
𝜕𝑊𝑗 𝑁 ∑𝑁𝐶 𝑖 𝑛 𝑁 𝑛=1 𝑛 𝑇
∑𝑇𝑖=1 𝑒𝑊𝑖 𝑥𝑛 𝑁
𝑖=1 𝑒

For stochastic gradient descent when (𝑖 ≠ 𝑗)

𝜕𝐸𝑖
𝜕𝑊𝑗
= 𝑥𝑛 . 𝜃(𝑊𝑗𝑇 𝑥𝑛 )

Diagonal values (𝑖 = 𝑗)

𝑥𝑛 (𝜃(𝑊1𝑇 𝑥𝑛 ) − 1) ⋯ 𝑥𝑛 . 𝜃(𝑊1𝑇 𝑥𝑛 )
𝜕𝐸𝑖
𝜕𝑊𝑗
=[ ⋮ ⋱ ⋮ ]
𝑥𝑛 . 𝜃(𝑊𝑇𝑇 𝑥𝑛 ) ⋯ 𝑥𝑛 (𝜃(𝑊𝑇𝑇 𝑥𝑛 ) − 1)

𝜃(𝑊𝑇1 𝑥𝑛 )
𝜕𝐸
= 𝑥𝑛 ([ ⋮ ] − [𝑦𝑛 ])
𝜕𝑊𝑗
𝜃(𝑊𝑇𝑇 𝑥𝑛 )
𝑇
𝑑𝐸 𝑒𝑊 𝑥𝑛+𝐷
= 𝑥𝑛 (𝜃(𝑊𝑇 𝑥𝑛 ) − 𝑦𝑛 ) = 𝑥𝑛 ( 𝑇 − 𝑦𝑛 )
𝑑𝑊 ∑𝑇𝑘=1 𝑒𝑊𝑘 𝑥𝑛+𝐷

𝑇 𝑥 +𝐷
1 ∑𝑇𝑖=1 𝑒𝑊 𝑛
𝐸 = 𝑁 ∑𝑁𝑛=1 ln ( 𝑇 )
𝑒𝑊 𝑥𝑛 +𝐷

An additional constant 𝐷 is introduced to cater for larger input values where 𝐷 =

−𝑚𝑎𝑥(𝑊1𝑇 𝑥𝑛 , … , 𝑊𝑇𝑇 𝑥𝑛 ) [2].

References:

[1] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, Learning From Data. S.l.: AMLBook, 2012.
[2] Eli Bendersky, “The Softmax function and its derivative - Eli Bendersky’s website,” 2017-2003. [Online].
Available: http://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/. [Accessed: 17-Jun-
2017].

MATH 499 Homework 2
100% (3)
MATH 499 Homework 2
2 pages
189 Cheat Sheet Minicards
No ratings yet
189 Cheat Sheet Minicards
2 pages
36-708 Statistical Machine Learning Homework #4 Solutions: DUE: April 19, 2019
No ratings yet
36-708 Statistical Machine Learning Homework #4 Solutions: DUE: April 19, 2019
16 pages
Lse Mba Essentials Online Certificate Course Prospectus PDF
No ratings yet
Lse Mba Essentials Online Certificate Course Prospectus PDF
12 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
189 Cheat Sheet Nominicards PDF
No ratings yet
189 Cheat Sheet Nominicards PDF
2 pages
Weather Wax Hastie Solutions Manual
No ratings yet
Weather Wax Hastie Solutions Manual
18 pages
2019-20-I MS Key
No ratings yet
2019-20-I MS Key
6 pages
Week 3 Lecture Notes
No ratings yet
Week 3 Lecture Notes
7 pages
sol3_2015
No ratings yet
sol3_2015
8 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
1. Statistical Learning Theory
No ratings yet
1. Statistical Learning Theory
100 pages
Logistic Regression Training DR Anil
No ratings yet
Logistic Regression Training DR Anil
38 pages
Representer Function
No ratings yet
Representer Function
12 pages
Sol Multiclass 1
No ratings yet
Sol Multiclass 1
5 pages
Cours 2 MVA
No ratings yet
Cours 2 MVA
5 pages
week2
No ratings yet
week2
43 pages
output_25
No ratings yet
output_25
8 pages
Tute1 Questions
No ratings yet
Tute1 Questions
4 pages
01B-DL2023-LinearModels
No ratings yet
01B-DL2023-LinearModels
47 pages
output_23
No ratings yet
output_23
6 pages
Unit 3-Discriminative Models
No ratings yet
Unit 3-Discriminative Models
29 pages
Cheat Sheet For Exam
No ratings yet
Cheat Sheet For Exam
2 pages
Machine Learning - Logistic Regression
No ratings yet
Machine Learning - Logistic Regression
16 pages
ML Question CMU
No ratings yet
ML Question CMU
12 pages
i2ML Cheatsheets
No ratings yet
i2ML Cheatsheets
7 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
Lecture 2
No ratings yet
Lecture 2
8 pages
Iterative Reweighted Least Squares: Sargur N. Srihari
No ratings yet
Iterative Reweighted Least Squares: Sargur N. Srihari
22 pages
n9 PDF
No ratings yet
n9 PDF
6 pages
a86e2ffbc8a505b53f9051b60587763c_MIT18_657F15_L8 (1)
No ratings yet
a86e2ffbc8a505b53f9051b60587763c_MIT18_657F15_L8 (1)
6 pages
SVM Problems1
No ratings yet
SVM Problems1
5 pages
CH 1
No ratings yet
CH 1
24 pages
Lecture 5_Logistic Regression (1)
No ratings yet
Lecture 5_Logistic Regression (1)
28 pages
Statistics
No ratings yet
Statistics
4 pages
Math Behind Machine Learning
No ratings yet
Math Behind Machine Learning
9 pages
Inference Quals 1992-2019
No ratings yet
Inference Quals 1992-2019
66 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
cs188 Fa23 Note22
No ratings yet
cs188 Fa23 Note22
3 pages
Sample Research Paper
No ratings yet
Sample Research Paper
26 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
ML_basics_lecture2_linear_classification
No ratings yet
ML_basics_lecture2_linear_classification
34 pages
Regression
No ratings yet
Regression
30 pages
Mock Exams 2024
No ratings yet
Mock Exams 2024
81 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
CMU 2018s NinaBALCAN HW3
No ratings yet
CMU 2018s NinaBALCAN HW3
7 pages
04- Linear-Classification-2024
No ratings yet
04- Linear-Classification-2024
65 pages
Class05 LogisticsSVM
No ratings yet
Class05 LogisticsSVM
33 pages
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
No ratings yet
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
67 pages
Lec 10
No ratings yet
Lec 10
8 pages
Logistic Regression
No ratings yet
Logistic Regression
36 pages
Lect5 Reg
No ratings yet
Lect5 Reg
16 pages
Midterm 2010 F
No ratings yet
Midterm 2010 F
15 pages
26660418
No ratings yet
26660418
59 pages
Ch2Regression and Regularization1
No ratings yet
Ch2Regression and Regularization1
45 pages
Logistic Regression
No ratings yet
Logistic Regression
4 pages
Machine Learning (CSEN3203) 1-14
No ratings yet
Machine Learning (CSEN3203) 1-14
15 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Calculus Volume1
From Everand
Calculus Volume1
Ming Yao Tsai
No ratings yet
Aditya College of Technology & Science Satna, M.P.: Presentation On Thermodynamic Cycle
100% (1)
Aditya College of Technology & Science Satna, M.P.: Presentation On Thermodynamic Cycle
37 pages
APA Refferencing 6th Edition
No ratings yet
APA Refferencing 6th Edition
22 pages
DG31PR Product Brief
No ratings yet
DG31PR Product Brief
4 pages
Water Use Efficiency
100% (1)
Water Use Efficiency
309 pages
Control Drilling Depth
No ratings yet
Control Drilling Depth
12 pages
MSIL PS 2 - PVC Socket Weld Failure
No ratings yet
MSIL PS 2 - PVC Socket Weld Failure
3 pages
Regular Subjects: 20 75 1200 0 0 Grand Total 1295
No ratings yet
Regular Subjects: 20 75 1200 0 0 Grand Total 1295
2 pages
Risk Assessment For Invasive Species
No ratings yet
Risk Assessment For Invasive Species
7 pages
02 Text2
No ratings yet
02 Text2
4 pages
Amy W Parker Resume 2017 Weebly
No ratings yet
Amy W Parker Resume 2017 Weebly
2 pages
Online Real Estate
No ratings yet
Online Real Estate
13 pages
Unit 3 Module 2 Climate Hand-Out
64% (14)
Unit 3 Module 2 Climate Hand-Out
3 pages
Consumables Catalogue Eng
No ratings yet
Consumables Catalogue Eng
644 pages
(Raymond A. Serway John W. Jewett) Physics For SC
No ratings yet
(Raymond A. Serway John W. Jewett) Physics For SC
4 pages
NLMS Adaptive Beamforming Algorithm For Smart Antenna: Mokpo National University, South Korea
No ratings yet
NLMS Adaptive Beamforming Algorithm For Smart Antenna: Mokpo National University, South Korea
4 pages
Shaviro, Steven - Accelerationist Aesthetics - Necessary Inefficiency in Times of Real
No ratings yet
Shaviro, Steven - Accelerationist Aesthetics - Necessary Inefficiency in Times of Real
9 pages
Data Preparation: Missing Values (Excel)
No ratings yet
Data Preparation: Missing Values (Excel)
5 pages
2024017001391714_ECO504_Booklet
No ratings yet
2024017001391714_ECO504_Booklet
2 pages
Operating Systems - Deadlocks
100% (3)
Operating Systems - Deadlocks
14 pages
Standard Precaution: Prof. Dr. Ida Parwati, PHD
100% (2)
Standard Precaution: Prof. Dr. Ida Parwati, PHD
23 pages
Lecture 01
No ratings yet
Lecture 01
18 pages
Exercise No. 1
No ratings yet
Exercise No. 1
5 pages
Color and Opacity Variations in Three Different Resin-Based Composite Products After Water Aging
No ratings yet
Color and Opacity Variations in Three Different Resin-Based Composite Products After Water Aging
5 pages
LV Price List of CHINT VN 20230301 Part 2
No ratings yet
LV Price List of CHINT VN 20230301 Part 2
63 pages
Resign Letter
No ratings yet
Resign Letter
1 page
Corrugated Web Beam
No ratings yet
Corrugated Web Beam
67 pages
Project Report
No ratings yet
Project Report
6 pages
Shreshta & Nets - 20240221 - 0001
No ratings yet
Shreshta & Nets - 20240221 - 0001
16 pages
Test T NG Gi A Kì 1
No ratings yet
Test T NG Gi A Kì 1
7 pages

Proof of Softmax

Uploaded by

Proof of Softmax

Uploaded by

CS412 Semester I, 2019, USP, Fiji

Lecturer: Anuraganand Sharma

Proof of cost and derivative of softmax function

ℎ(𝑥) represents a training example 𝑥 belongs to a given class 𝑡 then probability:

If a training example 𝑥 belongs to class 𝑡 then:

Substitute ℎ(𝑥) with the following term:

where 𝑦′𝑡 = 0, {⋃𝑇𝑖=1 𝑦′𝑖 | ∀𝑦′𝑖 = 1⋀𝑖 ≠ 𝑡}

To maximize the likelihood:

Or minimize the following error function:

𝑢(𝑥) 𝑢′ (𝑥)𝑣(𝑥)−𝑣 ′ (𝑥)𝑢(𝑥)

For stochastic gradient descent when (𝑖 ≠ 𝑗)

An additional constant 𝐷 is introduced to cater for larger input values where 𝐷 =

You might also like