2021 Logistic Regression
2021 Logistic Regression
Logistic Regression
Benjamin Rosman
𝑥2 𝑥2
𝑥1 𝑥1
Generative vs discriminative
• In Naïve Bayes, we used a generative approach
• Class conditional modeling
• 𝑝 𝑦 𝒙 ∝ 𝑝(𝒙|𝑦)𝑝(𝑦)
• Discriminative:
• Can’t generate data
• Often better
• Fewer variables
• Notes: 𝑧
• 𝜎 0 = 0.5: “decision boundary”
• 𝜎 ′ 𝑧 = 𝜎(𝑧)(1 − 𝜎 𝑧 ) –ve values of z → class 0
+ve values of z → class 1
Linear weights
• Now we need a way of incorporating features 𝑥 and
parameters/weights 𝜃
• Use the same idea of a linear weighting scheme from linear
regression
• 𝑝 𝑦 = 1 𝑥 = 𝜎(𝜃 𝑇 𝜙 𝑥 )
• 𝜃 is a vector of parameters
• 𝜙 𝑥 is the vector of features
• Decision boundary: 𝜎 𝑧 = 0.5 when 𝑧 = 0
• So: decision boundary = 𝜃 𝑇 𝜙 𝑥 = 0
• For an M dimensional problem, boundary is M-1
dimensional hyperplane
Linear decision boundary
In linear regression, 𝜃 𝑇 𝜙 𝑥
defined the function going
through our data. Here it is the
decision boundary = 𝜃 𝑇 𝜙 𝑥 = 0 function separating our classes.
Cost function
• So:
• 𝑝 𝑦 = 1 𝑥; 𝜃 = 𝜎 𝜃 𝑇 𝜙 𝑥 = ℎ𝜃 𝑥
Why?
• 𝑝 𝑦 = 0 𝑥; 𝜃 = 1 − ℎ𝜃 𝑥 What happens
• Write this more compactly as: when y=0?
And y=1?
𝑦 1−𝑦
• 𝑝 𝑦 𝑥; 𝜃 = ℎ𝜃 𝑥 1 − ℎ𝜃 𝑥
𝑖=1 𝑑
𝜆 = strength of
+ 𝜆 𝜃𝑗2
regularisation 𝑗=1
Regularisation
• Note: the higher the weights, the “steeper” the S – so
regularisation stops the model becoming over-confident
1
𝑦=
𝑒 −𝑘𝑥
Gradient descent (again)
• Initialise 𝜃 0 < 𝛼 ≤ 1 is the learning
• Repeat until convergence: rate, usually set quite
𝜕
• 𝜃𝑗 ← 𝜃𝑗 − 𝛼 𝐽(𝜃) small
𝜕𝜃𝑗
• Simultaneous update for 𝑗 = 0, … , 𝑑
• 𝜃0 ← 𝜃0 − 𝛼(ℎ𝜃 𝑥 𝑖 − 𝑦 (𝑖) )
𝑖
• 𝜃𝑗 ← 𝜃𝑗 − 𝛼 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗 + 𝜆𝜃𝑗
Update:
𝜃0 ← 𝜃0 + 𝛼𝛿𝜃0
𝜃1 ← 𝜃1 + 𝛼𝛿𝜃1
𝜃2 ← 𝜃2 + 𝛼𝛿𝜃2
Example
Example
Example
Example
Example
Example
• Run until convergence (threshold on the size of change of 𝜃)
0.9999
0.4386
---
0.7759
0.0007
Digression: the perceptron
• The logistic function gives a probabilistic output
• What if we wanted to instead force it to be {0, 1}?
• Instead of the logistic function, what about a step function?
1 𝑖𝑓 𝑧 ≥ 0
•𝑔 𝑧 = ቊ
0 𝑖𝑓 𝑧 < 0
• Use this as before:
• 𝑝 𝑦 = 1 𝑥 = 𝑔 𝜃 𝑇 𝜙 𝑥 = ℎ𝜃 (𝑥)
• Very controversial:
• Basically claimed they expected to
• “be able to walk, talk, see, write, reproduce itself and be conscious
of its existence”
Linear separability and XOR
• “Perceptrons” by Minsky and Papert (1969)
• Limitation of a perceptron: cannot implement functions
such as a XOR function
• Led to decreased research in neural networks, and
increased research in symbolic AI
Basis functions (again)
• Use basis functions (again) to get round the linear
separability
• Still need it to be separable in some space
Add polynomial
basis functions
Basis functions (again)
• Two Gaussian basis functions: centered at (-1, -1) and (0, 0)
• Data is separable under this transformation
Basis functions (again)
exp(𝜃 𝑇 𝑥)
=
1 + exp(𝜃 𝑇 𝑥)
• Given C classes:
exp(𝜃𝑘𝑇 𝑥)
• 𝑝 𝑦 = 𝑐𝑘 𝑥; 𝜃 = σ𝐶 𝑇
𝑗=1 exp(𝜃𝑗 𝑥)