0% found this document useful (0 votes)
30 views

2021 Logistic Regression

Uploaded by

sibahlemlambo5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

2021 Logistic Regression

Uploaded by

sibahlemlambo5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Machine Learning – COMS3007

Logistic Regression
Benjamin Rosman

Based heavily on course notes by


Chris Williams and Victor Lavrenko,
Amos Storkey, Eric Eaton, and Clint
van Alten
Classification
• Data 𝑋 = {𝑥 0 , … , 𝑥 (𝑛) }, where 𝑥 (𝑖) ∈ 𝑅𝑑
• Labels 𝐲 = {𝑦 0 , … , 𝑦 (𝑛) }, where 𝑦 (𝑖) ∈ {0,1}
• Want to learn function 𝑦 = 𝑓(𝑥, 𝜃) to predict y for a
new x
𝑦 is the class
(red/blue)

𝑥2 𝑥2

𝑥1 𝑥1
Generative vs discriminative
• In Naïve Bayes, we used a generative approach
• Class conditional modeling
• 𝑝 𝑦 𝒙 ∝ 𝑝(𝒙|𝑦)𝑝(𝑦)

• Now model 𝑝 𝑦 𝒙 directly: discriminative approach


• As was the case in decision trees
• Don’t model 𝑝(𝑥)

• Discriminative:
• Can’t generate data
• Often better
• Fewer variables

• Both are correct


Two class discrimination
• Consider two classes: 𝑦 ∈ 0,1
• We could use linear regression
• Doesn’t perform well
• Values < 0 or > 1 don’t make sense
• We want a model of the form:
• 𝑃 𝑦 = 1 𝑥 = 𝑓 𝑥; 𝜃
• It is a probability, so 0 ≤ 𝑓 ≤ 1
• Also, probabilities sum to 1, so
• 𝑃 𝑦 = 0 𝑥 = 1 − 𝑓 𝑥; 𝜃
• What form should we use for 𝑓?
The logistic function
• We need a function that gives probabilities: 0 ≤ 𝑓 ≤ 1
• Logistic function
1
• 𝑓 𝑧 =𝜎 𝑧 =
1+exp −𝑧
• “Sigmoid function”
• S-shape
𝜎(𝑧)
• “Squashing function”
• As z goes from −∞ to ∞
• 𝑓 goes from 0 to 1

• Notes: 𝑧
• 𝜎 0 = 0.5: “decision boundary”
• 𝜎 ′ 𝑧 = 𝜎(𝑧)(1 − 𝜎 𝑧 ) –ve values of z → class 0
+ve values of z → class 1
Linear weights
• Now we need a way of incorporating features 𝑥 and
parameters/weights 𝜃
• Use the same idea of a linear weighting scheme from linear
regression
• 𝑝 𝑦 = 1 𝑥 = 𝜎(𝜃 𝑇 𝜙 𝑥 )
• 𝜃 is a vector of parameters
• 𝜙 𝑥 is the vector of features
• Decision boundary: 𝜎 𝑧 = 0.5 when 𝑧 = 0
• So: decision boundary = 𝜃 𝑇 𝜙 𝑥 = 0
• For an M dimensional problem, boundary is M-1
dimensional hyperplane
Linear decision boundary

In linear regression, 𝜃 𝑇 𝜙 𝑥
defined the function going
through our data. Here it is the
decision boundary = 𝜃 𝑇 𝜙 𝑥 = 0 function separating our classes.
Cost function
• So:
• 𝑝 𝑦 = 1 𝑥; 𝜃 = 𝜎 𝜃 𝑇 𝜙 𝑥 = ℎ𝜃 𝑥
Why?
• 𝑝 𝑦 = 0 𝑥; 𝜃 = 1 − ℎ𝜃 𝑥 What happens
• Write this more compactly as: when y=0?
And y=1?
𝑦 1−𝑦
• 𝑝 𝑦 𝑥; 𝜃 = ℎ𝜃 𝑥 1 − ℎ𝜃 𝑥

• Likelihood of m data points:


• 𝐿 𝜃 = ς𝑚 𝑖=1 𝑝 𝑦
𝑖 𝑥 𝑖 ;𝜃
𝑦𝑖 1−𝑦 𝑖
• = ς𝑚
𝑖=1 ℎ𝜃 𝑥
𝑖 1 − ℎ𝜃 𝑥 𝑖
Cost function
• Likelihood of m data points:
𝑦𝑖 1−𝑦 𝑖
• 𝐿 𝜃 = ς𝑚
𝑖=1 ℎ𝜃 𝑥
𝑖 1 − ℎ𝜃 𝑥 𝑖

• Take the log of the likelihood:


• 𝑙 𝜃 = log 𝐿(𝜃)
• = σ𝑚𝑖=1 𝑦 𝑖 log(ℎ 𝑥 𝑖 ) + 1 − 𝑦
𝜃
𝑖 log 1 − ℎ𝜃 𝑥 𝑖

• We need to maximise the log likelihood

• Equivalent to minimising 𝐸 𝜃 = −𝑙(𝜃)

• Cannot use a closed form solution


Regularisation
• Just as in linear regression, regularisation is useful here
• Penalise the weights for growing too large
• Note: the higher the weights, the “steeper” the S – so
this stops the model becoming over-confident

• min 𝐸(𝜃) where


𝜃
• 𝐸 𝜃 𝑚=
− ෍ 𝑦 𝑖 log(ℎ𝜃 𝑥 𝑖 + 1−𝑦 𝑖 log 1 − ℎ𝜃 𝑥 𝑖

𝑖=1 𝑑

𝜆 = strength of
+ 𝜆 ෍ 𝜃𝑗2
regularisation 𝑗=1
Regularisation
• Note: the higher the weights, the “steeper” the S – so
regularisation stops the model becoming over-confident

1
𝑦=
𝑒 −𝑘𝑥
Gradient descent (again)
• Initialise 𝜃 0 < 𝛼 ≤ 1 is the learning
• Repeat until convergence: rate, usually set quite
𝜕
• 𝜃𝑗 ← 𝜃𝑗 − 𝛼 𝐽(𝜃) small
𝜕𝜃𝑗
• Simultaneous update for 𝑗 = 0, … , 𝑑

Take a step of size 𝛼


in the “downhill”
direction (negative
gradient)
GD with regularisation
• Initialise 𝜃
No regularisation
• Repeat until convergence: on 𝜃0

• 𝜃0 ← 𝜃0 − 𝛼(ℎ𝜃 𝑥 𝑖 − 𝑦 (𝑖) )
𝑖
• 𝜃𝑗 ← 𝜃𝑗 − 𝛼 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗 + 𝜆𝜃𝑗

• Simultaneous update for 𝑗 = 0, … , 𝑑

• This is identical to linear regression!


• But the model is completely different:
1
• ℎ𝜃 𝑥 = −𝜃𝑇 𝑥
1+𝑒
The effect of 𝛼
Example
• Generate two random classes of data, from Gaussians
centered at (1, -1) and (-1, 1)
Example
• Weights randomly initialized: 𝜃 = (0.3, −0.01, −0.3)
Example
• Cycle through each data point i:
• Compute:
𝛿𝜃0 = 𝑦 𝑖 − ℎ𝜃 𝑥 𝑖
𝛿𝜃1 = 𝑦 𝑖
− ℎ𝜃 𝑥 𝑖
𝑥1𝑖
𝛿𝜃2 = 𝑦 𝑖 − ℎ𝜃 𝑥 𝑖 𝑥2𝑖

Update:
𝜃0 ← 𝜃0 + 𝛼𝛿𝜃0
𝜃1 ← 𝜃1 + 𝛼𝛿𝜃1
𝜃2 ← 𝜃2 + 𝛼𝛿𝜃2
Example
Example
Example
Example
Example
Example
• Run until convergence (threshold on the size of change of 𝜃)

Probabilities (of class 1):

0.9999

0.4386
---
0.7759

0.0007
Digression: the perceptron
• The logistic function gives a probabilistic output
• What if we wanted to instead force it to be {0, 1}?
• Instead of the logistic function, what about a step function?
1 𝑖𝑓 𝑧 ≥ 0
•𝑔 𝑧 = ቊ
0 𝑖𝑓 𝑧 < 0
• Use this as before:
• 𝑝 𝑦 = 1 𝑥 = 𝑔 𝜃 𝑇 𝜙 𝑥 = ℎ𝜃 (𝑥)

• Perceptron learning rule:


𝑖
• 𝜃𝑗 ← 𝜃𝑗 + 𝛼 𝑦 𝑖 − ℎ𝜃 𝑥 𝑖 𝑥𝑗
• Exactly as before (with a different function)!
The perceptron
• Historical model
• Invented by Frank Rosenblatt (1957)
• Thought to model neurons in the brain
• (Crudely)
• Originally a machine!

• Very controversial:
• Basically claimed they expected to
• “be able to walk, talk, see, write, reproduce itself and be conscious
of its existence”
Linear separability and XOR
• “Perceptrons” by Minsky and Papert (1969)
• Limitation of a perceptron: cannot implement functions
such as a XOR function
• Led to decreased research in neural networks, and
increased research in symbolic AI
Basis functions (again)
• Use basis functions (again) to get round the linear
separability
• Still need it to be separable in some space

Add polynomial
basis functions
Basis functions (again)
• Two Gaussian basis functions: centered at (-1, -1) and (0, 0)
• Data is separable under this transformation
Basis functions (again)

Linear logistic regression Polynomial basis Gaussian basis functions


functions (RBFs)
Multiclass classification
• Instead of classifying between two classes, we may have
more classes
Multiclass logistic regression
• For two classes:
1
• 𝑝 𝑦 = 1 𝑥; 𝜃 = ℎ𝜃 𝑥 =
1+exp(−𝜃𝑇 𝑥)

exp(𝜃 𝑇 𝑥)
=
1 + exp(𝜃 𝑇 𝑥)

• Given C classes:

exp(𝜃𝑘𝑇 𝑥)
• 𝑝 𝑦 = 𝑐𝑘 𝑥; 𝜃 = σ𝐶 𝑇
𝑗=1 exp(𝜃𝑗 𝑥)

• This is the softmax function

• Note that 0 ≤ 𝑝 𝑐𝑘 𝑥; 𝜃 ≤ 1, and σ𝐶𝑗=1 𝑝 𝑐𝑘 𝑥; 𝜃 = 1


Multiclass classification
• Split into one-vs-rest for each of the C classes

• Use gradient descent: update all parameters for all models


simultaneously
• Pick most probable class
Recap
• Discriminative vs generative
• Model (logistic function)
• Decision boundaries
• Cost function
• Regularisation
• Gradient descent
• The perceptron
• Basis functions
• Multiclass classification

You might also like