23-LogisticRegression
23-LogisticRegression
Chris Piech
CS109, Stanford University
Classification
Classification Task
Heart Ancestry
Netflix
Training
Real World Problem
Learning Algorithm
Testing Classification
Data g𝜃(X) Accuracy
Testing
Real World Problem
Learning Algorithm
Testing Classification
Data g𝜃(X) Accuracy
Healthy Heart Classifier
ROI 1 ROI 2 ROI m Output
Heart 1 0 1 1 0
Heart 2 1 1 1 0
…
…
Heart n 0 0 0 1
g𝜃(X)
Healthy Heart Classifier
ROI 1 ROI 2 ROI m Output
New
1 0 1 1
Heart
g𝜃(X)
Naïve Bayes Classification
g𝜃(x)?
ŷ
g𝜃(x)?
x
[0, 1, 1, 0]
ŷ
g𝜃(x)?
x
[0, 1, 1, 0]
argmax P (y|x)
y={0,1}
ŷ
g𝜃(x)?
x
[0, 1, 1, 0]
y=0
P (y|x)
0.62
ŷ
g𝜃(x)?
x
[0, 1, 1, 0]
y=1
P (y|x)
0.38
ŷ
g𝜃(x)?
x
[0, 1, 1, 0]
argmax P (y|x)
y={0,1}
ŷ
g𝜃(x)?
x
[0, 1, 1, 0]
argmax P (y|x)
y={0,1}
ŷ = 0
Brute Force Bayes
Simply chose the class label that is the most
likely given the data
ŷ = argmax P (y|x)
y={0,1}
P (x|y)P (y)
= argmax
y={0,1} P (x)
= argmax P (x|y)P (y)
y={0,1}
ŷ = g✓ (x)
= argmax P (y|x)
Naïve Bayes
y={0,1}
P (x|y)P (y)
= argmax
y={0,1} P (x)
= argmax P (x|y)P (y)
y={0,1}
Y
= argmax P (xi |y)P (y)
y={0,1} i
X
= argmax log P (y) + log P (xi |y)
y={0,1} i
Simply chose the class label that is the most likely
given the data. Make Naïve Bayes assumption
ŷ = g✓ (x)
= argmax P (y|x)
Naïve Bayes
y={0,1}
P (x|y)P (y)
= argmax
y={0,1} P (x) By Naïve
Bayes
= argmax P (x|y)P (y)
y={0,1}
Assumption
Y
= argmax P (xi |y)P (y)
y={0,1} i
X
= argmax log P (y) + log P (xi |y)
y={0,1} i
Simply chose the class label that is the most likely
given the data. Make Naïve Bayes assumption
ŷ = g✓ (x)
= argmax P (y|x)
Naïve Bayes
y={0,1}
P (x|y)P (y)
= argmax
y={0,1} P (x)
= argmax P (x|y)P (y)
y={0,1}
Y Argmax of
= argmax P (xi |y)P (y) log
y={0,1} i
X
= argmax log P (y) + log P (xi |y)
y={0,1} i
Computing Probabilities from Data
• Various probabilities you will need to compute for
Naive Bayesian Classifier (using MLE here):
# instances in class = 0
Pˆ (Y = 0) =
total # instances
ˆ Pˆ ( X i = 0, Y = 0) ˆ Pˆ ( X i = 0, Y = 1)
P ( X i = 0 | Y = 0) = P( X i = 0 | Y = 1) =
Pˆ (Y = 0) Pˆ (Y = 1)
Pˆ ( X i = 1 | Y = 0) = 1 − Pˆ ( X i = 0 | Y = 0)
Training Naïve Bayes, is
estimating parameters for
a multinomial.
• This is spam:
Let’s get Bayesian on your spam:
Content analysis details: (49.5 hits, 7.0 required)
0.9 RCVD_IN_PBL RBL: Received via a relay in Spamhaus PBL
[93.40.189.29 listed in zen.spamhaus.org]
1.5 URIBL_WS_SURBL Contains an URL listed in the WS SURBL blocklist
[URIs: recragas.cn]
5.0 URIBL_JP_SURBL Contains an URL listed in the JP SURBL blocklist
[URIs: recragas.cn]
5.0 URIBL_OB_SURBL Contains an URL listed in the OB SURBL blocklist
[URIs: recragas.cn]
5.0 URIBL_SC_SURBL Contains an URL listed in the SC SURBL blocklist
[URIs: recragas.cn]
2.0 URIBL_BLACK Contains an URL listed in the URIBL blacklist
[URIs: recragas.cn]
8.0 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
[score: 1.0000]
Parameter Estimation
Knowledge Dependency
Parameter Estimation
Logistic Regression
Key Notation
1 Sigmoid function
(z) = z
1+e
n
X
T Weighted sum
✓ x= ✓i x i (aka dot product)
i=1
= ✓1 x1 + ✓2 x2 + · · · + ✓n xn
T 1 Sigmoid function of
(✓ x) =
1+e ✓T x weighted sum
Chapter 1: Big Picture
From Naïve Bayes to Logistic Regression
x, ✓ P (Y = 1|x)
[1, 1, 1, 0]
0.81
[-2, 4, 1, 2]
Logistic Regression Assumption
• Could we model P(Y | X) directly?
§ Welcome our friend: logistic regression!
x, ✓ P (Y = 1|x)
[1, 1, 1, 0]
0.81
[-2, 4, 1, 2]
z = 1.1
Logistic Regression Assumption
• Model conditional likelihood P(Y | X) directly
§ Model this probability with logistic function:
m
X
P (Y = 1|X) = (z) where z = ✓0 + ✓i xi
i=1
1
f ( z) =
1 + e− z
z
Note: inflection point at z = 0. f(0) = 0.5
The Sigmoid Function
Sigmoid
Logistic Regression Example
z
What is in a Name
Logistic Regression
Awesome classifier,
terrible name
Logistic regression
LL function is
convex
LL(𝜃)
𝜃2 𝜃1
Gradient ascent is your
bread and butter
algorithm for optimization
(eg argmax)
Calculate all θj
Logistic Regression Training
Initialize: θj = 0 for all 0 ≤ j ≤ m
⇣ 1 ⌘
gradient[j] += xj y
gradient[j] ✓T x
1+e
Allows for θ0 to be an
intercept.
⎧1 p > 0.5
§ Classify instance as: yˆ = ⎨
⎩0 otherwise
1-p
0 1
P (Y = y) = py (1 p)1 y
P (Y = y) = 0.2y (0.8)1 y
Log Probability of Data
P (Y = 1|X = x) = (✓T x)
P (Y = 0|X = x) = 1 (✓T x)
T y
⇥ T
⇤(1 y)
P (Y = y|X = x) = (✓ x) · 1 (✓ x)
n
Y
L(✓) = P (Y = y (i) |X = x(i) )
i=1
Yn
(i)
h i(1 y (i) )
= (✓T x(i) )y · 1 (✓T x(i) )
i=1
n
X
LL(✓) = y (i) log (✓T x(i) ) + (1 y (i) ) log[1 (✓T x(i) )]
i=0
How did we get that gradient?
Sigmoid has a Beautiful Slope
@
(✓T x)?
@✓j
@ @ @z Chain rule!
T
(✓ x) = (z) ·
@✓j @z @✓j
@LL(✓)
@LL(✓) @
@ T
T
@
@ (1 y) log[1 T
@✓
= y log (✓ T x) +
= @✓ y log (✓ x) + @✓ (1 y) log[1 (✓T x]
(✓ T x]
@✓jjj @✓jjj
@✓jjj
y 1 yy @ @
@LL(✓) = @ y T 1 @ (✓TTT x) T
= (✓yT
T
log
x) (✓ 1 x) +(✓ T
T x) (1 @✓ y) log[1
(✓ x) (✓ x]
@✓j @✓j(✓ x) 1
T (✓@✓x)
T j @✓jj j
y
y 1
1 yy @(✓TTT x)[1 T
=
= T T (✓ (✓ T
x)[1 x) (✓
(✓ T x)]xj
T x)]xjj
(✓ T x)
(✓ x) 1
T 1 (✓ T x)
(✓ x) @✓j
T
T
y yy (✓
(✓ TT x)
1 y
x) T T T T
=
= T T
(✓
(✓ TT
(✓x)[1
x)[1x)[1 (✓
(✓TT
(✓x)]x
x)]x
x)]x jj j
(✓
(✓ T x)[1
T x)
x)[1 (✓
1 (✓T(✓ T x)]
x)]T x) j
⇥⇥ T
⇤⇤
= T T
= yy y(✓
(✓T x)
x)(✓ xxx)j
jj T T
= T T
(✓ x)[1 (✓ x)]xj
X n(✓h x)[1 (✓ x)] i
@LL(✓) ⇥ (i) ⇤ T (i) (i) For many data points
T
== y y(✓ x) x(✓j x ) xj
@✓j i=0
Logistic Regression
✓T x = 0
✓0 x 0 + ✓1 x 1 + · · · + ✓m x m = 0
0 X1
0 1
§ Not possible to draw a line that successfully separates
all the y = 1 points (blue) from the y = 0 points (red)
§ Despite this fact, logistic regression and Naive Bayes
still often work well in practice
Logistic Regression vs Naïve Bayes
• Compare Naive Bayes and Logistic Regression
§ Recall that Naive Bayes models P(X, Y) = P(X | Y) P(Y)
§ Logistic Regression directly models P(Y | X)
§ We call Naive Bayes a “generative model”
o Tries to model joint distribution of how data is “generated”
o I.e., could use P(X, Y) to generate new data points if we wanted
o But lots of effort to model something that may not be needed
• Neural network
x1
x2
x3
x4
Biological Basis for Neural Networks
• A neuron
x1 θ1
x2 θ2
θ3 y
x3
θ4
x4
• Your brain
x1
x2
x3
x4
Actually, it’s probably someone else’s brain
Next up: Deep Learning!