0% found this document useful (0 votes)
5 views

23-LogisticRegression

The document discusses logistic regression and its application in classification tasks, particularly in modeling probabilities using the logistic function. It contrasts logistic regression with Naïve Bayes classification, highlighting the direct modeling of conditional probabilities in logistic regression. Additionally, it covers the mathematical foundations, including the log likelihood and gradient ascent optimization for training the model.

Uploaded by

nilavjyoti
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

23-LogisticRegression

The document discusses logistic regression and its application in classification tasks, particularly in modeling probabilities using the logistic function. It contrasts logistic regression with Naïve Bayes classification, highlighting the direct modeling of conditional probabilities in logistic regression. Additionally, it covers the mathematical foundations, including the log likelihood and gradient ascent optimization for training the model.

Uploaded by

nilavjyoti
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Logistic Regression

Chris Piech
CS109, Stanford University
Classification
Classification Task
Heart Ancestry

Netflix
Training
Real World Problem

Model the problem


Training
Formal Model 𝜃 Data

Learning Algorithm

Testing Classification
Data g𝜃(X) Accuracy
Testing
Real World Problem

Model the problem


Training
Formal Model 𝜃 Data

Learning Algorithm

Testing Classification
Data g𝜃(X) Accuracy
Healthy Heart Classifier
ROI 1 ROI 2 ROI m Output

Heart 1 0 1 1 0

Heart 2 1 1 1 0


Heart n 0 0 0 1

g𝜃(X)
Healthy Heart Classifier
ROI 1 ROI 2 ROI m Output

New
1 0 1 1
Heart

g𝜃(X)
Naïve Bayes Classification
g𝜃(x)?


g𝜃(x)?

x
[0, 1, 1, 0]


g𝜃(x)?

x
[0, 1, 1, 0]

argmax P (y|x)
y={0,1}

g𝜃(x)?

x
[0, 1, 1, 0]

y=0
P (y|x)
0.62

g𝜃(x)?

x
[0, 1, 1, 0]
y=1
P (y|x)
0.38

g𝜃(x)?

x
[0, 1, 1, 0]

argmax P (y|x)
y={0,1}

g𝜃(x)?

x
[0, 1, 1, 0]

argmax P (y|x)
y={0,1}

ŷ = 0
Brute Force Bayes
Simply chose the class label that is the most
likely given the data

ŷ = argmax P (y|x)
y={0,1}

P (x|y)P (y)
= argmax
y={0,1} P (x)
= argmax P (x|y)P (y)
y={0,1}

P (x1 , x2 , x3 , . . . , x100 |y)


Simply chose the class label that is the most likely
given the data. Make Naïve Bayes assumption

ŷ = g✓ (x)
= argmax P (y|x)
Naïve Bayes

y={0,1}

P (x|y)P (y)
= argmax
y={0,1} P (x)
= argmax P (x|y)P (y)
y={0,1}
Y
= argmax P (xi |y)P (y)
y={0,1} i
X
= argmax log P (y) + log P (xi |y)
y={0,1} i
Simply chose the class label that is the most likely
given the data. Make Naïve Bayes assumption

ŷ = g✓ (x)
= argmax P (y|x)
Naïve Bayes

y={0,1}

P (x|y)P (y)
= argmax
y={0,1} P (x) By Naïve
Bayes
= argmax P (x|y)P (y)
y={0,1}
Assumption
Y
= argmax P (xi |y)P (y)
y={0,1} i
X
= argmax log P (y) + log P (xi |y)
y={0,1} i
Simply chose the class label that is the most likely
given the data. Make Naïve Bayes assumption

ŷ = g✓ (x)
= argmax P (y|x)
Naïve Bayes

y={0,1}

P (x|y)P (y)
= argmax
y={0,1} P (x)
= argmax P (x|y)P (y)
y={0,1}
Y Argmax of
= argmax P (xi |y)P (y) log
y={0,1} i
X
= argmax log P (y) + log P (xi |y)
y={0,1} i
Computing Probabilities from Data
• Various probabilities you will need to compute for
Naive Bayesian Classifier (using MLE here):
# instances in class = 0
Pˆ (Y = 0) =
total # instances

ˆ # instances where X i = 0 and class = 0


P( X i = 0, Y = 0) =
total # instances

ˆ Pˆ ( X i = 0, Y = 0) ˆ Pˆ ( X i = 0, Y = 1)
P ( X i = 0 | Y = 0) = P( X i = 0 | Y = 1) =
Pˆ (Y = 0) Pˆ (Y = 1)

Pˆ ( X i = 1 | Y = 0) = 1 − Pˆ ( X i = 0 | Y = 0)
Training Naïve Bayes, is
estimating parameters for
a multinomial.

Thus training is just


counting.

Piech, CS106A, Stanford University


What is Bayes Doing in my Mail Server

• This is spam:
Let’s get Bayesian on your spam:
Content analysis details: (49.5 hits, 7.0 required)
0.9 RCVD_IN_PBL RBL: Received via a relay in Spamhaus PBL
[93.40.189.29 listed in zen.spamhaus.org]
1.5 URIBL_WS_SURBL Contains an URL listed in the WS SURBL blocklist
[URIs: recragas.cn]
5.0 URIBL_JP_SURBL Contains an URL listed in the JP SURBL blocklist
[URIs: recragas.cn]
5.0 URIBL_OB_SURBL Contains an URL listed in the OB SURBL blocklist
[URIs: recragas.cn]
5.0 URIBL_SC_SURBL Contains an URL listed in the SC SURBL blocklist
[URIs: recragas.cn]
2.0 URIBL_BLACK Contains an URL listed in the URIBL blacklist
[URIs: recragas.cn]
8.0 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
[score: 1.0000]

Who was crazy enough to think of that?


Spam, Spam… Go Away!
• The constant battle with spam
Email Classification

• Want to predict if an email is spam or not


§ Start with the input data
o Consider a lexicon of m words (Note: in English m ≈ 100,000)
o Define m indicator variables X = <X1, X2, …, Xm>
o Each variable Xi denotes if word i appeared in a document or not
o Note: m is huge, so make “Naive Bayes” assumption

§ Define output classes Y to be: {spam, non-spam}


§ Given training set of N previous emails
o For each email message, we have a training instance:
X = <X1, X2, …, Xm> noting for each word, if it appeared in email
o Each email message is also marked as spam or not (value of Y)
How Does This Do?
• After training, can test with another set of data
§ “Testing” set also has known values for Y, so we can
see how often we were right/wrong in predictions for Y
§ Spam data
o Email data set: 1789 emails (1578 spam, 211 non-spam)
o First, 1538 email messages (by time) used for training
o Next 251 messages used to test learned classifier
§ Criteria:
o Precision = # correctly predicted class Y/ # predicted class Y
o Recall = # correctly predicted class Y / # real class Y messages
Spam Non-spam
Precision Recall Precision Recall
Words only 97.1% 94.3% 87.7% 93.4%
Words + add’l features 100% 98.3% 96.2% 100%
Our Path

Parameter Estimation
Knowledge Dependency

Parameter Estimation
Logistic Regression
Key Notation
1 Sigmoid function
(z) = z
1+e
n
X
T Weighted sum
✓ x= ✓i x i (aka dot product)
i=1
= ✓1 x1 + ✓2 x2 + · · · + ✓n xn

T 1 Sigmoid function of
(✓ x) =
1+e ✓T x weighted sum
Chapter 1: Big Picture
From Naïve Bayes to Logistic Regression

• For classification we care about P(Y | X)

• Recall the Naive Bayes Classifier


§ Predict Yˆ = arg max P( X , Y ) = arg max P( X | Y ) P(Y )
y y m
§ Use assumption that P(X | Y ) = P( X 1 , X 2 ,... X m | Y ) = ∏ P( X i | Y )
i =1

§ That is a pretty big assumption…

• Could we model P(Y | X) directly?


§ Welcome our friend: logistic regression!
Logistic Regression Assumption
• Could we model P(Y | X) directly?
§ Welcome our friend: logistic regression!

x, ✓ P (Y = 1|x)
[1, 1, 1, 0]
0.81
[-2, 4, 1, 2]
Logistic Regression Assumption
• Could we model P(Y | X) directly?
§ Welcome our friend: logistic regression!

x, ✓ P (Y = 1|x)
[1, 1, 1, 0]
0.81
[-2, 4, 1, 2]

z = 1.1
Logistic Regression Assumption
• Model conditional likelihood P(Y | X) directly
§ Model this probability with logistic function:
m
X
P (Y = 1|X) = (z) where z = ✓0 + ✓i xi
i=1

§ For simplicity define x0 = 1 so z = ✓T x

§ Since P(Y = 0 | X) + P(Y = 1 | X) = 1:


Recall:
P (Y = 1|X = x) = (✓TT x) Sigmoid function
P (Y = 1|X = x) = (✓ x) T
P (Y = 0|X = x) = 1 (✓T x) 1
P (Y = 0|X = x) = 1 (✓ x) (z) =
1+e z
The Sigmoid Function

1
f ( z) =
1 + e− z

Want to distinguish y = 1 (blue)


points from y = 0 (red) points

z
Note: inflection point at z = 0. f(0) = 0.5
The Sigmoid Function

Standard normal CDF

Sigmoid
Logistic Regression Example

z
What is in a Name

Regression Algorithms Classification Algorithms

Linear Regression Naïve Bayes

Logistic Regression

Awesome classifier,
terrible name

If Chris could rename it he would call it: Sigmoidal Classification


Math for Logistic Regression

1 Make logistic regression assumption


P (Y = 1|X = x) = (✓T x)
P (Y = 0|X = x) = 1 (✓T x)

2 Calculate the log likelihood for all data


n
X
LL(✓) = y (i) log (✓T x(i) ) + (1 y (i) ) log[1 (✓T x(i) )]
i=0
Gradient Ascent

Logistic regression
LL function is
convex

Walk uphill and you will find a local maxima


(if your step size is small enough)
Gradient Ascent Step
n h
X i
@LL(✓) (i)
= y (i) (✓T x(i) ) xj
@✓j i=0

@LL(✓ old ) Do this


✓jnew = ✓jold +⌘·
@✓jold for all
Xn h i
(i)
thetas!
= ✓jold + ⌘ · y (i) (✓T x(i) ) xj
i=0

LL(𝜃)

𝜃2 𝜃1
Gradient ascent is your
bread and butter
algorithm for optimization
(eg argmax)

Piech, CS106A, Stanford University


Math for Logistic Regression

1 Make logistic regression assumption


P (Y = 1|X = x) = (✓T x)
P (Y = 0|X = x) = 1 (✓T x)

2 Calculate the log likelihood for all data


n
X
LL(✓) = y (i) log (✓T x(i) ) + (1 y (i) ) log[1 (✓T x(i) )]
i=0

3 Get derivative of log likelihood with respect to thetas


n h
X i
@LL(✓) (i)
= y (i) (✓T x(i) ) xj
@✓j i=0
Piech, CS106A, Stanford University
Logistic Regression Training
Initialize: θj = 0 for all 0 ≤ j ≤ m

Calculate all θj
Logistic Regression Training
Initialize: θj = 0 for all 0 ≤ j ≤ m

Repeat many times:


gradient[j] = 0 for all 0 ≤ j ≤ m

Calculate all gradient[j]’s based on data

𝜃j += η * gradient[j] for all 0 ≤ j ≤ m


Logistic Regression Training
Initialize: θj = 0 for all 0 ≤ j ≤ m

Repeat many times:


gradient[j] = 0 for all 0 ≤ j ≤ m

For each training example (x, y):

For each parameter j:

Update gradient[j] for current training


example

𝜃j += η * gradient[j] for all 0 ≤ j ≤ m


Logistic Regression Training
Initialize: θj = 0 for all 0 ≤ j ≤ m

Repeat many times:


gradient[j] = 0 for all 0 ≤ j ≤ m

For each training example (x, y):

For each parameter j:

⇣ 1 ⌘
gradient[j] += xj y
gradient[j] ✓T x
1+e

𝜃j += η * gradient[j] for all 0 ≤ j ≤ m


Don’t forget:

xj is j-th input variable


and x0 = 1.

Allows for θ0 to be an
intercept.

Piech, CS106A, Stanford University


Classification with Logistic Regression

• Training: determine parameters 𝜃j (for all 0 ≤ j ≤ m)


§ After parameters 𝜃j have been learned, test classifier
• To test classifier, for each new (test) instance X:
m
1 T
§ Compute: p = P(Y = 1 | X ) = −z
, where z z
= ∑
= ✓
β jXxj
1+ e j =0

⎧1 p > 0.5
§ Classify instance as: yˆ = ⎨
⎩0 otherwise

§ Note about evaluation set-up: parameters 𝜃j are not


updated during “testing” phase
Chapter 2: How Come?
Logistic Regression

1 Make logistic regression assumption


P (Y = 1|X = x) = (✓T x)
P (Y = 0|X = x) = 1 (✓T x)

2 Calculate the log probability for all data


n
X
LL(✓) = y (i) log (✓T x(i) ) + (1 y (i) ) log[1 (✓T x(i) )]
i=0

3 Get derivative of log probability with respect to thetas


n h
X i
@LL(✓) (i)
= y (i) (✓T x(i) ) xj
@✓j i=0
How did we get that LL function?
Recall: PMF of Bernoulli
§ Y ~ Ber(p)
§ Probability mass function: P (Y = y)

PMF of Bernoulli PMF of Bernoulli (p = 0.2)

1-p

0 1

P (Y = y) = py (1 p)1 y
P (Y = y) = 0.2y (0.8)1 y
Log Probability of Data
P (Y = 1|X = x) = (✓T x)
P (Y = 0|X = x) = 1 (✓T x)

T y
⇥ T
⇤(1 y)
P (Y = y|X = x) = (✓ x) · 1 (✓ x)

n
Y
L(✓) = P (Y = y (i) |X = x(i) )
i=1
Yn
(i)
h i(1 y (i) )
= (✓T x(i) )y · 1 (✓T x(i) )
i=1

n
X
LL(✓) = y (i) log (✓T x(i) ) + (1 y (i) ) log[1 (✓T x(i) )]
i=0
How did we get that gradient?
Sigmoid has a Beautiful Slope
@
(✓T x)?
@✓j

@ True fact about


(z) = (z)[1 z]
@z sigmoid functions

@ @ @z Chain rule!
T
(✓ x) = (z) ·
@✓j @z @✓j

@ Plug and chug


(✓T x) = (✓T x)[1 (✓T x)]xj
@✓j

Sigmoid, you should be a ski hill


Gradient Update
n
X
LL(✓) = y (i) log (✓T x(i) ) + (1 y (i) ) log[1 (✓T x(i) )]
i=0

@LL(✓)
@LL(✓) @
@ T
T
@
@ (1 y) log[1 T
@✓
= y log (✓ T x) +
= @✓ y log (✓ x) + @✓ (1 y) log[1 (✓T x]
(✓ T x]
@✓jjj @✓jjj
 @✓jjj
 y 1 yy @ @
@LL(✓) = @ y T 1 @ (✓TTT x) T
= (✓yT
T
log
x) (✓ 1 x) +(✓ T
T x) (1 @✓ y) log[1
(✓ x) (✓ x]
@✓j @✓j(✓ x) 1
T (✓@✓x)
T j @✓jj j

 y
y 1
1 yy @(✓TTT x)[1 T
=
= T T (✓ (✓ T
x)[1 x) (✓
(✓ T x)]xj
T x)]xjj
(✓ T x)
(✓ x) 1
T 1 (✓ T x)
(✓ x) @✓j
T

 T
y yy (✓
(✓ TT x)
1 y
x) T T T T
=
= T T
(✓
(✓ TT
(✓x)[1
x)[1x)[1 (✓
(✓TT
(✓x)]x
x)]x
x)]x jj j
(✓
(✓ T x)[1
T x)
x)[1 (✓
1 (✓T(✓ T x)]
x)]T x) j
⇥⇥ T
⇤⇤
= T T
= yy y(✓
(✓T x)
x)(✓ xxx)j
jj T T
= T T
(✓ x)[1 (✓ x)]xj
X n(✓h x)[1 (✓ x)] i
@LL(✓) ⇥ (i) ⇤ T (i) (i) For many data points
T
== y y(✓ x) x(✓j x ) xj
@✓j i=0
Logistic Regression

1 Make logistic regression assumption


P (Y = 1|X = x) = (✓T x)
P (Y = 0|X = x) = 1 (✓T x)

2 Calculate the log probability for all data


n
X
LL(✓) = y (i) log (✓T x(i) ) + (1 y (i) ) log[1 (✓T x(i) )]
i=0

3 Get derivative of log probability with respect to thetas


n h
X i
@LL(✓) (i)
= y (i) (✓T x(i) ) xj
@✓j i=0
Chapter 3: Philosophy
Discrimination Intuition
§ Logistic regression is trying to fit a line that separates
data instances where y = 1 from those where y = 0

✓T x = 0
✓0 x 0 + ✓1 x 1 + · · · + ✓m x m = 0

§ We call such data (or the functions generating the data)


“linearly separable”
§ Naïve bayes is linear too as there is no interaction
between different features.
Some Data Not Linearly Separable
• Some data sets/functions are not separable
§ Consider function: y = x1 XOR x2
§ Note: y =1 iff one of either x1 or x2 = 1
X2
1

0 X1
0 1
§ Not possible to draw a line that successfully separates
all the y = 1 points (blue) from the y = 0 points (red)
§ Despite this fact, logistic regression and Naive Bayes
still often work well in practice
Logistic Regression vs Naïve Bayes
• Compare Naive Bayes and Logistic Regression
§ Recall that Naive Bayes models P(X, Y) = P(X | Y) P(Y)
§ Logistic Regression directly models P(Y | X)
§ We call Naive Bayes a “generative model”
o Tries to model joint distribution of how data is “generated”
o I.e., could use P(X, Y) to generate new data points if we wanted
o But lots of effort to model something that may not be needed

§ We call Logistic Regression a “discriminative model”


o Just tries to model way to discriminate y = 0 vs. y = 1 cases
o Cannot use model to generate new data points (no P(X, Y))
o Note: Logistic Regression can be generalized to more than two
output values for y (have multiple sets of parameters βj)
Choosing an Algorithm?
• Many trade-offs in choosing learning algorithm
§ Continuous input variables
o Logistic Regression easily deals with continuous inputs
o Naive Bayes needs to use some parametric form for continuous
inputs (e.g., Gaussian) or “discretize” continuous values into
ranges (e.g., temperature in range: <50, 50-60, 60-70, >70)

§ Discrete input variables


o Naive Bayes naturally handles multi-valued discrete data by
using multinomial distribution for P(Xi | Y)
o Logistic Regression requires some sort of representation of
multi-valued discrete data (e.g., one hot vector)
o Say Xi ∈ {A, B, C}. Not necessarily a good idea to encode Xi as
taking on input values 1, 2, or 3 corresponding to A, B, or C.
Logistic Regression and Neural Networks

• Consider logistic regression as:


x1 θ1
Logistic regression is
x2 θ2
y same as a one node
θ3
x3 neural network
θ4
x4

• Neural network
x1
x2
x3
x4
Biological Basis for Neural Networks
• A neuron
x1 θ1

x2 θ2
θ3 y
x3
θ4
x4
• Your brain
x1
x2
x3
x4
Actually, it’s probably someone else’s brain
Next up: Deep Learning!

You might also like