0% found this document useful (0 votes)
108 views53 pages

Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor

Logistic regression is a linear model used for classification tasks. Unlike naive Bayes, it does not assume conditional independence between features. Logistic regression models the probability of the class variable being true using the logistic function of a linear combination of the feature variables. The weights of the linear combination are learned through maximum likelihood estimation, by taking the gradient of the log-likelihood function with respect to the weights. Logistic regression can be extended to multi-class classification by training a separate model for each class.

Uploaded by

Prima P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views53 pages

Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor

Logistic regression is a linear model used for classification tasks. Unlike naive Bayes, it does not assume conditional independence between features. Logistic regression models the probability of the class variable being true using the logistic function of a linear combination of the feature variables. The weights of the linear combination are learned through maximum likelihood estimation, by taking the gradient of the log-likelihood function with respect to the weights. Logistic regression can be extended to multi-class classification by training a separate model for each class.

Uploaded by

Prima P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Logistic Regression

Some slides adapted from Dan Jurfasky and Brendan O’Connor


Naïve Bayes Recap
• Bag of words (order independent)

• Features are assumed independent given class

P (x1, . . . , xn|c) = P (x1|c) . . . P (xn|c)

Q: Is this really true?


The problem with assuming
conditional independence
• Correlated features -> double counting
evidence
– Parameters are estimated independently

• This can hurt classifier accuracy and


calibration
Logistic Regression

• (Log) Linear Model – similar to Naïve Bayes

• Doesn’t assume features are independent

• Correlated features don’t “double count”


What are “Features”?
• A feature function, f
– Input: Document, D (a string)
– Output: Feature Vector, X
What are “Features”?
0 1
count(“boring”)
Bcount(“not boring”)C
f(d) = length of document
B author of document C
@
.
.
What
Doesn’t are
have “Features”?
to be just “bag of words”
Feature Templates
• Typically “feature templates” are used to
generate many features at once

• For each word:


– ${w}_count
– ${w}_lowercase
– ${w}_with_NOT_before_count
Logistic Regression: Example
• Compute Features:
0 1
count(“nigerian”)
f(d i) = xi = @ count(“prince”)
count(“nigerian prince”)
• Assume we are given
0some1
weights:
—1.0
w = @—1.0
4.0
Logistic Regression: Example
• Compute Features
• We are given some weights
• Compute the dot product:

|X|
X
z= w ix i
i=0
Logistic Regression: Example
• Compute the dot product:
|X|
X
z= w ix i
i=0
• Compute the logistic function:
ez 1
P (spam|x) = z =
e + 1 1 + e—z
The Logistic function

ez 1
P (spam|x) = z =
e + 1 1 + e—z
The Dot Product

|X|
X
z= wixi
i=0

• Intuition: weighted sum of features


• All Linear models have this form
Naïve Bayes as a log-linear model

• Q: what are the features?

• Q: what are the weights?


Naïve Bayes as a Log-Linear Model
Y
P(spam|D) / P(spam) P (w|spam)
w2D
Y
P(spam|D) / P(spam) P (w|spam)xi
w Vocab

X
log P (spam|D) / log P (spam) + xi · log P (w|spam)
w Vocab
Naïve Bayes as a Log-Linear Model
X
log P (spam|D) / log P(spam) + xi · log P (w|spam)
w Vocab

In both Naïve Bayes and


Logistic Regression we
Compute The Dot Product!
NB vs. LR
• Both compute the dot product

• NB: sum of log probabilities

• LR: logistic function


NB vs. LR:
Parameter Learning
• Naïve Bayes:
– Learn conditional probabilities independently by
counting

• Logistic Regression:
– Learn weights jointly
LR: Learning Weights

• Given: a set of feature vectors and labels

• Goal: learn the weights


Learning Weights
Feature Labels

0 1
x11 x12 x13 ... x1n y1
Document x
6 21 x22 x23 ... x2n7 B y2 C
.. B C
. . . . . .
. . . .
xd1 xd2 xd3 ... xdn yn
Q: what parameters should we
choose?
• What is the right value for the weights?

• Maximum Likelihood Principle:


– Pick the parameters that maximize the probability
of the data
Maximum Likelihood Estimation
wMLE = argmaxw log P(y1, . . . , yd|x1, . . . , xd; w)
X
= argmaxw log P (yi|xi; w)
i (
X
= argmaxw log p i , if y i = 1

i
1 — p i, if y i =0

X
I(yi =1)
= argmaxw log pi (1 — pi )I(yi =0)
i
Maximum Likelihood Estimation
X
I(yi =1) I(yi =0)
= argmaxw log pi (1 — pi )
i
X
= argmaxw yi log pi + (1 — yi) log(1 — pi)
i
Maximum Likelihood Estimation
• Unfortunately there is no closed form solution
– (like there was with naïve bayes)
• Solution:
– Iteratively climb the log-likelihood surface through
the derivatives for each weight
• Luckily, the derivatives turn out to be nice
Gradient ascent
Loop While not converged:
For all features j, compute and add derivatives
new old
@
wj =wj +⌘ L(w)
@wj
L(w): Training set log-likelihood
@L @L , . . . , @L : Gradient vector
,
@w1 @w2 @wn
Gradient ascent

w2

w1
Gradient ascent

w2

w1
LR Gradient

@L X
= (yi — pi )xj
@wj
i
Logistic Regression: Pros and Cons
• Doesn’t assume conditional independence of
features
– Better calibrated probabilities
– Can handle highly correlated overlapping features

• NB is faster to train, less likely to overfit


NB & LR
|X|
• Both are linear models X
z= wixi
i=0

• Training is different:
– NB: weights are trained independently
– LR: weights trained jointly
Perceptron Algorithm
• Very similar to logistic regression
• Not exactly computing gradients
Perceptron Algorithm
• Algorithm is Very similar to logistic regression
• Not exactly computing gradients

Initalize weight vector w = 0


Loop for K iterations
Loop For all training examples x_i
if sign(w * x_i) != y_i
w += (y_i - sign(w * x_i)) * x_i
Differences between LR and
Perceptron
• Online learning vs. Batch

• Perceptron doesn’t always make updates


MultiClass Classification
• Q: what if we have more than 2 categories?
– Sentiment: Positive, Negative, Neutral
– Document topics: Sports, Politics, Business,
Entertainment, …
• Could train a seperate logistic regression
model for each category...
• Pretty clear what to do with Naive Bayes.
Log-Linear Models

w·f (d,y)
P (y|x) / e
1 w·f (d,y)
P (y x)
| = e
Z(w)
MultiClass Logistic Regression

w·f (d,y)
P (y|x) / e
1 w·f (d,y)
P (y x)
| = e
Z(w)

ew·f (d,y)
P(y|x) = P w·f (d,y 0)
y02Y e
MultiClass Logistic Regression
• Binary logistic regression:
– We have one feature vector that matches the size
of the vocabulary
Can represent this in practice with
• Multiclass in practice:
one giant weight vector and
one weight vector for each category
– repeated features for each category.

wpos wneg wneut


Q: How to compute posterior class
probabilities for multiclass?

ewj ·xi
P (y = j|xi) = wk ·xi
k e
Maximum Likelihood Estimation
wMLE = argmaxw log P(y1, . . . , yn|x1, . . . , xn; w)

X
= argmaxw log P (yi|xi; w)
i

X ew·f (xi ,yi )


= argmaxw log Py02Y
i ew·f (xi ,yi )
i y02Y
Multiclass LR Gradient

D
X D X
X
@L
= fj(yi ,di) — fj(y, di )P (y|di)
@wj i=1 y2Y
i=1
MAP-based learning
(perceptron)

D
X D
X
@L
= fj(yi ,di) — fj(arg max P (y|di ), di)
@wj y2Y
i=1 i=1
Online Learning (perceptron)

• Rather than making a full pass through the


data, compute gradient and update
parameters after each training example.
MultiClass Perceptron Algorithm
Initalize weight vector w = 0
Loop for K iterations
Loop For all training examples x_i
y_pred = argmax_y w_y * x_i
if y_pred != y_i
w_y_gold += x_i
w_y_pred -= x_i
Q: what if there are only 2 categories?

ewj ·xi
P (y = j|xi) = wk ·xi
k e
Q: what if there are only 2 categories?

ew1·x
P(y = 1|x) =
ew0·x+w1·x—w1·x + ew1·x
Q: what if there are only 2 categories?

ew1·x
P(y = 1|x) =
ew0·x—w1·xew1·x + ew1·x
Q: what if there are only 2 categories?

ew1·x
P(y = 1|x) =
ew1·x(ew0·x—w1·x + 1)
Q: what if there are only 2 categories?

1
P(y = 1|x) =
ew0·x—w1·x + 1
Q: what if there are only 2 categories?

1
P(y = 1|x) = —w 0·x
e +1
Regularization
• Combating over fitting

• Intuition: don’t let the weights get very large


wMLE = argmaxw log P(y1, . . . , yd|x1, . . . , xd; w)

V
X 2
argmaxw log P(y1, . . . , yd|x1, . . . , xd; w) — 6 wi
i=1
Regularization in the Perceptron
Algorithm
• Can’t directly include regularization in
gradient

• # of iterations

• Parameter averaging

You might also like