Ch03 LogisticRegression
Ch03 LogisticRegression
1
Logistic Regression
Important analytic tool in natural and
social sciences
Baseline supervised machine learning tool
for classification
It is also the foundation of neural
networks
2
Generative and
Discriminative Classifiers
Naive Bayes is a generative classifier
by contrast:
imagenet imagenet 4
Generative Classifier:
• Build a model of what's in a cat image
• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?
Naive Bayes
Logistic Regression
posterior
P(c|d)
7
Components of a probabilistic machine learning
classifier
Given m input/output pairs (x(i),y(i)):
9
Classification in Logistic Regression
Logistic
Regressi
on
10
Classification Reminder
Positive/negative sentiment
Spam/not spam
Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton
11
Text Classification: definition
Input:
◦ a document x
◦ a fixed set of classes C = {c1, c2,…, cJ}
12
Binary Classification in Logistic Regression
13
Features in logistic regression
• For feature xi, weight wi tells is how important is xi
• xi ="review contains ‘awesome’": wi = +10
• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2
14
Logistic Regression for one
observation x
15
How to do classification
For each feature xi, weight wi tells us importance of xi
◦ (Plus we'll have a bias b)
We'll sum up all the weighted features and the bias
17
The problem: z isn't a probability, it's just a
number!
18
The very useful sigmoid or logistic function
19
Idea of logistic regression
20
Making probabilities with sigmoids
21
Turning a probability into a classifier
22
The probabilistic classifier
P(y=1)
wx + b 23
Turning a probability into a classifier
if w∙x+b > 0
if w∙x+b ≤ 0
24
Logistic Regression: a text example
Logistic on sentiment classification
Regressi
on
25
Sentiment example: does y=1
or y=0?
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .
26
27
Classifying sentiment for input x
Suppose w =
b = 0.1 28
Classifying sentiment for input x
29
Classification in (binary) logistic regression: summary
Given:
◦ a set of classes: (+ sentiment,- sentiment)
◦ a vector x of features [x1, x2, …, xn]
◦ x1= count( "awesome")
◦ x2 = log(number of words in review)
◦ A vector w of weights [w1, w2, …, wn]
◦ w for each feature f
i i
30
Logistic Regression: a text example
Logistic on sentiment classification
Regressi
on
31
Learning: Cross-Entropy Loss
Logistic
Regressi
on
32
Wait, where did the W’s come
from?
Supervised classification:
• We know the correct label y (either 0 or 1) for each x.
• But what the system produces is an estimate,
We want to set w and b to minimize the distance between our
estimate (i) and the true y(i).
• We need a distance estimator: a loss function or a cost
function
• We need an optimization algorithm to update w and b to
minimize the loss. 33
Learning components
A loss function:
◦ cross-entropy loss
An optimization algorithm:
◦ stochastic gradient descent
34
The distance between and y
36
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Since there are only 2 discrete outcomes (0 or 1) we can
express the probability p(y|x) from our classifier (the
thing we want to maximize) as
noting:
if y=1, this simplifies to
if y=0, this simplifies to 1-
37
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Maximize:
Now take the log of both sides (mathematically handy)
Maximize:
38
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Maximize:
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is great . Another
nice touch is the music . I was overcome with the urge to get off the couch
and start dancing . It sucked me in , and it'll do the same to you .
40
Let's see if this works for our sentiment example
41
Let's see if this works for our sentiment example
Suppose true value instead was y=0.
42
Let's see if this works for our sentiment example
The loss when model was right (if true y=1)
Is lower than the loss when model was wrong (if true y=0):
44
Stochastic Gradient Descent
Logistic
Regressi
on
45
Our goal: minimize the loss
by weights 𝛳=(w,b)
Let's make explicit that the loss function is parameterized
46
Our goal: minimize the loss
For logistic regression, loss function is convex
• A convex function has just one minimum
• Gradient descent starting from any point is
guaranteed to find the minimum
• (Loss for neural networks is non-convex)
47
Let's first visualize for a single
scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
48
Let's first visualize for a single
scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
49
Let's first visualize for a single
scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
50
Gradients
The gradient of a function of many variables is a
vector pointing in the direction of the greatest
increase in a function.
51
How much do we move in that direction ?
52
Now let's consider N dimensions
We want to know where in the N-dimensional space
(of the N parameters that make up θ ) we should
move.
The gradient is just such a vector; it expresses the
directional components of the sharpest slope along
each of the N dimensions.
53
Imagine 2 dimensions, w and b
Visualizing the
gradient vector at
the red point
It has two
dimensions shown
in the x-y plane
54
The gradient
We’ll represent as f (x; θ ) to make the dependence on θ more
obvious:
55
What are these partial derivatives for logistic regression?
56
57
Hyperparameters
The learning rate η is a hyperparameter
◦ too high: the learner will take big steps and overshoot
◦ too low: the learner will take too long
Hyperparameters:
• Briefly, a special kind of parameter for an ML model
• Instead of being learned by algorithm from
supervision (like regular parameters), they are
chosen by algorithm designer.
58
Stochastic Gradient Descent
Logistic
Regressi
on
59
Stochastic Gradient Descent:
Logistic An example and more details
Regressi
on
60
Working through an example
One step of gradient descent
A mini-sentiment example, where the true y=1 (positive)
Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
Assume 3 parameters (2 weights and 1 bias) in Θ0 are zero:
w1 = w2 = b = 0
η = 0.1
61
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2
where
Gradient vector has 3 dimensions:
62
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2
where
Gradient vector has 3 dimensions:
63
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2
where
Gradient vector has 3 dimensions:
64
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2
where
Gradient vector has 3 dimensions:
65
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2
where
Gradient vector:
66
Example of gradient descent
67
Example of gradient descent
68
Example of gradient descent
69
Example of gradient descent
72
Regularization
Logistic
Regressi
on
73
Overfitting
A model that perfectly match the training data has a
problem.
It will also overfit to the data, modeling noise
◦ A random word that perfectly predicts y (it happens to
only occur in one class) will get a very high weight.
◦ Failing to generalize to a test set without this word.
A good model should be able to generalize
74
Overfitting Useful or harmless features
+ X1 = "this"
X2 = "movie
This movie drew me in, and it'll X3 = "hated"
do the same to you. X4 = "drew me in"
-
I can't tell you how much I
4gram features that just
"memorize" training set and
might cause problems
hated this movie. It sucked. X5 = "the same to you"
X7 = "tell you how much"
75
Overfitting
4-gram model on tiny data will just memorize the data
◦ 100% accuracy on the training set
But it will be surprised by the novel 4-grams in the test data
◦ Low accuracy on test set
Models that are too powerful can overfit the data
◦ Fitting the details of the training data so exactly that the
model doesn't generalize well to the test set
◦ How to avoid overfitting?
◦ Regularization in logistic regression
◦ Dropout in neural networks
76
Regularization
A solution for overfitting
Add a regularization term R(θ) to the loss function (for
now written as maximizing logprob rather than minimizing loss)
78
L1 Regularization (= lasso
regression)
The sum of the (absolute value of the) weights
Named after the L1 norm ||W||1, = sum of the absolute
values of the weights, = Manhattan distance
79