0% found this document useful (0 votes)
5 views

Ch03 LogisticRegression

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Ch03 LogisticRegression

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 79

Background: Generative and

Logistic Discriminative Classifiers


Regressi
on

1
Logistic Regression
 Important analytic tool in natural and
social sciences
 Baseline supervised machine learning tool
for classification
 It is also the foundation of neural
networks

2
Generative and
Discriminative Classifiers
Naive Bayes is a generative classifier

by contrast:

Logistic regression is a discriminative


classifier
3
Generative and
Discriminative Classifiers
Suppose we're distinguishing cat from dog images

imagenet imagenet 4
Generative Classifier:
• Build a model of what's in a cat image
• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?

Also build a model for dog images

Now given a new image:


Run both models and see which one fits better 5
Discriminative Classifier
Just try to distinguish dogs from cats

Oh look, dogs have collars!


Let's ignore everything else 6
Finding the correct class c from a document d in
Generative vs Discriminative Classifiers

Naive Bayes

Logistic Regression
posterior
P(c|d)
7
Components of a probabilistic machine learning
classifier
Given m input/output pairs (x(i),y(i)):

1. A feature representation of the input. For each input


observation x(i), a vector of features [x1, x2, ... , xn]. Feature j
for input x(i) is xj, more completely xj(i), or sometimes fj(x).
2. A classification function that computes , the estimated class,
via p(y|x), like the sigmoid or softmax functions.
3. An objective function for learning, like cross-entropy loss.
4. An algorithm for optimizing the objective function: stochastic
gradient descent.
8
The two phases of logistic
regression

Training: we learn weights w and b using stochastic


gradient descent and cross-entropy loss.

Test: Given a test example x we compute p(y|x)


using learned weights w and b, and return
whichever label (y = 1 or y = 0) is higher probability

9
Classification in Logistic Regression
Logistic
Regressi
on

10
Classification Reminder

Positive/negative sentiment

Spam/not spam
Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton

11
Text Classification: definition

Input:
◦ a document x
◦ a fixed set of classes C = {c1, c2,…, cJ}

Output: a predicted class  C

12
Binary Classification in Logistic Regression

Given a series of input/output pairs:


◦ (x(i), y(i))
For each observation x(i)
◦ We represent x(i) by a feature vector [x1, x2,…, xn]
◦ We compute an output: a predicted class (i)  {0,1}

13
Features in logistic regression
• For feature xi, weight wi tells is how important is xi
• xi ="review contains ‘awesome’": wi = +10
• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2

14
Logistic Regression for one
observation x

Input observation: vector x = [x1, x2,…, xn]


Weights: one per feature: W = [w1, w2,…, wn]
◦ Sometimes we call the weights θ = [θ1, θ2,…, θn]
Output: a predicted class  {0,1}
(multinomial logistic regression:  {0, 1, 2, 3, 4})

15
How to do classification
For each feature xi, weight wi tells us importance of xi
◦ (Plus we'll have a bias b)
We'll sum up all the weighted features and the bias

If this sum is high, we say y=1; if low, then y=0


16
But we want a probabilistic
classifier

We need to formalize “sum is high”.


We’d like a principled classifier that gives us a
probability
We want a model that can tell us:
p(y=1|x; θ)
p(y=0|x; θ)

17
The problem: z isn't a probability, it's just a
number!

Solution: use a function of z that goes from 0 to 1

18
The very useful sigmoid or logistic function

19
Idea of logistic regression

We’ll compute w∙x+b


And then we’ll pass it through the
sigmoid function:
σ(w∙x+b)
And we'll just treat it as a probability

20
Making probabilities with sigmoids

21
Turning a probability into a classifier

0.5 here is called the decision boundary

22
The probabilistic classifier

P(y=1)

wx + b 23
Turning a probability into a classifier

if w∙x+b > 0
if w∙x+b ≤ 0

24
Logistic Regression: a text example
Logistic on sentiment classification
Regressi
on

25
Sentiment example: does y=1
or y=0?

It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .

26
27
Classifying sentiment for input x

Suppose w =
b = 0.1 28
Classifying sentiment for input x

29
Classification in (binary) logistic regression: summary
Given:
◦ a set of classes: (+ sentiment,- sentiment)
◦ a vector x of features [x1, x2, …, xn]
◦ x1= count( "awesome")
◦ x2 = log(number of words in review)
◦ A vector w of weights [w1, w2, …, wn]
◦ w for each feature f
i i

30
Logistic Regression: a text example
Logistic on sentiment classification
Regressi
on

31
Learning: Cross-Entropy Loss
Logistic
Regressi
on

32
Wait, where did the W’s come
from?

Supervised classification:
• We know the correct label y (either 0 or 1) for each x.
• But what the system produces is an estimate,
We want to set w and b to minimize the distance between our
estimate (i) and the true y(i).
• We need a distance estimator: a loss function or a cost
function
• We need an optimization algorithm to update w and b to
minimize the loss. 33
Learning components

A loss function:
◦ cross-entropy loss

An optimization algorithm:
◦ stochastic gradient descent

34
The distance between and y

We want to know how far is the classifier output:


= σ(w∙x+b)

from the true output:


y [= either 0 or 1]

We'll call this difference:


L(,y) = how much differs from the true y
35
Intuition of negative log likelihood loss = cross-
entropy loss

A case of conditional maximum likelihood


estimation
We choose the parameters w,b that maximize
• the log probability
• of the true y labels in the training data
• given the observations x

36
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Since there are only 2 discrete outcomes (0 or 1) we can
express the probability p(y|x) from our classifier (the
thing we want to maximize) as

noting:
if y=1, this simplifies to
if y=0, this simplifies to 1-
37
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Maximize:
Now take the log of both sides (mathematically handy)
Maximize:

Whatever values maximize log p(y|x) will also maximize


p(y|x)

38
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Maximize:

Now flip sign to turn this into a loss: something to minimize


Cross-entropy loss (because is formula for cross-entropy(y, ))
Minimize:
Or, plugging in definition of
Let's see if this works for our sentiment example
We want loss to be:
• smaller if the model estimate is close to correct
• bigger if model is confused
Let's first suppose the true label of this is y=1 (positive)

It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is great . Another
nice touch is the music . I was overcome with the urge to get off the couch
and start dancing . It sucked me in , and it'll do the same to you .

40
Let's see if this works for our sentiment example

True value is y=1. How well is our model doing?

Pretty well! What's the loss?

41
Let's see if this works for our sentiment example
Suppose true value instead was y=0.

What's the loss?

42
Let's see if this works for our sentiment example
The loss when model was right (if true y=1)

Is lower than the loss when model was wrong (if true y=0):

Sure enough, loss was bigger when model was wrong! 43


Cross-Entropy Loss
Logistic
Regressi
on

44
Stochastic Gradient Descent
Logistic
Regressi
on

45
Our goal: minimize the loss

by weights 𝛳=(w,b)
Let's make explicit that the loss function is parameterized

• And we’ll represent as f (x; θ ) to make the


dependence on θ more obvious
We want the weights that minimize the loss, averaged
over all examples:

46
Our goal: minimize the loss
For logistic regression, loss function is convex
• A convex function has just one minimum
• Gradient descent starting from any point is
guaranteed to find the minimum
• (Loss for neural networks is non-convex)

47
Let's first visualize for a single
scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

48
Let's first visualize for a single
scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

So we'll move positive

49
Let's first visualize for a single
scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

So we'll move positive

50
Gradients
The gradient of a function of many variables is a
vector pointing in the direction of the greatest
increase in a function.

Gradient Descent: Find the gradient of the loss


function at the current point and move in the
opposite direction.

51
How much do we move in that direction ?

• The value of the gradient (slope in our example)


weighted by a learning rate η
• Higher learning rate means move w faster

52
Now let's consider N dimensions
We want to know where in the N-dimensional space
(of the N parameters that make up θ ) we should
move.
The gradient is just such a vector; it expresses the
directional components of the sharpest slope along
each of the N dimensions.

53
Imagine 2 dimensions, w and b

Visualizing the
gradient vector at
the red point
It has two
dimensions shown
in the x-y plane

54
The gradient
We’ll represent as f (x; θ ) to make the dependence on θ more
obvious:

The final equation for updating θ based on the gradient is thus

55
What are these partial derivatives for logistic regression?

The loss function

The elegant derivative of this function

56
57
Hyperparameters
The learning rate η is a hyperparameter
◦ too high: the learner will take big steps and overshoot
◦ too low: the learner will take too long
Hyperparameters:
• Briefly, a special kind of parameter for an ML model
• Instead of being learned by algorithm from
supervision (like regular parameters), they are
chosen by algorithm designer.
58
Stochastic Gradient Descent
Logistic
Regressi
on

59
Stochastic Gradient Descent:
Logistic An example and more details
Regressi
on

60
Working through an example
One step of gradient descent
A mini-sentiment example, where the true y=1 (positive)
Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
Assume 3 parameters (2 weights and 1 bias) in Θ0 are zero:
w1 = w2 = b = 0
η = 0.1
61
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2

where
Gradient vector has 3 dimensions:

62
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2

where
Gradient vector has 3 dimensions:

63
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2

where
Gradient vector has 3 dimensions:

64
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2

where
Gradient vector has 3 dimensions:

65
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2

where
Gradient vector:

66
Example of gradient descent

Now that we have a gradient, we compute the new parameter vector


θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;

67
Example of gradient descent

Now that we have a gradient, we compute the new parameter vector


θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;

68
Example of gradient descent

Now that we have a gradient, we compute the new parameter vector


θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;

69
Example of gradient descent

Now that we have a gradient, we compute the new parameter vector


θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;

Note that enough negative examples would eventually make w2 negative 70


Mini-batch training
Stochastic gradient descent chooses a single
random example at a time.
That can result in choppy movements
More common to compute gradient over batches of
training instances.
Batch training: entire dataset
Mini-batch training: m examples
71
Stochastic Gradient Descent:
Logistic An example and more details
Regressi
on

72
Regularization
Logistic
Regressi
on

73
Overfitting
A model that perfectly match the training data has a
problem.
It will also overfit to the data, modeling noise
◦ A random word that perfectly predicts y (it happens to
only occur in one class) will get a very high weight.
◦ Failing to generalize to a test set without this word.
A good model should be able to generalize

74
Overfitting Useful or harmless features

+ X1 = "this"
X2 = "movie
This movie drew me in, and it'll X3 = "hated"
do the same to you. X4 = "drew me in"

-
I can't tell you how much I
4gram features that just
"memorize" training set and
might cause problems
hated this movie. It sucked. X5 = "the same to you"
X7 = "tell you how much"
75
Overfitting
4-gram model on tiny data will just memorize the data
◦ 100% accuracy on the training set
But it will be surprised by the novel 4-grams in the test data
◦ Low accuracy on test set
Models that are too powerful can overfit the data
◦ Fitting the details of the training data so exactly that the
model doesn't generalize well to the test set
◦ How to avoid overfitting?
◦ Regularization in logistic regression
◦ Dropout in neural networks
76
Regularization
A solution for overfitting
Add a regularization term R(θ) to the loss function (for
now written as maximizing logprob rather than minimizing loss)

Idea: choose an R(θ) that penalizes large weights


◦ fitting the data well with lots of big weights not as good as
fitting the data a little less well, with small weights
77
L2 Regularization (= ridge
regression)
The sum of the squares of the weights
The name is because this is the (square of the)
L2 norm ||θ||2, = Euclidean distance of θ to the origin.

L2 regularized objective function:

78
L1 Regularization (= lasso
regression)
The sum of the (absolute value of the) weights
Named after the L1 norm ||W||1, = sum of the absolute
values of the weights, = Manhattan distance

L1 regularized objective function:

79

You might also like