04-LogisticRegression
04-LogisticRegression
Logistic Regression
PAGE 2
O verview
B i n a r y linear classification
classification: given a D-dimensional input x ∈ R D predict a
discrete-valued target
binary: predict a binary target t ∈ { 0, 1}
• Training examples with t = 1 are called positive examples, and
training examples with t = 0 are called negative examples. Sorry.
Intro M L (UofT) • t ∈ {0, 1} or t ∈ {−1, + 1 } is for computational convenience.
linear: model prediction y is a linear function of x, followed by a
threshold r:
Some Simplifications
Intro M L (UofT)
PAGE 5
Examples
NOT
x0 x1 t
1 0 1
1 1 0
Intro M L (UofT)
• Suppose this is our training set, with the dummy feature x 0
included.
AND
x0 x1 t
1 0 1
1 1 0
Intro M L (UofT)
Training examples are points
Weights (hypotheses) w can be represented by half-spaces
H + = { x : w T x ≥ 0}, H − = { x : w T x < 0}
▪ The boundaries of these half-spaces pass through the origin
The boundary is the decision boundary: { x : w T x = 0}
▪ In 2-D, it’s a line, but in high dimensions it is a hyperplane
If the training examples can be perfectly separated by a linear
decision rule, we say data is linearly separable.
The Geometric Picture
w0 ≥ 0
w0 + w1 < 0
Intro M L (UofT)
Weights (hypotheses) w are points
Each training example x specifies a half-space w must lie in to be
correctly classified: w T x ≥ 0 if t = 1.
For N O T example:
x 0 = 1, x 1 = 0, t = 1 = ⇒ (w0, w1) ∈ { w : w0 ≥ 0}
x 0 = 1, x 1 = 1, t = 0 = ⇒ (w0, w1) ∈ { w : w0 + w1 < 0}
The region satisfying all the constraints is the feasible region; if
this region is nonempty, the problem is feasible, otw it is infeasible.
The Geometric Picture
Intro M L (UofT)
PAGE 14
Problems with 0-1 loss
For t=0
PAGE 15
Second Try: Squared Loss for Linear Regression
PAGE 16
Problems with Squared Loss
C S C 311 I n t r o t o M L (UofT)
C S C 311 I n t r o t o M L (UofT)
Logistic Regression
PAGE 20
Logistic Regression
PAGE 21
Comparing Loss Functions for t = 1
PAGE 22
Gradient Descent for Logistic Regression
How do we minimize the cost J for logistic regression? Unfortunately, no direct solution.
A standard initialization is w = 0
PAGE 23
Gradient of Logistic Loss
Therefore
PAGE 24
Gradient Descent for Logistic Regression
Logistic regression:
Not a coincidence! These are both examples of generalized linear models. But we won't go in further detail.
Notice 1: N in front of sums due to averaged losses. This is why you need smaller learning rate when cost is summed
losses
PAGE 25
Main Takeaways on Logistic Regression 1/2
Why did we try 0-1 loss first? What’s the problem with it?
Natural choice for classification.
Gradient zero almost everywhere. a discontinuity.
Why did we try squared loss next? What’s the problem with it?
Easier to optimize.
Large penalty for a correct prediction with high confidence.
PAGE 26
Main Takeaways on Logistic Regression 2/2
Why did we try logistic activation function next? What’s the problem with it?
Prediction ∈ [0, 1].
An extreme mis-classification case appears optimal.
PAGE 27
Main Takeaways on Basic Concepts
C S C 311 I n t r o t o M L (UofT)