0% found this document useful (0 votes)
9 views

Week 3 - Lecture Slides - Logistic Regression

Uploaded by

fantiaoxi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Week 3 - Lecture Slides - Logistic Regression

Uploaded by

fantiaoxi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

CS6140: Machine Learning

Week 3 – Logistic Regression

Dr. Ryan Rad

Summer 2024
Today’s Agenda

• Group Quiz
• Generalized Linear Models
Logistic Regression • Case Study: Sentiment Analysis
• Logistic Regression
Evaluation Metrics • Decision Boundary
• Loss Function

• Evaluation Metrics

2
Sentiment Classifier

In our example, we want to classify a restaurant review as positive or negative.

Sentence
Classifier Model
from review

Input: x Output: y
Predicted class
Converting Text to Numbers (Vectorizing):
Bag of Words

Idea: One feature per word!


Example: ”Sushi was great, the food was awesome, but the service was terrible”

sushi was great the food awesome but service terrible

This has to be too simple, right?


Stay tuned for issues that arise and how to address them J

4
Pre-Processing: Sample Dataset

Review Sentiment
“Sushi was great, the food was awesome, but the service +1
was terrible”
… …
“Terrible food; the sushi was rancid.” -1

Vectorizer

Sushi was great the food awesome but service terrible rancid Sentiment
1 3 1 2 1 1 1 1 1 0 +1
… … … … … … … … … … …
1 1 0 1 1 0 0 0 1 1 -1
Attempt 1: Simple Threshold Classifier

Idea: Use a list of good words and bad words, classify review by the most frequent type of word

Word Good?
Simple Threshold Classifier
sushi None
Input 𝑥: Sentence from review
was None
Count the number of positive and negative words, in 𝑥
great Good
If num_posi<ve > num_nega<ve:
the None - 𝑦" = +1
food None
Else:
but None - 𝑦" = −1
awesome Good
service None Example: ”Sushi was great, the food was awesome, but the
service was terrible”
terrible Bad
rancid Bad
Limitations of Attempt 1 (Simple Threshold Classifier)

• Words have different degrees of sentiment.


Awesome > Great
How can we weigh them differently?

• Single words are not enough sometimes…


“Good” → Positive
“Not Good” → Negative

• How do we get list of positive/negative words?


Single Words Are Sometimes Not Enough!

What if instead of making each feature one word, we made it two?


• Unigram: a sequence of one word
• Bigram: a sequence of two words
• N-gram: a sequence of n-words
”Sushi was good, the food was good, the service was not good”

sushi was good the food service not


1 3 3 2 1 1 1

sushi was good the food the service was not


was good the food was service was not good
1 2 2 1 1 1 1 1 1

Longer sequences of words results in more context, more features, and a greater chance of overfitting.
Words Have Different Degrees of Sentiments

What if we generalize good/bad to a numeric weighting per word?


Word Good? Word Weight
sushi None sushi 0
was None was 0
great Good great 1
the None the 0
food None food 0
but None but 0
awesome Good awesome 2
service None service 0
terrible Bad terrible -1
rancid Bad rancid -2
How do we get the word weights?

What if we learn them from the data?


Word Weight
y = w0 + w1φ1(x1) + w2φ2(x2) + … + wDφD(xD)
sushi 𝑤!
was 𝑤"
φ! (𝑥) φ" (𝑥) φ# (𝑥) φ$ (𝑥) φ% (𝑥) φ& (𝑥) φ' (𝑥) φ( (𝑥) φ) (𝑥) great 𝑤#
sushi was great the food awesome but service terrible the 𝑤$
1 3 1 2 1 1 1 1 1 food 𝑤%
awesome 𝑤&
but 𝑤'
In linear regression we learnt the weights for each feature. service 𝑤(
Can we do something similar here? terrible 𝑤)
Attempt 2: Linear Regression
y = w0 + w1φ1(x1) + w2φ2(x2) + … + wDφD(xD)

Idea: Use the regression model we learnt! The output will be the sentiment!
$
Word Weight
(&)
Predic𝑡𝑒𝑑 𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡 = 𝑦0 = 2 𝑤! φ! (𝑥 ) sushi 0
!"#
was 0
φ! (𝑥) φ" (𝑥) φ# (𝑥) φ$ (𝑥) φ% (𝑥) φ& (𝑥) φ' (𝑥) φ( (𝑥) φ) (𝑥)
great 1
sushi was great the food awesome but service terrible
the 0
1 3 1 2 1 1 1 1 1
food 0
”Sushi was great, the food was awesome, but the service was terrible” awesome 2
Predic𝑡𝑒𝑑 𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡 = 𝑦1 but 0
= (1*1) + (2*1) + (-1*1) service 0
terrible −1
=2
Attempt 2: Linear Regression

𝑆𝑐𝑜𝑟𝑒 𝑥 (#) = 𝑠̂
= w0 + w1φ1(x1) + w2φ2(x2) + … + wDφD(xD)
(

= + 𝑤% φ% (𝑥 (#) )
%&'

= 𝑤 ) φ(𝑥 (#) )

This score will be always numerical!


Attempt 3: Linear Classifier
Idea: Only predict the sign of the output!

Predic𝑡𝑒𝑑 𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡 = 𝑦1 = 𝑠𝑖𝑔𝑛( 𝑆𝑐𝑜𝑟𝑒 𝑥 )

Linear Classifier
Input 𝑥: Sentence from review
Compute S𝑐𝑜𝑟𝑒(𝑥)
If S𝑐𝑜𝑟𝑒 𝑥 > 0:
𝑦1 = +1
Else:
𝑦1 = −1
Consider if only two words had non-zero coefficients
Decision Word Coefficient Weight
Boundary 𝑤* 0.0
awesome 𝑤! 1.0
awful 𝑤" -1.5

𝑠̂ = 1 ⋅ #𝑎𝑤𝑒𝑠𝑜𝑚𝑒 − 1.5 ⋅ #𝑎𝑤𝑓𝑢𝑙


#awful

4 ⋯
3
2
1
0

0 1 2 3 4 ⋯ #awesome
Consider if only two words had non-zero coefficients
Decision Word Coefficient Weight
Boundary 𝑤* 0.0
awesome 𝑤! 1.0
awful 𝑤" -1.5

𝑠̂ = 1 ⋅ #𝑎𝑤𝑒𝑠𝑜𝑚𝑒 − 1.5 ⋅ #𝑎𝑤𝑓𝑢𝑙


#awful

4 ⋯
0
=
𝑓𝑢𝑙
𝑤
⋅ #𝑎
1.5

3
𝑚𝑒−
𝑠𝑜

2
𝑒
#𝑎𝑤
1 ⋅
1
0

0 1 2 3 4 ⋯ #awesome
Issue: How do we train this?

Say we were to use the MSE…


,
1 )
𝑀𝑆𝐸 = '(𝑦 − 𝑠𝑖𝑔𝑛( 𝑆𝑐𝑜𝑟𝑒 𝑥 ()) ) )-
𝑛
)*+

The derivative of the 𝑠𝑖𝑔𝑛 function is 0!

Hence, Gradient Descent will no longer work L


Mathematical One idea is to just model the processing of finding 𝑤
" based on
Definition what we discussed in linear regression using MSE
,
1
Can we use MSE for 𝑤
5 = argmin '? ? ?
. 2𝑛
classification task? )*+
One idea is to just model the processing of finding 𝑤
" based on
Mathematical what we discussed in linear regression using MSE
Definition 1
%
𝑤
" = argmin . 𝕀 𝑦" ≠ 𝑦2"
! 2𝑛
"#$
Can we use MSE for
classification task?
Great! This makes sense conceptually!
Will this work?
One idea is to just model the processing of finding 𝑤
" based on
Mathematical what we discussed in linear regression using MSE
Definition 1
%
𝑤
" = argmin . 𝕀 𝑦" ≠ 𝑦2"
! 2𝑛
"#$
Can we use MSE for Will this work?
classification task?
Assume ℎ$ 𝑥 = #𝑎𝑤𝑒𝑠𝑜𝑚𝑒 so 𝑤$ is its coefficient and 𝑤& is fixed.
#awful loss / error

4 ⋯ 10
0
𝑢𝑙=
𝑎𝑤𝑓
⋅ #
1.5
3

𝑒 −
𝑚
𝑠𝑜
2

𝑒
#𝑎𝑤
1 ⋅
1

0 𝑤$
0

0 1 2 3 4 ⋯ #awesome 0
Convexity

Taken from Prof. Matt Gormley, CMU


Convexity

Taken from Prof. Matt Gormley, CMU


Quality Metric The MSE loss function doesn’t work because of different reasons:
for Classification The outputs are discrete values with no ordered nature, so we
need a different way to frame how close a prediction is to a
certain correct category
The MSE loss function for classification task is not continuous,
differentiable or convex, so we can’t use optimization
algorithm like Gradient Descent to find an optimal set of
weights

Note: Convexity is an important concept in Machine Learning. By


minimizing error, we want to find where that global minimum is,
and that’s ideal in a convex function.

Let’s frame this problem in term of probabilities instead.


Assume that there is some randomness in the world, and instead will
try to model the probability of a positive/negative label.

Probabilities
Examples:
“The sushi & everything else were awesome!”
• Definite positive (+1)
• 𝑃 𝑦 = +1 │𝑥 = “𝑇ℎ𝑒 𝑠𝑢𝑠ℎ𝑖 & 𝑒𝑣𝑒𝑟𝑦𝑡ℎ𝑖𝑛𝑔 𝑒𝑙𝑠𝑒 𝑤𝑒𝑟𝑒 𝑎𝑤𝑒𝑠𝑜𝑚𝑒! ” = 0.99

“The sushi was alright, the service was OK”


• Not as sure
• 𝑃 𝑦 = −1│𝑥 = “𝑇ℎ𝑒 𝑠𝑢𝑠ℎ𝑖 𝑎𝑙𝑟𝑖𝑔ℎ𝑡, 𝑡ℎ𝑒 𝑠𝑒𝑟𝑣𝑖𝑐𝑒 𝑤𝑎𝑠 𝑜𝑘𝑎𝑦! ” = 0.5

Use probability as the measurement of certainty


𝑃(𝑦|𝑥)
Idea: Estimate probabilities 𝑃< 𝑦 𝑥 and use those for prediction

Probability Probability Classifier


Classifier Input 𝑥: Sentence from review
< = +1|𝑥)
Estimate class probability 𝑃(𝑦
If 𝑃< 𝑦 = +1 𝑥 > 0.5:
- 𝑦2 = +1
Else:
- 𝑦2 = −1

Notes:
Estimating the probability improves interpretability.
- Unclear how much better a score of 5 is from a score of
3. Clear how much better a probability of 0.75 is than a
probability of 0.5
Connecting Score & Probability

) = +1|𝑥)
Idea: Let’s try to relate the value of 𝑆𝑐𝑜𝑟𝑒(𝑥) to 𝑃(𝑦
#awful
4 ⋯

0
𝑢𝑙= What if 𝑆𝑐𝑜𝑟𝑒 𝑥 is positive?
𝑎𝑤𝑓
⋅#
1.5
3

𝑒−
𝑠𝑜
𝑚 What if 𝑆𝑐𝑜𝑟𝑒 𝑥 is negative?
2

𝑒
𝑎𝑤
1 ⋅#
1

What if 𝑆𝑐𝑜𝑟𝑒 𝑥 is 0?
0

0 1 2 3 4 ⋯ #awesome
Connecting Score & Probability

𝑆𝑐𝑜𝑟𝑒 𝑥) = 𝑤 / φ(𝑥 ()) )


−∞ 𝑦?) = −1 0 𝑦?) = +1 ∞

Very sure Not sure if Very sure


𝑦?) = −1 𝑦?) = −1 𝑜𝑟 𝑦?) = +1 𝑦?) = +1

𝑃C 𝑦) = +1|𝑥) = 0 𝑃C 𝑦) = +1|𝑥) = 0.5 𝑃C 𝑦) = +1|𝑥) = 1


0 C = +1 |𝑥)
𝑃(𝑦 1
Logistic Function

Want: a function that takes numbers arbitrarily large/small and maps them between 0 and 1.
1
𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑆𝑐𝑜𝑟𝑒(𝑥) =
1 + 𝑒 HIJKLM(N)
𝑆𝑐𝑜𝑟𝑒(𝑥) 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑆𝑐𝑜𝑟𝑒 𝑥 )
1
−∞ =0
1 + 𝑒 !(!#)

−2
1
0 = 0.5
1 + 𝑒 !(%)

2
1
∞ =1
1 + 𝑒 !(#)
Logistic Function
1
𝑃 𝑦) = +1 𝑥) , 𝑤 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑆𝑐𝑜𝑟𝑒 𝑥) =
1 + 𝑒 0. φ(1 )
! (#)

Logistic Regression Classifier


Input 𝑥: Sentence from review
Estimate class probability 𝑃C 𝑦 = +1 𝑥, 𝑤 5 / ℎ 𝑥) )
5 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤
If 𝑃C 𝑦 = +1 𝑥, 𝑤
5 > 0.5:
𝑦? = +1

w> h(x)
Else:

1
1+e
𝑦? = −1

w> h(x)
Quality Metric = Likelihood

Want to compute the probability of seeing our dataset for every possible setting
for 𝑤. Find 𝑤 that makes data most likely!

Data Point φ𝟏 (𝒙) φ𝟐 (𝒙) 𝒚 Choose 𝒘 to maximize


𝑥 (O) , 𝑦 (O) 2 1 +1 𝑃 𝑦 (O) = +1 𝑥 (O) , 𝑤)
𝑥 (P) , 𝑦 (P) 0 2 −1 𝑃 𝑦 (P) = −1 𝑥 (P) , 𝑤)
𝑥 (Q) , 𝑦 (Q) 3 3 −1 𝑃 𝑦 (Q) = −1 𝑥 (Q) , 𝑤)
𝑥 (R) , 𝑦 (R) 4 1 +1 𝑃 𝑦 (R) = +1 𝑥 (R) , 𝑤)
Learn 𝑤
!

Now that we have our new model, we will talk about how to choose 𝑤
O to be the “best fit”.
The choice of 𝑤 affects how likely seeing our dataset is
#awful
&
ℓ 𝑤 = H 𝑃(𝑦 (%) |𝑥 (%) , 𝑤)
%

1
𝑃 𝑦 (%) = +1 𝑥 (%) , 𝑤 = ! )(* (#) )
1 + 𝑒 '(
! )(* (#) )
𝑒 '(
𝑃 𝑦 (%) = −1 𝑥 (%) , 𝑤 = ! )(* (#) )
1 + 𝑒 '(

0 1 2 3 4 ⋯ #awesome
Loss Function −𝐿𝑜𝑔(𝑎) −𝐿𝑜𝑔(1 − 𝑎)

Find the 𝑤 that maximizes the likelihood Cost(ℎ! 𝑥 , 𝑦) = /


−log ℎ! 𝑥 if 𝑦 = 1
) −log 1 − ℎ! 𝑥 if 𝑦 = 0

𝑤
9 = argmax ℓ 𝑤 = argmax A 𝑃(𝑦& |𝑥& , 𝑤)
% %
&'(

Generally, we maximize the log-likelihood which looks like


)

𝑤
9 = argmax ℓ 𝑤 = argmax log(ℓ 𝑤 ) = argmax I log(𝑃 𝑦& 𝑥& , 𝑤))
% % %
&'(

Also commonly written by separating out positive/negative terms

𝐹𝑜𝑟 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑟𝑚𝑠 𝐹𝑜𝑟 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑡𝑒𝑟𝑚𝑠


Decision Boundary

The decision boundary is the set of x such that

1
8 = 0.5
1+ 𝑒 J(K L)

A little bit of algebra shows that this is equivalent to


8 L)
1 = 𝑒 J(K

and, taking the natural log of both sides,


$
0 = − R 𝑤& x&
&"#

So, our decision boundary is linear!


Complex Decision Boundaries?

What if we want to use a more complex decision boundary?


• Need more complex model/features! (More on this later)
The logistic function become “sharper” with larger coefficients.

w0 0 w0 0 w0 0
w#awesome +1 w#awesome +2 w#awesome +6
w#awful -1 w#awful -2 w#awful -6

w> h(x)

w> h(x)
w> h(x)

1
1

1+e

1+e
1+e

#awesome - #awful #awesome - #awful #awesome - #awful

What does this mean for our predictions?


Because the 𝑆𝑐𝑜𝑟𝑒(𝑥) is getting larger in magnitude, the
probabilities are closer to 0 or 1!
Binary classification Multiclass classification

𝑥! 𝑥!

𝑥" 𝑥"
How do we extend Logistic Regression to Multiclass classification?

• Approach 1: one-versus-one
• Computationally very expensive

• Approach 2: one-versus-rest
• Approach 3: discriminant functions
One-vs-all (one-vs-rest) 𝑥#
+
ℎ2 𝑥
𝑥$
𝑥!
𝑥#
ℎ2- 𝑥

𝑥" 𝑥$
Class 1:
Class 2: 𝑥#
Class 3: ℎ23 𝑥

"
ℎ! 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3) 𝑥$
Slide credit: Andrew Ng
One-vs-all
• Train a logistic regression classifier ℎ2) 𝑥 for each class 𝑖 to predict the
probability that 𝑦 = 𝑖

• Given a new input 𝑥, pick the class 𝑖 that maximizes


max ℎ2) 𝑥
4
SoftMax
SoftMax
Read more on the difference between Softmax and Sigmoid (6 min):

https://dataaspirant.com/difference-between-softmax-function-and-sigmoid-function/
Evaluation Metrics
True Positive (TP):

• Predicted True and True in reality.

True Negative (TN):

• Predicted False and False in reality.

False Positive (FP):

• Predicted True and False in reality.

False Negative (FN):

• Predicted False and True in reality.


Confusion Metrics
Evaluation Metrics
Evaluation Metrics

Accuracy
• is defined as the percentage of correct predictions for the test data. It can be calculated easily by dividing the number of
correct predictions by the number of total predictions.
Precision
• is defined as the fraction of relevant examples (true positives) among all of the examples which were predicted to
belong in a certain class.
Recall
• is defined as the fraction of examples which were predicted to belong to a class with respect to all of the examples that
truly belong in the class.
Evaluation
Metrics
Evaluate the performance of a COVID-19
Antigen test kits?

Let’s take a hypothetical kit that tests 20 individuals for potential case of COVID19

In this sample population we will state:


• 16 people do NOT have COVID-19

The Reality:

What Model
Predicts:
Arithmetic Average vs Harmonic Average
Jaccard Index and Dice Score
Jaccard Index and Dice Score

49
Other Applications
Semantic Segmentation Automatic Speech Recognition
Micro/Macro Average
Our model is
50% accurate?

Last Experiment

Frequency Metric
Class A 5 100%
Class B 95 0%
• Week 4 – Clustering
• K-Means, Gaussian Mixture Models, & EM

Coming up
Next…
• Homework #2 due Friday, May 31 (@ 7pm Pacific Time)
HW2 – Q5 Walkthrough
https://colab.research.google.com/drive/1iJc23kLBuCEeIygTysHxJc83BCfbHUVK
Questions?

You might also like