11/14/2023
Introduction to
Machine Learning
Dr. Muhammad Amjad Iqbal
Associate Professor
University of Central Punjab, Lahore.
[Link]@[Link]
[Link]
Slides of Prof. Dr. Andrew Ng, Stanford & Dr. Humayoun
Logistic Regression
A Classification Algorithm
One of the most popular and most widely
used learning algorithm today
Classification
Email: Spam / Not Spam?
Online Transactions: Fraudulent (Yes / No)?
Tumor: Malignant / Benign ?
0: “Negative Class” (e.g., benign tumor)
1: “Positive Class” (e.g., malignant tumor)
0: “Negative Class” (e.g., benign tumor)
𝑦 ∈ {0, 1, 2, 3} 1: “Positive Class 1” (e.g., type 1 tumor)
2: “Positive Class 2” (e.g., type 2 tumor)
…
3
1
11/14/2023
Threshold classifier output at 0.5:
If , predict “y = 1”
If , predict “y = 0”
4
Threshold classifier output at 0.5:
If , predict “y = 1”
If , predict “y = 0”
5
Bad thing to do for linear regression
Before we just got lucky!
6
2
11/14/2023
Classification: y = 0 or 1
can be > 1 or < 0
Logistic Regression:
Classification task
Hypothesis Representation
Logistic Regression Model
Want
0.5
Sigmoid function 0
Logistic function
Need to select parameters so that this line fits data
Do it with an algorithm later 9
3
11/14/2023
Interpretation of Hypothesis Output
= estimated probability that y = 1 on a new input x
Example: If
Tell patient that 70% chance of tumor being malignant
ℎ 𝑥 = 𝑃 𝑦 = 1 𝑥 ; 𝜃) “probability that y = 1, given x,
parameterized by ”
𝑦 = 0 𝑜𝑟 1
10
Decision boundary
Estimates that
Logistic regression 𝑃 𝑦 = 1 𝑥 ; 𝜃) or 𝑃 𝑦 = 0 𝑥 ; 𝜃) 𝒉𝜽 (𝒙) =
𝜽𝑻 𝒙 =
If ℎ 𝑥 ≥ 0.5 then predict that 𝑦 = 1
`
If 𝑧 is +ve then 𝑔 𝑧 ≥ 0.5 i.e. 𝜃 𝑥 ≥ 0
If ℎ 𝑥 < 0.5 then predict that 𝑦 = 0
If z is −ve then 𝑔 𝑧 < 0.5 i.e. 𝜃 𝑥 < 0
12
4
11/14/2023
𝒙𝟏 + 𝒙𝟐 = 𝟑
ℎ 𝑥 = 0.5
For any example with features x1, x2 that satisfy this equation predicts y=1
Decision Boundary 𝒙𝟏 + 𝒙𝟐 = 𝟑
ℎ 𝑥 = 0.5
x2
3
2
1 2 3 x1
Predict “ “ if
15
5
11/14/2023
Non-linear decision boundaries
x2
-1 1 x1
Predict “ “ if
-1
x2
x1
16
Cost function
To fit the parameters 𝜃
Training set:
m examples
How to choose parameters ?
18
6
11/14/2023
Cost function
_____ regression:
Linear
Logistic 1
1+𝑒
“non-convex function” “convex function”
19
Logistic regression cost function
If y = 1
Cost
0 1 20
Logistic regression cost function
If y = 1
Cost
0 1 21
7
11/14/2023
Logistic regression cost function
If y = 0
Cost
0 1 22
Logistic regression cost function
If y = 0 Cost = 0 if 𝑦 = 0, ℎ 𝑥 = 0,
But as ℎ 𝑥 → 1
𝐶𝑜𝑠𝑡 → ∞
Cost Captures the intuition that if ℎ 𝑥 = 0,
(predict 𝑃 𝑦 = 0 𝑥; 𝜃 = 1), but 𝑦 = 1,
We penalize learning algorithm by very
large cost
0 1 23
Simplified cost function and gradient
descent
8
11/14/2023
Logistic regression cost function
𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 = −𝑦 log ℎ 𝑥 − 1 − 𝑦 log(1 − ℎ 𝑥 )
𝐼𝑓 𝑦 = 1: 𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 = −𝑙𝑜𝑔 ℎ (𝑥)
𝐼𝑓 𝑦 = 0: 𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 = −𝑙𝑜𝑔(1 − ℎ (𝑥))
25
Logistic regression cost function
Why do we chose this function when other cost
functions exist?
• This cost function can be derived from statistics
using the principle of maximum likelihood
estimation
– An efficient method to find parameters in data for
different models
– It is a convex function
Logistic regression cost function
To fit parameters :
Hypothesis estimating the
probability that y=1
To make a prediction given new :
Output
27
9
11/14/2023
Gradient Descent
Want :
Repeat
(simultaneously update all )
𝒎
𝝏 𝟏 (𝒊)
𝑱 𝜽 = 𝒉𝜽 𝒙 𝒊 −𝒚 𝒊
𝒙𝒋
𝝏𝜽 𝒎
𝒊 𝟏
28
Gradient Descent
Want :
Repeat
(simultaneously update all )
Algorithm looks identical to linear regression!
Difference
But actually they are very different from each other 29
1
Hypothesis ℎ 𝑥 =
1+𝑒
Cost function
1
𝐽 𝜃 = (−𝑦 log ℎ 𝑥 − 1 − 𝑦 log(1 − ℎ 𝑥 ))
𝑚
Gradient Descent
𝒎
𝟏 (𝒊)
𝜃 =𝜃 −𝛼 𝒉𝜽 𝒙 𝒊 − 𝒚 𝒊
𝒙𝒋
𝒎𝒊 𝟏
10
11/14/2023
Advanced optimization
Optimization algorithm
Cost function . Want .
Given , we have code that can compute
-
- (for )
Gradient descent:
Repeat
32
Optimization algorithm
Given , we have code that can compute
-
- (for )
Optimization algorithms:
- Gradient descent
- Newton-Raphson’s method
- Conjugate gradient
- BFGS (Broyden-Fletcher-Goldfarb-Shanno)
- L-BFGS (Limited memory - BFGS)
33
11
11/14/2023
Optimization algorithm: Conjugate gradient, BFGS, etc.
+ No need to manually pick alpha (learning rate)
+ Have a clever inner loop (line search algorithm) which tries a
bunch of alpha values and picks a good one
+ Often faster than gradient descent
+ Can be used successfully without understanding their complexity
‾ Very complicated
‾ Could make debugging more difficult
‾ Should not be implemented themselves (implement only if you are
an expert in numerical computing)
‾ Different libraries may use different implementations - may hit
performance
Prediction
Once you have optimized 𝜃, compute:
ℎ 𝑥 = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑(𝜃 𝑥 + 𝜃 𝑥 + ⋯ + 𝜃 𝑥 )
If ℎ 𝑥 ≥ 0.5 then 𝑦 = 1
else 𝑦 = 0
Multi-class classification
One-vs-all algorithm
12
11/14/2023
Multiclass classification
Email foldering/tagging: Work, Friends, Family, Hobby
Medical diagrams: Not ill, Cold, Flu
Weather: Sunny, Cloudy, Rain, Snow
37
Binary classification: Multi-class classification:
x2 x2
x1 x1
38
One-vs-all (one-vs-rest):
x2
x1
Class 1:
Class 2:
Class 3:
39
13
11/14/2023
One-vs-all
Train a logistic regression classifier for each
class to predict the probability that .
On a new input , to make a prediction, pick the
class that maximizes
40
Regularization
41
The problem of overfitting
• So far we've seen a few algorithms
• Work well for many applications, but can suffer from
the problem of overfitting
42
14
11/14/2023
Overfitting with linear regression
Example: Linear regression (housing prices)
Price
Price
Price
Size Size Size
Overfitting: If we have too many features, the learned hypothesis
may fit the training set very well ( ), but fail
to generalize to new examples (predict prices on new examples).
The hypothesis is just too large, too variable and we don't have enough data to
constrain it to give us a good hypothesis 43
Example: Logistic regression
x2 x2 x2
x1 x1 x1
( = sigmoid function)
Addressing overfitting:
size of house
Price
no. of bedrooms
no. of floors
age of house
average income in neighborhood Size
kitchen size
• Plotting hypothesis is one way to decide whether
overfitting occurs or not
• But with lots of features and little data we cannot
visualize, and therefore:
• Hard to select the degree of polynomial
• What features to keep and which to drop
15
11/14/2023
Addressing overfitting:
Options:
1. Reduce number of features. (but this means loosing
information)
― Manually select which features to keep.
― Model selection algorithm (later in course).
2. Regularization.
― Keep all the features, but reduce magnitude/values of
parameters .
― Works well when we have a lot of features, each of
which contributes a bit to predicting .
Cost function
47
Intuition
Price
Price
Size of house Size of house
Suppose we penalize and make , really small.
16
11/14/2023
Regularization.
Small values for parameters
― “Simpler” hypothesis
― Less prone to overfitting
Housing:
Unlike the polynomial
― Features: example, we don't know what
― Parameters: are the high order terms
How do we pick the ones that need to be shrunk?
With regularization, take cost function and modify it to shrink all the parameters
By convention you don't penalize θ0 - minimization is from θ1 onwards
Regularization.
• Using the regularized objective
(i.e. cost function with
Price
regularization term)
• We get a much smoother curve
which fits the data and gives a
much better hypothesis
Size of house
λ is the regularization parameter
Controls a trade off between our two goals
1) Want to fit the training set well
2) Want to keep parameters small
17
11/14/2023
In regularized linear regression, we choose to minimize
What if is set to an extremely large value (perhaps too large for
our problem, say )?
- Algorithm works fine; setting to be very large can’t hurt it
- Algorithm fails to eliminate overfitting.
- Algorithm results in underfitting. (Fails to fit even training data
well).
- Gradient descent will fail to converge.
In regularized linear regression, we choose to minimize
What if is set to an extremely large value (perhaps for too large
for our problem, say )?
Price
Size of house
Regularized linear regression
54
18
11/14/2023
Regularized linear regression
Gradient descent 𝜕
𝐽(𝜃)
𝜕𝜃
Repeat
𝝀
+ 𝜽
𝒎 𝒋
𝝏
𝑱(𝜽)
𝝏𝜽𝒋
(regularized)
Same as before
Interesting term: 1 − 𝛼 <1
Ex. 0.99 𝜃 × 0.99
Usually learning rate is small and m is large
Regularized logistic regression
57
19
11/14/2023
Regularized logistic regression.
x2
x1
Cost function:
Gradient descent
Repeat
𝝀
+ 𝜽
𝒎 𝒋
𝝏
𝑱(𝜽)
𝝏𝜽𝒋
(regularized)
End
20