0% found this document useful (0 votes)
40 views20 pages

Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal

This document provides an introduction to logistic regression, a popular machine learning classification algorithm. Logistic regression models the probability of an output belonging to one of two classes using a logistic function. The logistic regression cost function is minimized during training using gradient descent, which iteratively updates the model parameters to reduce the cost. Logistic regression is widely used for tasks such as spam detection, fraud detection, and medical diagnosis where the output is binary.

Uploaded by

Muneeb Butt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views20 pages

Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal

This document provides an introduction to logistic regression, a popular machine learning classification algorithm. Logistic regression models the probability of an output belonging to one of two classes using a logistic function. The logistic regression cost function is minimized during training using gradient descent, which iteratively updates the model parameters to reduce the cost. Logistic regression is widely used for tasks such as spam detection, fraud detection, and medical diagnosis where the output is binary.

Uploaded by

Muneeb Butt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

11/14/2023

Introduction to
Machine Learning
Dr. Muhammad Amjad Iqbal
Associate Professor
University of Central Punjab, Lahore.
[Link]@[Link]

[Link]
Slides of Prof. Dr. Andrew Ng, Stanford & Dr. Humayoun

Logistic Regression
A Classification Algorithm

One of the most popular and most widely


used learning algorithm today

Classification

Email: Spam / Not Spam?


Online Transactions: Fraudulent (Yes / No)?
Tumor: Malignant / Benign ?

0: “Negative Class” (e.g., benign tumor)


1: “Positive Class” (e.g., malignant tumor)
0: “Negative Class” (e.g., benign tumor)
𝑦 ∈ {0, 1, 2, 3} 1: “Positive Class 1” (e.g., type 1 tumor)
2: “Positive Class 2” (e.g., type 2 tumor)

3

1
11/14/2023

Threshold classifier output at 0.5:


If , predict “y = 1”
If , predict “y = 0”
4

Threshold classifier output at 0.5:


If , predict “y = 1”
If , predict “y = 0”
5

Bad thing to do for linear regression

Before we just got lucky!


6

2
11/14/2023

Classification: y = 0 or 1

can be > 1 or < 0

Logistic Regression:

Classification task

Hypothesis Representation

Logistic Regression Model


Want

0.5

Sigmoid function 0

Logistic function
Need to select parameters so that this line fits data
Do it with an algorithm later 9

3
11/14/2023

Interpretation of Hypothesis Output


= estimated probability that y = 1 on a new input x

Example: If

Tell patient that 70% chance of tumor being malignant

ℎ 𝑥 = 𝑃 𝑦 = 1 𝑥 ; 𝜃) “probability that y = 1, given x,


parameterized by ”
𝑦 = 0 𝑜𝑟 1

10

Decision boundary

Estimates that
Logistic regression 𝑃 𝑦 = 1 𝑥 ; 𝜃) or 𝑃 𝑦 = 0 𝑥 ; 𝜃) 𝒉𝜽 (𝒙) =

𝜽𝑻 𝒙 =
If ℎ 𝑥 ≥ 0.5 then predict that 𝑦 = 1
`
If 𝑧 is +ve then 𝑔 𝑧 ≥ 0.5 i.e. 𝜃 𝑥 ≥ 0

If ℎ 𝑥 < 0.5 then predict that 𝑦 = 0


If z is −ve then 𝑔 𝑧 < 0.5 i.e. 𝜃 𝑥 < 0
12

4
11/14/2023

𝒙𝟏 + 𝒙𝟐 = 𝟑
ℎ 𝑥 = 0.5

For any example with features x1, x2 that satisfy this equation predicts y=1

Decision Boundary 𝒙𝟏 + 𝒙𝟐 = 𝟑
ℎ 𝑥 = 0.5
x2
3
2

1 2 3 x1

Predict “ “ if

15

5
11/14/2023

Non-linear decision boundaries


x2

-1 1 x1

Predict “ “ if
-1

x2

x1

16

Cost function
To fit the parameters 𝜃

Training set:

m examples

How to choose parameters ?


18

6
11/14/2023

Cost function
_____ regression:
Linear
Logistic 1
1+𝑒

“non-convex function” “convex function”

19

Logistic regression cost function

If y = 1

Cost

0 1 20

Logistic regression cost function

If y = 1

Cost

0 1 21

7
11/14/2023

Logistic regression cost function

If y = 0

Cost

0 1 22

Logistic regression cost function

If y = 0 Cost = 0 if 𝑦 = 0, ℎ 𝑥 = 0,
But as ℎ 𝑥 → 1
𝐶𝑜𝑠𝑡 → ∞
Cost Captures the intuition that if ℎ 𝑥 = 0,
(predict 𝑃 𝑦 = 0 𝑥; 𝜃 = 1), but 𝑦 = 1,
We penalize learning algorithm by very
large cost
0 1 23

Simplified cost function and gradient


descent

8
11/14/2023

Logistic regression cost function

𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 = −𝑦 log ℎ 𝑥 − 1 − 𝑦 log(1 − ℎ 𝑥 )


𝐼𝑓 𝑦 = 1: 𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 = −𝑙𝑜𝑔 ℎ (𝑥)
𝐼𝑓 𝑦 = 0: 𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 = −𝑙𝑜𝑔(1 − ℎ (𝑥))
25

Logistic regression cost function


Why do we chose this function when other cost
functions exist?
• This cost function can be derived from statistics
using the principle of maximum likelihood
estimation
– An efficient method to find parameters in data for
different models
– It is a convex function

Logistic regression cost function

To fit parameters :

Hypothesis estimating the


probability that y=1
To make a prediction given new :
Output
27

9
11/14/2023

Gradient Descent

Want :
Repeat

(simultaneously update all )

𝒎
𝝏 𝟏 (𝒊)
𝑱 𝜽 = 𝒉𝜽 𝒙 𝒊 −𝒚 𝒊
𝒙𝒋
𝝏𝜽 𝒎
𝒊 𝟏
28

Gradient Descent

Want :
Repeat

(simultaneously update all )

Algorithm looks identical to linear regression!


Difference
But actually they are very different from each other 29

1
Hypothesis ℎ 𝑥 =
1+𝑒
Cost function
1
𝐽 𝜃 = (−𝑦 log ℎ 𝑥 − 1 − 𝑦 log(1 − ℎ 𝑥 ))
𝑚

Gradient Descent
𝒎
𝟏 (𝒊)
𝜃 =𝜃 −𝛼 𝒉𝜽 𝒙 𝒊 − 𝒚 𝒊
𝒙𝒋
𝒎𝒊 𝟏

10
11/14/2023

Advanced optimization

Optimization algorithm
Cost function . Want .
Given , we have code that can compute
-
- (for )

Gradient descent:
Repeat

32

Optimization algorithm
Given , we have code that can compute
-
- (for )

Optimization algorithms:
- Gradient descent
- Newton-Raphson’s method
- Conjugate gradient
- BFGS (Broyden-Fletcher-Goldfarb-Shanno)
- L-BFGS (Limited memory - BFGS)
33

11
11/14/2023

Optimization algorithm: Conjugate gradient, BFGS, etc.


+ No need to manually pick alpha (learning rate)
+ Have a clever inner loop (line search algorithm) which tries a
bunch of alpha values and picks a good one
+ Often faster than gradient descent
+ Can be used successfully without understanding their complexity
‾ Very complicated
‾ Could make debugging more difficult
‾ Should not be implemented themselves (implement only if you are
an expert in numerical computing)
‾ Different libraries may use different implementations - may hit
performance

Prediction

Once you have optimized 𝜃, compute:

ℎ 𝑥 = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑(𝜃 𝑥 + 𝜃 𝑥 + ⋯ + 𝜃 𝑥 )

If ℎ 𝑥 ≥ 0.5 then 𝑦 = 1
else 𝑦 = 0

Multi-class classification
One-vs-all algorithm

12
11/14/2023

Multiclass classification
Email foldering/tagging: Work, Friends, Family, Hobby

Medical diagrams: Not ill, Cold, Flu

Weather: Sunny, Cloudy, Rain, Snow

37

Binary classification: Multi-class classification:

x2 x2

x1 x1
38

One-vs-all (one-vs-rest):

x2

x1
Class 1:
Class 2:
Class 3:
39

13
11/14/2023

One-vs-all

Train a logistic regression classifier for each


class to predict the probability that .

On a new input , to make a prediction, pick the


class that maximizes

40

Regularization

41

The problem of overfitting


• So far we've seen a few algorithms
• Work well for many applications, but can suffer from
the problem of overfitting

42

14
11/14/2023

Overfitting with linear regression


Example: Linear regression (housing prices)
Price

Price

Price
Size Size Size

Overfitting: If we have too many features, the learned hypothesis


may fit the training set very well ( ), but fail
to generalize to new examples (predict prices on new examples).
The hypothesis is just too large, too variable and we don't have enough data to
constrain it to give us a good hypothesis 43

Example: Logistic regression

x2 x2 x2

x1 x1 x1

( = sigmoid function)

Addressing overfitting:
size of house
Price

no. of bedrooms
no. of floors
age of house
average income in neighborhood Size
kitchen size
• Plotting hypothesis is one way to decide whether
overfitting occurs or not
• But with lots of features and little data we cannot
visualize, and therefore:
• Hard to select the degree of polynomial
• What features to keep and which to drop

15
11/14/2023

Addressing overfitting:

Options:
1. Reduce number of features. (but this means loosing
information)
― Manually select which features to keep.
― Model selection algorithm (later in course).
2. Regularization.
― Keep all the features, but reduce magnitude/values of
parameters .
― Works well when we have a lot of features, each of
which contributes a bit to predicting .

Cost function

47

Intuition
Price
Price

Size of house Size of house

Suppose we penalize and make , really small.

16
11/14/2023

Regularization.
Small values for parameters
― “Simpler” hypothesis
― Less prone to overfitting
Housing:
Unlike the polynomial
― Features: example, we don't know what
― Parameters: are the high order terms
How do we pick the ones that need to be shrunk?
With regularization, take cost function and modify it to shrink all the parameters

By convention you don't penalize θ0 - minimization is from θ1 onwards

Regularization.

• Using the regularized objective


(i.e. cost function with
Price

regularization term)
• We get a much smoother curve
which fits the data and gives a
much better hypothesis
Size of house

λ is the regularization parameter


Controls a trade off between our two goals
1) Want to fit the training set well
2) Want to keep parameters small

17
11/14/2023

In regularized linear regression, we choose to minimize

What if is set to an extremely large value (perhaps too large for


our problem, say )?
- Algorithm works fine; setting to be very large can’t hurt it
- Algorithm fails to eliminate overfitting.
- Algorithm results in underfitting. (Fails to fit even training data
well).
- Gradient descent will fail to converge.

In regularized linear regression, we choose to minimize

What if is set to an extremely large value (perhaps for too large


for our problem, say )?
Price

Size of house

Regularized linear regression

54

18
11/14/2023

Regularized linear regression

Gradient descent 𝜕
𝐽(𝜃)
𝜕𝜃
Repeat

𝝀
+ 𝜽
𝒎 𝒋
𝝏
𝑱(𝜽)
𝝏𝜽𝒋
(regularized)

Same as before
Interesting term: 1 − 𝛼 <1
Ex. 0.99 𝜃 × 0.99
Usually learning rate is small and m is large

Regularized logistic regression

57

19
11/14/2023

Regularized logistic regression.

x2

x1
Cost function:

Gradient descent
Repeat

𝝀
+ 𝜽
𝒎 𝒋
𝝏
𝑱(𝜽)
𝝏𝜽𝒋
(regularized)

End

20

You might also like