0% found this document useful (0 votes)
2 views

1. Classification

This document discusses the concepts of classification, focusing on logistic regression as a preferred method over linear regression for binary outcomes due to its ability to produce probabilities between 0 and 1. It includes examples, such as credit card default prediction, and explains the logistic regression model's parameters and their estimation through maximum likelihood. Additionally, it touches on the application of logistic regression for multiclass problems and introduces discriminant analysis as an alternative when classes are well-separated.

Uploaded by

sainathgunda99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

1. Classification

This document discusses the concepts of classification, focusing on logistic regression as a preferred method over linear regression for binary outcomes due to its ability to produce probabilities between 0 and 1. It includes examples, such as credit card default prediction, and explains the logistic regression model's parameters and their estimation through maximum likelihood. Additionally, it touches on the application of logistic regression for multiclass problems and introduces discriminant analysis as an alternative when classes are well-separated.

Uploaded by

sainathgunda99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

sainathgunda99@gmail.

com
DLZNK464L9 Classification – Logistic Regression

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
• Introduction to classifications
• Linear regression vs. Logistic regression
• Brief about Prediction
• Introduction to Logistic regression

[email protected]
DLZNK464L9

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Classification
Supervised learning…class labels!

[email protected]
DLZNK464L9

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Classification
• Qualitative variables (nominal) take values in an unordered set of classes (C), such as:

Eye color ∈ {brown, blue, green}


mail∈ {spam, ham}

• Given a feature vector X and a qualitative response Y taking values in the set C, the
[email protected]
DLZNK464L9
classification task is to build a function C(X) that takes as input the feature vector X and
predicts its value for Y; i.e., C(X) ∈ C.
• Classifiers – non-probabilistic or probabilistic
• Often, we are more interested in estimating the probabilities that X belongs to every
category in C.

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Example: Credit Card Default
Group-level dispersion vs. income and balance

2500

60000
2000
60000

1500
Balance

Income
40000
Income
40000

1000
[email protected]
DLZNK464L9

20000
20000

500
0
0

0
0 500 1000 1500 No No
2000 2500
Yes Yes
Balance
Default Default

Balance seems to be of more


importance than income
(default)
Default ~ failing to make CC payments for
Proprietary afilegiven
Thiscontent. meantnumber
is ©University ofAll days
forofpersonal
Arizona. Rightsby
use (~180)
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Can we use Linear Regression?
Suppose for the Default classification task that we code:

1 if No
Y=
2 if Yes.

Can we simply perform a linear regression of Y on X and classify it as Yes if ŷ >0.5?


[email protected]
DLZNK464L9• In this case of a binary outcome, linear regression does a good job as a classifier and is equivalent
to linear discriminant analysis, which we will discuss later.
• Since in the population E(Y |X = x) = Pr(Y = 1|X = x), we might think that regression is perfect for
this task.
• However, linear regression might produce probabilities less than zero or
bigger than one. Logistic regression is more appropriated.
(“squeezes” the range of the response).

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Linear vs. Logistic Regression
||||||||||||
1.0

||||||||||
Probability of Default

Probability of Default
0.8
0.8

0.6
0.6

0.4
0.4
0.2

0.2
|||||||||| ||||||||||
0.0

0.0
1.0
[email protected]
DLZNK464L9

0 500 1000 1500 0 500 1000 1500


2000 2500 2000 2500

Balance Balance

● Orange marks: Response Y (0 or 1).


● Left: Linear regression does not reflect Pr(Y =1|X) correctly (among others).
● Right: Logistic regression seems to reflect the actual change in probabilities
across values of X.
Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Linear vs. Logistic Regression

Sigmoid curve

|||||||||||| ||||||||||||
1.0
Probability of Default

Probability of Default
0.8
0.8

0.6
0.6

[email protected]
DLZNK464L9

0.4
0.4
0.2

0.2
|||||||||| ||||||||||
0.0

0.0
1.0
0 500 1000 1500 0 500 1000 1500
2000 2500 2000 2500

Balance Balance

Logistic regression ensures that our estimate for p(X) lies between 0 and 1.

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Logistic Regression
Logistic regression uses the form

Remember our lecture on linear models?

It is easy to see that no matter what values β0, β1, or X take, p(X) will have values between 0
[email protected]
DLZNK464L9
and 1.
A bit of rearrangement gives:

This monotone transformation is called the log odds or logit


transformation of p(X). (here, log = natural log)
Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Maximum Likelihood
Another approach: Gradient descent
We use maximum likelihood to estimate the parameters.
Maximum likelihood estimates: differentiate the log-likelihood
with respect to the parameters, set the derivatives equal to
“cost function” zero, and solve.

This likelihood gives the probability of the observed zeros and ones in the data. We pick β0
[email protected]
DLZNK464L9
and β1 to maximize the likelihood of the observed data.

Most statistical packages can fit linear logistic regression models by maximum likelihood.

Coefficient Std. error Z-statistic P-value


Intercept -10.7 0.3612 -29.5 < 0.0001
balance 0.006 0.0002 24.9 < 0.0001
Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Making Predictions
What is our estimated probability of default for someone with a balance of $1000?

With a balance of $2000?


[email protected]
DLZNK464L9

Coefficient Std. error Z-statistic P-value


Intercept -10.7 0.3612 -29.5 < 0.0001
balance 0.006 0.0002 24.9 < 0.0001

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Making Predictions
Let's do it again, using student as the predictor.

Coefficient Std. error Z-statistic P-value


Intercept -3.5 0.0707 -49.6 < 0.0001
student[Yes] 0.405 0.115 3.52 4E-04
[email protected]
DLZNK464L9

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Logistic Regression with Several Variables
…multiple linear regression models?

[email protected] Coefficient Std. error Z-statistic P-value


DLZNK464L9
Intercept -10.9 0.4923 -22.1 < 0.0001
balance 0.006 0.0002 24.74 < 0.0001
income 0.003 0.0082 0.37 0.712
student[Yes] -0.65 0.2362 -2.74 0.006

The coefficient for student is now negative (it was positive before).

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Logistic Regressions: Confounding

1000 1500 2000 2500


Credit Card Balance
0.6

0.8
Default Rate
0.4
0.2

500
[email protected]
DLZNK464L9
0.0

0
500 1000 1500 2000 No Yes

Credit Card Balance Student Status

• Students tend to have higher balances than non-students, so their marginal default rate
is higher than for non-students.
• But for each level of balance, students default less than non-students.
• Multiple logistic regression can tease this out.

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Logistic Regression with more than Two Classes
Logistic regressions are easily generalized to more than two classes (multinomial regression). Two
major options:

• “Classic”: Using a baseline class (K-1 functions). Parameter estimates are given in respect of the
Kth class. Baseline selection is not important.

[email protected]
DLZNK464L9

• Softmax: No baseline selection. A linear function for each class (implemented in glmnet).

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
• We discussed and understood classifications using an example of credit card default.
• We discussed the reason for preferring Logistic regression over Linear regression, as linear regression
might produce probabilities less than zero or bigger than one, so Logistic regression is more
appropriate for modeling.
• We talked about making predictions using a simple example of statistical data of a student.
• We discussed Logistic regression which is easily generalized to more than two classes (multinomial
[email protected]
DLZNK464L9
regression) and ensures that our estimate for the predictor lies between 0 and 1.

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Classification – Discriminant Analysis
[email protected]
DLZNK464L9

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will cover:
• Introduction to discriminant analysis
• Conditional Probabilities
• Introduction to Bayes theorem
• Linear Discriminant Analysis with different P value
• Types of errors
• ROC Curve
[email protected]
DLZNK464L9
• Limitations of LDA

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Why Discriminant Analysis?
• When the classes are well-separated, parameter estimates for the logistic regression model are
unreliable. Linear discriminant analysis does not suffer from this problem.

• If n is small and the distribution of the predictors X is approximately normal in each of the classes,
the linear discriminant model is more stable than the logistic regression model.

[email protected]
DLZNK464L9
• Linear discriminant analysis is popular when we have >2 classes because it also provides low-
dimensional views of the data.

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Discriminant Analysis
Goal:
Model the distribution of X in each of the classes separately.
Then, use the Bayes’ theorem to obtain Pr(Y|X).

When we use normal (Gaussian) distributions for each class, this leads to linear or
[email protected]
DLZNK464L9
quadratic discriminant analysis (other distributions can be used as well).

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Conditional Probabilities

F
[email protected]
DLZNK464L9 H

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Bayes’ Rule

[email protected]
DLZNK464L9

B
A

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Bayes’ Rule
Concepts:

• Likelihood
- How much does a certain hypothesis explain the data?

• Prior
- What do you believe before seeing any data?
[email protected]
DLZNK464L9

• Posterior
- What do we believe after seeing the data?

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Bayes’ Theorem for Classification
The Bayes’ theorem:

In the context of discriminant analysis,


[email protected]
DLZNK464L9

where
Pr(Y = k|X = x) =

• fk(x) =Pr(X=x|Y=k) is the density for Xin class k. We will use


normal densities (one for each class).
• πk = Pr(Y =k) is prior probability for class k.

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Classify to the Highest Density
Bayes’ π1=.5, π2=.5 π1=.3, π2=.7
boundaries…the
smallest error*

[email protected]
DLZNK464L9

−4 0 2 4 −4 0 2 4

−2 −2

We classify a new point according to which density is the highest.


When the priors are different, we take them into account as well and
compare πkfk(x). On the right, we favor the pink class — the decision
boundary has shifted to the left.

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Linear Discriminant Analysis when p = 1
Again, we’re assuming normal distributions for each class:

• Here µk is the mean, and σk2 the variance (in class k). We will assume that all
the σk = σ are the same.
[email protected]
DLZNK464L9

• Plugging this into Bayes’ formula, we get the following expression for pk(x) =
Pr(Y=k|X=x):
Description of the kth group
Prior

Marginal probability

Some simplifications later… Proprietary


Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Discriminant Functions (Bayes classifier)
To classify at the value X=x, we need to see which of the pk(x) is the largest. Taking logs, and discarding
terms that do not depend on k, we see that this is equivalent to assigning x to the class with the
largest discriminant score:

[email protected]
DLZNK464L9
Note that δk(x) is a linear function of x.
If there are K=2 classes and π1 = π2 = 0.5, then one can see that the decision boundary is at:

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Discriminant Functions (Bayes’ classifier)

Bayes decision boundary

5
4
[email protected]

3
DLZNK464L9

2
1
4

0
−4 −2 0 2 4 −3 −2 1 2 3
−1
0
Decision boundary (LDA)

Example with µ1 = −1.5, µ2 = 1.5, π1 = π2 = 0.5, and σ2 = 1.


Typically, we don’t know these parameters; we just have the training data. In
that case, we simply estimateProprietary
the parameters
Thiscontent.
file is ©University
(e.g.,
meant forofpersonal
Arizona. All use
LDA)
Rightsby
Reserved.
and plug themprohibited. into the
Unauthorized use or distributiononly.
[email protected]
rule. Sharing or publishing the contents in part or full is liable for legal action.
Estimating the parameters (LDA))
An idea for a prior

An expectation. Average of all the


training observations from the kth class

29 /
[email protected] 40
LDA:
DLZNK464L9
- Label-specific means
- Shared variance
Weighted average of sample
variances for each class

where is the usual formula for the


estimated variance in the kth class.

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Linear Discriminant Analysis when p > 1
Cor(x1,x2)≠0
Cor(x1,x2)=0

[email protected]
DLZNK464L9
Var(x1) ≠ Var(x2)
Var(x1) =
Var(x2) Vector of mean values
Density:
Covariance matrix (shared across
labels…)
Discriminant function:
Despite its complex form,

a linear function.
Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Illustration: p = 2 and K = 3 classes
Ellipses, 95% of observations 20 (simulated) observations (per class)

4
Bayes boundaries

0
4
2

X2
X2
0

−2
2
[email protected]
DLZNK464L9
−2
−4

−4
−4 −2 0 −4 −2 0 2 4
2 4
X1
LDA boundary
X1

Here π1 = π2 = π3 = 1/3.
The dashed lines are known as the Bayes’ decision boundaries. In this
case, they yield the fewest misclassification errors, among all possible
classifiers. Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
From δk(x) to Probabilities
Once we have estimates , we can turn these into estimates
for class probabilities:

So, classifying to the largest


[email protected]
amounts to classifying to the class for
DLZNK464L9
which is largest.

When K = 2, we classify to class 2 if


else to class 1.

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
LDA on Credit Data

[email protected]
DLZNK464L9

(23 + 252)/10000 errors — a 2.75% misclassification rate! Some


caveats:
• This is a training error, and we may be overfitting.
• Of the true No’s, we make 23/9667 = 0.2% errors; of the true
Yes’s, we make 252/333 = 75.7% errors!

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
A Summary of the Types of Errors (Confusion Matrix)

[email protected]
DLZNK464L9

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Types of Errors
False positive rate: The fraction of true negative examples that are classified as positive — 0.2% in
example.
False negative rate: The fraction of true positives that are classified as negative — 75.7% in example.
We produced this table by classifying to class Yes if
^
Pr(Default = Yes|Balance, Student) ≥ 0.5
[email protected]
DLZNK464L9

We can change the two error rates by changing the threshold from 0.5 to some other value in [0, 1]:

^
Pr(Default = Yes|Balance, Student) ≥ threshold, and vary threshold.

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Varying the threshold
The fraction of true negative examples that are classified as positive
0.6
Error Rate

Overall Error
0.4

[email protected]
DLZNK464L9 False Positive
False Negative

The fraction of true positive examples that are classified as negative


0.2
0.0

0.0 0.1 0.2 0.3 0.4 0.5

Threshold

In order to reduce the false negative rate, we may want to reduce the threshold
to 0.1 or less. Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
ROC Curve

1.0
0.8 0.6
True positive

0.4
rate

[email protected] No association
DLZNK464L9
between
0.2

predictor(s) and
response
0.0

0.0 0.2 0.4 0.8 1.0


0.6

False positive rate

The ROC plot displays both simultaneously.


Sometimes we use the AUC or area under the curve to summarize
the overall performance. Higher AUCforispersonal
good.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
This file is meant use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Limitations of LDA
If the distributions are significantly non-Gaussian, the LDA projections may not preserve complex
structure in the data needed for classification.

[email protected]
DLZNK464L9

LDA will also fail if discriminatory information is not in the mean but in the variance of the data.

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
• We discussed that we use discriminant analysis when the classes are well-separated, and parameter
estimates for the logistic regression model are unreliable.
• We learned conditional probabilities using an example.
• We discussed Bayes’ theorem mathematically and understood how it works.
• We talked about Linear Discriminant Analysis with different p and K values to yield the fewest
misclassification errors, among all possible classifiers.
[email protected]
DLZNK464L9
• We talked about various types of errors, such as false positive rate, true positive rate, etc.
• We discussed the threshold rate graphically in order to reduce the false negative rate and understood
ROC Curve between true and false positive rates.
• We talked about the limitations of LDA, such as it failed to predict if discriminatory information is not
in the mean but in the variance of the data.

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
[email protected]
DLZNK464L9 Extensions of the LDA

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
• Other forms of Discriminant Analysis

[email protected]
DLZNK464L9

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Other forms of Discriminant Analysis

πkf k(x)
Pr(Y = k|X = x) = Σ K
πlfl(x)
l=1
LDA: When fk(x) are Gaussian densities, with the same covariance matrix Σin each class.
By altering the forms for fk(x), we get different classifiers.
[email protected]
DLZNK464L9

• With Gaussians but different Σk in each class, we get quadratic discriminant analysis.

• By assuming independency between features, we get a näive Bayes classifier.

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Quadratic Discriminant Analysis
QDA Bayes boundary

r=-0.7
2

r=0.7
1

−2
0
2
LDA

X2
X2

r=0.7
−1

r=0.7
0

−3
−1
[email protected]

1
−2

DLZNK464L9
−3

−4
−4

−4 −2 0 2 4 −4 −2 0 2 4

X1 X1

Because the Σk are different, the quadratic terms matter.


Number of parameters
QDAProprietary
= Kp(p+1)/2
Thiscontent.
file is ©University
vsArizona.
LDA=Kp
meant forofpersonal All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Näive Bayes
Assumes features are independent in each class.
Useful when p is large, and so multivariate methods like QDA and even LDA break down (curse of
dimensionality!).
• Gaussian näive Bayes assumes each Σk is diagonal:

[email protected]
DLZNK464L9

• Can use for mixed feature vectors (qualitative and quantitative). If Xj is


qualitative, replace fkj (xj ) with probability mass function (histogram) over
discrete categories.
Despite strong assumptions, näive Bayes often produces good classification
results.
Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
• We discussed other types of Discriminant Analysis, such as quadratic discriminant analysis, and näive
Bayes classifier, and understood both the forms graphically and mathematically in detail.

[email protected]
DLZNK464L9

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
[email protected]
DLZNK464L9 KNN

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
• Introduction of K-nearest Neighbor
• Distance metrics
• Advantages and limitations of KNN

[email protected]
DLZNK464L9

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
KNN
• K-nearest neighbors (KNN)
• Very popular because very
simple and excellent empirical
performance
• Handles both binary and multi-
class data
[email protected]

DLZNK464L9 Makes no assumptions about the
parametric form of the decision
boundary:
• A non-parametric
method

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
KNN
• Does not have a training phase –
just store the training data and
do computation when time to
classify.

• Find the K “training points” that are


closest to xnew .
[email protected]
• Select the majority class amongst
DLZNK464L9

these K neighbors
(or for regression: average)

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
K-nearest Neighbor
What value of k should we use?
• Using only the closest example (1NN) to determine the class is subject to errors due to:
• A single atypical example
• Noise

• Pick k too large and you end up with looking at neighbors that are not that close.
[email protected]
DLZNK464L9

• Value of k is typically odd to avoid ties; 3 and 5 are most common.

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Similarity Metrics
Nearest neighbor methods depends on a similarity (or distance) metric.

Ideas?
Euclidean distance

Binary instance space is Hamming distance (number of feature values that differ).
[email protected]
DLZNK464L9

For text, cosine similarity of tf.idf weighted vectors is typically most effective.

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Advantages and limitations of KNN
• Good
o No training is necessary.
o No feature selection necessary.
o Scales well with large number of classes.
▪ Don’t need to train n classifiers for n classes.
• Bad
[email protected]
DLZNK464L9 o Classes can influence each other.
▪ Small changes to one class can have ripple effect.
o Scores can be hard to convert to probabilities.
o Can be more expensive at test time.
o “Model” is all of your training examples which can be large.

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Example: K-nearest neighbors in two dimensions

[email protected]
DLZNK464L9

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
For different values of K
KNN: K=1 KNN: K=100

[email protected]
DLZNK464L9

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
KNN: K=10

[email protected]
DLZNK464L9

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
• We discussed K-nearest Neighbor is a simple method that has empirical performance and handles
both binary and multi-class data.
• We talked about in brief on various distance metrics, such as Euclidean and Hamming distance
• We discussed the advantages and limitations of KNN, such that no training is required but can be
more expensive during testing the model.
• We talked about KNN with different k-values and understood the effect on test data graphically.
[email protected]
DLZNK464L9

Proprietary
Thiscontent.
file is ©University
meant forofpersonal
Arizona. All use
Rightsby
Reserved. Unauthorized use or distributiononly.
[email protected] prohibited.

Sharing or publishing the contents in part or full is liable for legal action.

You might also like