0% found this document useful (0 votes)
0 views

lec-7

The document outlines a lecture on Data Mining focusing on the k-Nearest Neighbors (kNN) classifier and model evaluation techniques. It covers lazy learning methods, the kNN algorithm, and the importance of performance evaluation using metrics such as accuracy and confusion matrices. Additionally, it discusses the implications of choosing the right parameters and handling data in classification tasks.

Uploaded by

hr9s2b5cq5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

lec-7

The document outlines a lecture on Data Mining focusing on the k-Nearest Neighbors (kNN) classifier and model evaluation techniques. It covers lazy learning methods, the kNN algorithm, and the importance of performance evaluation using metrics such as accuracy and confusion matrices. Additionally, it discusses the implications of choosing the right parameters and handling data in classification tasks.

Uploaded by

hr9s2b5cq5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Data Mining [CSEN 911]

GUC - Winter 2024 – Lecture 7


Model Building – kNN Classifier, Classification
Model Evaluation

Dr. Ayman Al-Serafi

TAs:
Tameem Alghazaly* (lead)
Nada Bakeer
Sarah Samir
Mariam Moustafa
Outline
1. Midterm Feedback
2. Classification using Lazy Learning: kNN
3. Classification Model Evaluation
4. Conclusion

Q&A breaks
between sections

Urgent Qs only in
between!
7-2
Data Mining - GUC - Winter 2024
Outline
1. Midterm Feedback
2. Classification using Lazy Learning: kNN
3. Classification Model Evaluation
4. Conclusion

Q&A
Data Mining - GUC - Winter 2024 7-3
Midterm Review – Data Mining Conceptual Classification

Data Data Data


Mining Mining Mining
Strategy Technique Algorithm

Supervised Decision Tree Information


Learning - Learning Gain C4.5
Classification

Supervised Linear Gradient Descent /


Learning - Regression Least-squares
Estimation fitting

Unsupervised Partitioning- K-Means


Learning - based
Clustering Clustering

Market Association Apriori


Basket Rule Mining
Analysis

Data Mining - GUC - Winter 2024 7-4


Outline
1. Midterm Feedback
2. Classification using Lazy Learning: kNN
3. Classification Model Evaluation
4. Conclusion

Q&A
Data Mining - GUC - Winter 2024 7-7
CLASSIFYING THE CLASSIFIERS!

DATA MINING - GUC - WINTER 2024 8


LAZY VS. EAGER LEARNING
Lazy vs. eager learning
 Lazy learning (e.g., instance-based learning): Simply stores
training data (or only minor processing) and waits until it is given a
test tuple (doesn’t create a model!)
 Eager learning (the previously discussed methods): Given a set of
training tuples, constructs a classification model before receiving
new (e.g., test) data to classify
Lazy: less time in training but more time in predicting
Accuracy
 Lazy method effectively uses a richer hypothesis space since it uses
many local functions to form an implicit global approximation to
the target function
 Eager: must commit to a single hypothesis (model) that covers the
entire instance space
DATA MINING - GUC - WINTER 2024 9
LAZY LEARNER: INSTANCE-BASED METHODS

Instance-based learning:
 Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
 Must have a database of previous examples to be able to
classify future examples

Typical approach:
 k-nearest neighbour (kNN) approach
1. Instances represented as points in a Euclidean space.  Can only work with
numerical input data!
2. Compute distance between the new point and all other points in the DB
3. Find the k-closest instances in the database (e.g., Euclidean distance)
4. Assign the most voted class for the dependent variable as the class for the
instance with unknown classification

DATA MINING - GUC - WINTER 2024 10


LAZY LEARNERS
K-NEAREST NEIGHBOR (KNN) CLASSIFIERS
Delay classification until new test data is available
 Store training data meanwhile

Use similarity measure to compute distance between test data tuple and each of the
training data tuples (Euclidian, Manhattan, …)

 Remember to normalize if ranges vary between attributes

k stands for the number of “closest” neighbours of a test data tuple according to measured
distance
 Majority voting of their class labels used to determine class of test tuple

DATA MINING - GUC - WINTER 2024 11


THE K-NEAREST NEIGHBOUR ALGORITHM
All instances correspond to points in the
n-D space
The nearest neighbor are defined in
terms of Euclidean distance, dist(X1, X2)
Dependent variable could be discrete
(categorical) or numerical
For discrete-valued, k-NN returns the
most common value among the k
training examples nearest to xq
_
Each k nearest neighbor votes for a _
_ _
class for xq +
. +
_ xq +
_ +
DATA MINING - GUC - WINTER 2024 12
DISCUSSION ON THE K-NN ALGORITHM
k-NN for numerical prediction for a given unknown tuple
 Returns the mean values of the k nearest neighbors

Robust to noisy data by averaging k-nearest neighbors

Curse of dimensionality: distance between neighbours could be


dominated by irrelevant attributes
 To overcome it, apply normalization and data reduction to reduce
number of input attributes by elimination of the least relevant
attributes
DATA MINING - GUC - WINTER 2024 13
Euclidean distance
LAZY LEARNERS equation for input
independent
K-NEAREST NEIGHBOR – EUCLIDEAN DISTANCE + SIMILARITY variables 1 until z
between 2
instances x and y

𝐷= 𝑥1 − 𝑦1 2 + 𝑥2 − 𝑦2 2 + ⋯ + 𝑥𝑧 − 𝑦𝑧 2

S= 1 /(1 + 𝐷) OR S = 1- D (if distance is


normalized in [0,1] as % distance, divide by Max(D)) Similarity:
opposite of
distance

DATA MINING - GUC - WINTER 2024 14


LAZY LEARNERS - K-NEAREST NEIGHBOR (KNN) EXAMPLE
RID age Loan ($) Default EUCLIDEAN DISTANCE NORMALIZED EUCLIDEAN SIMILIARTY
MAX DISTANCE
1 25 40000 No 25 − 48 2 + 40000 − 142000 2 = 102000 = 102000 / 124000 = 0.823 = 1- 0.823 = 0.18 (*100) = 18%

2 35 60000 No 35 − 48 2 + 60000 − 142000 2 = 82000 = 82000 / 124000 = 0.661 = 1- 0.661 = 0.34 (*100) = 34%

3 45 80000 No 45 − 48 2 + 80000 − 142000 2 = 62000 = 62000 / 124000 = 0.5 = 1- 0.5 = 0.5 (*100) = 50%

4 20 20000 No 20 − 48 2 + 20000 − 142000 2 = 122000 = 122000 / 124000 = 0.984 = 1- 0.984 = 0.02 (*100) = 2%

5 35 120000 No 35 − 48 2 + 120000 − 142000 2 = 22000 = 22000 / 124000 = 0.177 = 1- 0.177 = 0.82 (*100) = 82%

6 52 18000 No 52 − 48 2 + 18000 − 142000 2 = 124000 = 124000 / 124000 = 1.0 = 1- 1.0 = 0 (*100) = 0%

7 23 95000 Yes 23 − 48 2 + 95000 − 142000 2 = 47000 = 47000 / 124000 = 0.379 = 1- 0.379 = 0.62 (*100) = 62%

8 40 62000 Yes 40 − 48 2 + 62000 − 142000 2 = 80000 = 80000 / 124000 = 0.645 = 1- 0.645 = 0.36 (*100) = 36%

9 60 100000 Yes 60 − 48 2 + 100000 − 142000 2 = 42000 = 42000 / 124000 = 0.339 = 1- 0.339 = 0.66 (*100) = 66%

10 48 220000 Yes 48 − 48 2 + 220000 − 142000 2 = 78000 = 78000 / 124000 = 0.629 = 1- 0.629 = 0.37 (*100) = 37%

11 33 150000 Yes 33 − 48 2 + 150000 − 142000 2 = 8000 = 8000 / 124000 = 0.065 = 1- 0.065 = 0.94 (*100) = 94%

48 142000 ?

DATA MINING - GUC - WINTER 2024 15


LAZY LEARNERS
K-NEAREST NEIGHBOR (KNN) EXAMPLE
RID age Loan ($) Default Distance
1 25 40000 No 102000
2 35 60000 No 82000
𝐷= 𝑥1 − 𝑦1 2 + 𝑥2 − 𝑦2 2
3 45 80000 No 62000
4 20 20000 No 122000
5 35 120000 No 22000
If k=1  NN is RID 11
6 52 18000 No 124000
• Default =YES
7 23 95000 Yes 47000
8 40 62000 Yes 80000 If k=3  NNs are RIDs 11, 5, 9
9 60 100000 Yes 42000
10 48 220000 Yes 78000 • Default = YES
11 33 150000 Yes 8000

48 142000 Yes
DATA MINING - GUC - WINTER 2024 16
KNN EXAMPLE
VOTER PARTY REGISTRATION
Assume we have a training data set of
voters each tagged with three attributes:
voter party registration, voter wealth, and
a quantitative measure of voter
religiousness
We want to predict voter registration using
wealth and religiousness as predictors

DATA MINING - GUC - WINTER 2024 17


KNN EXAMPLE Using KNN with k=1, we can predict voter registration for each
voter in the training data
VOTER PARTY REGISTRATION  Highly overfitted!

𝐾 =1
DATA MINING - GUC - WINTER 2024 18
KNN EXAMPLE
VOTER PARTY REGISTRATION

𝐾 =3 Lighter colors indicate less certainty about predictions 𝐾 = 10


 Reasonable fitting to data
DATA MINING - GUC - WINTER 2024 19
KNN EXAMPLE
VOTER PARTY REGISTRATION

𝐾 = 20 Lighter colors indicate less certainty about predictions 𝐾 = 80


 Highly underfitted
DATA MINING - GUC - WINTER 2024 20
Precautions and tips for kNN
• Choosing a reasonable number for n: usually an odd number like
3,5,7

• How to handle missing values?

• Normalisation of input independent variables

DATA MINING - GUC - WINTER 2024 21


Outline
1. Midterm Feedback
2. Classification using Lazy Learning: kNN
3. Classification Model Evaluation
4. Conclusion

Q&A
Data Mining - GUC - Winter 2024 7-22
EVALUATING SUPERVISED LEARNER MODELS
Performance evaluation is probably the most critical of all steps in the
data mining process.
Supervised learner models are used to classify, estimate, and/or predict
future outcome.
For some applications the desire is to build a model showing consistently
high predictive accuracy.

Three applications focus on classification correctness:


 Develop a model to accept or reject credit card applicants
 Develop a model to accept or reject home mortgage applicants
 Develop a model to decide whether or not to drill for oil

Classification correctness is best calculated by presenting previously


unseen data summarized in a table known as a Confusion Matrix.

23
Data Mining - GUC - Winter 2024
TWO-CLASS ERROR ANALYSIS
Many applications listed previously represent two-class problem.
 Yes / No, High / Low, etc.

For example, cell with True Accept and True Reject represent correctly
classified instances.
A cell with False Accept denotes accepted applicants that should have
been rejected.
A cell with False Reject denotes rejected applicants that should have
been accepted.

24
Data Mining - GUC - Winter 2024
TWO-CLASS ERROR ANALYSIS EXPLAINED

True Class
Table 2.6 • A Simple Confusion Matrix Positive Negative

Positive
Computed Computed True False

Predicted Class
Positive Positive
Accept Reject
Count (TP) Count (FP)
Accept True False
Accept Reject

Negative
Reject False True False True
Accept Reject Negative Negative
Count (FN) Count (TN)

25
Data Mining - GUC - Winter 2024
MODEL EVALUATION Measure

accuracy, recognition rate


Formula

𝑇𝑃 + 𝑇𝑁
METRICS 𝑃+𝑁

error rate, misclassification rate 𝐹𝑃 + 𝐹𝑁


Predicted (ALSO 1-Accuracy) 𝑃+𝑁
Recall, true positive rate, sensitivity 𝑇𝑃
Yes No Total 𝑃
Yes TP FN P specificity, true negative rate 𝑇𝑁
Actual No FP TN N 𝑁
Total 𝑃෠ ෡
𝑁 P+N precision 𝑇𝑃
𝑇𝑃 + 𝐹𝑃
Confusion Matrix Positives  tuples representing class of interest
Negatives  tuples representing other class(es)
True Positives  positive tuples correctly labeled
False Positives  negative tuples incorrectly labeled
True Negatives  negative tuples correctly labeled
False Negatives  positive tuples incorrectly labeled
26
CLASSIFIER EVALUATION METRICS: ACCURACY,
ERROR RATE, SENSITIVITY AND SPECIFICITY
A\P C ¬C Class Imbalance Problem:

C TP FN P
 One class may be rare, e.g.
¬C FP TN N
fraud, or COVID-positive
P’ N’ All
 Significant majority of the

Classifier Accuracy, or recognition negative class and minority of


rate: percentage of test set tuples the positive class
that are correctly classified  Sensitivity: True Positive
Accuracy = (TP + TN)/All recognition rate
Error rate: 1 – accuracy, or  Sensitivity = TP/P

Error rate = (FP + FN)/All  Specificity: True Negative

recognition rate
 Specificity = TN/N

27
Data Mining - GUC - Winter 2024
MODEL EVALUATION
METRICS FOR EVALUATING CLASSIFIER PERFORMANCE

Balanced Classes
Predicted
Yes No Total Accuracy (%)
Yes 6954 46 7000 99.34
Actual No 412 2588 3000 86.27
Total 7366 2634 10000 95.42

Example Buys_Computer Confusion Matrix

Use accuracy and error rate


Data Mining - GUC - Winter 2024 28
As the model has too few cancer
(positive) patients, we can’t only

MODEL EVALUATION depend on accuracy to evaluate


the model!

METRICS FOR EVALUATING CLASSIFIER PERFORMANCE

Imbalanced Classes Low


Predicted
Yes No Total Accuracy (%)
Yes 90 210 300 30 (90/300 = 30%)
Actual No 140 9560 9700 98.56 (9560/9700 = 98.5%)
Total 230 9770 10000 96.4 ((90+9560)/10000 = 96.5%)

Example Cancer Confusion Matrix

Use sensitivity (TPs or recall) and specificity High

Data Mining - GUC - Winter 2024 29


CONFUSION MATRIX FOR MULTI-CLASS
EXPLAINED
Table 2.5 • A Three-Class Confusion Matrix

Computed Decision

C C C
TRUE / CORRECT
1 2 3
C C C C
classifications
1 11 12 13
C C C C
2 21 22 23
C C C C
3 31 32 33

30
Data Mining - GUC - Winter 2024
CONFUSION MATRIX
• A matrix used to summarize the results of a supervised classification.
• Entries along the main diagonal are correct classifications.
• Entries other than those on the main diagonal are classification
errors.
• Rule 1: the value C11 represent total number of C1 instances correctly classified by
the matrix. Same logic applies to C22 and C33.
• Rule 2: values in row C1 represent those instances belong to class Ci. for example.
With i=2, the instances associated with cells C21, C22, and C23 are all actually
members of C2. to find total number of C2 instances misclassified as member of other
class, we compute sum of C21 and C23.
• Rule 3: values found in column Ci indicate those instances that have been classified as
members of Ci. With i=2, the instances associated with cells C12, C22, and C32
are all actually members of C2. to find total number of C2 instances misclassified
as member of other class, we compute sum of C12 and C32.

31
Data Mining - GUC - Winter 2024
Is the Error Rate sufficient to judge?

For both models, Error Rate = (25+75) / 1000 = 10%

Table 2.7 • Two Confusion Matrices Each Showing a 10% Error Rate

Model Computed Computed Model Computed Computed


A Accept Reject B Accept Reject

Accept 600 25 Accept 600 75


Reject 75 300 Reject 25 300

RecallA = (600) / (625) = 96% RecallB = (600) / (675) = 88%

PrecisionA = (600) / (675) = 88% PrecisionB = (600) / (625) = 96%

F measure (F1 or F-score): harmonic mean of precision


and recall,
Data Mining - GUC - Winter 2024
7-33
Comparing Models by Measuring Lift
for Binary Classification
 The hope is to select samples that will show higher
response rates than the rates seen within the general
population.

 Supervised learner models designed for extracting bias


samples from a general population are often evaluated by
a measure that comes directly from marketing known as
LIFT.

 A value of lift of 3+ is considered very good: 3 times better


than random selection of positive-class from population!
 1+ means model better than random selection from population
 0 to 1 means random selection from population is better than
model

Data Mining - GUC - Winter 2024


7-34
COMPUTING LIFT
Lift measures the change in percent
concentration of a

P(Ci | Sample)
Lift  Desired class, Ci, taken from a biased
P(Ci | Population) sample relative to the

Concentration of Ci within the entire


population.

35
Data Mining - GUC - Winter 2024
Table 2.9 • Two Confusion Matrices for Alternative Models with Lift Equal to 2.25

Model Computed Computed Model Computed Computed


X Accept Reject Y Accept Reject

Accept 540 460 Accept 450 550


Reject 23,460 75,540 Reject 19,550 79,450

Lift (model X) = [{540/24000}/{1000/100000}]

Lift (model Y) = [{450/20000}/{1000/100000}]

A B
C D

= (A/A+C) / (A+B/All)

7-36
Data Mining - GUC - Winter 2024
Table 2.9 • Two Confusion Matrices for Alternative Models with Lift Equal to 2.25

Model Computed Computed Model Computed Computed


X Accept Reject Y Accept Reject

Accept 540 460 Accept 450 550


Reject 23,460 75,540 Reject 19,550 79,450

Lift (model X) = [{540/24000}/{1000/100000}]

Lift (model Y) = [{450/20000}/{1000/100000}]

AccuracyX = (540+75540) / (100000) AccuracyY = (450+79450) / (100000)


= 76% = 79%

37
Table 2.8 • Two Confusion Matrices: No Model and an Ideal Model

No Computed Computed Ideal Computed Computed


Model Accept Reject Model Accept Reject

Accept 1,000 0 Accept 1,000 0


Reject 99,000 0 Reject 0 99,000

LiftA= (1000/100000) / (1000/100000) = 1 LiftB= (1000/1000) / (1000/100000) = 100

38
MODEL SELECTION: ROC CURVES FOR BINARY
CLASSIFIERS
ROC (Receiver Operating Characteristics) curves:
for visual comparison of classification models
 The true positive rate (TPR, also called sensitivity) is
calculated as TP/TP+FN = Recall
 false positive rate is calculated as FP/FP+TN

Originated from signal detection theory


Shows the trade-off between the true positive
rate and the false positive rate
 Vertical axis represents the
The area under the ROC curve is a measure of true positive rate
the accuracy of the model
 Horizontal axis rep. the
Rank the test tuples in decreasing order: the one false positive rate
that is most likely to belong to the positive class
appears at the top of the list  The plot also shows a
 Based on confidence interval from the classification output diagonal line
like the percentages we had of the leaf nodes in the decision
trees  A model with perfect
The closer to the diagonal line (i.e., the closer the
accuracy will have an area
area is to 0.5), the less accurate is the model under the curve (AUC) of
1.0 if variables are
normalised 41
Outline
1. Midterm Feedback
2. Classification using Lazy Learning: kNN
3. Classification Model Evaluation
4. Conclusion

Q&A
Data Mining - GUC - Winter 2024

7-52
SUMMARY
kNN Classifier

• The Algorithm  Lazy learning


• The distance metric
• The parameter k

Model Evaluation

• Metrics for Evaluating Classifiers Performance

53
Data Mining - GUC - Winter 2024
Mini-Project 2
 Due on Friday 15th November 23:59

 Implement a supervised classification project in Python on a


new dataset

 Expect you to implement the full CRISP-DM data mining


process as a data scientist

7-57
Data Mining - GUC - Winter 2024
THANK YOU FOR YOUR
ATTENTION
NEXT LECTURE: Unsupervised Learning: k-
means clustering, evaluation

NEXT TUTORIAL: Decision Trees Lab


+ Assignment / Mini-project 2

You might also like