lec-7
lec-7
TAs:
Tameem Alghazaly* (lead)
Nada Bakeer
Sarah Samir
Mariam Moustafa
Outline
1. Midterm Feedback
2. Classification using Lazy Learning: kNN
3. Classification Model Evaluation
4. Conclusion
Q&A breaks
between sections
Urgent Qs only in
between!
7-2
Data Mining - GUC - Winter 2024
Outline
1. Midterm Feedback
2. Classification using Lazy Learning: kNN
3. Classification Model Evaluation
4. Conclusion
Q&A
Data Mining - GUC - Winter 2024 7-3
Midterm Review – Data Mining Conceptual Classification
Q&A
Data Mining - GUC - Winter 2024 7-7
CLASSIFYING THE CLASSIFIERS!
Instance-based learning:
Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
Must have a database of previous examples to be able to
classify future examples
Typical approach:
k-nearest neighbour (kNN) approach
1. Instances represented as points in a Euclidean space. Can only work with
numerical input data!
2. Compute distance between the new point and all other points in the DB
3. Find the k-closest instances in the database (e.g., Euclidean distance)
4. Assign the most voted class for the dependent variable as the class for the
instance with unknown classification
Use similarity measure to compute distance between test data tuple and each of the
training data tuples (Euclidian, Manhattan, …)
k stands for the number of “closest” neighbours of a test data tuple according to measured
distance
Majority voting of their class labels used to determine class of test tuple
𝐷= 𝑥1 − 𝑦1 2 + 𝑥2 − 𝑦2 2 + ⋯ + 𝑥𝑧 − 𝑦𝑧 2
2 35 60000 No 35 − 48 2 + 60000 − 142000 2 = 82000 = 82000 / 124000 = 0.661 = 1- 0.661 = 0.34 (*100) = 34%
3 45 80000 No 45 − 48 2 + 80000 − 142000 2 = 62000 = 62000 / 124000 = 0.5 = 1- 0.5 = 0.5 (*100) = 50%
4 20 20000 No 20 − 48 2 + 20000 − 142000 2 = 122000 = 122000 / 124000 = 0.984 = 1- 0.984 = 0.02 (*100) = 2%
5 35 120000 No 35 − 48 2 + 120000 − 142000 2 = 22000 = 22000 / 124000 = 0.177 = 1- 0.177 = 0.82 (*100) = 82%
7 23 95000 Yes 23 − 48 2 + 95000 − 142000 2 = 47000 = 47000 / 124000 = 0.379 = 1- 0.379 = 0.62 (*100) = 62%
8 40 62000 Yes 40 − 48 2 + 62000 − 142000 2 = 80000 = 80000 / 124000 = 0.645 = 1- 0.645 = 0.36 (*100) = 36%
9 60 100000 Yes 60 − 48 2 + 100000 − 142000 2 = 42000 = 42000 / 124000 = 0.339 = 1- 0.339 = 0.66 (*100) = 66%
10 48 220000 Yes 48 − 48 2 + 220000 − 142000 2 = 78000 = 78000 / 124000 = 0.629 = 1- 0.629 = 0.37 (*100) = 37%
11 33 150000 Yes 33 − 48 2 + 150000 − 142000 2 = 8000 = 8000 / 124000 = 0.065 = 1- 0.065 = 0.94 (*100) = 94%
48 142000 ?
48 142000 Yes
DATA MINING - GUC - WINTER 2024 16
KNN EXAMPLE
VOTER PARTY REGISTRATION
Assume we have a training data set of
voters each tagged with three attributes:
voter party registration, voter wealth, and
a quantitative measure of voter
religiousness
We want to predict voter registration using
wealth and religiousness as predictors
𝐾 =1
DATA MINING - GUC - WINTER 2024 18
KNN EXAMPLE
VOTER PARTY REGISTRATION
Q&A
Data Mining - GUC - Winter 2024 7-22
EVALUATING SUPERVISED LEARNER MODELS
Performance evaluation is probably the most critical of all steps in the
data mining process.
Supervised learner models are used to classify, estimate, and/or predict
future outcome.
For some applications the desire is to build a model showing consistently
high predictive accuracy.
23
Data Mining - GUC - Winter 2024
TWO-CLASS ERROR ANALYSIS
Many applications listed previously represent two-class problem.
Yes / No, High / Low, etc.
For example, cell with True Accept and True Reject represent correctly
classified instances.
A cell with False Accept denotes accepted applicants that should have
been rejected.
A cell with False Reject denotes rejected applicants that should have
been accepted.
24
Data Mining - GUC - Winter 2024
TWO-CLASS ERROR ANALYSIS EXPLAINED
True Class
Table 2.6 • A Simple Confusion Matrix Positive Negative
Positive
Computed Computed True False
Predicted Class
Positive Positive
Accept Reject
Count (TP) Count (FP)
Accept True False
Accept Reject
Negative
Reject False True False True
Accept Reject Negative Negative
Count (FN) Count (TN)
25
Data Mining - GUC - Winter 2024
MODEL EVALUATION Measure
𝑇𝑃 + 𝑇𝑁
METRICS 𝑃+𝑁
recognition rate
Specificity = TN/N
27
Data Mining - GUC - Winter 2024
MODEL EVALUATION
METRICS FOR EVALUATING CLASSIFIER PERFORMANCE
Balanced Classes
Predicted
Yes No Total Accuracy (%)
Yes 6954 46 7000 99.34
Actual No 412 2588 3000 86.27
Total 7366 2634 10000 95.42
Computed Decision
C C C
TRUE / CORRECT
1 2 3
C C C C
classifications
1 11 12 13
C C C C
2 21 22 23
C C C C
3 31 32 33
30
Data Mining - GUC - Winter 2024
CONFUSION MATRIX
• A matrix used to summarize the results of a supervised classification.
• Entries along the main diagonal are correct classifications.
• Entries other than those on the main diagonal are classification
errors.
• Rule 1: the value C11 represent total number of C1 instances correctly classified by
the matrix. Same logic applies to C22 and C33.
• Rule 2: values in row C1 represent those instances belong to class Ci. for example.
With i=2, the instances associated with cells C21, C22, and C23 are all actually
members of C2. to find total number of C2 instances misclassified as member of other
class, we compute sum of C21 and C23.
• Rule 3: values found in column Ci indicate those instances that have been classified as
members of Ci. With i=2, the instances associated with cells C12, C22, and C32
are all actually members of C2. to find total number of C2 instances misclassified
as member of other class, we compute sum of C12 and C32.
31
Data Mining - GUC - Winter 2024
Is the Error Rate sufficient to judge?
Table 2.7 • Two Confusion Matrices Each Showing a 10% Error Rate
P(Ci | Sample)
Lift Desired class, Ci, taken from a biased
P(Ci | Population) sample relative to the
35
Data Mining - GUC - Winter 2024
Table 2.9 • Two Confusion Matrices for Alternative Models with Lift Equal to 2.25
A B
C D
= (A/A+C) / (A+B/All)
7-36
Data Mining - GUC - Winter 2024
Table 2.9 • Two Confusion Matrices for Alternative Models with Lift Equal to 2.25
37
Table 2.8 • Two Confusion Matrices: No Model and an Ideal Model
38
MODEL SELECTION: ROC CURVES FOR BINARY
CLASSIFIERS
ROC (Receiver Operating Characteristics) curves:
for visual comparison of classification models
The true positive rate (TPR, also called sensitivity) is
calculated as TP/TP+FN = Recall
false positive rate is calculated as FP/FP+TN
Q&A
Data Mining - GUC - Winter 2024
7-52
SUMMARY
kNN Classifier
Model Evaluation
53
Data Mining - GUC - Winter 2024
Mini-Project 2
Due on Friday 15th November 23:59
7-57
Data Mining - GUC - Winter 2024
THANK YOU FOR YOUR
ATTENTION
NEXT LECTURE: Unsupervised Learning: k-
means clustering, evaluation