0% found this document useful (0 votes)
6 views

Notes 03

Uploaded by

HAMXALA KHAN
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Notes 03

Uploaded by

HAMXALA KHAN
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Machine Learning

EE514 – CS535

Analysis and Evaluation of Classifier’s


Performance and Multi-class Classification

Zubair Khalid

School of Science and Engineering


Lahore University of Management Sciences

https://www.zubairkhalid.org/ee514_2023.html
Outline

- Classification Accuracy (0/1 Loss)


- TP, TN, FP and FN
- Confusion Matrix
- Sensitivity, Specificity, Precision Trade-offs, ROC, AUC
- F1-Score and Matthew’s Correlation Coefficient
- Multi-class Classification, Evaluation, Micro, Macro Averaging
Evaluation of Classification Performance
Classification Accuracy, Misclassification Rate (0/1 Loss):

- For each test-point, the loss is either 0 or 1; whether the prediction is correct or
incorrect.
- Averaged over n data-points, this loss is a ‘Misclassification Rate’.

Interpretation:
- Misclassification Rate: Estimate of the probability that a point is incorrectly classified.
- Accuracy = 1- Misclassification rate

Issue:
- Not meaningful when the classes are imbalanced or skewed.
Evaluation of Classification Performance
Classification Accuracy (0/1 Loss):
Example:
- Predict if a bowler will not bowl a no-ball?
- Assuming 15 no-balls in an inning, a model that says ‘Yes’ all the time will have
95% accuracy.
- Using accuracy as performance metric, we can say that a model is very accurate,
but it is not useful or valuable in fact.

Why?
- Total points: 315 (assuming other balls are legal ☺)
- No-ball label: Class 0 (4.76% are from this class) Imbalanced
- Not a no-ball label: Class 1 (95.24% are from this class) Classes
Evaluation of Classification Performance
TP, TN, FP and FN:
- Consider a binary classification problem.
Evaluation of Classification Performance
TP, TN, FP and FN:
Evaluation of Classification Performance
TP, TN, FP and FN:
Example:
- Predict if a bowler will not bowl a no-ball?
- 15 no-balls in an inning (Total balls: 315)
- Bowl no-ball (Class 0), Bowl regular ball (Class 1)
- Model(*) predicted 10 no-balls (8 correct predictions, 2 incorrect)

* Assume you have a model that has been observing the bowlers for the last 15 years
and used these observations for learning.
Evaluation of Classification Performance
Confusion Matrix (Contingency Table):
- (TP; TN; FP; FN); usefully summarized in a table, referred to as confusion matrix:
- the rows correspond to predicted class (𝑦)

- and the columns to true class (𝑦)

Actual Labels
1 (Positive) 0 (Negative) Total
1 (Positive) Predicted Total
Predicted TP FP Positives
Labels 0 (Negative) Predicted Total
FN TN Negatives
Total P= TP+FN N= P+TN
Actual Actual
Total Total
Positives Negatives
Evaluation of Classification Performance
Confusion Matrix:
Actual Labels
Example:
- Disease Detection : 1 (Positive) 0 (Negative) Total
Given pathology reports and
1 (Positive)
scans, predict heart disease Predicted TP = 100 FP = 10 110
- Yes: 1, No: 0 Labels 0 (Negative)
FN = 5 TN = 50 55

Interpretation: Total P = 105 N = 60


Out of 165 cases

- Predicted: “Yes" 110 times, and “No" 55 times

- In reality: “Yes" 105 times, and “No" 60 times


Evaluation of Classification Performance
Confusion Matrix:
Actual Labels
Example:
- Predict if a bowler will not 1 (Positive) 0 (Negative) Total
bowl a no-ball?
1 (Positive) 305
Predicted TP = 298 FP = 7
Labels 0 (Negative)
FN = 2 TN = 8 10
Interpretation:
Total P = 300 N = 15
Out of 315 balls, we had 15 no-balls.

- Model predicted 305 regular balls and 10 no-balls (8 correct predictions, 2


incorrect).
Evaluation of Classification Performance
Confusion Matrix:
Metrics using Confusion Matrix:

- Accuracy: Overall, how frequently is the classifier correct?

- Misclassification or Error Rate: Overall, how frequently is it wrong?

- Sensitivity or Recall or True Positive Rate (TPR): How often does it predict Positive
when it is actually Positive?
Evaluation of Classification Performance
Confusion Matrix:
Metrics using Confusion Matrix:

- False Positive Rate: Actual Negative, how often does it


predict Positive?

- Specificity or True Negative Rate (TNR): When it's actually Negative, how often does it
predict Negative?

- Precision: When it predicts Positive, how often is it Positive?


Evaluation of Classification Performance
Confusion Matrix Metrics:

Negative Predicted Value


Evaluation of Classification Performance
Confusion Matrix:
Metrics using Confusion Matrix (Example: Disease Prediction):

- Accuracy: Disease/Healthy prediction accuracy

= (100+50)/165 = 0.91

- Misclassification or Error Rate: Disease/Healthy prediction accuracy

= (10+5)/165 = 0.09

- Sensitivity or Recall or True Positive Rate (TPR): When it's positive, how often does
the model detected disease?

= 100/105 = 0.95
Evaluation of Classification Performance
Confusion Matrix:
Metrics using Confusion Matrix (Example: Disease Prediction):

- False Positive Rate: Actually heathy, how often does it predict yes?

= 10/60 = 0.17

- Specificity or True Negative Rate (TNR): When it's actually health, how often does it predict
healthy?
= 50/60 = 0.83

- Precision: When it predicts disease, how often is it correct?

= 100/110 = 0.91
Evaluation of Classification Performance
Confusion Matrix:
Metrics using Confusion Matrix:
When to use which?

- Disease Detection: We do not want FN

- Fraud Detection: We do not want FP


Outline

- Classification Accuracy (0/1 Loss)


- TP, TN, FP and FN
- Confusion Matrix
- Sensitivity, Specificity, Precision Trade-offs, ROC, AUC
- F1-Score and Matthew’s Correlation Coefficient
- Multi-class Classification, Evaluation, Micro, Macro Averaging
Evaluation of Classification Performance
Confusion Matrix:
Precision and Sensitivity (Recall) Trade-off:
Sensitivity or Recall Precision
- Disease Detection:

- Recall or Sensitivity (Se); how good we are at detecting diseased people.


- Precision: How many have been correctly diagnosed as unhealthy.

- If we have diagnosed everyone unhealthy, Se=1 (diagnose


all unhealthy people correctly) but Precision may be low
(because TN=0 that increases the value of FP).

- We want high Precision and high Se (=1, Ideally).


- We should combine precision and sensitivity to evaluate the performance of classifier.
- F1-Score
Evaluation of Classification Performance
Confusion Matrix:
Sensitivity and Specificity Trade-off:
Sensitivity or Recall Specificity
- Disease Detection:

- Sp and Se; how good we are at detecting healthy and diseased people, respectively.

- If we have diagnosed everyone healthy, Sp=1 (diagnose all healthy people correctly) but
Se=0 (diagnose all unhealthy people incorrectly)

- Ideally: we want Sp= Se= 1 (perfect sensitivity and specificity) but unrealistic.
Evaluation of Classification Performance
Confusion Matrix:
Sensitivity and Specificity Trade-off:
How optimal a pair of sensitivity, specificity values is?
- Is Sp= 0.8, Se= 0.7 better than Sp= 0.7, Se= 0.8?

Threshold

- The answer depends on the application. Se = 1 Sp= 1

- In disease diagnosis;

- happy to reduce Sp in order to increase Se.

- In other applications, we may have different requirements.

- Trade-off is better explained by ROC curve and AUC.


Evaluation of Classification Performance
Confusion Matrix:
ROC (Receiver Operating Characteristic) Curve:
- Plot of TPR (Sensitivity) against FPR (1 – Specificity)
for different values of threshold.

- Also referred to as Sensitivity-(1-Specificity) plot.


Threshold

- Threshold of 0.0, every case is diagnosed as positive.


- Se= TPR = 1
- FPR = 1
- Sp= 0

- Threshold of 1.0, every case is diagnosed as negative.


- Se= TPR = 0
- FPR = 0
- Sp= 1
Evaluation of Classification Performance
Confusion Matrix:
ROC Curve
ROC Curve and AUC:

- TPR (Sensitivity): how many correct positive results


occur among all positive samples.

- FPR (1 – Specificity): how many incorrect positive


results occur among all negative samples.

- The best possible prediction method


- Se = Sp = 1 (Upper left corner of ROC space)

- Random guess; a point along a diagonal line (the


so-called line of no-discrimination), No Power!

- Area Under the ROC Curve, abbreviated as (AUC)


quantifies the power of the classifier.
Outline

- Classification Accuracy (0/1 Loss)


- TP, TN, FP and FN
- Confusion Matrix
- Sensitivity, Specificity, Precision Trade-offs, ROC, AUC
- F1-Score and Matthew’s Correlation Coefficient
- Multi-class Classification, Evaluation, Micro, Macro Averaging
Evaluation of Classification Performance
F1-Score:
- We observed trade-off between recall and precision.

- Higher levels of recall may be obtained at the price of lower values of precision.

- We need to define a single measure that combines recall and precision or other
metrics to evaluate the performance of a classifier.

- Some combined measures:


- F1 Score
- Matthew’s Correlation Coefficient
- 11-point average precision
- The Breakeven point
Evaluation of Classification Performance
F1 Score:

- One measure that assesses recall and precision trade-off is weighted harmonic
mean (HM) of recall and precision, that is,
Evaluation of Classification Performance
F1 Score:
Why harmonic mean?
- We could also use arithmetic mean (AM) or geometric mean (GM).

- HM is preferred as it penalizes model the


most; a conservative average, that is, for two
real positive numbers, we have

- Improvement in HM implies improvement in


AM or GM.

Different means, minimum and maximum against


precision. Recall=70% is fixed.
Evaluation of Classification Performance
Matthew’s Correlation Coefficient (MCC):
- Precision, Recall and F1-score are asymmetric. Get a different result if the classes are switched.

- Matthew’s correlation coefficient determines the correlation between true class and predicted
class. The higher the correlation between true and predicted values, the better the prediction.

- Defined as

- MCC=1 when FP = FN = 0 (Perfect classification)


- MCC=-1 when TP = TN = 0 (Perfect misclassification)
- MCC=0; Performance of classifier is not better than a random classifier (flip coin)
- MCC is symmetric by design
Evaluation of Classification Performance
11-point Average Precision:
- Adjust threshold of the classifier such that the recall takes the following 11 values 0.0, 0.1.,
…, 0.9, 1.0.

- For each value of the recall, determine the precision and find the average value of precision,
referred to as average precision (AP).

- This is just uniformly-spaced sampling of Precision-Recall curve and taking average value.

The Breakeven Point:


- Compute precision as a function of recall for different values of thresholds.

- When Precision = Recall, we have a breakeven.


Outline

- Classification Accuracy (0/1 Loss)


- TP, TN, FP and FN
- Confusion Matrix
- Sensitivity, Specificity, Precision Trade-offs, ROC, AUC
- F1-Score and Matthew’s Correlation Coefficient
- Multi-class Classification, Evaluation, Micro, Macro Averaging
Multi-Class Classification
Formulation:

Examples:
- Emotion Detection.

- Vehicle Type, Make, model, color of the vehicle from the images streamed by safe city camera.

- Speaker Identification from Speech Signal.

- State (rest, ramp-up, normal, ramp-down) of the process machine in the plant.

- Sentiment Analysis (Categories: Positive, Negative, Neutral), Text Analysis.

- Take an image of the sky and determine the pollution level (healthy, moderate, hazard).

- Record Home WiFi signals and identify the type of appliance being operated.
Multi-Class Classification
Implementation (Possible options using binary classifiers):
Option 1: Build a one-vs-all (OvA) one-vs-rest (OvR) classifier:

Option 2: Build an all-vs-all classifier:

There can be other options…


Evaluation of Classification Performance
Multiclass Classification:

- How do we define the measures for the evaluation of the performance of multi-class classifier?

- Macro-averaging: We compute performance for each class and then average.

- Micro-averaging: Compute confusion matrix after collecting decisions for all classes and then
evaluate.
Evaluation of Classification Performance
Multiclass Classification:
Confusion Matrix
- Predict if a bowler will bowl a no-ball, wide bowl, regular bowl?
- 15 no-balls, 20 wide-balls in an inning (Total balls: 335)
- Model Predictions:
Actual
No-ball Wide-ball Regular ball Precision
No-ball
8 5 20
Classifier
Output
Wide-ball 2 10 10
Regular ball 5 5 270
Recall
Evaluation of Classification Performance
Multiclass Classification:
Confusion Matrix – Recall and Precision:

Recall
- For i-th class, recall represents the fraction of data-points classified correctly, that is,

Precision
- For i-th class, precision represents the fraction of data-points
predicted to be in class i are actually in the i-th class, that is,

Accuracy
- Fraction of data points classified correctly, that is,
Evaluation of Classification Performance
Multiclass Classification:
Confusion Matrix – Macro-Averaging:
- We compute performance for
each class and then average.

Confusion Matrix – Each Class:


Actual Actual Actual
Not a Not
No-ball No-ball Wide Wide Regular Not Regular

Classifier
No-ball 8 25 Wide 10 12 Regular 270 10
Output Not a no- Not
ball 7 295 Not Wide
10 303 Regular 30 25
Recall

Macro-average Recall:
Evaluation of Classification Performance
Multiclass Classification:
True False
Confusion Matrix – Micro-Averaging: Micro-average
- Compute confusion matrix after collecting
True
288 47 Recall:
decisions for all classes and then evaluate.
False
47 623

Confusion Matrix – Each Class:


Actual Actual Actual
Not a Not
No-ball No-ball Wide Wide Regular Not Regular

Classifier
No-ball 8 25 Wide 10 12 Regular 270 10
Output Not a no- Not
ball 7 295 Not Wide
10 303 Regular 30 25
Evaluation of Classification Performance
Multiclass Classification:
Micro-Averaging vs Macro Averaging:
- Note Micro-average recall= Micro-average precision = F1 Score = Accuracy (computed from
confusion matrix)
- Micro-average is termed as a global metric.
- Consequently, it is not a good measure when classes are not balanced.

- Macro-average is relatively a better as we can see a zoomed-in picture before averaging.

- Note Macro-averaging does not take class imbalance into account.


- Weighted-averaging; Similar to Macro averaging but takes a weighted mean instead where
weight for each class is the total number of data-points of that class.

Weighted-average Recall:
Evaluation of Classification Performance
References:

- KM 5.7.2

You might also like