🎯 Why Are These Metrics Important?
When we build a classification model, we need a way to measure how well it performs — especially
when the data is imbalanced (e.g., 90% healthy, 10% sick).
💡 Confusion Matrix (The Base of Everything)
A confusion matrix is a summary table for classification results:
lua Copy Edit
Predicted
0 | 1
-----------------
Actual 0 | TN | FP
1 | FN | TP
Where:
TP (True Positive): Correctly predicted positive class
TN (True Negative): Correctly predicted negative class
FP (False Positive): Incorrectly predicted positive (Type I error)
FN (False Negative): Incorrectly predicted negative (Type II error)
✅ Accuracy
Definition:
Percentage of total predictions that are correct.
TP + TN
Accuracy =
TP + TN + FP + FN
Use-case:
Good for balanced datasets.
Weakness:
Can be misleading for imbalanced data.
Example:
python Copy Edit
from sklearn.metrics import accuracy_score y_true = [1, 0, 1, 1, 0, 1, 0] y_pred = [1, 0,
1, 0, 0, 1, 1] print("Accuracy:", accuracy_score(y_true, y_pred))
Output:
sql Copy Edit
Accuracy: 0.714 (5 out of 7 predictions are correct)
🎯 Precision
Definition:
Of all predicted positives, how many were actually positive?
TP
Precision =
TP + FP
Use-case:
Important when false positives are costly (e.g., spam detection, cancer diagnosis).
Example:
python Copy Edit
from sklearn.metrics import precision_score print("Precision:", precision_score(y_true,
y_pred))
Output:
sql Copy Edit
Precision: 0.75 (3 correct positives out of 4 predicted positives)
🎯 Recall (Sensitivity or True Positive Rate)
Definition:
Of all actual positives, how many were correctly predicted?
TP
Recall =
TP + FN
Use-case:
Important when missing positives is dangerous (e.g., detecting fraud or cancer).
Example:
python Copy Edit
from sklearn.metrics import recall_score print("Recall:", recall_score(y_true, y_pred))
Output:
kotlin Copy Edit
Recall: 0.75 (3 out of 4 actual positives were caught)
🎯 F1-Score
Definition:
Harmonic mean of Precision and Recall — a balanced metric.
Precision × Recall
F1 = 2 ×
Precision + Recall
Use-case:
When you want to balance precision and recall, especially on imbalanced datasets.
Example:
python Copy Edit
from sklearn.metrics import f1_score print("F1-Score:", f1_score(y_true, y_pred))
Output:
makefile Copy Edit
F1-Score: 0.75
🧮 Full Breakdown with Confusion Matrix
Let's compute it manually to understand better:
python Copy Edit
from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:\n", cm)
Output:
lua Copy Edit
[[2 1]
[1 3]]
From this:
TP = 3, TN = 2, FP = 1, FN = 1
Now manually:
python Copy Edit
TP = 3 TN = 2 FP = 1 FN = 1 accuracy = (TP + TN) / (TP + TN + FP + FN) precision = TP /
(TP + FP) recall = TP / (TP + FN) f1 = 2 * (precision * recall) / (precision + recall)
print("Manual Accuracy:", accuracy) print("Manual Precision:", precision) print("Manual
Recall:", recall) print("Manual F1-Score:", f1)
📊 Visualize Metrics (Bar Plot)
python Copy Edit
import matplotlib.pyplot as plt metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
values = [accuracy, precision, recall, f1] plt.figure(figsize=(8,5)) plt.bar(metrics,
values, color='skyblue') plt.ylim(0, 1) plt.title('Performance Metrics')
plt.ylabel('Score') plt.grid(True, linestyle='--', alpha=0.5) plt.show()
✅ Summary Table
Metric Best When... Worst When...
Accuracy Data is balanced Data is imbalanced
Precision False positives are costly You need to catch all positives
Recall False negatives are costly False positives don’t matter as
much
F1-Score Need balance between precision & You care only about one (P or R)
recall
📌 Bonus: Classification Report
Scikit-learn gives all metrics per class:
python
Copy Edit
from sklearn.metrics import classification_report print(classification_report(y_true,
y_pred, target_names=['Class 0', 'Class 1']))
Output:
markdown Copy Edit
precision recall f1-score support
Class 0 0.67 0.67 0.67 3
Class 1 0.75 0.75 0.75 4
accuracy 0.71 7
macro avg 0.71 0.71 0.71 7
weighted avg 0.71 0.71 0.71 7
Would you like the same deep explanation for macro/micro/weighted averaging, ROC AUC, or how to
use these metrics with multiclass or multilabel classification?