U02Lecture07 Classification
U02Lecture07 Classification
Unit 02 Lecture 07
3
Classification
A sliding cutoff can then be used to convert the score to a
decision.
The general approach is as follows:
1. Establish a cutoff probability for the class of interest, above
which we consider a record as belonging to that class.
2. Estimate (with any model) the probability that a record
belongs to the class of interest.
3. If that probability is above the cutoff probability, assign the
new record to the class of interest.
The higher the cutoff, the fewer the records predicted as 1—
that is, as belonging to the class of interest.
The lower the cutoff, the more the records predicted as 1.
4
Naive Bayes Classifier
Naive Bayes is the most straightforward and fast classification
algorithm, which is suitable for a large chunk of data.
Naive Bayes classifier is successfully used in various applications such
as spam filtering, text classification, sentiment analysis, and
recommender systems.
It uses Bayes theorem of probability for prediction of unknown class.
Whenever you perform classification,
the first step is to understand the problem and identify potential
features and label.
Features are those characteristics or attributes which affect the results
of the label.
For example, in the case of a loan distribution, bank managers identify the
customer’s occupation, income, age, location, previous loan history,
transaction history, and credit score. These characteristics are known5 as
features that help the model classify customers.
Naive Bayes Classifier
The classification has two phases,
a learning phase and the evaluation phase. In the learning phase, the
classifier trains its model on a given dataset,
and in the evaluation phase, it tests the classifier's performance.
Performance is evaluated on the basis of various parameters such as
accuracy, error, precision, and recall.
6
Naive Bayes Classifier
Naive Bayes classifier is the fast, accurate and reliable
algorithm.
Naive Bayes classifiers have high accuracy and speed on
large datasets.
Naive Bayes classifier assumes that the effect of a particular
feature in a class is independent of other features.
For example, a loan applicant is desirable or not depending on his/her
income, previous loan and transaction history, age, and location. Even if
these features are interdependent, these features are still considered
independently.
This assumption simplifies computation, and that's why it is
considered as naive. This assumption is called class conditional
independence.
7
Naive Bayes Classifier
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used
to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
9
Naive Bayes Classifier
There are two likelihood tables. Likelihood Table 1 is showing prior
probabilities of labels and Likelihood Table 2 is showing the posterior
probability.
10
Naive Bayes Classifier
Probability of playing:
P(Yes | Overcast) = P(Overcast | Yes) P(Yes) / P (Overcast) .....................(1)
Calculate Prior Probabilities:
P(Overcast) = 4/14 = 0.29
P(Yes)= 9/14 = 0.64
Calculate Posterior Probabilities:
P(Overcast |Yes) = 4/9 = 0.44
Put Prior and Posterior probabilities in equation (1)
P (Yes | Overcast) = 0.44 * 0.64 / 0.29 = 0.98(Higher)
Similarly, calculate the probability of not playing:
Probability of not playing:
P(No | Overcast) = P(Overcast | No) P(No) / P (Overcast) ..(2)
Calculate Prior Probabilities:
P(Overcast) = 4/14 = 0.29
P(No)= 5/14 = 0.36
Calculate Posterior Probabilities:
P(Overcast |No) = 0/9 = 0
Put Prior and Posterior probabilities in equation (2) 11
P (No | Overcast) = 0 * 0.36 / 0.29 = 0
The probability of a 'Yes' class is higher. So you can determine here if the weather is overcast
Naive Bayes Classifier with multiple features
Now suppose you want to calculate the probability of when the weather is
overcast, and the temperature is mild.
Probability of playing
Bayes Naïve Theorem says
12
Naive Bayes Classifier with multiple features
Total 9 5 14
13
Types of Naive Bayes Classifier
There are three types of Naive Bayes Model, which are given below:
Gaussian: The Gaussian model assumes that features follow a
normal distribution. This means if predictors take continuous values
instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.
Multinomial: The Multinomial Naïve Bayes classifier is used when
the data is multinomial distributed. It is primarily used for document
classification problems, it means a particular document belongs to
which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
Bernoulli: The Bernoulli classifier works similar to the Multinomial
classifier, but the predictor variables are the independent Booleans
variables. Such as if a particular word is present or not in a
document. This model is also famous for document classification 14
tasks.
Naive Bayes Classifier (colab code)
15
Naive Bayes Classifier (colab code)
16
Naive Bayes Classifier (colab code)
17
Discriminant Analysis
Discriminant analysis is the earliest statistical classifier
It was introduced by R. A. Fisher in 1936
While discriminant analysis encompasses several techniques, the
most commonly used is linear discriminant analysis, or
LDA.
It many other applications like used in principal component
analysis (PCA).
18
Discriminant Analysis
Linear Discriminant Analysis (LDA) is a supervised learning algorithm used
for classification tasks in machine learning. It is a technique used to find
a linear combination of features that best separates the classes in a
dataset.
LDA works by projecting the data onto a lower-dimensional space that
maximizes the separation between the classes. It does this by finding
a set of linear discriminants that maximize the ratio of between-class
variance to within-class variance. In other words, it finds the
directions in the feature space that best separate the different classes
of data.
LDA assumes that the data has a Gaussian distribution. It also assumes
that the data is linearly separable, meaning that a linear decision
boundary can accurately classify the different classes.
To understand discriminant analysis, it is first necessary to introduce the
concept of covariance between two or more variables.
19
Covariance Matrix
The covariance measures the relationship between two
variables x and z.
Denote the mean for each variable by 𝑋 ̅ and 𝑌 ̅
The covariance Sx,z between x and z is given by:
20
Covariance Matrix
As with the correlation coefficient, positive values indicate a
positive relationship and negative values indicate a negative
relationship.
Correlation, however, is constrained to be between –1
and 1, whereas covariance scale depends on the scale of
the variables x and z.
The covariance matrix Σ for x and z consists of the
individual variable variances, 𝑆_𝑥^2 s 𝑆_𝑧^2, on the
diagonal (where row and column are the same variable) and the
covariances between variable pairs on the off-diagonals:
21
Fisher’s Linear Discriminant
Fisher’s linear discriminant distinguishes variation between
groups, on the one hand, from variation within groups on
the other.
Divides the records into two groups, linear discriminant analysis
(LDA) focuses on
maximizing the “between” sum of squares SSbetween
(measuring the variation between the two groups) relative to
the “within” sum of squares SSwithin (measuring the within-
group variation).
LDA projects data from a D dimensional feature space
down to a D’ (D>D’) dimensional space in a way to
maximize the variability between the classes and reducing the
variability within the classes. 22
Fisher’s Linear Discriminant
For implementation and example see colab notebook
23
Logistic Regression
Approximately 70% of problems in Data Science are classification
problems.
Logistic regression is common and is a useful regression method
for solving the binary classification problem.
Logistic Regression can be used for various classification problems
such as:
spam detection
Diabetes prediction,
if a given customer will purchase a particular product or will they churn
another competitor,
whether the user will click on a given advertisement link or not
Logistic regression describes and estimates the relationship between
one dependent binary variable and independent variables. 24
Logistic Regression
Logistic regression is a statistical method for predicting binary
classes.
The outcome or target variable is dichotomous in nature.
Dichotomous means there are only two possible classes.
For example, it can be used for cancer detection problems. It
computes the probability of an event occurrence.
It is a special case of linear regression where the target
variable is categorical in nature.
Logistic Regression uses a log of odds as the dependent
variable. It predicts the probability of occurrence of a binary
event utilizing a logit function.
25
Logistic Regression
Logistic regression assumptions:
The dependent variable is binary or dichotomous,
i.e. It fits into one of two clear-cut categories.
There should be no, or very little, multicollinearity between the
predictor variables
The independent variables should be linearly related to the log
odds.
Logistic regression requires fairly large sample sizes
26
Logistic Regression
Log-odds: In very simplistic terms, log odds are an alternate
way of expressing probabilities.
In order to understand log odds, it’s important to understand a
key difference between odds and probabilities:
odds are the ratio of something happening to something
not happening,
while probability is the ratio of something happening to
everything that could possibly happen.
Example: if you and your friend play 10 games of tennis, and
you win 4 out of 10 games,
the odds of you winning are 4 to 6 ( or, as a fraction, 4/6).
The probability of you winning, is 4 to 10 (or, as a fraction, 4/10
27
28
Logistic Regression
The sigmoid function, also called logistic
function gives an ‘S’ shaped curve that can take
any real-valued number and map it into a value
between 0 and 1.
If the curve goes to positive infinity, y predicted will
become 1, and
if the curve goes to negative infinity, y predicted
will become 0.
If the output of the sigmoid function is more than
0.5, we can classify the outcome as 1 or YES,
and if it is less than 0.5, we can classify it as 0 or
NO.
The outputcannot For example: If the output is
0.75, we can say in terms of probability as: There
is a 75 percent chance that a patient will suffer 29
from cancer.
Types of Logistic Regression
Types of Logistic Regression:
Binary Logistic Regression: The target variable has only two
possible outcomes such as Spam or Not Spam, Cancer or No
Cancer.
Multinomial Logistic Regression: The target variable has three
or more nominal categories such as predicting the type of
Wine.
Ordinal Logistic Regression: the target variable has three or
more ordinal categories such as restaurant or product rating
from 1 to 5.
30
Logistic Regression code
Let's build the diabetes prediction model.
Let's first load the required Pima Indian Diabetes dataset using
the pandas' read CSV function. You can download data from the
following link:
https://www.kaggle.com/uciml/pima-indians-diabetes-database
31
Logistic Regression code
32
Evaluating Classification Models
It is common in predictive modeling to train a number of different
models, apply each to a holdout sample, and assess their
performance.
Model validation is referred to as the process where a trained
model is evaluated with a testing data set.
Fundamentally, the assessment process attempts to learn which
model produces the most accurate and useful predictions.
A simple way to measure classification performance is to count
the proportion of predictions that are correct, i.e., measure the
accuracy.
Accuracy is simply a measure of total error:
33
Evaluating Classification Models
In most classification algorithms, each case is assigned an
“estimated probability of being a 1.”
The default decision point, or cutoff, is typically 0.50 or 50%.
If the probability is above 0.5, the classification is “1”;
otherwise it is “0”.
34
Evaluating Classification Models
Confusion Matrix: The confusion matrix is a table showing the
number of correct and incorrect predictions categorized by type
of response.
It is often used to measure the performance of classification
models.
It tell what your machine learning algorithm did right and what it
did wrong.
The matrix displays the number of true positives (TP), true
negatives (TN), false positives (FP), and false negatives (FN)
produced by the model on the test data.
Each row of the matrix represents the instances in an actual class
while each column represents the instances in a predicted class.
The name “confusion” from the fact that it makes it easy to see
whether the system is confusing two classes (i.e. commonly 35
37
Evaluating Classification Models (ML Performance Metrics)
Precision is a metric that measures how often a machine learning
model correctly predicts the positive class or How well you guess
the label in question or goal is to minimize mistakes in guessing
positive labels
38
Evaluating Classification Models (ML Performance Metrics)
39
Evaluating Classification Models
40
Evaluating Classification Models
41
https://www.geeksforgeeks.org/confusion-matrix-machine-learning/
Evaluating Classification Models
42
https://www.geeksforgeeks.org/confusion-matrix-machine-learning/
Evaluating Classification Models
43
https://www.geeksforgeeks.org/confusion-matrix-machine-learning/
Evaluating Classification Models (AUC-ROC Curve)
ROC curve is the graphical representation of the effectiveness of
the binary classification model.
It plots the true positive rate (TPR) vs the false positive rate (FPR)
at different classification thresholds.
AUC stands for Area Under the Curve
AUC curve represents the area under the ROC curve.
TPR and FPR range between 0 to 1, So, the area will always lie
between 0 and 1, and A greater value of AUC denotes better
model performance.
The goal is to maximize this area in order to have the highest TPR
and lowest FPR at the given threshold.
The AUC measures the probability that the model will assign a
randomly chosen positive instance a higher predicted probability
44
45
Evaluating Classification Models (AUC-ROC Curve)
Basically, TPR/Recall/Sensitivity is the ratio of positive
examples that are correctly identified.
It represents the ability of the model to correctly identify
positive instances and is calculated as follows:
46
Evaluating Classification Models (AUC-ROC Curve)
And as said earlier ROC is nothing but the plot between TPR and
FPR across all possible thresholds and AUC is the entire area
beneath this ROC curve
let us look at AUC-ROC from a probabilistic point of view.
AUC measures how well a model is able to distinguish between classes
The black dots are TPR and FPR at different probability thresholds.
47
Strategies for imbalanced data
Balanced Dataset: In a Balanced dataset, there is approximately
equal distribution of classes in the target column.
Imbalanced Dataset: In an Imbalanced dataset, there is a highly
unequal distribution of classes in the target column.
Example : Suppose there is a Binary Classification problem with
the following training data:
Total Observations : 1000
Target variable class is either ‘Yes’ or ‘No’.
Case 1:
If there are 900 ‘Yes’ and 100 ‘No’ then it represents an Imbalanced dataset as
there is highly unequal distribution of the two classes. .
Case 2:
If there are 550 ‘Yes’ and 450 ‘No’ then it represents a Balanced dataset as
there is approximately equal distribution of the two classes. 48
Strategies for imbalanced data
Imbalanced Data Distribution, generally happens when
observations in one of the class are much higher or lower than
the other classes.
This problem is prevalent in examples such as Fraud Detection,
Anomaly Detection, Facial recognition etc.
Standard ML techniques such as Decision Tree and Logistic
Regression have a bias towards the majority class, and they tend
to ignore the minority class. They tend only to predict the
majority class, hence, having major misclassification of the
minority class in comparison with the majority class.
In more technical words, if we have imbalanced data distribution
in our dataset then our model becomes more prone to the case
when minority class has negligible or very lesser recall.
49
Strategies for imbalanced data
Hence, there is a significant amount of difference between the
sample sizes of the two classes in an Imbalanced Dataset.
Problem with Imbalanced dataset:
Algorithms may get biased towards the majority class and thus
tend to predict output as the majority class.
Imbalanced dataset gives misleading accuracy score.
Two main types of balancing data:
Up/down sampling
SMOTE (Synthetic Minority Oversampling Technique)
50
Strategies for imbalanced data (Over/Up-Sample Minority Class)
51
Strategies for imbalanced data (Over/Up-Sample Minority Class)
Using RandomOverSampler:
This can be done with the help of the
RandomOverSampler method present in imblearn.
This function randomly generates new data points
belonging to the minority class with replacement (by
default).
52
Strategies for imbalanced data (Over/Up-Sample Minority Class)
53
Strategies for imbalanced data (Down/Under-sample Majority Class)
54
Strategies for imbalanced data (Down/Under-sample Majority Class)
Using RandomUnderSampler
55
Summary
56