100% found this document useful (1 vote)
26 views

U02Lecture07 Classification

The document discusses various classification algorithms including naive Bayes classification, discriminant analysis, and logistic regression. Naive Bayes classification uses Bayes' theorem and makes the assumption of conditional independence between features. Discriminant analysis uses covariance matrices and Fisher's linear discriminant to classify data. Logistic regression uses a logistic response function to predict class probabilities and can be used for binary and multiclass classification problems. The document also covers topics like evaluating classification models, dealing with imbalanced data, and types of naive Bayes classifiers.

Uploaded by

tunio.bscsf21
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
26 views

U02Lecture07 Classification

The document discusses various classification algorithms including naive Bayes classification, discriminant analysis, and logistic regression. Naive Bayes classification uses Bayes' theorem and makes the assumption of conditional independence between features. Discriminant analysis uses covariance matrices and Fisher's linear discriminant to classify data. Logistic regression uses a logistic response function to predict class probabilities and can be used for binary and multiclass classification problems. The document also covers topics like evaluating classification models, dealing with imbalanced data, and types of naive Bayes classifiers.

Uploaded by

tunio.bscsf21
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

Classification

Unit 02 Lecture 07

Dr. Mohammad Asif Khan


Contents
 Naive Bayes
 Why Exact Bayesian Classification Is Impractical
 The Naive Solution
 Discriminant Analysis
 Covariance Matrix
 Fisher’s Linear Discriminant
 A Simple Example
 Logistic Regression
 Logistic Response Function and Logit
 Logistic Regression and the GLM
 Generalized Linear Models
 Predicted Values from Logistic Regression
 Evaluating Classification Models
 Confusion Matrix
 The Rare Class Problem
 Precision, Recall, and Specificity
 ROC Curve
 AUC
 Lift
 Strategies for Imbalanced Data
 Undersampling
 Oversampling and Up/Down Weighting
 Data Generation
 Cost-Based Classification 2

 Exploring the Predictions


Classification
 Classification is perhaps the most important form of prediction:
the goal is to predict whether a record is a 1 or a 0
(phishing/not-phishing, click/don’t click, churn/don’t churn),
known as binary classification.
 In multiclassification problem, most algorithms can return a
probability score of belonging to the class of interest. Like
predicting type of tumor normal, benign, malignant.
 Most classification methods, provides two prediction methods:
 predict (which returns the class) and
 returns probabilities for each class

3
Classification
 A sliding cutoff can then be used to convert the score to a
decision.
 The general approach is as follows:
1. Establish a cutoff probability for the class of interest, above
which we consider a record as belonging to that class.
2. Estimate (with any model) the probability that a record
belongs to the class of interest.
3. If that probability is above the cutoff probability, assign the
new record to the class of interest.
 The higher the cutoff, the fewer the records predicted as 1—
that is, as belonging to the class of interest.
 The lower the cutoff, the more the records predicted as 1.
4
Naive Bayes Classifier
 Naive Bayes is the most straightforward and fast classification
algorithm, which is suitable for a large chunk of data.
 Naive Bayes classifier is successfully used in various applications such
as spam filtering, text classification, sentiment analysis, and
recommender systems.
 It uses Bayes theorem of probability for prediction of unknown class.
 Whenever you perform classification,
 the first step is to understand the problem and identify potential
features and label.
 Features are those characteristics or attributes which affect the results
of the label.
 For example, in the case of a loan distribution, bank managers identify the
customer’s occupation, income, age, location, previous loan history,
transaction history, and credit score. These characteristics are known5 as
features that help the model classify customers.
Naive Bayes Classifier
 The classification has two phases,
 a learning phase and the evaluation phase. In the learning phase, the
classifier trains its model on a given dataset,
 and in the evaluation phase, it tests the classifier's performance.
Performance is evaluated on the basis of various parameters such as
accuracy, error, precision, and recall.

6
Naive Bayes Classifier
 Naive Bayes classifier is the fast, accurate and reliable
algorithm.
 Naive Bayes classifiers have high accuracy and speed on
large datasets.
 Naive Bayes classifier assumes that the effect of a particular
feature in a class is independent of other features.
 For example, a loan applicant is desirable or not depending on his/her
income, previous loan and transaction history, age, and location. Even if
these features are interdependent, these features are still considered
independently.
 This assumption simplifies computation, and that's why it is
considered as naive. This assumption is called class conditional
independence.
7
Naive Bayes Classifier
 Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used
to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.

 P(A|B) is Posterior probability: Probability of hypothesis A on the observed


event B.
 P(B|A) is Likelihood probability: Probability of the evidence given that the
probability of a hypothesis is true.
 P(A) is Prior Probability: Probability of hypothesis before observing the
evidence.
 P(B) is Marginal Probability: Probability of Evidence.
8
Naive Bayes Classifier
 Naive Bayes classifier calculates the probability of an event in the
following steps:
 Step 1: Calculate the prior probability for given class labels
 Step 2: Find Likelihood probability with each attribute for each class
 Step 3: Put these value in Bayes Formula and calculate posterior
probability.
 Step 4: See which class has a higher probability, given the input belongs
to the higher probability class.
 For simplifying prior and posterior probability calculation, you can use the two tables
frequency and likelihood tables.
 Both of these tables will help you to calculate the prior and posterior probability. The
Frequency table contains the occurrence of labels for all features.

9
Naive Bayes Classifier
 There are two likelihood tables. Likelihood Table 1 is showing prior
probabilities of labels and Likelihood Table 2 is showing the posterior
probability.

10
Naive Bayes Classifier
 Probability of playing:
 P(Yes | Overcast) = P(Overcast | Yes) P(Yes) / P (Overcast) .....................(1)
 Calculate Prior Probabilities:
 P(Overcast) = 4/14 = 0.29
 P(Yes)= 9/14 = 0.64
 Calculate Posterior Probabilities:
 P(Overcast |Yes) = 4/9 = 0.44
 Put Prior and Posterior probabilities in equation (1)
 P (Yes | Overcast) = 0.44 * 0.64 / 0.29 = 0.98(Higher)
 Similarly, calculate the probability of not playing:
 Probability of not playing:
 P(No | Overcast) = P(Overcast | No) P(No) / P (Overcast) ..(2)
 Calculate Prior Probabilities:
 P(Overcast) = 4/14 = 0.29
 P(No)= 5/14 = 0.36
 Calculate Posterior Probabilities:
 P(Overcast |No) = 0/9 = 0
 Put Prior and Posterior probabilities in equation (2) 11
 P (No | Overcast) = 0 * 0.36 / 0.29 = 0
 The probability of a 'Yes' class is higher. So you can determine here if the weather is overcast
Naive Bayes Classifier with multiple features
 Now suppose you want to calculate the probability of when the weather is
overcast, and the temperature is mild.
 Probability of playing
 Bayes Naïve Theorem says

12
Naive Bayes Classifier with multiple features

 Calculate Prior Probability: P(Yes)= 9/14 = 0.64


 Calculate Likelihood Probability: P(Overcast |Yes) = 4/9 = 0.44; P(Mild |Yes) = 4/9 = 0.44
 Calculate Marginal Probability: P(Overcast) = 4/14 = 0.29; P(Mild) = 6/14 = 0.4285
 Now calculate Posterior Probability
Temperatu Yes No P(Temp type)
re Type

Hot 2 2 4/14 = 0.2857


Mild 4 2 6/14 = 0.4285

Cool 3 1 4/14 =0.2857

Total 9 5 14

13
Types of Naive Bayes Classifier
 There are three types of Naive Bayes Model, which are given below:
 Gaussian: The Gaussian model assumes that features follow a
normal distribution. This means if predictors take continuous values
instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.
 Multinomial: The Multinomial Naïve Bayes classifier is used when
the data is multinomial distributed. It is primarily used for document
classification problems, it means a particular document belongs to
which category such as Sports, Politics, education, etc.
 The classifier uses the frequency of words for the predictors.
 Bernoulli: The Bernoulli classifier works similar to the Multinomial
classifier, but the predictor variables are the independent Booleans
variables. Such as if a particular word is present or not in a
document. This model is also famous for document classification 14

tasks.
Naive Bayes Classifier (colab code)

15
Naive Bayes Classifier (colab code)

16
Naive Bayes Classifier (colab code)

17
Discriminant Analysis
 Discriminant analysis is the earliest statistical classifier
 It was introduced by R. A. Fisher in 1936
 While discriminant analysis encompasses several techniques, the
most commonly used is linear discriminant analysis, or
LDA.
 It many other applications like used in principal component
analysis (PCA).

18
Discriminant Analysis
 Linear Discriminant Analysis (LDA) is a supervised learning algorithm used
for classification tasks in machine learning. It is a technique used to find
a linear combination of features that best separates the classes in a
dataset.
 LDA works by projecting the data onto a lower-dimensional space that
maximizes the separation between the classes. It does this by finding
a set of linear discriminants that maximize the ratio of between-class
variance to within-class variance. In other words, it finds the
directions in the feature space that best separate the different classes
of data.
 LDA assumes that the data has a Gaussian distribution. It also assumes
that the data is linearly separable, meaning that a linear decision
boundary can accurately classify the different classes.
 To understand discriminant analysis, it is first necessary to introduce the
concept of covariance between two or more variables.
19
Covariance Matrix
 The covariance measures the relationship between two
variables x and z.
 Denote the mean for each variable by 𝑋 ̅ and 𝑌 ̅
 The covariance Sx,z between x and z is given by:

 where n is the number of records (note that we divide by n – 1


instead of n)

20
Covariance Matrix
 As with the correlation coefficient, positive values indicate a
positive relationship and negative values indicate a negative
relationship.
 Correlation, however, is constrained to be between –1
and 1, whereas covariance scale depends on the scale of
the variables x and z.
 The covariance matrix Σ for x and z consists of the
individual variable variances, 𝑆_𝑥^2 s 𝑆_𝑧^2, on the
diagonal (where row and column are the same variable) and the
covariances between variable pairs on the off-diagonals:

21
Fisher’s Linear Discriminant
 Fisher’s linear discriminant distinguishes variation between
groups, on the one hand, from variation within groups on
the other.
 Divides the records into two groups, linear discriminant analysis
(LDA) focuses on
 maximizing the “between” sum of squares SSbetween
(measuring the variation between the two groups) relative to
the “within” sum of squares SSwithin (measuring the within-
group variation).
 LDA projects data from a D dimensional feature space
down to a D’ (D>D’) dimensional space in a way to
maximize the variability between the classes and reducing the
variability within the classes. 22
Fisher’s Linear Discriminant
 For implementation and example see colab notebook

23
Logistic Regression
 Approximately 70% of problems in Data Science are classification
problems.
 Logistic regression is common and is a useful regression method
for solving the binary classification problem.
 Logistic Regression can be used for various classification problems
such as:
 spam detection
 Diabetes prediction,
 if a given customer will purchase a particular product or will they churn
another competitor,
 whether the user will click on a given advertisement link or not
 Logistic regression describes and estimates the relationship between
one dependent binary variable and independent variables. 24
Logistic Regression
 Logistic regression is a statistical method for predicting binary
classes.
 The outcome or target variable is dichotomous in nature.
Dichotomous means there are only two possible classes.
 For example, it can be used for cancer detection problems. It
computes the probability of an event occurrence.
 It is a special case of linear regression where the target
variable is categorical in nature.
 Logistic Regression uses a log of odds as the dependent
variable. It predicts the probability of occurrence of a binary
event utilizing a logit function.

25
Logistic Regression
 Logistic regression assumptions:
 The dependent variable is binary or dichotomous,
 i.e. It fits into one of two clear-cut categories.
 There should be no, or very little, multicollinearity between the
predictor variables
 The independent variables should be linearly related to the log
odds.
 Logistic regression requires fairly large sample sizes

26
Logistic Regression
 Log-odds: In very simplistic terms, log odds are an alternate
way of expressing probabilities.
 In order to understand log odds, it’s important to understand a
key difference between odds and probabilities:
 odds are the ratio of something happening to something
not happening,
 while probability is the ratio of something happening to
everything that could possibly happen.
 Example: if you and your friend play 10 games of tennis, and
you win 4 out of 10 games,
 the odds of you winning are 4 to 6 ( or, as a fraction, 4/6).
 The probability of you winning, is 4 to 10 (or, as a fraction, 4/10
27

), as there were 10 games played in total.


Logistic Regression
 Regression predicts the probability of occurrence of a binary
event utilizing a logit function.
 Linear Regression Equation:

 Where, y is a dependent variable and x1, x2 ... and Xn are


explanatory variables.
 Sigmoid Function:

 Apply Sigmoid function on linear regression:

28
Logistic Regression
 The sigmoid function, also called logistic
function gives an ‘S’ shaped curve that can take
any real-valued number and map it into a value
between 0 and 1.
 If the curve goes to positive infinity, y predicted will
become 1, and
 if the curve goes to negative infinity, y predicted
will become 0.
 If the output of the sigmoid function is more than
0.5, we can classify the outcome as 1 or YES,
 and if it is less than 0.5, we can classify it as 0 or
NO.
 The outputcannot For example: If the output is
0.75, we can say in terms of probability as: There
is a 75 percent chance that a patient will suffer 29

from cancer.
Types of Logistic Regression
 Types of Logistic Regression:
 Binary Logistic Regression: The target variable has only two
possible outcomes such as Spam or Not Spam, Cancer or No
Cancer.
 Multinomial Logistic Regression: The target variable has three
or more nominal categories such as predicting the type of
Wine.
 Ordinal Logistic Regression: the target variable has three or
more ordinal categories such as restaurant or product rating
from 1 to 5.

30
Logistic Regression code
 Let's build the diabetes prediction model.

 Here, you are going to predict diabetes using the Logistic


Regression Classifier.

 Let's first load the required Pima Indian Diabetes dataset using
the pandas' read CSV function. You can download data from the
following link:
https://www.kaggle.com/uciml/pima-indians-diabetes-database

31
Logistic Regression code

32
Evaluating Classification Models
 It is common in predictive modeling to train a number of different
models, apply each to a holdout sample, and assess their
performance.
 Model validation is referred to as the process where a trained
model is evaluated with a testing data set.
 Fundamentally, the assessment process attempts to learn which
model produces the most accurate and useful predictions.
 A simple way to measure classification performance is to count
the proportion of predictions that are correct, i.e., measure the
accuracy.
 Accuracy is simply a measure of total error:

33
Evaluating Classification Models
 In most classification algorithms, each case is assigned an
“estimated probability of being a 1.”
 The default decision point, or cutoff, is typically 0.50 or 50%.
 If the probability is above 0.5, the classification is “1”;
otherwise it is “0”.

34
Evaluating Classification Models
 Confusion Matrix: The confusion matrix is a table showing the
number of correct and incorrect predictions categorized by type
of response.
 It is often used to measure the performance of classification
models.
 It tell what your machine learning algorithm did right and what it
did wrong.
 The matrix displays the number of true positives (TP), true
negatives (TN), false positives (FP), and false negatives (FN)
produced by the model on the test data.
 Each row of the matrix represents the instances in an actual class
while each column represents the instances in a predicted class.
 The name “confusion” from the fact that it makes it easy to see
whether the system is confusing two classes (i.e. commonly 35

mislabeling one as another).


Evaluating Classification Models
 For binary classification, the matrix will be of a 2X2 table,
 For multi-class classification, the matrix shape will be equal
to the number of classes i.e for n classes it will be nXn.
 A 2X2 Confusion matrix is shown below for the image
recognization having a Dog image or Not Dog image.
 True Positive (TP): It is the total counts having both
predicted and actual values are Dog.
 True Negative (TN): It is the total counts having both
predicted and actual values are Not Dog.
 False Positive (FP): It is the total counts having prediction is
Dog while actually Not Dog.
 False Negative (FN): It is the total counts having prediction
is Not Dog while actually, it is Dog. 36
Evaluating Classification Models (ML Performance Metrics)

 Accuracy is a metric that measures how often a machine


learning model correctly predicts the outcome.
 You can calculate accuracy by dividing the number of
correct predictions by the total number of predictions.
 It treats all classes as equally important and looks at all
correct predictions.
 However, many real-world applications have a high
imbalance of classes. These are the cases when one
category has significantly more frequent occurrences than
the other.
 Much read example: Link to accuracy example

37
Evaluating Classification Models (ML Performance Metrics)
 Precision is a metric that measures how often a machine learning
model correctly predicts the positive class or How well you guess
the label in question or goal is to minimize mistakes in guessing
positive labels

 The recall or sensitivity or true positive rate, measures how


often a machine learning model correctly identifies positive
instances (true positives) from all the actual positive samples in
the dataset.

 Another metric used is specificity, which measures a model’s


ability to predict a negative outcome:

38
Evaluating Classification Models (ML Performance Metrics)

 F1-Score: F1-score is used to evaluate the overall


performance of a classification model. It is the harmonic
mean of precision and recall
 F1 Score is needed when you want to seek a balance
between Precision and Recall

39
Evaluating Classification Models

 Actual Dog Counts = 6


 Actual Not Dog Counts = 4
 True Positive Counts = 5
 False Positive Counts = 1
 True Negative Counts = 3
 False Negative Counts = 1

40
Evaluating Classification Models

41

 https://www.geeksforgeeks.org/confusion-matrix-machine-learning/
Evaluating Classification Models

42

 https://www.geeksforgeeks.org/confusion-matrix-machine-learning/
Evaluating Classification Models

43

 https://www.geeksforgeeks.org/confusion-matrix-machine-learning/
Evaluating Classification Models (AUC-ROC Curve)
 ROC curve is the graphical representation of the effectiveness of
the binary classification model.
 It plots the true positive rate (TPR) vs the false positive rate (FPR)
at different classification thresholds.
 AUC stands for Area Under the Curve
 AUC curve represents the area under the ROC curve.
 TPR and FPR range between 0 to 1, So, the area will always lie
between 0 and 1, and A greater value of AUC denotes better
model performance.
 The goal is to maximize this area in order to have the highest TPR
and lowest FPR at the given threshold.
 The AUC measures the probability that the model will assign a
randomly chosen positive instance a higher predicted probability
44

compared to a randomly chosen negative instance.


Evaluating Classification Models (AUC-ROC Curve)
 Basically, the ROC curve is a graph that shows the
performance of a classification model at all possible
thresholds (threshold is a particular value beyond which you
say a point belongs to a particular class).
 The curve is plotted between two parameters
 TPR – True Positive Rate (Recall)
 FPR – False Positive Rate

45
Evaluating Classification Models (AUC-ROC Curve)
 Basically, TPR/Recall/Sensitivity is the ratio of positive
examples that are correctly identified.
 It represents the ability of the model to correctly identify
positive instances and is calculated as follows:

 FPR is the ratio of negative examples that are incorrectly


classified.

46
Evaluating Classification Models (AUC-ROC Curve)
 And as said earlier ROC is nothing but the plot between TPR and
FPR across all possible thresholds and AUC is the entire area
beneath this ROC curve
 let us look at AUC-ROC from a probabilistic point of view.
 AUC measures how well a model is able to distinguish between classes
 The black dots are TPR and FPR at different probability thresholds.

There are two models applied


red & blue.
Which one performed
better?

47
Strategies for imbalanced data
 Balanced Dataset: In a Balanced dataset, there is approximately
equal distribution of classes in the target column.
 Imbalanced Dataset: In an Imbalanced dataset, there is a highly
unequal distribution of classes in the target column.
 Example : Suppose there is a Binary Classification problem with
the following training data:
 Total Observations : 1000
 Target variable class is either ‘Yes’ or ‘No’.
 Case 1:
 If there are 900 ‘Yes’ and 100 ‘No’ then it represents an Imbalanced dataset as
there is highly unequal distribution of the two classes. .
 Case 2:
 If there are 550 ‘Yes’ and 450 ‘No’ then it represents a Balanced dataset as
there is approximately equal distribution of the two classes. 48
Strategies for imbalanced data
 Imbalanced Data Distribution, generally happens when
observations in one of the class are much higher or lower than
the other classes.
 This problem is prevalent in examples such as Fraud Detection,
Anomaly Detection, Facial recognition etc.
 Standard ML techniques such as Decision Tree and Logistic
Regression have a bias towards the majority class, and they tend
to ignore the minority class. They tend only to predict the
majority class, hence, having major misclassification of the
minority class in comparison with the majority class.
 In more technical words, if we have imbalanced data distribution
in our dataset then our model becomes more prone to the case
when minority class has negligible or very lesser recall.

49
Strategies for imbalanced data
 Hence, there is a significant amount of difference between the
sample sizes of the two classes in an Imbalanced Dataset.
 Problem with Imbalanced dataset:
 Algorithms may get biased towards the majority class and thus
tend to predict output as the majority class.
 Imbalanced dataset gives misleading accuracy score.
 Two main types of balancing data:
 Up/down sampling
 SMOTE (Synthetic Minority Oversampling Technique)

50
Strategies for imbalanced data (Over/Up-Sample Minority Class)

 In Up-sampling, samples from minority


classes are randomly duplicated so as to
achieve equivalence with the majority class.

51
Strategies for imbalanced data (Over/Up-Sample Minority Class)

 Using RandomOverSampler:
 This can be done with the help of the
RandomOverSampler method present in imblearn.
 This function randomly generates new data points
belonging to the minority class with replacement (by
default).

52
Strategies for imbalanced data (Over/Up-Sample Minority Class)

 Synthetic Minority Oversampling Technique (SMOTE): It is used


to generate artificial/synthetic samples for the minority class.
 This technique works by randomly choosing a sample from a
minority class and determining K-Nearest Neighbors for this
sample, then the artificial sample is added between the picked
sample and its neighbors. This function is present in imblearn
module.
 Minority class is given as input vector.
 Determine its K-Nearest Neighbours
 Pick one of these neighbors and place an artificial sample point anywhere
between the neighbor and sample point under consideration.
 Repeat till the dataset gets balanced.

53
Strategies for imbalanced data (Down/Under-sample Majority Class)

 Down/Under Sampling is the process of randomly


selecting samples of majority class and removing them
in order to prevent them from dominating over the
minority class in the dataset.

54
Strategies for imbalanced data (Down/Under-sample Majority Class)

 Using RandomUnderSampler

55
Summary

 We learned different classifications methods of machine learning


 We have learned performance metrics for evaluating models
 We have learned about data imbalances and methods to balance datasets

56

You might also like