0% found this document useful (0 votes)
19 views

CH 04 Classification Techniques

Uploaded by

1032210687
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

CH 04 Classification Techniques

Uploaded by

1032210687
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

Classification Techniques

Chapter 4 (MLITAD)
Syllabus
• k Nearest Neighbor, Support Vector Machine, Decision Tree (CART),
Issues in Decision Tree, Ensembles Techniques -Bagging, Boosting,
Evaluation Metrics, Use cases
Decision Tree - Classification
• Decision tree builds classification models in the form of a tree
structure.
• It breaks down a dataset into smaller and smaller subsets
while at the same time an associated decision tree is
incrementally developed.
• The final result is a tree with decision nodes and leaf nodes.
• A decision node has two or more branches
• Leaf node represents a classification or decision.
• The topmost decision node in a tree which corresponds to
the best predictor called root node.
• Decision trees can handle both categorical and numerical
data.
Classification Model
What is node impurity/purity in decision trees?
• The decision tree is a greedy algorithm that performs a recursive binary partitioning of the
feature space.
• The tree predicts the same label for each bottommost (leaf) partition.
• Each partition is chosen greedily by selecting the best split from a set of possible splits.

Consider an example as the set of atoms in a metallic ball


• If all of the ball's atoms were gold - you would say that the ball is purely gold, and that its
purity level is highest (and its impurity level is lowest).
• Similarly, if all of the examples in the set were of the same class, then the set's purity
would be highest.
• If 1/3 of the atoms were gold, 1/3 silver, and 1/3 iron - you would say that for a ball made
of 3 kinds of atoms, its purity is lowest.
• Similarly, if the examples are split evenly between all of the classes, then the set's purity is
lowest.
• So the purity of a set of examples is the homogeneity of its examples - with regard to their
classes.
• Gini index is one of the popular measures of impurity
CART Algorithm
• CART Algorithm is an abbreviation of Classification And Regression Trees.
• Rather than general trees that could have multiple branches, CART makes
use binary tree, which has only two branches from each node.
• CART use Gini Impurity as the criterion to split node, not Information Gain.
• CART supports numerical target variables, which enables itself to become a
Regression Tree that predicts continuous values.
• Just like the ID3 and C4.5 algorithms that rely on Information Gain as the
criterion to split nodes, the CART algorithm makes use another criterion
called Gini to split the nodes.
CART Algorithm
• In CART algorithm it is intuitively using the Gini coefficient for a similar
purpose. That is, the larger Gini coefficient means the larger impurity of
the node.
• Similar to ID3 and C4.5 using Information Gain to select the node with
more uncertainty, the Gini coefficient will guide the CART algorithm to find
the node with larger uncertainty (i.e. impurity) and then split it.
• Gini Index is a metric to measure how often a randomly chosen element
would be incorrectly identified.
• It means an attribute with lower Gini index should be preferred.
• Sklearn supports “Gini” criteria for Gini Index and by default, it takes “gini”
value.
Example 1
Age Gender Sportive
Here I have given data of sports on which age and gender
22 F yes
are taking a part in a decision on ‘what kind of person
24 M yes
30 F yes would play ground-game? ‘.
31 F no
27 F no We will divide data in binary, like F or M & age =< 25 or
32 M no age > 25.
25 F yes
30 M no
24 F yes
21 F yes
29 M yes
26 M no
21 M no
Solution Step 1

Gender Sportive- Sportive Total


yes -no
Female 5 2 7
Male 2 4 6

Age Sportive- Sportiv Total


yes e-no
< 25 5 1 6
> = 25 2 5 7
Step 2: Find Gini index of attributes
(Target attribute)

Example 2:
Design decision
tree using
CART algorithm
for the given
dataset

12
Find Gini index of all attributes
Outlook is the root node since it has less Gini index among all attributes
Gini index of humidity is low , so it will appear below outlook branch with sunny as a value
Gini index of wind is low , so it will appear below outlook branch with rain as a value
K-Nearest Neighbor(KNN)
Algorithm
Introduction
• K-Nearest Neighbor is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar
to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
Introduction
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new
data.
• Example:
Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. KNN model will find the similar features of
the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?

• Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories.
• To solve this type of problem, we need a K-NN algorithm.
• With the help of K-NN, we can easily identify the category or class of a particular dataset.
How does K-NN work?

• Step-1: Select the number K of the neighbors


• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
• Step-4: Among these k neighbors, count the number of the data
points in each category.
• Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
•As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.
How to select the value of K in the K-NN
Algorithm?
• There is no particular way to determine the best value for "K", so we
need to try some values to find the best out of them. The most
preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to
the effects of outliers in the model.
• Large values for K are good, but it may find some difficulties
Advantages/ Disadvantages
Advantages of KNN Algorithm:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
• Always needs to determine the value of K which may be complex
some time.
• The computation cost is high because of calculating the distance
between the data points for all the training samples.
KNN Example 1
• Consider the data with attributes (acid durability and strength) to
classify whether a special paper tissue is good or not. For new sample
with X1=3 and x2=7, predict the classification value using KNN
algorithm with K= 3
X1-Acid X2-Strengh y-
durability Classification

7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
KNN Example 2
Find the class label for given instance using KNN with K=5
Step 1: Find distance
Step 2: Find Rank

Step 3: Find nearest neighbours to


assign class
Support Vector Machine (SVM) algorithm
What is the Support Vector Machine?
• “Support Vector Machine” (SVM) is a supervised machine
learning algorithm that can be used for both classification or
regression challenges.
• However, it is mostly used in classification problems.
• In the SVM algorithm, we plot each data item as a point in n-
dimensional space with the value of each feature being the
value of a particular coordinate.
• Then, we perform classification by finding the hyper-plane that
differentiates the two classes very well.
Goal of Support Vector Machine Algorithm
• The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so
that we can easily put the new data point in the correct category in
the future. This best decision boundary is called a hyperplane.
• SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called as support vectors,
and hence algorithm is termed as Support Vector Machine.
Hyperplane and Support Vectors
Hyperplane in the SVM algorithm:
• Hyperplane: There can be multiple lines/decision boundaries to
segregate the classes in n-dimensional space, but we need to find out
the best decision boundary that helps to classify the data points. This
best boundary is known as the hyperplane of SVM.
• The dimensions of the hyperplane depend on the features present in
the dataset, which means if there are 2 features, then hyperplane will
be a straight line. And if there are 3 features, then hyperplane will be
a 2-dimension plane.
• We always create a hyperplane that has a maximum margin, which
means the maximum distance between the data points.
Support Vectors in the SVM algorithm:
• Support Vectors:
The data points or vectors that are the closest to the hyperplane
and which affect the position of the hyperplane are termed as
Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
How does SVM works?
• The working of the SVM algorithm can be understood by using an
example. Suppose we have a dataset that has two tags (green and
blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the new pair(x1, x2) of coordinates in either
green or blue
How does SVM works? Contd..

So as it is 2-d space so by just using a straight


line, we can easily separate these two classes.
But there can be multiple lines that can
separate these classes.

SVM algorithm helps to find the best line or


decision boundary; this best boundary or region
is called as a hyperplane
How does SVM works? contd..

• Hence, the SVM algorithm helps to find the best line or decision
boundary; this best boundary or region is called as a hyperplane.
• SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors.
• The distance between the vectors and the hyperplane is called
as margin. And the goal of SVM is to maximize this margin.
• The hyperplane with maximum margin is called the optimal
hyperplane.
How does SVM works? contd..
Case 1: How does SVM works? Contd..
It unable to segregate the two classes using a straight line, as one of the stars
lies in the territory of other(circle) class as an outlier. The SVM algorithm has a
feature to ignore outliers and find the hyper-plane that has the maximum
margin. Hence, we can say, SVM classification is robust to outliers.
Case 2: How does SVM works?

In the scenario below, we can’t have linear hyper-plane between the two classes, so how does SVM
classify these two classes? SVM can solve this problem by introducing additional feature. Here, we will
add a new feature z=x^2+y^2. Now, let’s plot the data points on axis x and z:

In given plot, points to consider


are:

•All values for z would be positive


always because z is the squared
sum of both x and y
•In the original plot, red circles
appear close to the origin of x and
y axes, leading to lower value of z
and star relatively away from the
origin result to higher value of z.
SVM Kernel
• The SVM kernel is a function that takes low dimensional input space
and transforms it to a higher dimensional space i.e. it converts not
separable problem to separable problem.
• It is mostly useful in non-linear separation problem. Simply put, it
does some extremely complex data transformations, then finds out
the process to separate the data based on the labels or outputs
you’ve defined.
Support Vector Machines (Kernels)
• Kernel Function is a method used to take data as input and transform into the
required form of processing data.
• “Kernel” is used due to set of mathematical functions used in Support Vector
Machine provides the window to manipulate the data.
• So, Kernel Function generally transforms the training set of data so that a non-linear
decision surface is able to transformed to a linear equation in a higher number of
dimension spaces.
• Basically, It returns the inner product between two points in a standard feature
dimension.

Types of SVM kernels:


• Polynomial Kernel
• Sigmoid Kernel
• Gaussian Kernel Radial Basis Function (RBF)
Ensemble Methods
Ensemble Methods: Increasing the Accuracy

• Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
• Bagging: averaging the prediction over a collection of
classifiers. Eg. Random Forest
• Boosting: weighted vote with a collection of classifiers.
Eg. Ada Boost
65
Ensemble Methods
• Ensemble learning refers to algorithms that combine the predictions from two or
more models.
• Ensemble learning is a general meta approach to machine learning that seeks
better predictive performance by combining the predictions from multiple
models.
• The three main classes of ensemble learning methods are bagging, stacking,
and boosting
• Bagging involves fitting many decision trees on different samples of the same
dataset and averaging the predictions.
• Stacking involves fitting many different models types on the same data and using
another model to learn how to best combine the predictions.
• Boosting involves adding ensemble members sequentially that correct the
predictions made by prior models and outputs a weighted average of the
predictions.
Bagging
• The idea behind bagging is combining the results of multiple models (for
instance, all decision trees) to get a generalized result.
• Here’s a question: If you create all the models on the same set of data and
combine it, will it be useful? There is a high chance that these models will
give the same result since they are getting the same input. So how can we
solve this problem? One of the techniques is bootstrapping.
• Bootstrapping is a sampling technique in which we create subsets of
observations from the original dataset, with replacement. The size of the
subsets is the same as the size of the original set.
• Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to
get a fair idea of the distribution (complete set). The size of subsets created
for bagging may be less than the original set.
Bootstrapping
Bagging
1.Multiple subsets are created from the original dataset, selecting observations with replacement.
2.A base model (weak model) is created on each of these subsets.
3.The models run in parallel and are independent of each other.
4.The final predictions are determined by combining the predictions from all the models
Boosting

• If a data point is incorrectly predicted by the first model, and then the next
(probably all models), will combining the predictions provide better results?
Such situations are taken care of by boosting.
• Boosting is a sequential process, where each subsequent model attempts to
correct the errors of the previous model. The succeeding models are
dependent on the previous model.
Let’s understand the way boosting works in the below steps.
1. A subset is created from the original dataset.
2. Initially, all data points are given equal weights.
3. A base model is created on this subset.
4. This model is used to make predictions on the whole dataset
Boosting
5. Errors are calculated using the actual values and predicted values.
6. The observations which are incorrectly predicted, are given higher weights.

7. Another model is created and predictions are made on the dataset.


(This model tries to correct the errors from the previous model)
8. Similarly, multiple models are created, each correcting the errors of the
previous model.
9. The final model (strong learner) is the weighted mean of all the models (weak
learners)

Thus, the boosting algorithm combines a number of weak learners to form a


strong learner. The individual models would not perform well on the entire
dataset, but they work well for some part of the dataset. Thus, each model
actually boosts the performance of the ensemble.
Algorithms based on Bagging and Boosting

Bagging and Boosting are two of the most commonly used techniques in
machine learning. Following are the algorithms we will be focusing on:

Bagging algorithms:
• Bagging meta-estimator
• Random forest
Boosting algorithms:
• AdaBoost
• GBM
• XGBM
• Light GBM
• CatBoost
Case study 1: Bagging
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
• Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
• A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
• Each classifier Mi returns its class prediction
• The bagged classifier M* counts the votes and assigns the class with the
most votes to X
• Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
• Accuracy
• Often significantly better than a single classifier derived from D
• For noise data: not considerably worse, more robust
• Proved improved accuracy in prediction
75
Case study 1:Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
• Weights are assigned to each training tuple
• A series of k classifiers is iteratively learned
• After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
• The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data
76
Adaboost algorithm
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
• Tuples from D are sampled (with replacement) to form a training set Di
of the same size
• Each tuple’s chance of being selected is based on its weight
• A classification model Mi is derived from Di
• Its error rate is calculated using Di as a test set
• If a tuple is misclassified, its weight is increased, otherwise it is
decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error
rate is the sum of the weights of the misclassified tuples:
d
error ( M i ) =  w j  err ( X j )
j

• The weight of classifier Mi’s vote is 1 − error ( M i )


log
error ( M i )
77
Random Forest algorithm
• Random Forest:
• Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to
determine the split
• During classification, each tree votes and the most popular class is
returned
• Two Methods to construct Random Forest:
• Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
• Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes (reduces
the correlation between individual classifiers)
• Comparable in accuracy to Adaboost, but more robust to errors and outliers
• Insensitive to the number of attributes selected for consideration at each
split, and faster than bagging or boosting

78
Confusion matrix

The above table has the following cases:


•True Negative: Model has given prediction No, and the real or actual value was also No.
•True Positive: The model has predicted yes, and the actual value was also true.
•False Negative: The model has predicted no, but the actual value was Yes, it is also called as Type-II
error.
•False Positive: The model has predicted Yes, but the actual value was No. It is also called a Type-I
error.
Calculations using Confusion Matrix:
• Classification Accuracy: It defines how often the model predicts the correct
output.
• It can be calculated as the ratio of the number of correct predictions made by the
classifier to all number of predictions made by the classifiers. The formula is given
below:

• Misclassification rate: It is also termed as Error rate, and it defines how often the
model gives the wrong predictions. The value of error rate can be calculated as the
number of incorrect predictions to all number of the predictions made by the
classifier.
• Precision: It can be defined as the number of correct outputs provided by the
model or out of all positive classes that have predicted correctly by the model,
how many of them were actually true. It can be calculated using the below
formula:

• Recall (Sensitivity): It is defined as the out of total positive classes, how our
model predicted correctly. The recall must be as high as possible.
• F-measure: If two models have low precision and high recall or vice versa, it is
difficult to compare these models. So, for this purpose, we can use F-score. This
score helps us to evaluate the recall and precision at the same time. The F-score
is maximum if the recall is equal to the precision. It can be calculated using the
below formula:
Example: Confusion Matrix

Precision = 90/230 = 39.13% Recall (Sensitivity) = 90/300 = 30.00%

Specificity= 9560/9770=98.56% Accuracy= (90+9560)/10000=96.5%


Other important terms used in Confusion Matrix:
• Null Error rate: It defines how often our model would be incorrect if it always
predicted the majority class. As per the accuracy paradox, it is said that "the
best classifier has a higher error rate than the null error rate."
• ROC Curve: The ROC is a graph displaying a classifier's performance for all
possible thresholds. The graph is plotted between the true positive rate ie.
Sensitivity (on the Y-axis) and the false Positive rate ie. Specificity (on the x-
axis).
A ROC curve
• A ROC curve is a diagnostic plot for summarizing the behavior of a model by
calculating the false positive rate and true positive rate for a set of predictions
by the model under different thresholds.
The true positive rate is the recall or sensitivity.
• TruePositiveRate = TruePositive / (TruePositive + FalseNegative)
The false positive rate is specificity.
• FalsePositiveRate = FalsePositive / (FalsePositive + TrueNegative)
• Each threshold is a point on the plot and the points are connected to form a
curve. A classifier that has no skill (e.g. predicts the majority class under all
thresholds) will be represented by a diagonal line from the bottom left to the
top right.
• Any points below this line have worse than no skill. A perfect model will be a
point in the top left of the plot.
A ROC curve

The ROC Curve is a helpful diagnostic for one model.


The area under the ROC curve can be calculated and
provides a single score to summarize the plot that can
be used to compare models.

A no skill classifier will have a score of 0.5, whereas a


perfect classifier will have a score of 1.0.
Review questions
❑Justify the statement “KNN is called as lazy learner algorithm”

❑Define the k-Nearest Neighbours (KNN) algorithm and explain its fundamental principle.
❑Define the CART algorithm and explain its primary purpose in machine learning
❑Define Bagging as ensemble methods in machine learning
❑State the issues in decision tree learning
❑Define a Support Vector Machine (SVM) and explain its primary purpose in machine learning.
❑Define Boosting as ensemble methods in machine learning
❑Explain the core idea behind Bagging and how it differs from Boosting in terms of how base models
are combined.
❑Describe the concept of a margin in SVMs. How does the margin relate to the separation of data
points in SVM classification?
❑Compare and contrast linear and non-linear SVMs. Discuss when you would choose one over the
other

❑Discuss the impact of choosing a larger value of "k" in KNN. How does it affect the model's bias and
variance, and what considerations should be made when selecting an appropriate "k"?
❑ Compare and contrast the strengths and weaknesses of classification algorithms like KNN,
decision trees, and support vector machines. Under what circumstances would you prefer
to use these alternatives?
❑ Design a scenario in which KNN can be applied to solve a practical problem. Specify the
dataset characteristics, explain how you would preprocess the data, and outline the steps
to choose the most suitable "k" value for the problem.
❑ Design a scenario where an SVM can be applied to solve a practical problem. Outline the
dataset requirements, preprocessing steps, kernel choice, and hyperparameter tuning
process
❑ Suppose you have trained an SVM model for a classification task. Discuss the
performance evaluation metrics you would use to assess the model's quality and provide
guidelines on interpreting the results.
❑ Discuss the theoretical foundations and practical applications of Support Vector Machines
(SVM) in machine learning. Illustrate your answer with appropriate examples and
diagrams where necessary.
❑ Compare and contrast Bagging and Boosting in terms of their approach to handling model
diversity. How does each method encourage diversity among base models, and how does
this impact ensemble performance?
❑ Compare and contrast the Gini impurity and entropy as criteria for splitting nodes in a
decision tree.
• Consider the data with attributes (acid durability and strength) to classify whether a special paper
tissue is good or not. For new sample with X1=3 and x2=7, predict the classification value using
KNN algorithm with K=3

• For the given data of sports, on which age and gender are taking a part in a decision on ‘what kind
of person would play ground-game? Find the root node of a decision tree using CART. Note:
consider age =< 25 and age > 25.

You might also like