CH 04 Classification Techniques
CH 04 Classification Techniques
Chapter 4 (MLITAD)
Syllabus
• k Nearest Neighbor, Support Vector Machine, Decision Tree (CART),
Issues in Decision Tree, Ensembles Techniques -Bagging, Boosting,
Evaluation Metrics, Use cases
Decision Tree - Classification
• Decision tree builds classification models in the form of a tree
structure.
• It breaks down a dataset into smaller and smaller subsets
while at the same time an associated decision tree is
incrementally developed.
• The final result is a tree with decision nodes and leaf nodes.
• A decision node has two or more branches
• Leaf node represents a classification or decision.
• The topmost decision node in a tree which corresponds to
the best predictor called root node.
• Decision trees can handle both categorical and numerical
data.
Classification Model
What is node impurity/purity in decision trees?
• The decision tree is a greedy algorithm that performs a recursive binary partitioning of the
feature space.
• The tree predicts the same label for each bottommost (leaf) partition.
• Each partition is chosen greedily by selecting the best split from a set of possible splits.
Example 2:
Design decision
tree using
CART algorithm
for the given
dataset
12
Find Gini index of all attributes
Outlook is the root node since it has less Gini index among all attributes
Gini index of humidity is low , so it will appear below outlook branch with sunny as a value
Gini index of wind is low , so it will appear below outlook branch with rain as a value
K-Nearest Neighbor(KNN)
Algorithm
Introduction
• K-Nearest Neighbor is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar
to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
Introduction
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new
data.
• Example:
Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. KNN model will find the similar features of
the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.
•
Why do we need a K-NN Algorithm?
• Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories.
• To solve this type of problem, we need a K-NN algorithm.
• With the help of K-NN, we can easily identify the category or class of a particular dataset.
How does K-NN work?
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
KNN Example 2
Find the class label for given instance using KNN with K=5
Step 1: Find distance
Step 2: Find Rank
• Hence, the SVM algorithm helps to find the best line or decision
boundary; this best boundary or region is called as a hyperplane.
• SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors.
• The distance between the vectors and the hyperplane is called
as margin. And the goal of SVM is to maximize this margin.
• The hyperplane with maximum margin is called the optimal
hyperplane.
How does SVM works? contd..
Case 1: How does SVM works? Contd..
It unable to segregate the two classes using a straight line, as one of the stars
lies in the territory of other(circle) class as an outlier. The SVM algorithm has a
feature to ignore outliers and find the hyper-plane that has the maximum
margin. Hence, we can say, SVM classification is robust to outliers.
Case 2: How does SVM works?
In the scenario below, we can’t have linear hyper-plane between the two classes, so how does SVM
classify these two classes? SVM can solve this problem by introducing additional feature. Here, we will
add a new feature z=x^2+y^2. Now, let’s plot the data points on axis x and z:
• Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
• Bagging: averaging the prediction over a collection of
classifiers. Eg. Random Forest
• Boosting: weighted vote with a collection of classifiers.
Eg. Ada Boost
65
Ensemble Methods
• Ensemble learning refers to algorithms that combine the predictions from two or
more models.
• Ensemble learning is a general meta approach to machine learning that seeks
better predictive performance by combining the predictions from multiple
models.
• The three main classes of ensemble learning methods are bagging, stacking,
and boosting
• Bagging involves fitting many decision trees on different samples of the same
dataset and averaging the predictions.
• Stacking involves fitting many different models types on the same data and using
another model to learn how to best combine the predictions.
• Boosting involves adding ensemble members sequentially that correct the
predictions made by prior models and outputs a weighted average of the
predictions.
Bagging
• The idea behind bagging is combining the results of multiple models (for
instance, all decision trees) to get a generalized result.
• Here’s a question: If you create all the models on the same set of data and
combine it, will it be useful? There is a high chance that these models will
give the same result since they are getting the same input. So how can we
solve this problem? One of the techniques is bootstrapping.
• Bootstrapping is a sampling technique in which we create subsets of
observations from the original dataset, with replacement. The size of the
subsets is the same as the size of the original set.
• Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to
get a fair idea of the distribution (complete set). The size of subsets created
for bagging may be less than the original set.
Bootstrapping
Bagging
1.Multiple subsets are created from the original dataset, selecting observations with replacement.
2.A base model (weak model) is created on each of these subsets.
3.The models run in parallel and are independent of each other.
4.The final predictions are determined by combining the predictions from all the models
Boosting
• If a data point is incorrectly predicted by the first model, and then the next
(probably all models), will combining the predictions provide better results?
Such situations are taken care of by boosting.
• Boosting is a sequential process, where each subsequent model attempts to
correct the errors of the previous model. The succeeding models are
dependent on the previous model.
Let’s understand the way boosting works in the below steps.
1. A subset is created from the original dataset.
2. Initially, all data points are given equal weights.
3. A base model is created on this subset.
4. This model is used to make predictions on the whole dataset
Boosting
5. Errors are calculated using the actual values and predicted values.
6. The observations which are incorrectly predicted, are given higher weights.
Bagging and Boosting are two of the most commonly used techniques in
machine learning. Following are the algorithms we will be focusing on:
Bagging algorithms:
• Bagging meta-estimator
• Random forest
Boosting algorithms:
• AdaBoost
• GBM
• XGBM
• Light GBM
• CatBoost
Case study 1: Bagging
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
• Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
• A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
• Each classifier Mi returns its class prediction
• The bagged classifier M* counts the votes and assigns the class with the
most votes to X
• Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
• Accuracy
• Often significantly better than a single classifier derived from D
• For noise data: not considerably worse, more robust
• Proved improved accuracy in prediction
75
Case study 1:Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
• Weights are assigned to each training tuple
• A series of k classifiers is iteratively learned
• After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
• The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data
76
Adaboost algorithm
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
• Tuples from D are sampled (with replacement) to form a training set Di
of the same size
• Each tuple’s chance of being selected is based on its weight
• A classification model Mi is derived from Di
• Its error rate is calculated using Di as a test set
• If a tuple is misclassified, its weight is increased, otherwise it is
decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error
rate is the sum of the weights of the misclassified tuples:
d
error ( M i ) = w j err ( X j )
j
78
Confusion matrix
• Misclassification rate: It is also termed as Error rate, and it defines how often the
model gives the wrong predictions. The value of error rate can be calculated as the
number of incorrect predictions to all number of the predictions made by the
classifier.
• Precision: It can be defined as the number of correct outputs provided by the
model or out of all positive classes that have predicted correctly by the model,
how many of them were actually true. It can be calculated using the below
formula:
• Recall (Sensitivity): It is defined as the out of total positive classes, how our
model predicted correctly. The recall must be as high as possible.
• F-measure: If two models have low precision and high recall or vice versa, it is
difficult to compare these models. So, for this purpose, we can use F-score. This
score helps us to evaluate the recall and precision at the same time. The F-score
is maximum if the recall is equal to the precision. It can be calculated using the
below formula:
Example: Confusion Matrix
❑Define the k-Nearest Neighbours (KNN) algorithm and explain its fundamental principle.
❑Define the CART algorithm and explain its primary purpose in machine learning
❑Define Bagging as ensemble methods in machine learning
❑State the issues in decision tree learning
❑Define a Support Vector Machine (SVM) and explain its primary purpose in machine learning.
❑Define Boosting as ensemble methods in machine learning
❑Explain the core idea behind Bagging and how it differs from Boosting in terms of how base models
are combined.
❑Describe the concept of a margin in SVMs. How does the margin relate to the separation of data
points in SVM classification?
❑Compare and contrast linear and non-linear SVMs. Discuss when you would choose one over the
other
❑Discuss the impact of choosing a larger value of "k" in KNN. How does it affect the model's bias and
variance, and what considerations should be made when selecting an appropriate "k"?
❑ Compare and contrast the strengths and weaknesses of classification algorithms like KNN,
decision trees, and support vector machines. Under what circumstances would you prefer
to use these alternatives?
❑ Design a scenario in which KNN can be applied to solve a practical problem. Specify the
dataset characteristics, explain how you would preprocess the data, and outline the steps
to choose the most suitable "k" value for the problem.
❑ Design a scenario where an SVM can be applied to solve a practical problem. Outline the
dataset requirements, preprocessing steps, kernel choice, and hyperparameter tuning
process
❑ Suppose you have trained an SVM model for a classification task. Discuss the
performance evaluation metrics you would use to assess the model's quality and provide
guidelines on interpreting the results.
❑ Discuss the theoretical foundations and practical applications of Support Vector Machines
(SVM) in machine learning. Illustrate your answer with appropriate examples and
diagrams where necessary.
❑ Compare and contrast Bagging and Boosting in terms of their approach to handling model
diversity. How does each method encourage diversity among base models, and how does
this impact ensemble performance?
❑ Compare and contrast the Gini impurity and entropy as criteria for splitting nodes in a
decision tree.
• Consider the data with attributes (acid durability and strength) to classify whether a special paper
tissue is good or not. For new sample with X1=3 and x2=7, predict the classification value using
KNN algorithm with K=3
• For the given data of sports, on which age and gender are taking a part in a decision on ‘what kind
of person would play ground-game? Find the root node of a decision tree using CART. Note:
consider age =< 25 and age > 25.