0% found this document useful (0 votes)
70 views

Decision Tree and Ensemble

Uploaded by

Gopinath Sudha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

Decision Tree and Ensemble

Uploaded by

Gopinath Sudha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 92

Agenda

Decision Tree
· Intro
· Concept
· Entropy
· Information Gain
· Forward Pruning
· Backward Pruning
· Implementation of C5.0
Boosting & Bagging
· Intro
· What is Bagging and Boosting
· Comparing the results of Boosting and single model
· Parameters in Boosting

Bias & Variance


Workflow- Data to Deployment
Collect Raw Collect (Output)
(Input) Data Ground Truth!

Engineer “Predictive”
features

“Train” an ML Choose Model type &


“Model” complexity

Deploy Model. Make


Evaluate, Iterate, Improve
Predictions on
Model
Unlabeled data
Data
Workflow From Data to Decisions…
Dating the Data!
Data Nuances
 FEATURE NOISE – value may not be accurate
 Sensor Malfunctioning, Sensor Biased, Sensor Resolution,…
 Call center notes, Transcription errors, Data entry errors
 Comments  Tweets Blogs  News  Scientific Papers

 MISSING FEATURES – some feature values might be missing


 Sensor went down, Communication/Storage failure, Human error

 NON-NORMAL FEATURE DISTRIBUTIONS


 Exponential, log-normal distributions are more common than normal
 Taking log of features helps

 HETEREGONEOUS FEATURES - ranges, scales, distributions.


 E.g. Age, Income, Temperature, RBC Counts, Blood Pressure,…

 MULTI-MODALITY FEATURES
 Mix of numeric, symbolic, series, text, and image PER data point!
Not normal distributions!

log(feature)

feature
Hidden Treasures!

log(feature)

feature
Feature Engineering
Feature Engineering
Collect Raw Collect (Output)
(Input) Data Ground Truth!

Engineer
“Predictive”
features

“Train” an ML Choose Model


“Model” type & complexity

Deploy Model.
Evaluate, Iterate,
Make Predictions
Improve Model
on Unlabeled
data
Two Mindsets to Modeling
Model-Centric Feature-centric
• Throw all features in! • Carefully craft features

• Have enough data • Use Domain Knowledge

• Build Complex models • Build Simpler Models

Simple Complex Complex Simple


Features Model Features Model
Distribute Complexity Fairly

Simple Complex Complex Simple


Features Model Features Model
Engineer Features that make sense!
Raw Input Derived Feature - 1
 Time of current trans.  Distance(PrevCurrent
)
 Place of current trans.
 TimeLag(PrevCurrent)
 Time of prev. trans.

 Place of prev. trans. Derived Feature - 2


 Velocity(PrevCurrent)

Velocity ( Prev ® Current )


Distance(Prev ® Current)
=
TimeLag(Prev ® Current)
Feature Engineering
Collect Raw Collect (Output)
(Input) Data Ground Truth!

Engineer
“Predictive”
features

“Train” an ML Choose Model


“Model” type & complexity

Deploy Model.
Evaluate, Iterate,
Make Predictions
Improve Model
on Unlabeled
data
PARTITIONING the (FEATURE) SPACE into
PURE REGIONS assigned to each CLASS

What is Classification?
Purity of a Region! (1 – Entropy)

10 10 0.5 0.5 LOW 0.00

100 100 0.5 0.5 LOW 0.00

100 50 0.67 0.33 MEDIUM 0.09

50 100 0.33 0.67 MEDIUM 0.09

100 10 0.91 0.09 HIGH 0.56

100 0 1.00 0.00 PERFECT 1.00


Purity of a Region! (Accuracy)

10 10 0.5 0.5 LOW 50%

100 100 0.5 0.5 LOW 50%

100 50 0.67 0.33 MEDIUM 67%

50 100 0.33 0.67 MEDIUM 67%

100 10 0.91 0.09 HIGH 91%

100 0 1.00 0.00 PERFECT 100%


Transformation / Partition / Purity
Decision boundary for 3-Class
Decision Boundaries for multi-class
SIMPLE Decision Boundary?
MEDIUM Decision Boundary!
COMPLEX Decision Boundary!
Model SIGNAL not NOISE

Model is too simple  UNDER LEARN


Model is too complex  MEMORIZE
Model is just right  GENERALIZE
Generalization vs. Memorization
Generalize, don’t Memorize!

Model Accuracy

Model Complexity
Classification: Steps

• Given a collection of records (training set )


– Each record contains a set of attributes, one of the attributes is the class.

• Task: Find a model for class attribute as a


function of the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the model. Usually, the
given data set is divided into training and test sets, with training set used
to build the model and test set used to validate it.
Illustrating Classification Task

Tid Attrib1 Attrib2 Attrib3 Class Learning


1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Examples of Classification Task
• Predicting tumor cells as benign or malignant

• Classifying credit card transactions


as legitimate or fraudulent

• Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random
coil

• Categorizing news stories as finance,


weather, entertainment, sports, etc
Classification Techniques
• Decision Tree based Methods
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
• …..
• …..
Decision Tree
Example of a Decision Tree

Tid Refund Marital Taxable


Splitting Attributes
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of Decision Tree
MarSt Single,
Tid Refund Marital Taxable
Married Divorce
Status Income Cheat d
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
C5.0
A job offer to be considered begins at the root node, where it is then passed through decision nodes that require choices to be
made based on the attributes of the job. These choices split the data across branches that indicate potential outcomes of a
decision, depicted here as yes or no outcomes,
though in some cases there may be more than two possibilities. In the case a final decision can be made, the tree is terminated
by leaf nodes (also known as terminal nodes) that denote the action to be taken as the result of the series of decisions. In the
case of a predictive model, the leaf nodes provide the expected result given the series of events in the tree.
Applications

• Credit scoring models in which the criteria that causes an applicant to be rejected
need to be clearly documented and free from bias
• Marketing studies of customer behaviour such as satisfaction or churn, which will
be shared with management or advertising agencies
• Diagnosis of medical conditions based on laboratory measurements, symptoms, or
the rate of disease progression
Divide and conquer

Decision trees are built using a heuristic called recursive partitioning. This approach is
also commonly known as divide and conquer because it splits the data into subsets,
which are then split repeatedly into even smaller subsets, and so on and so forth until
the process stops when the algorithm determines the data within the subsets are
sufficiently homogenous, or another stopping criterion has been met.
How Decision Tree works

To see how splitting a dataset can create a decision tree, imagine a bare root node that will
grow into a mature tree. At first, the root node represents the entire dataset, since no
splitting has transpired. Next, the decision tree algorithm must choose a feature to split
upon; ideally, it chooses the feature most predictive of the target class. The examples are
then partitioned into groups according to the distinct values of this feature, and the first
set of tree branches are formed
C5.0 is one of the best implementations of the Decision

Tree Building methodology:

The first question is which feature to select first?


There are various measures to identify the best decision tree splitting candidate.

C5.0 uses entropy, a concept borrowed from information theory that quantifies the
randomness, or disorder, within a set of class values.

Data sets with high entropy are very diverse and provide little information about other
items that may also belong in the set, as there is no apparent commonality.

The decision tree hopes to find splits that reduce entropy, ultimately increasing
homogeneity within the groups.
How do you identify good features
SL NO Ball size Ball Color Price Usefull for Play
1 10 Red 5 Y
2 1 Red 1 Y
3 50 Red 5 Y
4 100 Red 5 N
5 1000 Red 10 N

My aim is to determine whether given ball is useful for playing.


Now,

Which are the above columns are most helpful to carryout above
objective?
Can we quantify the usefulness of
columns / Features?
Entropy of a Distribution
Consider a UNIVERSE of possible events
(dice = {1, 2, …, 6})
 Probability of an event: P rob ( 2 ) = 1
6
 Number of bits to transmit that event: I ( 2) = - log 2 6 bits

 Expected Number of bits for all events?


The Entropy is the Expected Information
Content of all events
n
æ 1ö n
H ( p = ( p1 ... pn )) = å pi log ç ÷ = -å pi log ( pi )
i=1 è pi ø i=1
In General..
How do we measure the information?

When something unexpected happens, we say it's a big news.

Also, when we say something easily predictable, it's not really interesting.

So to quantify this interesting-ness, the function should satisfy

 if the probability of the event is 1 (predictable), then the function gives 0

 if the probability of the event is close to 0, then the function should give high number

 if probability 0.5 events happens it give one bit of information.

One natural measure that satisfy the constraints is I(X) = -log_2(p)

where p is the probability of the event X. And the unit is in bit, the same bit computer uses. 0 or 1.
Information Theory 101
What is the “Information Content” of the following events?
 The sun rose in the east today. Prob ( e) = 1Þ I ( e) = 0 bit.

 The weather in London is Cloudy. P(e) = 0.95 Þ I ( e) = 0.074 bits.


 A baby is born in my community. P(e) = 1/ 365 Þ I (e) = 8.52 bits.
 A major earthquake hit LA. P(e) = 1/ (15 * 365) Þ I ( e) = 12.4 bits.
The Information Content is proportional to RARITY.

æ 1 ö
I ( event ) = log 2 ç = - log 2 P rob ( event )
è P rob ( event ) ÷
ø

Claude Elwood Shannon (1916–2001)


Entropy

• Typically, entropy is measured in bits.


• If there are only two possible classes, entropy values can range from 0 to 1.
• For n classes, entropy ranges from 0 to log2(n).
• Minimum value indicates that the sample is completely homogenous, while the
maximum value indicates that the data are as diverse as possible, and no group has
even a small plurality.

Entropy can be computed by

For a given segment of data (S), the term c refers to the number of class
levels and pi refers to the proportion of values falling into class level i.
Entropy

Entropy for all the possible two-class arrangements.

As illustrated by the peak in entropy at x = 0.50, a 50-50 split results in


maximum entropy. As one class increasingly dominates the other, the
entropy reduces to zero
Information gain

To use entropy to determine the optimal feature to split upon, the algorithm
calculates the change in homogeneity that would result from a split on each
possible feature, which is a measure known as information gain.

The information gain for a feature F is calculated as the difference between the
entropy in the segment before the split (S1) and the partitions resulting from the
split (S2):

After splitting the feature , the function to calculate Entropy(S2) needs to


consider the total entropy across all of the partitions

The total entropy resulting from a split is the sum of the entropy
of each of the n partitions weighted by the proportion of examples falling in the partition (wi)
Example
Predict whether a potential movie would fall into one of three categories: Critical
Success, Mainstream Hit, or Box Office Bust.

Relationship between the film's estimated shooting budget, the number


of A-list celebrities lined up for starring roles, and the level of success.
Using the divide and conquer strategy, we can build a simple decision tree from this data.
First, to create the tree's root node, we split the feature indicating the number of
celebrities, partitioning the movies into groups with and without a significant number of
A-list stars:
Next, among the group of movies with a larger number of celebrities, we can make
another split between movies with and without a high budget:
The group at the top-left corner of the diagram is composed entirely of critically
acclaimed films. This group is distinguished by a high number of celebrities and a
relatively low budget. At the top-right corner, majority of movies are box office hits
with high budgets and a large number of celebrities. The final group, which has
little star power but budgets ranging from small to large, contains the flops.
Pruning the decision tree

A decision tree can continue to grow indefinitely, choosing splitting features and dividing
the data into smaller and smaller partitions until each example is perfectly classified or
the algorithm runs out of features to split on. However, if the tree grows overly large,
many of the decisions it makes will be overly specific and the model will be overfitted to
the training data. The process of pruning a decision tree involves reducing its size such
that it generalizes better to unseen data
Implementation of Decision Tree
Intro to Ensemble Methods
Intro : Ensemble methods

The accuracy and reliability of a predictive model can be boosted in two ways: Either by embracing feature
engineering or by applying boosting algorithms straight away.

While working with boosting algorithms, you’ll soon come across two frequently occurring
buzzwords: Bagging and Boosting

Bagging: It is an approach where you take random samples of data, build same learning algorithms and take
simple means to find bagging probabilities.

Boosting: Boosting is similar, however the selection of sample is made more intelligently. We subsequently
give more and more weight to hard to classify observations.
Understanding ensembles

Suppose you were a contestant on a television trivia show that allowed you to choose a panel of five friends
to assist you with answering the final question for the million-dollar prize. Most people would try to stack the
panel with a diverse set of subject matter experts. A panel containing professors of literature, science,
history,and art, along with a current pop-culture expert would be a safely well-rounded group. Given their
breadth of knowledge, it would be unlikely to find a question that stumps the group.

The meta-learning approach that utilizes a similar principle of creating a varied


team of experts is known as an ensemble. All the ensemble methods are based on the idea that by
combining multiple weaker learners, a stronger learner is created.
• First, input training data is used to build a number of models.
• The allocation function dictates how much of the training data each model receives. Do they each receive the
full training dataset or merely a sample? Do they each receive every feature or a subset?
• After the models are constructed, they can be used to generate a set of predictions, which must be managed
in some way.
• The combination function governs how disagreements among the predictions are reconciled
For example, the ensemble might use a majority vote to determine the final prediction, or it could use a more
complex strategy such as weighting each model's votes based on its prior performance
Bagging
One of the first ensemble methods to gain widespread acceptance used a technique
called bootstrap aggregating or bagging for short

Bagging generates a number of training datasets by bootstrap sampling the original training data. These
datasets are then used to generate a set of models using a single learning algorithm. The models'
predictions are combined using voting (for classification) or averaging (for numeric prediction).

Although bagging is a relatively simple ensemble, it can perform quite well as long as it is used with relatively
unstable learners, that is, those generating models that tend to change substantially when the input data
changes only slightly. Unstable models are essential in order to ensure the ensemble's diversity in spite of
only minor variations between the bootstrap training datasets. For this reason, bagging is often used with
decision trees, which have the tendency to vary dramatically given minor changes in the input data.
Bootstrap Method

The bootstrap is a powerful statistical method for estimating a quantity from a data sample. This is easiest
to understand if the quantity is a descriptive statistic such as a mean or a standard deviation

Let’s assume we have a sample of 100 values (x) and we’d like to get an estimate of the mean of the sample.
We can calculate the mean directly from the sample as:

mean(x) = 1/100 * sum(x)

We know that our sample is small and that our mean has error in it. We can improve the estimate of our mean using the
bootstrap procedure:

 Create many (e.g. 1000) random sub-samples of our dataset with replacement (meaning we can select the same value
multiple times).
 Calculate the mean of each sub-sample.
 Calculate the average of all of our collected means and use that as our estimated mean for the data.
Let’s assume we have a sample dataset of 1000 instances (x) and we are using the C5.0 algorithm.
Bagging of the C5.0 algorithm would work as follows.

1.Create many (e.g. 100) random sub-samples of our dataset with replacement.
2.Train a C5.0 model on each sample.
3.Given a new dataset, calculate the average prediction from each model.

For example,

if we had 5 bagged decision trees that made the following class predictions for a in input sample:
blue, blue, red, blue and red, we would take the most frequent class and predict blue
When bagging with decision trees, we are less concerned about individual trees overfitting the training
data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training samples
at each leaf-node of the tree) and the trees are not pruned. These trees will have both high variance and low
bias. These are important characterize of sub-models when combining predictions using bagging.

The only parameters when bagging decision trees is the number of samples and hence the number of trees to include.
This can be chosen by increasing the number of trees on run after run until the accuracy begins to stop showing
improvement (e.g. on a cross validation test harness). Very large numbers of models may take a long time to prepare,
but will not over fit the training data

Practical R session on Bagged Tree


Boosting
Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak
classifiers

This is done by building a model from the training data, then creating a second model that attempts to correct the
errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of
models are added.

AdaBoost was the first really successful boosting algorithm developed for binary classification and later on
extended to multiclass problem. It is the best starting point for understanding boosting.

AdaBoost can be used to boost the performance of any machine learning algorithm. It is best used with weak
learners. These are models that achieve accuracy just above random chance on a classification problem
Algorithm

Samples difficult to classify receives increasingly larger weights until the algorithm identifies an algorithm that correctly classifies these samples

At each iteration, a stage weight [ln(1-err/err)] is computed based on the error rate at that iteration [ at the initial stage it assigns equal weights (1/N)
to all observations ]

The overall sequence of weighted classifiers is then combined into an ensemble and has a strong potential to classify better than any of the individual
classifiers
Samples that are incorrectly classified
in the kth iteration receives more weights in the (k+1)st iteration, while
samples that are correctly classified receives less weights in subsequent
iteration
Learning An AdaBoost Model From Data

Output
summarization

1) Each sample have the same starting weight (1/n) initially


2) Fit a weak classifier using the weighted samples and compute the kth model’s misclassification error (errk)
3) Compute the kth stage value as ln ((1 − errk) /errk)
4) Update the sample weights giving more weight to incorrectly predicted samples and less weight to correctly
predicted samples
Where error is the mis classication rate, correct are the number of training instance predicted
correctly by the model and N is the total number of training instances. For example, if the
model predicted 78 of 100 training instances correctly the error or misclassication rate would
be 78-100 /100 or 0.22.

The above formula is modified to use the weightage of the training instances:
Which is the weighted sum of the misclassification rate, where w is the weight for training
instance i and perror is the prediction error for training instance i which is 1 if misclassified
and 0 if correctly classified.

For example, if we had 3 training instances with the weights 0.01, 0.5 and 0.2.

The predicted values were -1, -1 and -1, and the actual output variables in the instances were -1, 1 and -1,
then the perror values would be 0, 1, and 0. The misclassification rate would be calculated as:
A stage value is calculated for the trained model which provides a weighting for any predictions
that the model makes. The stage value for a trained model is calculated as follows:

Where stage is the stage value used to weight predictions from the model, ln() is the natural
logarithm and error is the misclassification error for the model. The effect of the stage weight
is that more accurate models have more weight or contribution to the final prediction.
The training weights are updated giving more weight to incorrectly predicted instances, and less
weight to correctly predicted instances. For example, the weight of one training instance (w) is
updated using:
Where w is the weight for a specific training instance, e is the numerical constant Euler's
number raised to a power, stage is the misclassification rate for the weak classier and perror
is the error the weak classier made predicting the output variable for the training instance,
evaluated as:

Where y is the output variable for the training instance and p is the prediction from the weak
learner. This has the effect of not changing the weight if the training instance was classified
correctly and making the weight slightly larger if the weak learner misclassified the instance.
Adaboost Ensemble

Weak models are added sequentially, trained using the weighted training data. The process
continues until a pre-set number of weak learners have been created (a user parameter) or no
further improvement can be made on the training dataset. Once completed, you are left with a
pool of weak learners each with a stage value.
Stacking
•Stacking. Building multiple models (typically of differing types) and supervisor model that learns how to
best combine the predictions of the primary models.
We can combine the predictions of multiple caret models using the caretEnsemble package.

Given a list of caret models, the caretStack() function can be used to specify a higher-order model to
learn how to best combine the predictions of sub-models together.

Let’s first look at creating 5 sub-models for the ionosphere dataset, specifically:

•Linear Discriminate Analysis (LDA)


•Classification and Regression Trees (CART)
•Logistic Regression (via Generalized Linear Model or GLM)
•k-Nearest Neighbors (kNN)
•Support Vector Machine with a Radial Basis Kernel Function (SVM)
When we combine the predictions of different models using stacking, it is desirable that the predictions made
by the sub-models have low correlation. This would suggest that the models are skilful but in different ways,
allowing a new classifier to figure out how to get the best from each model for an improved score.

If the predictions for the sub-models were highly corrected (>0.75) then they would be making the same or
very similar predictions most of the time reducing the benefit of combining the prediction
Practical Issues of Classification

• Under fitting and Overfitting


• Class Imbalance
Underfitting and Overfitting (Example)

500 circular and 500


triangular data points.
Underfitting and Overfitting
Overfitting

Underfitting: when model is too simple, and insufficient features would lead to both
training and test errors are large
Overfitting due to Noise

Decision boundary is distorted by noise point


Overfitting

High complex model and Lack of data points in the lower half of the diagram makes it
difficult to predict correctly the class labels of that region
Imbalance data set

What is Imbalanced Classification

Imbalanced classification is a supervised learning problem where one class outnumbers other class by a large
proportion. This problem is faced more frequently in binary classification problems than multi-level classification
problems

Below are the reasons which leads to reduction in accuracy of ML algorithms on imbalanced data sets:
1.ML algorithms struggle with accuracy because of the unequal distribution in dependent variable.
2.This causes the performance of existing classifiers to get biased towards majority class.
3.The algorithms are accuracy driven i.e. they aim to minimize the overall error to which the minority class
contributes very little.
4.ML algorithms assume that the data set has balanced class distributions.
5.They also assume that errors obtained from different classes have same cost (explained below in detail).
Below are the methods used to treat imbalanced
datasets:
1.Under sampling
2.Oversampling
3.Synthetic Data Generation

Under sampling
This method works with majority class. It reduces the number of observations from majority class to make
the data set balanced. This method is best to use when the data set is huge and reducing the number of
training samples helps to improve run time and storage troubles.

Oversampling
This method works with minority class. It replicates the observations from minority class to balance the data.
It is also known as upsampling.
3. Synthetic Data Generation
In simple words, instead of replicating and adding the observations from the minority class, it overcome
imbalances by generates artificial data. It is also a type of oversampling technique.

In regards to synthetic data generation, synthetic minority oversampling technique (SMOTE) is a powerful and
widely used method. SMOTE algorithm creates artificial data based on feature space (rather than data space)
similarities from minority samples. We can also say, it generates a random set of minority class observations to
shift the classifier learning bias towards minority class.

To generate artificial data, it uses bootstrapping and k-nearest neighbors. Precisely, it works this way:

1.Take the difference between the feature vector (sample) under consideration and its nearest neighbor.
2.Multiply this difference by a random number between 0 and 1
3.Add it to the feature vector under consideration
4.This causes the selection of a random point along the line segment between two specific features

You might also like