Decision Tree and Ensemble
Decision Tree and Ensemble
Decision Tree
· Intro
· Concept
· Entropy
· Information Gain
· Forward Pruning
· Backward Pruning
· Implementation of C5.0
Boosting & Bagging
· Intro
· What is Bagging and Boosting
· Comparing the results of Boosting and single model
· Parameters in Boosting
Engineer “Predictive”
features
MULTI-MODALITY FEATURES
Mix of numeric, symbolic, series, text, and image PER data point!
Not normal distributions!
log(feature)
feature
Hidden Treasures!
log(feature)
feature
Feature Engineering
Feature Engineering
Collect Raw Collect (Output)
(Input) Data Ground Truth!
Engineer
“Predictive”
features
Deploy Model.
Evaluate, Iterate,
Make Predictions
Improve Model
on Unlabeled
data
Two Mindsets to Modeling
Model-Centric Feature-centric
• Throw all features in! • Carefully craft features
Engineer
“Predictive”
features
Deploy Model.
Evaluate, Iterate,
Make Predictions
Improve Model
on Unlabeled
data
PARTITIONING the (FEATURE) SPACE into
PURE REGIONS assigned to each CLASS
What is Classification?
Purity of a Region! (1 – Entropy)
Model Accuracy
Model Complexity
Classification: Steps
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Examples of Classification Task
• Predicting tumor cells as benign or malignant
6 No Medium 60K No
Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
C5.0
A job offer to be considered begins at the root node, where it is then passed through decision nodes that require choices to be
made based on the attributes of the job. These choices split the data across branches that indicate potential outcomes of a
decision, depicted here as yes or no outcomes,
though in some cases there may be more than two possibilities. In the case a final decision can be made, the tree is terminated
by leaf nodes (also known as terminal nodes) that denote the action to be taken as the result of the series of decisions. In the
case of a predictive model, the leaf nodes provide the expected result given the series of events in the tree.
Applications
• Credit scoring models in which the criteria that causes an applicant to be rejected
need to be clearly documented and free from bias
• Marketing studies of customer behaviour such as satisfaction or churn, which will
be shared with management or advertising agencies
• Diagnosis of medical conditions based on laboratory measurements, symptoms, or
the rate of disease progression
Divide and conquer
Decision trees are built using a heuristic called recursive partitioning. This approach is
also commonly known as divide and conquer because it splits the data into subsets,
which are then split repeatedly into even smaller subsets, and so on and so forth until
the process stops when the algorithm determines the data within the subsets are
sufficiently homogenous, or another stopping criterion has been met.
How Decision Tree works
To see how splitting a dataset can create a decision tree, imagine a bare root node that will
grow into a mature tree. At first, the root node represents the entire dataset, since no
splitting has transpired. Next, the decision tree algorithm must choose a feature to split
upon; ideally, it chooses the feature most predictive of the target class. The examples are
then partitioned into groups according to the distinct values of this feature, and the first
set of tree branches are formed
C5.0 is one of the best implementations of the Decision
C5.0 uses entropy, a concept borrowed from information theory that quantifies the
randomness, or disorder, within a set of class values.
Data sets with high entropy are very diverse and provide little information about other
items that may also belong in the set, as there is no apparent commonality.
The decision tree hopes to find splits that reduce entropy, ultimately increasing
homogeneity within the groups.
How do you identify good features
SL NO Ball size Ball Color Price Usefull for Play
1 10 Red 5 Y
2 1 Red 1 Y
3 50 Red 5 Y
4 100 Red 5 N
5 1000 Red 10 N
Which are the above columns are most helpful to carryout above
objective?
Can we quantify the usefulness of
columns / Features?
Entropy of a Distribution
Consider a UNIVERSE of possible events
(dice = {1, 2, …, 6})
Probability of an event: P rob ( 2 ) = 1
6
Number of bits to transmit that event: I ( 2) = - log 2 6 bits
Also, when we say something easily predictable, it's not really interesting.
if the probability of the event is close to 0, then the function should give high number
where p is the probability of the event X. And the unit is in bit, the same bit computer uses. 0 or 1.
Information Theory 101
What is the “Information Content” of the following events?
The sun rose in the east today. Prob ( e) = 1Þ I ( e) = 0 bit.
æ 1 ö
I ( event ) = log 2 ç = - log 2 P rob ( event )
è P rob ( event ) ÷
ø
For a given segment of data (S), the term c refers to the number of class
levels and pi refers to the proportion of values falling into class level i.
Entropy
To use entropy to determine the optimal feature to split upon, the algorithm
calculates the change in homogeneity that would result from a split on each
possible feature, which is a measure known as information gain.
The information gain for a feature F is calculated as the difference between the
entropy in the segment before the split (S1) and the partitions resulting from the
split (S2):
The total entropy resulting from a split is the sum of the entropy
of each of the n partitions weighted by the proportion of examples falling in the partition (wi)
Example
Predict whether a potential movie would fall into one of three categories: Critical
Success, Mainstream Hit, or Box Office Bust.
A decision tree can continue to grow indefinitely, choosing splitting features and dividing
the data into smaller and smaller partitions until each example is perfectly classified or
the algorithm runs out of features to split on. However, if the tree grows overly large,
many of the decisions it makes will be overly specific and the model will be overfitted to
the training data. The process of pruning a decision tree involves reducing its size such
that it generalizes better to unseen data
Implementation of Decision Tree
Intro to Ensemble Methods
Intro : Ensemble methods
The accuracy and reliability of a predictive model can be boosted in two ways: Either by embracing feature
engineering or by applying boosting algorithms straight away.
While working with boosting algorithms, you’ll soon come across two frequently occurring
buzzwords: Bagging and Boosting
Bagging: It is an approach where you take random samples of data, build same learning algorithms and take
simple means to find bagging probabilities.
Boosting: Boosting is similar, however the selection of sample is made more intelligently. We subsequently
give more and more weight to hard to classify observations.
Understanding ensembles
Suppose you were a contestant on a television trivia show that allowed you to choose a panel of five friends
to assist you with answering the final question for the million-dollar prize. Most people would try to stack the
panel with a diverse set of subject matter experts. A panel containing professors of literature, science,
history,and art, along with a current pop-culture expert would be a safely well-rounded group. Given their
breadth of knowledge, it would be unlikely to find a question that stumps the group.
Bagging generates a number of training datasets by bootstrap sampling the original training data. These
datasets are then used to generate a set of models using a single learning algorithm. The models'
predictions are combined using voting (for classification) or averaging (for numeric prediction).
Although bagging is a relatively simple ensemble, it can perform quite well as long as it is used with relatively
unstable learners, that is, those generating models that tend to change substantially when the input data
changes only slightly. Unstable models are essential in order to ensure the ensemble's diversity in spite of
only minor variations between the bootstrap training datasets. For this reason, bagging is often used with
decision trees, which have the tendency to vary dramatically given minor changes in the input data.
Bootstrap Method
The bootstrap is a powerful statistical method for estimating a quantity from a data sample. This is easiest
to understand if the quantity is a descriptive statistic such as a mean or a standard deviation
Let’s assume we have a sample of 100 values (x) and we’d like to get an estimate of the mean of the sample.
We can calculate the mean directly from the sample as:
We know that our sample is small and that our mean has error in it. We can improve the estimate of our mean using the
bootstrap procedure:
Create many (e.g. 1000) random sub-samples of our dataset with replacement (meaning we can select the same value
multiple times).
Calculate the mean of each sub-sample.
Calculate the average of all of our collected means and use that as our estimated mean for the data.
Let’s assume we have a sample dataset of 1000 instances (x) and we are using the C5.0 algorithm.
Bagging of the C5.0 algorithm would work as follows.
1.Create many (e.g. 100) random sub-samples of our dataset with replacement.
2.Train a C5.0 model on each sample.
3.Given a new dataset, calculate the average prediction from each model.
For example,
if we had 5 bagged decision trees that made the following class predictions for a in input sample:
blue, blue, red, blue and red, we would take the most frequent class and predict blue
When bagging with decision trees, we are less concerned about individual trees overfitting the training
data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training samples
at each leaf-node of the tree) and the trees are not pruned. These trees will have both high variance and low
bias. These are important characterize of sub-models when combining predictions using bagging.
The only parameters when bagging decision trees is the number of samples and hence the number of trees to include.
This can be chosen by increasing the number of trees on run after run until the accuracy begins to stop showing
improvement (e.g. on a cross validation test harness). Very large numbers of models may take a long time to prepare,
but will not over fit the training data
This is done by building a model from the training data, then creating a second model that attempts to correct the
errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of
models are added.
AdaBoost was the first really successful boosting algorithm developed for binary classification and later on
extended to multiclass problem. It is the best starting point for understanding boosting.
AdaBoost can be used to boost the performance of any machine learning algorithm. It is best used with weak
learners. These are models that achieve accuracy just above random chance on a classification problem
Algorithm
Samples difficult to classify receives increasingly larger weights until the algorithm identifies an algorithm that correctly classifies these samples
At each iteration, a stage weight [ln(1-err/err)] is computed based on the error rate at that iteration [ at the initial stage it assigns equal weights (1/N)
to all observations ]
The overall sequence of weighted classifiers is then combined into an ensemble and has a strong potential to classify better than any of the individual
classifiers
Samples that are incorrectly classified
in the kth iteration receives more weights in the (k+1)st iteration, while
samples that are correctly classified receives less weights in subsequent
iteration
Learning An AdaBoost Model From Data
Output
summarization
The above formula is modified to use the weightage of the training instances:
Which is the weighted sum of the misclassification rate, where w is the weight for training
instance i and perror is the prediction error for training instance i which is 1 if misclassified
and 0 if correctly classified.
For example, if we had 3 training instances with the weights 0.01, 0.5 and 0.2.
The predicted values were -1, -1 and -1, and the actual output variables in the instances were -1, 1 and -1,
then the perror values would be 0, 1, and 0. The misclassification rate would be calculated as:
A stage value is calculated for the trained model which provides a weighting for any predictions
that the model makes. The stage value for a trained model is calculated as follows:
Where stage is the stage value used to weight predictions from the model, ln() is the natural
logarithm and error is the misclassification error for the model. The effect of the stage weight
is that more accurate models have more weight or contribution to the final prediction.
The training weights are updated giving more weight to incorrectly predicted instances, and less
weight to correctly predicted instances. For example, the weight of one training instance (w) is
updated using:
Where w is the weight for a specific training instance, e is the numerical constant Euler's
number raised to a power, stage is the misclassification rate for the weak classier and perror
is the error the weak classier made predicting the output variable for the training instance,
evaluated as:
Where y is the output variable for the training instance and p is the prediction from the weak
learner. This has the effect of not changing the weight if the training instance was classified
correctly and making the weight slightly larger if the weak learner misclassified the instance.
Adaboost Ensemble
Weak models are added sequentially, trained using the weighted training data. The process
continues until a pre-set number of weak learners have been created (a user parameter) or no
further improvement can be made on the training dataset. Once completed, you are left with a
pool of weak learners each with a stage value.
Stacking
•Stacking. Building multiple models (typically of differing types) and supervisor model that learns how to
best combine the predictions of the primary models.
We can combine the predictions of multiple caret models using the caretEnsemble package.
Given a list of caret models, the caretStack() function can be used to specify a higher-order model to
learn how to best combine the predictions of sub-models together.
Let’s first look at creating 5 sub-models for the ionosphere dataset, specifically:
If the predictions for the sub-models were highly corrected (>0.75) then they would be making the same or
very similar predictions most of the time reducing the benefit of combining the prediction
Practical Issues of Classification
Underfitting: when model is too simple, and insufficient features would lead to both
training and test errors are large
Overfitting due to Noise
High complex model and Lack of data points in the lower half of the diagram makes it
difficult to predict correctly the class labels of that region
Imbalance data set
Imbalanced classification is a supervised learning problem where one class outnumbers other class by a large
proportion. This problem is faced more frequently in binary classification problems than multi-level classification
problems
Below are the reasons which leads to reduction in accuracy of ML algorithms on imbalanced data sets:
1.ML algorithms struggle with accuracy because of the unequal distribution in dependent variable.
2.This causes the performance of existing classifiers to get biased towards majority class.
3.The algorithms are accuracy driven i.e. they aim to minimize the overall error to which the minority class
contributes very little.
4.ML algorithms assume that the data set has balanced class distributions.
5.They also assume that errors obtained from different classes have same cost (explained below in detail).
Below are the methods used to treat imbalanced
datasets:
1.Under sampling
2.Oversampling
3.Synthetic Data Generation
Under sampling
This method works with majority class. It reduces the number of observations from majority class to make
the data set balanced. This method is best to use when the data set is huge and reducing the number of
training samples helps to improve run time and storage troubles.
Oversampling
This method works with minority class. It replicates the observations from minority class to balance the data.
It is also known as upsampling.
3. Synthetic Data Generation
In simple words, instead of replicating and adding the observations from the minority class, it overcome
imbalances by generates artificial data. It is also a type of oversampling technique.
In regards to synthetic data generation, synthetic minority oversampling technique (SMOTE) is a powerful and
widely used method. SMOTE algorithm creates artificial data based on feature space (rather than data space)
similarities from minority samples. We can also say, it generates a random set of minority class observations to
shift the classifier learning bias towards minority class.
To generate artificial data, it uses bootstrapping and k-nearest neighbors. Precisely, it works this way:
1.Take the difference between the feature vector (sample) under consideration and its nearest neighbor.
2.Multiply this difference by a random number between 0 and 1
3.Add it to the feature vector under consideration
4.This causes the selection of a random point along the line segment between two specific features