Decision Tree & Random Forest
Decision Tree & Random Forest
Continuous variable decision tree: A tree with continuous response variable is known as
continuous variable decision tree
The tree accuracy is heavily affected by the split point at decision node. Decision trees use
different criteria to decide split on decision node to get two or more sub nodes. The resultant
sub nodes must increase in the homogeneity of data points also known as the purity of nodes
with respect to target variable. The split decision is tested on all available variables and then
the split with maximum purity sub nodes is get selected.
Measures of Impurity: Decision trees recursively split feature about to their target variable’s
purity. The algorithm is designed to optimize each split such the purity will be maximized.
Impurity can be measured in many ways such as Gini impurity, Entropy and information
gain.
Gini Impurity -Gini index is the measure of how often a randomly chosen element from the
set would be incorrectly labelled. Mathematically the impurity of a set can be expressed as:
Entropy Entropy is nothing but the uncertainty in our dataset or measure of disorder.
In a decision tree, the output is mostly “yes” or “no”
Suppose a feature has 8 “yes” and 4 “no” initially, after the first split the left node gets
5 ‘yes’ and 2 ‘no’ whereas right node gets 3 ‘yes’ and 2 ‘no’.
We see here the split is not pure, why? Because we can still see some negative
classes in both the nodes. In order to make a decision tree, we need to calculate the
impurity of each split, and when the purity is 100%, we make it as a leaf node.
To check the impurity of feature 2 and feature 3 we will take the help for Entropy
formula.
For feature 3,
We can clearly see from the tree itself that left node has low entropy or more purity
than right node since left node has a greater number of “yes” and it is easy to decide
here.
Always remember that the higher the Entropy, the lower will be the purity and the
higher will be the impurity.
For this, we bring a new metric called “Information gain” which tells us how much the
parent entropy has decreased after splitting it with some feature.
Information Gain
Information gain measures the reduction of uncertainty given some feature and it is also a
deciding factor for which attribute should be selected as a decision node or root node.
It is just entropy of the full dataset – entropy of the dataset given some feature.
Suppose our entire population has a total of 30 instances. The dataset is to predict whether
the person will go to the gym or not. Let’s say 16 people go to the gym and 14 people don’t
Now we have two features to predict whether he/she will go to the gym or not.
Let’s see how our decision tree will be made using these 2 features. We’ll use information
gain to decide which feature should be the root node and which feature should be placed
after the split.
Image Source: Author
Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:
Our parent entropy was near 0.99 and after looking at this value of information gain, we can
say that the entropy of the dataset will decrease by 0.37 if we make “Energy” as our root
node.
Similarly, we will do this with the other feature “Motivation” and calculate its information
gain.
Now we have the value of E(Parent) and E(Parent|Motivation), information gain will be:
We now see that the “Energy” feature gives more reduction which is 0.37 than the
“Motivation” feature. Hence we will select the feature which has the highest information
gain and then split the node based on that feature.
In this example “Energy” will be our root node and we’ll do the same for sub-nodes. Here we
can see that when the energy is “high” the entropy is low and hence we can say a person will
definitely go to the gym if he has high energy, but what if the energy is low? We will again
split the node based on the new feature which is “Motivation”.
There are many ways to tackle this problem through hyperparameter tuning. We can set the
maximum depth of our decision tree using the max_depth parameter. The more the value
of max_depth, the more complex your tree will be. The training error will off-course decrease
if we increase the max_depth value but when our test data comes into the picture, we will get
a very bad accuracy. Hence you need a value that will not overfit as well as underfit our data
and for this, you can use GridSearchCV.
Another way is to set the minimum number of samples for each spilt. It is denoted
by min_samples_split. Here we specify the minimum number of samples required to do a spilt.
For example, we can use a minimum of 10 samples to reach a decision. That means if a node
has less than 10 samples then using this parameter, we can stop the further splitting of this
node and make it a leaf node.
max_features – it helps us decide what number of features to consider when looking for the
best split.
Pruning
It is another method that can help us avoid overfitting. It helps in improving the performance
of the tree by cutting the nodes or sub-nodes which are not significant. It removes the
branches which have very low importance.
(i) Pre-pruning – we can stop growing the tree earlier, which means we can
prune/remove/cut a node if it has low importance while growing the tree.
(ii) Post-pruning – once our tree is built to its depth, we can start pruning the nodes based on
their significance.
A complex and large tree poorly generalizes the new samples data whereas a small tree fails
to capture the information of training sample data.
Pruning may be defined as shortening the branches of tree. The process of reducing the size
of the tree by turning some branch node into leaf node and removing the leaf node under the
original branch.
Pruning is very useful in decision tree because sometime what happens is that the decision
tree may fit the training data very well but performs very poorly in testing or new data. So, by
removing branches we can reduce the complexity of tree which help in reducing the over
fitting of tree.
RANDOM FOREST
Random forest is a Supervised Machine Learning Algorithm that is used
widely in Classification and Regression problems. It builds decision trees on
different samples and takes their majority vote for classification and average
in case of regression.
One of the most important features of the Random Forest Algorithm is that
it can handle the data set containing continuous variables as in the case of
regression and categorical variables as in the case of classification.
Before understanding the working of the random forest we must look into the
ensemble technique. Ensemble simply means combining multiple models. Thus a
collection of models is used to make predictions rather than an individual model.
As mentioned earlier, Random forest works on the Bagging principle. Now let’s dive
Bagging
Bagging, also known as Bootstrap Aggregation is the ensemble technique
used by random forest. Bagging chooses a random sample from the data
set. Hence each model is generated from the samples (Bootstrap Samples)
provided by the Original Data with replacement known as row sampling.
This step of row sampling with replacement is called bootstrap. Now each
model is trained independently which generates results. The final output is
based on majority voting after combining the results of all models. This step
which involves combining all the results and generating output based on
majority voting is known as aggregation.
Now let’s look at an example by breaking it down with the help of the
following figure. Here the bootstrap sample is taken from actual data
(Bootstrap sample 01, Bootstrap sample 02, and Bootstrap sample 03) with
a replacement which means there is a high possibility that each sample
won’t contain unique data. Now the model (Model 01, Model 02, and Model
03) obtained from this bootstrap sample is trained independently. Each
model generates results as shown. Now Happy emoji is having a majority
when compared to sad emoji. Thus based on majority voting final output is
obtained as Happy emoji.
Step 1: In Random forest n number of random records are taken from the
data set having k number of records.
4. Train-Test split- In a random forest we don’t have to segregate the data
for train and test as there will always be 30% of the data which is not seen
by the decision tree.
Thus random forests are much more successful than decision trees only if
the trees are diverse and acceptable.
Important Hyperparameters
Hyperparameters are used in random forests to either enhance the
performance and predictive power of models or to make the model faster.
1. n_jobs– it tells the engine how many processors it is allowed to use. If the
value is 1, it can use only one processor but if the value is -1 there is no
limit.
Advantages
4. Each decision tree created is independent of the other thus it shows the
property of parallelization.
6. It maintains diversity as all the attributes are not considered while making
each decision tree though it is not true in all cases.
8. We don’t have to segregate data into train and test as there will always
be 30% of the data which is not seen by the decision tree made out of
bootstrap.
Disadvantages
Summary
Now, we can conclude that Random Forest is one of the best techniques
with high performance which is widely used in various industries for its
efficiency. It can handle binary, continuous, and categorical data.
Random forest is a great choice if anyone wants to build the model fast and
efficiently as one of the best things about the random forest is it can handle
missing values.
Overall, random forest is a fast, simple, flexible, and robust model with some
limitations.