0% found this document useful (0 votes)
337 views

Decision Tree & Random Forest

A decision tree is a tool that uses a tree-like model to predict outcomes based on input data. It starts with a root node and uses a series of splits based on feature values to classify or regress outcomes at the leaf nodes. Decision trees select splits that maximize the purity of outcomes in the descendant nodes. They use measures like information gain and Gini impurity to evaluate split quality. Decision trees can overfit data, so techniques like pruning branches and limiting tree depth are used to improve generalization. Random forests are an ensemble method that combines many decision trees to make more robust predictions.

Uploaded by

reshma acharya
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
337 views

Decision Tree & Random Forest

A decision tree is a tool that uses a tree-like model to predict outcomes based on input data. It starts with a root node and uses a series of splits based on feature values to classify or regress outcomes at the leaf nodes. Decision trees select splits that maximize the purity of outcomes in the descendant nodes. They use measures like information gain and Gini impurity to evaluate split quality. Decision trees can overfit data, so techniques like pruning branches and limiting tree depth are used to improve generalization. Random forests are an ensemble method that combines many decision trees to make more robust predictions.

Uploaded by

reshma acharya
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Decision Tree

What is a Decision Tree?


It is a tool that has applications spanning several different areas. Decision
trees can be used for classification as well as regression problems. The
name itself suggests that it uses a flowchart like a tree structure to show
the predictions that result from a series of feature-based splits. It starts with
a root node and ends with a decision made by leaves.

It is a graphical representation of all the possible solution to a decision that is


based on certain condition. In this algorithm, the training sample points are
split into two or more sets based on the split condition over input variables. A
simple example of decision tree can be as a person has to take a decision for
going to sleep or restaurant based on parameters like he is hungry or have 25$
in his pocket.

Types of Decision tree –


Categorical variable decision tree: The type of decision tree is classified based in the
response/target variable. A tree with qualitative or categorical response variable is known as
Categorical variable decision tree.

Continuous variable decision tree: A tree with continuous response variable is known as
continuous variable decision tree
The tree accuracy is heavily affected by the split point at decision node. Decision trees use
different criteria to decide split on decision node to get two or more sub nodes. The resultant
sub nodes must increase in the homogeneity of data points also known as the purity of nodes
with respect to target variable. The split decision is tested on all available variables and then
the split with maximum purity sub nodes is get selected.
Measures of Impurity: Decision trees recursively split feature about to their target variable’s
purity. The algorithm is designed to optimize each split such the purity will be maximized.
Impurity can be measured in many ways such as Gini impurity, Entropy and information
gain.
Gini Impurity -Gini index is the measure of how often a randomly chosen element from the
set would be incorrectly labelled. Mathematically the impurity of a set can be expressed as:

Entropy Entropy is nothing but the uncertainty in our dataset or measure of disorder.
In a decision tree, the output is mostly “yes” or “no”

The formula for Entropy is shown below:

Here p+ is the probability of positive class

p– is the probability of negative class

S is the subset of the training example

Entropy basically measures the impurity of a node. Impurity is the degree of


randomness; it tells how random our data is. A pure sub-split means that either you
should be getting “yes”, or you should be getting “no”.

Suppose a feature has 8 “yes” and 4 “no” initially, after the first split the left node gets
5 ‘yes’ and 2 ‘no’ whereas right node gets 3 ‘yes’ and 2 ‘no’.
We see here the split is not pure, why? Because we can still see some negative
classes in both the nodes. In order to make a decision tree, we need to calculate the
impurity of each split, and when the purity is 100%, we make it as a leaf node.

To check the impurity of feature 2 and feature 3 we will take the help for Entropy
formula.

Image Source: Author

For feature 3,

We can clearly see from the tree itself that left node has low entropy or more purity
than right node since left node has a greater number of “yes” and it is easy to decide
here.
Always remember that the higher the Entropy, the lower will be the purity and the
higher will be the impurity.

As mentioned earlier the goal of machine learning is to decrease the uncertainty or


impurity in the dataset, here by using the entropy we are getting the impurity of a
particular node, we don’t know if the parent entropy or the entropy of a particular
node has decreased or not.

For this, we bring a new metric called “Information gain” which tells us how much the
parent entropy has decreased after splitting it with some feature.

 Information Gain

Information gain measures the reduction of uncertainty given some feature and it is also a
deciding factor for which attribute should be selected as a decision node or root node.

It is just entropy of the full dataset – entropy of the dataset given some feature.

To understand this better let’s consider an example:

Suppose our entire population has a total of 30 instances. The dataset is to predict whether
the person will go to the gym or not. Let’s say 16 people go to the gym and 14 people don’t

Now we have two features to predict whether he/she will go to the gym or not.

Feature 1 is “Energy” which takes two values “high” and “low”

Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral” and “Highly


motivated”.

Let’s see how our decision tree will be made using these 2 features. We’ll use information
gain to decide which feature should be the root node and which feature should be placed
after the split.
Image Source: Author

Let’s calculate the entropy:

To see the weighted average of entropy of each node we will do as follows:

Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:

Our parent entropy was near 0.99 and after looking at this value of information gain, we can
say that the entropy of the dataset will decrease by 0.37 if we make “Energy” as our root
node.

Similarly, we will do this with the other feature “Motivation” and calculate its information
gain.

Image Source: Author

Let’s calculate the entropy here:


To see the weighted average of entropy of each node we will do as follows:

Now we have the value of E(Parent) and E(Parent|Motivation), information gain will be:

We now see that the “Energy” feature gives more reduction which is 0.37 than the
“Motivation” feature. Hence we will select the feature which has the highest information
gain and then split the node based on that feature.

In this example “Energy” will be our root node and we’ll do the same for sub-nodes. Here we
can see that when the energy is “high” the entropy is low and hence we can say a person will
definitely go to the gym if he has high energy, but what if the energy is low? We will again
split the node based on the new feature which is “Motivation”.

When to stop splitting?


Usually, real-world datasets have a large number of features, which will result in a large
number of splits, which in turn gives a huge tree. Such trees take time to build and can lead to
overfitting. That means the tree will give very good accuracy on the training dataset but will
give bad accuracy in test data.

There are many ways to tackle this problem through hyperparameter tuning. We can set the
maximum depth of our decision tree using the max_depth parameter. The more the value
of max_depth, the more complex your tree will be. The training error will off-course decrease
if we increase the max_depth value but when our test data comes into the picture, we will get
a very bad accuracy. Hence you need a value that will not overfit as well as underfit our data
and for this, you can use GridSearchCV.

Another way is to set the minimum number of samples for each spilt. It is denoted
by min_samples_split. Here we specify the minimum number of samples required to do a spilt.
For example, we can use a minimum of 10 samples to reach a decision. That means if a node
has less than 10 samples then using this parameter, we can stop the further splitting of this
node and make it a leaf node.

There are more hyperparameters such as :

min_samples_leaf – represents the minimum number of samples required to be in the leaf


node. The more you increase the number, the more is the possibility of overfitting.

max_features – it helps us decide what number of features to consider when looking for the
best split.

Pruning

It is another method that can help us avoid overfitting. It helps in improving the performance
of the tree by cutting the nodes or sub-nodes which are not significant. It removes the
branches which have very low importance.

There are mainly 2 ways for pruning:

(i) Pre-pruning – we can stop growing the tree earlier, which means we can
prune/remove/cut a node if it has low importance while growing the tree.

(ii) Post-pruning – once our tree is built to its depth, we can start pruning the nodes based on
their significance.

A complex and large tree poorly generalizes the new samples data whereas a small tree fails
to capture the information of training sample data.
Pruning may be defined as shortening the branches of tree. The process of reducing the size
of the tree by turning some branch node into leaf node and removing the leaf node under the
original branch.
Pruning is very useful in decision tree because sometime what happens is that the decision
tree may fit the training data very well but performs very poorly in testing or new data. So, by
removing branches we can reduce the complexity of tree which help in reducing the over
fitting of tree.

Pruning – is nothing but cutting down some nodes to stop overfitting


Decision Trees
Pros
1. Normalization or scaling of data not needed.
2. Handling missing values: No considerable impact of missing values.
3. Easy to explain to non-technical team members.
4. Easy visualization
5. Automatic Feature selection : Irrelevant features won’t affect decision trees.
Cons
1. Prone to overfitting.
2. Sensitive to data. If data changes slightly, the outcomes can change to a very
large extent.
3. Higher time required to train decision trees.
Applications:
Identifying buyers for products, prediction of likelihood of default, which
strategy can maximize profit, finding strategy for cost minimization, which
features are most important to attract and retain customers (is it the frequency
of shopping, is it the frequent discounts, is it the product mix etc), fault
diagnosis in machines(keep measuring pressure, vibrations and other measures
and predict before a fault occurs) etc.

RANDOM FOREST
Random forest is a Supervised Machine Learning Algorithm that is used
widely in Classification and Regression problems. It builds decision trees on
different samples and takes their majority vote for classification and average
in case of regression.

One of the most important features of the Random Forest Algorithm is that
it can handle the data set containing continuous variables as in the case of
regression and categorical variables as in the case of classification.

It performs better results for classification problems.

Working of Random Forest Algorithm

Before understanding the working of the random forest we must look into the
ensemble technique. Ensemble simply  means combining multiple models. Thus a
collection of models is used to make predictions rather than an individual model.

Ensemble uses two types of methods:


1. Bagging– It creates a different training subset from sample training data
with replacement & the final output is based on majority voting. For
example,  Random Forest.

2. Boosting– It combines weak learners into strong learners by creating


sequential models such that the final model has the highest accuracy. For
example,  ADA BOOST, XG BOOST

As mentioned earlier, Random forest works on the Bagging principle. Now let’s dive

in and understand bagging in detail.

Bagging
Bagging, also known as Bootstrap Aggregation is the ensemble technique
used by random forest. Bagging chooses a random sample from the data
set. Hence each model is generated from the samples (Bootstrap Samples)
provided by the Original Data with replacement known as row sampling.
This step of row sampling with replacement is called bootstrap. Now each
model is trained independently which generates results. The final output is
based on majority voting after combining the results of all models. This step
which involves combining all the results and generating output based on
majority voting is known as aggregation.
 

Now let’s look at an example by breaking it down with the help of the
following figure. Here the bootstrap sample is taken from actual data
(Bootstrap sample 01, Bootstrap sample 02, and Bootstrap sample 03) with
a replacement which means there is a high possibility that each sample
won’t contain unique data. Now the model (Model 01, Model 02, and Model
03) obtained from this bootstrap sample is trained independently. Each
model generates results as shown. Now Happy emoji is having a majority
when compared to sad emoji. Thus based on majority voting final output is
obtained as Happy emoji.

 
 

Steps involved in random forest algorithm:

Step 1: In Random forest n number of random records are taken from the
data set having k number of records.

Step 2: Individual decision trees are constructed for each sample.

Step 3: Each decision tree will generate an output.

Step 4: Final output is considered based on Majority Voting or Averaging for


Classification and regression respectively.
For example:  consider the fruit basket as the data as shown in the figure
below. Now n number of samples are taken from the fruit basket and an
individual decision tree is constructed for each sample. Each decision tree
will generate an output as shown in the figure. The final output is
considered based on majority voting. In the below figure you can see that
the majority decision tree gives output as an apple when compared to a
banana, so the final output is taken as an apple.

Important Features of Random Forest


1. Diversity- Not all attributes/variables/features are considered while
making an individual tree, each tree is different.
2. Immune to the curse of dimensionality- Since each tree does not
consider all the features, the feature space is reduced.

3. Parallelization-Each tree is created independently out of different data


and attributes. This means that we can make full use of the CPU to build
random forests.

4.  Train-Test split- In a random forest we don’t have to segregate the data
for train and test as there will always be 30% of the data which is not seen
by the decision tree.

5.  Stability- Stability arises because the result is based on majority voting/


averaging.

Difference Between Decision Tree & Random Forest


Random forest is a collection of decision trees; still, there are a lot of
differences in their behavior.

Decision trees Random Forest


1. Decision trees normally suffer 1. Random forests are created from
from the problem of overfitting if it’s subsets of data and the final output is
allowed to grow without any control. based on average or majority ranking
and hence the problem of overfitting
is taken care of.

2. A single decision tree is faster in 2. It is comparatively slower.


computation.

3. When a data set with features is 3. Random forest randomly selects


taken as input by a decision tree it observations, builds a decision tree
will formulate some set of rules to do and the average result is taken. It
prediction. doesn’t use any set of formulas.

Thus random forests are much more successful than decision trees only if
the trees are diverse and acceptable.
Important Hyperparameters
Hyperparameters are used in random forests to either enhance the
performance and predictive power of models or to make the model faster.

Following hyperparameters increases the predictive power:

1. n_estimators– number of trees the algorithm builds before averaging the


predictions.

2. max_features– maximum number of features random forest considers


splitting a node.

3. mini_sample_leaf– determines the minimum number of leaves required to


split an internal node.

Following hyperparameters increases the speed:

1. n_jobs– it tells the engine how many processors it is allowed to use. If the
value is 1, it can use only one processor but if the value is -1 there is no
limit.

2. random_state– controls randomness of the sample. The model will always


produce the same results if it has a definite value of random state and if it
has been given the same hyperparameters and the same training data.

3. oob_score – OOB means out of the bag. It is a random forest cross-


validation method. In this one-third of the sample is not used to train the
data instead used to evaluate its performance. These samples are called out
of bag samples.
Advantages and Disadvantages of Random Forest
Algorithm

Advantages 

1.  It can be used in classification and regression problems.

2. It solves the problem of overfitting as output is based on majority voting


or averaging.

3. It performs well even if the data contains null/missing values.

4. Each decision tree created is independent of the other thus it shows the
property of parallelization.

5. It is highly stable as the average answers given by a large number of trees


are taken.

6. It maintains diversity as all the attributes are not considered while making
each decision tree though it is not true in all cases.

7. It is immune to the curse of dimensionality. Since each tree does not


consider all the attributes, feature space is reduced.

8. We don’t have to segregate data into train and test as there will always
be 30% of the data which is not seen by the decision tree made out of
bootstrap.

Disadvantages

1. Random forest is highly complex when compared to decision trees where


decisions can be made by following the path of the tree.
2. Training time is more compared to other models due to its complexity.
Whenever it has to make a prediction each decision tree has to generate
output for the given input data.

Summary

Now, we can conclude that Random Forest is one of the best techniques
with high performance which is widely used in various industries for its
efficiency. It can handle binary, continuous, and categorical data.

Random forest is a great choice if anyone wants to build the model fast and
efficiently as one of the best things about the random forest is it can handle
missing values.

Overall, random forest is a fast, simple, flexible, and robust model with some
limitations.

You might also like