0% found this document useful (0 votes)
8 views

Decision Tree and Random Forest

Uploaded by

asinghal2122003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Decision Tree and Random Forest

Uploaded by

asinghal2122003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Decision Tree Technique

Decision Tree
• Decision Tree
• "A decision tree in machine learning is a flowchart structure in which each node
represents a "test" on the attribute and each branch represents the outcome of the
test."
• The end node (called leaf node) represents a class label.
• Decision Tree is a Supervised learning technique that can be used for
both classification and Regression problems, but mostly it is preferred
for solving Classification problems.

• It is a tree-structured classifier, where internal nodes represent the


features of a dataset, branches represent the decision rules and
each leaf node represents the outcome.

• In a Decision tree, there are two nodes, which are the Decision Node
and Leaf Node. Decision nodes are used to make any decision and
have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
• To make a prediction using the decision tree, we start at the root
node and follow the branches based on the feature values of the new
instance.

• Eventually, we reach a leaf node and output the corresponding


prediction.
Attribute Selection Measures

• While implementing a Decision tree, the main issue arises that how to
select the best attribute for the root node and for sub-nodes. So, to
solve such problems there is a technique which is called as Attribute
selection measure or ASM.
• By this measurement, we can easily select the best attribute for the
nodes of the tree. There are two popular techniques for ASM, which
are:
• Information Gain
• Gini Index
• The decision tree algorithm starts with the entire dataset and selects
the best feature to split the data based on a criterion, such as
information gain or Gini impurity.

• The selected feature becomes the root node of the tree, and the data
is split into different branches based on the possible attribute values.
1. Information gain
• Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
• It calculates how much information a feature provides us about a class.
• According to the value of information gain, we split the node and build
the decision tree.
• A decision tree algorithm always tries to maximize the value of
information gain, and a node/attribute having the highest information
gain is split first.
• It can be calculated using the below formula:
Information Gain= Entropy(S)-
sum[(Weighted Avg) *Entropy(each feature) ]
2. The Gini Index
• The Gini Index is calculated by subtracting the sum of the squared
probabilities of each class from one.
• Gini index is a measure of impurity or purity used while creating a
decision tree in the CART(Classification and Regression Tree)
algorithm.
• An attribute with the low Gini index should be preferred as compared
to the high Gini index. Perfectly classified, Gini Index would be zero.
• Gini index can be calculated using the below formula:
Decision Tree Terminology

• Entropy E(x): The "average amount of information" contained by a random variable (x) is
called Entropy. It is denoted by (E) or (H).
• In other words, entropy is the "measure of randomness of information" of a variable.
Entropy (E) is the measure of impurity or uncertainty associated with a random variable
(X).
ID3 Algorithm

ID3 Steps
1. Calculate the Information Gain of each feature.
2. Considering that all rows don’t belong to the same class, split the dataset S into subsets
using the feature for which the Information Gain is maximum.
3. Make a decision tree node using the feature with the maximum Information gain.
4. If all rows belong to the same class, make the current node as a leaf node with the class as
its label.
5. Repeat for the remaining features until we run out of all features, or the decision tree has
all leaf nodes.
Here, nine “Yes”, Five “No” i.e. 9 (+ve) and 5(-ve) example in this Table 1
Advantages of the Decision Tree

• It is simple to understand as it follows the same process which a


human follow while making any decision in real-life.

• It can be very useful for solving decision-related problems.

• It helps to think about all the possible outcomes for a problem.

• There is less requirement of data cleaning compared to other


algorithms.
Disadvantages of the Decision Tree

• The decision tree contains lots of layers, which makes it complex.

• It may have an overfitting issue, which can be resolved using the


Random Forest algorithm.

• For more class labels, the computational complexity of the decision


tree may increase.
Overfitting issues in Decision Tree
• Overfitting refers to the condition when the model completely fits the
training data but fails to generalize the testing unseen data.

• Overfit condition arises when the model memorizes the noise of the
training data and fails to capture important patterns.

• A perfectly fit decision tree performs well for training data but
performs poorly for unseen test data.
Random Forest Technique
• Random Forest is a popular machine learning algorithm that belongs
to the supervised learning technique.

• It can be used for both Classification and Regression problems in ML.

• It is based on the concept of ensemble learning, which is a process


of combining multiple classifiers to solve a complex problem and to
improve the performance of the model.
• A large number of relatively uncorrelated models (trees) operating as a
committee will outperform any of the individual constituent models.

• The fundamental concept behind random forest is a simple but powerful


one — the wisdom of crowds.

• Random Forest is a classifier that contains a number of decision trees on


various subsets of the given dataset and takes the average to improve
the predictive accuracy of that dataset.“
• The name "random forest" comes from the fact that the algorithm
creates an ensemble of decision trees and introduces randomness
during the training process.

• Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of
predictions, and it predicts the final output.

• The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
• Random Forest employs a technique called bootstrapping, where
multiple random subsets (with replacement) are created from the
original training dataset.

• Each subset, known as a bootstrap sample, is of the same size as the


original dataset but contains some repeated instances and may
exclude others.

• For each bootstrap sample, a decision tree is constructed using a


subset of features randomly chosen from the available features.
Pictorial understanding of Random Forest Algorithm
Working of Random Forest algorithm
• Random Forest works in two-phase first is to create the random forest by
combining N decision tree, and second is to make predictions for each tree
created in the first phase.
• The Working process can be explained
• Phase-1
• 1. Select random K data points from the training set.
• 2. Build the decision trees associated with the selected data points
(Subsets).
• 3. Choose the number N for decision trees that you want to build.
• 4. Repeat Step 1 & 2.
• Phase -2
For new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority
votes.
Random Forests offer several advantages:
• They are effective in handling large and high-dimensional datasets.

• They provide robustness against overfitting, as the randomness introduced


during the training process helps reduce variance.

• They can handle both numerical and categorical features without requiring
extensive data preprocessing.

• They offer feature importance estimation, indicating the relevance of each


feature in the prediction process.

You might also like