0% found this document useful (0 votes)
2 views

Decision Tree & Random Forest

Uploaded by

vardhanvalluri5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Decision Tree & Random Forest

Uploaded by

vardhanvalluri5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Clustering

Clustering
• Clustering is the process of grouping a set of objects into
classes of similar objects.
• Group data points that are close (or similar) to each other.
• Clustering - unsupervised machine learning: no predefined
classes.
• A good clustering method will produce
clusters with
High intra-class similarity
Low inter-class similarity
Types of Clustering
Centroid based clustering

The centroid of a cluster is the


arithmetic mean of all the points in
the cluster.

Centroid-based clustering organizes


the data into non-hierarchical
clusters.
Clustering
Density based clustering
o Density-based clustering connects
contiguous areas of high example
density into clusters.
o This allows for the discovery of
any number of clusters of any
shape.
o Outliers are not assigned to
clusters.
Clustering
Distribution based clustering
o This approach assumes data is
composed of probabilistic
distributions, such as Gaussian
distributions.
o The distribution-based algorithm
clusters data into 3 Gaussian
distributions.
o As distance from the distribution's
center increases, the probability that
a point belongs to the distribution
decreases.
Clustering
Hierarchical clustering
o Hierarchical clustering creates
a tree of clusters.

o It is well suited to hierarchical


data.
Decision Tree
Decision Tree
• A decision tree is a supervised learning algorithm that is used
for classification and regression modeling.
• Mostly it is preferred for solving Classification problems.
• A decision tree is a non-parametric supervised learning
algorithm
• It is a tree-structured classifier
• Internal nodes represent the features of a dataset.
• Branches represent the decision rules.
• Each leaf node represents the final decision/outcome.
Decision Tree
Decision Tree
Decision Tree
• A decision tree starts with a root node, which does not have
any incoming branches.
• The outgoing branches from the root node then feed into
the internal nodes (decision nodes).
• Decision nodes are used to make any decision and have
multiple branches.
• Leaf nodes are the output of those decisions and do not
contain any further branches.
Decision Tree
• Decision tree is a graphical representation for getting all the
possible solutions to a problem/decision based on given
conditions.
• In order to build a tree, CART (Classification and Regression
Tree algorithm) algorithm can be used.
• A decision tree simply asks a question, and based on the
answer (Yes/No), it further split the tree into subtrees.
Decision Tree Example
Decision Tree Example
Decision Tree Example
Decision Tree Example
Decision Tree Terminologies
• Root Node: Root node is from where the decision tree starts. It
represents the entire dataset, which further gets divided into two
or more homogeneous sets.

• Decision (or internal) node: Decision nodes are used to make


any decision and have multiple branches.

• Leaf (External or terminal) Node: Leaf nodes are the final output
node, and the tree cannot be segregated further after getting a
leaf node.
Decision Tree
• Splitting: Splitting is the process of dividing the decision
node/root node into sub-nodes according to the given
conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted
branches from the tree.
• Parent/Child node: The root node of the tree is called the
parent node, and other nodes are called the child nodes.
Decision Tree – Solving Example -1
Decision Tree – Solving Example -2

Suppose there is a candidate who


has a job offer and wants to
decide whether he should accept
the offer or Not.
Decision Tree – Solving Example -2
Decision Tree - Algorithm
• Step-1: Begin the tree with the root node, say S, which contains the
complete dataset.

• Step-2: Find the best attribute in the dataset using Attribute


Selection Measure (ASM).
• Information Gain
• Gini Index
Decision Tree - Algorithm
• Step-3: Divide the S into subsets that contains possible
values for the best attributes.

• Step-4: Generate the decision tree node, which contains the


best attribute.

• Step-5: Recursively make new decision trees using the


subsets of the dataset created in step -3. Continue this
process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
Decision Tree
Advantages
• It is simple to understand as it follows the same process
which a human follow while making any decision in real-life.

• It can be very useful for solving decision-related problems.

• It helps to think about all the possible outcomes for a


problem.

• There is less requirement of data cleaning compared to


other algorithms.
Decision Tree
Disadvantages
• The decision tree contains lots of layers, which makes it
complex.

• It may have an overfitting issue, which can be resolved using


the Random Forest algorithm.

• For more class labels, the computational complexity of the


decision tree may increase.
Random Forest
Random Forest
• Random Forest is a popular machine learning algorithm that
belongs to the supervised learning technique.

• It can be used for both Classification and Regression


problems in ML.

• It is based on the concept of ensemble learning.


• Combining multiple classifiers to solve a complex problem
and to improve the performance of the model.
Random Forest
• Random Forest is a classifier that contains a number
of decision trees on various subsets of the given
dataset and takes the average to improve the
predictive accuracy of that dataset.

• The greater number of trees in the forest leads to


higher accuracy and prevents the problem of
overfitting.
Random Forest
Random Forest
Random Forest - Algorithm
• Step-1: Select random K data points from the training set.
• Step-2: Build the decision trees associated with the
selected data points (Subsets).
• Step-3: Choose the number N for decision trees that you
want to build.
• Step-4: Repeat Step 1 & 2.
• Step-5: For new data points, find the predictions of each
decision tree, and assign the new data points to the
category that wins the majority votes.
Random Forest
Advantages
• Random Forest is capable of performing both Classification and
Regression tasks.
• It is capable of handling large datasets with high dimensionality.
• It enhances the accuracy of the model and prevents the
overfitting issue.
Disadvantages
• Although random forest can be used for both classification and
regression tasks, it is not more suitable for Regression tasks.
Decision Tree vs Random Forest

You might also like