0% found this document useful (0 votes)
9 views

Decision Trees

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Decision Trees

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Decision Trees

Dr. Ashutosh K. Dubey

Shiv Nadar University Chennai

September 2, 2024
Decision Trees
■ What is it?
▶ Decision trees are a type of supervised machine learning
algorithm.
▶ A Decision Tree is a flowchart-like structure used for
decision-making.
▶ It consists of nodes representing decisions or outcomes,
branches for possible choices, and leaves for final decisions or
classifications.
■ Why is it important?
▶ Decision Trees are intuitive and interpretable, making them
useful for both predictive modeling and understanding decision
processes.
▶ They can handle both categorical and numerical data.

■ Benefits
▶ Easy to visualize and understand, even for non-technical
audiences.
▶ Can be used for both classification and regression tasks.
Definition & Purpose

■ Definition:
▶ A Decision Tree is a model that predicts the value of a target
variable based on input variables.
▶ It is structured as a flowchart with nodes, branches, and leaves.

■ Purpose:
▶ To make decisions that lead to favorable outcomes.
▶ Helps in classification and regression tasks.
■ Example:
▶ Predicting whether a customer will purchase a product based
on attributes like age, income, and browsing history.
Classification Trees

■ What is a Classification Tree?


▶ A type of Decision Tree where the target variable is categorical.
▶ Used to classify data into predefined classes.
■ Example:
▶ Predicting whether a patient has a disease based on age,
symptoms, and test results.
Regression Trees

■ What is a Regression Tree?


▶ A type of Decision Tree where the target variable is continuous.
▶ Used to predict real-valued outcomes.
■ Example:
▶ Predicting house prices based on features like square footage,
number of bedrooms, and location.
Basic Terminology: Nodes

■ What are Nodes?


▶ Points where the data is split based on an attribute.
▶ Each node represents a decision or a test.
• Root: The top node in a tree from where the data is split. In
other words, The first decision point.
• Internal Nodes: Points where the data is split further.
• Leaf Nodes: The final decision or classification.
■ Example:
▶ In predicting whether a customer will buy a product, a node
might test if the customer’s income is above a certain
threshold.
Basic Terminology: Branches

■ What are Branches?


▶ Represent the outcome of a decision at a node.
▶ Connect one node to another.
■ Example:
▶ If a node tests whether a customer’s income is above $50,000,
branches might represent ”Yes” (income > $50,000) or ”No”
(income ≤ $50,000).
Basic Terminology: Decision Rules

■ What are Decision Rules?


▶ Conditions derived from the paths of the tree from the root to
the leaves.
▶ Represent the logic used to make a decision.

■ Example:
▶ ”If income > $50,000 and credit score > 700, then approve
the loan.”
Types of Decision Trees

■ Classification Trees
▶ Used for predicting the class or category of an object based on
input features.
▶ Example: Predicting whether an email is spam or not spam
based on keywords.
■ Regression Trees
▶ Used for predicting a continuous value.
▶ Example: Predicting sales of a product based on advertising
spend.
Building a Decision Tree

■ Algorithms
▶ CART: Classification and Regression Tree.
▶ ID3: Iterative Dichotomiser 3.
▶ CHAID: Chi-squared Automatic Interaction Detector.
Splitting Criteria: What is it?

■ Splitting criteria are the rules or measures used to decide how


to split the data at each node in a Decision Tree.
■ Commonly used splitting criteria include:
▶ Gini Index: Measures impurity in a dataset. Lower values
indicate better splits.
▶ Information Gain: Based on entropy, it measures the
reduction in uncertainty when the data is split.
▶ Entropy: Measures randomness or disorder, used to calculate
Information Gain.
Splitting Criteria: Importance and Benefits

■ Why is it important?
▶ The choice of splitting criterion impacts:
• Accuracy: Ensures that the tree makes accurate predictions
by creating clear branches.
• Complexity: Affects whether the tree is too complex
(overfitting) or too simple (underfitting).
■ Benefits
▶ A well-chosen splitting criterion can lead to:
• Improved Performance: Ensures the tree generalizes well and
performs better on unseen data.
• Meaningful Splits: Helps create splits that are intuitive and
easy to interpret.
Information Gain and Entropy

■ Information gain measures reduction in entropy (uncertainty)


from splitting data on an attribute
■ Entropy is a measure of impurity or uncertainty in a dataset
■ Goal is to choose attribute that maximizes information gain
for most informative split
Gini Impurity

■ Gini impurity measures probability of randomly chosen


element being incorrectly labeled
■ Alternative measure of impurity used in decision trees
■ Goal is to choose attribute that minimizes Gini impurity
Tree Depth and Pruning
■ What is it?
▶ Tree Depth: The maximum number of splits from the root to
a leaf.
▶ Pruning: The process of reducing tree complexity by removing
branches that add little predictive power.
■ Why is it important?
▶ Controlling tree depth and pruning helps prevent overfitting,
where the model becomes too complex and captures noise
rather than signal.
■ Benefits
▶ Simplifies the decision tree by removing branches with little
power to classify instances.
▶ Can be performed during tree building (pre-pruning) or after
the full tree is built (post-pruning).
▶ Pruning can improve model generalization and interpretability.
■ Case Study: Credit Card Fraud Detection
▶ Pruning is used to simplify a Decision Tree model for detecting
fraudulent transactions, making it more robust to new, unseen
data.
Case Study: Predicting Employee Turnover

■ Scenario: A company wants to predict employee turnover


based on factors like job satisfaction, salary, and tenure.
■ Application of Information Gain:
▶ The company calculates Information Gain for each factor to
find the most informative one.
▶ Job satisfaction might have the highest Information Gain,
leading to the first split in the tree.
■ Outcome: The Decision Tree helps the company identify
high-risk employees and take proactive measures to reduce
turnover.
Advanced Topics in Decision Trees

■ Advanced topics enhance the performance, robustness, and


interpretability of Decision Trees.
■ These include:
▶ Ensemble Methods
▶ Feature Importance
▶ Handling Imbalanced Data
Ensemble Methods

■ What are Ensemble Methods?


▶ Techniques that combine multiple Decision Trees to create a
more powerful model.
■ Key Types:
▶ Random Forests: Builds multiple trees using different subsets
of data and features, then averages their predictions.
▶ Gradient Boosting: Builds trees sequentially, with each tree
correcting errors of the previous one.
■ Examples:
▶ Random Forests in Credit Scoring: Improves accuracy by
averaging predictions from multiple trees, reducing overfitting.
▶ Gradient Boosting in Customer Churn Prediction:
Captures complex patterns in customer behavior, improving
predictive accuracy.
Feature Importance

■ What is Feature Importance?


▶ A technique to identify which features are most influential in
making predictions in a Decision Tree model.
■ Key Concepts:
▶ Ranking Features: Features are ranked based on their
importance scores.
▶ Impact of Features: Measures how much a feature reduces
impurity across all trees in the ensemble.
■ Examples:
▶ Ranking Features in Marketing: Identifies the most
important factors influencing customer purchases, like
frequency of visits.
▶ Impact of Features in Healthcare: Highlights key predictors
in patient outcomes, such as age and medical history.
Handling Imbalanced Data

■ What is Imbalanced Data?


▶ Refers to datasets where one class is significantly
underrepresented compared to another.
■ Techniques for Handling Imbalanced Data:
▶ SMOTE: Creates synthetic examples of the minority class to
balance the dataset.
▶ Under-sampling: Reduces the number of instances in the
majority class to balance the class distribution.
▶ Class Weights: Assigns higher weights to the minority class
during model training.
■ Examples:
▶ SMOTE in Fraud Detection: Generates synthetic fraudulent
transactions, improving the model’s ability to detect fraud.
▶ Under-sampling in Disease Prediction: Reduces healthy
patient records to balance the dataset, improving rare disease
detection.
Thank You!!!
Comments & Suggestions
akdubey @snuchennai.edu.in

You might also like