UNIT II 2.1 ML Decision Tree Learning
UNIT II 2.1 ML Decision Tree Learning
(20BT60501)
COURSE DESCRIPTION:
Concept learning, General to specific ordering, Decision tree
learning, Support vector machine, Artificial neural networks,
Multilayer neural networks, Bayesian learning, Instance based
learning, reinforcement learning.
Subject :MACHINE LEARNING –(20BT60501)
Prepared By:
Dr.J.Avanija
Professor
Dept. of CSE
Sree Vidyanikethan Engineering College
Tirupati.
Unit II – DECISION TREE LEARNING AND KERNEL MACHINES
4
Why Decision Tree?
5
Types of Decision Trees
6
Decision Tree Representation
7
Decision Tree Representation
8
Decision Tree Representation
• Terminologies
• Root Node: Root node is from where the decision tree starts.
It represents the entire dataset, which further gets divided into two or
more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot
be segregated further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root
node into sub-nodes according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches
from the tree.
• Parent/Child node: The root node of the tree is called the parent node,
and other nodes are called the child nodes.
9
MODEL
Learning Predictive
Training Data Algorithm Model
Accuracy
Unseen Predictive
Prediction
Data Model Results
Age
<30 >60
30-
60
Budget
Income Income Spender
9 – Yes
5 - No
20
Decision Tree Representation
21
Decision Tree Representation
Decision trees classify instances by sorting them down the tree from the
root to some leaf node
The decision tree in above figure classifies a particular morning according to
whether it is suitable for playing tennis and returning the classification
associated with the particular leaf.(in this case Yes or No). Classified as
negative
instance
22
Appropriate Problems for Decision Tree Learning
Examples:
Equipment or medical diagnosis
Credit risk analysis
23
Basic Decision Tree Learning Algorithm
The decision of making strategic splits heavily affects a tree’s
accuracy.
The decision criteria are different for classification and regression
trees.
Decision trees use multiple algorithms to decide to split a node into two
or more sub-nodes.
The algorithm selection is also based on the type of target variables. Let
us look at some algorithms used in Decision Trees:
ID3 → Basic algorithm
C4.5 → successor of ID3
CART → Classification And Regression Tree
CHAID → Chi-square automatic interaction detection Performs multi-
level splits when computing classification trees
MARS → multivariate adaptive regression splines
24
ID3 Algorithm
ID3 (Iterative Dichotomiser 3 ) is named such because the
algorithm iteratively (repeatedly) dichotomizes(divides)
features into two or more groups at each step.
Invented by Ross Quinlan
ID3 algorithm builds decision trees using a top-down greedy
search approach through the space of possible branches with no
backtracking.
Classification algorithm that follows a greedy approach by
selecting a best attribute that yields maximum Information
Gain(IG) or minimum Entropy(H).
Typically used in Machine Learning and Natural Language
Processing domains
25
ID3 Algorithm
Steps in ID3 algorithm:
STEP 1: Begin with the original set S as the root node.
STEP 2: Iterate through the unused attribute of the set S and
calculate Entropy(H) and Information gain(IG) of this attribute.
STEP 3: Selects the attribute which has the smallest Entropy or
Largest Information gain.
STEP 4: Set S is then split by the selected attribute to produce a subset
of the data.
STEP 5: Algorithm continues to recur on each subset, considering only
attributes never selected before.
26
ID3 Algorithm
27
Entropy
Entropy is a metric to measure the impurity in a given attribute. It
specifies randomness in data. Entropy measures homogenity of
examples
Given a collection S containing positive and negative examples
of some target concept , Entropy of S related to this boolean
classification is:
Yes No 28
Entropy
Entropy related to boolean
classification varies from 0 to 1.
29
Information Gain
A measure of effectiveness of an attribute in classifying the training
data.
Information gain is expected reduction in entropy.
Information Gain(S,A) of an attribute A relative to collection of
examples S is defined as:
30
Information Gain
31
A Typical Learned Decision Tree
derived from ID3 algorithm
32
Hypothesis Space Search in Decision Tree Learning
ID3 can be characterized as searching a hypothesis space for one
that fits the training examples.
The hypothesis space searched by ID3 is the set of possible
decision trees.
ID3 performs a simple-to complex, hill-climbing search through
this hypothesis space,
Begins with the empty tree, then considers progressively more
elaborate hypotheses in search of a decision tree that correctly
classifies the training data.
The information gain is the heuristic measure that guides the
hill-climbing search.
33
Hypothesis Space Search in Decision Tree Learning
34
Inductive Bias
35
Inductive Bias in ID3
• ID3 search strategy
– selects in favor of shorter trees over longer ones,
– selects trees that place the attributes with highest
information gain closest to the root.
– because ID3 uses the information gain heuristic and a hill
climbing strategy, it does not always find the shortest
consistent tree, and it is biased to favor trees that place
attributes with high information gain closest to the root.
Inductive Bias of ID3:
– Shorter trees are preferred over longer trees.
– Trees that place high information gain attributes close to
the root are preferred over those that do not.
3
6
Inductive Bias in ID3
Occam’s Razor
A classical example of Inductive Bias.
The simplest consistent hypothesis about the target function
is actually the best.
Select solution with fewest assumptions.
3
7
Hyperparameters
3
9
Overfitting
Given a hypothesis space H, a hypothesis hH is said to OVERFIT
the training data if there exists some alternative hypothesis h'H,
such that h has smaller error than h' over the training examples, but
h' has a smaller error than h over the entire distribution of instances.
• As ID3 adds new nodes to grow the decision tree, the accuracy of the tree
measured over the training examples increases monotonically.
• However, when measured over a set of test examples independent of
the training examples, accuracy first increases, then decreases.
Avoid Overfitting - Reduced-Error Pruning
• the accuracy increases over the test set as nodes are pruned from the
tree.
• the validation set used for pruning is distinct from both the training
and test sets.
Rule-Post Pruning
• Rule-Post Prunning is another sucessful method for finding
high accuracy hypotheses.
• It is used by C4.5 learning algorithm (an extension of ID3).
Temperature: 40 48 60 72 80 90
PlayTennis :No No Yes Yes Yes No
Use these new new boolean attributes same as other discrete valued
attributes.
Alternative Selection Measures
• Ex. Date attribute has many values, and may separate training examples into very
small subsets (even singleton sets – perfect partitions)
– Information gain will be very high for Date attribute.
Weaknesses:
Decision trees are less appropriate for estimation tasks where the
goal is to predict the value of a continuous attribute.
Decision trees are prone to errors in classification problems with
many class and relatively small number of training examples.
Decision tree can be computationally expensive to train. The
54
process of growing a decision tree is computationally expensive.
Train-Test Split Example
55