0% found this document useful (0 votes)
17 views

UNIT II 2.1 ML Decision Tree Learning

Uploaded by

Hii
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

UNIT II 2.1 ML Decision Tree Learning

Uploaded by

Hii
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

MACHINE LEARNING

(20BT60501)

COURSE DESCRIPTION:
Concept learning, General to specific ordering, Decision tree
learning, Support vector machine, Artificial neural networks,
Multilayer neural networks, Bayesian learning, Instance based
learning, reinforcement learning.
Subject :MACHINE LEARNING –(20BT60501)

Topic: Unit II – DECISION TREE LEARNING AND KERNEL MACHINES

Prepared By:
Dr.J.Avanija
Professor
Dept. of CSE
Sree Vidyanikethan Engineering College
Tirupati.
Unit II – DECISION TREE LEARNING AND KERNEL MACHINES

Decision Tree Learning:


 Decision tree representation
 Problems for decision tree learning
 Decision tree learning algorithm
 Hypothesis space search
 Inductive bias in decision tree learning
 Issues in decision tree learning
Kernel Machines:
 Support vector machines
 SVMs for regression
 SVMs for classification
 Choosing C
 A probabilistic interpretation of SVMs
Introduction

 Decision trees are considered to be widely used in data science.


It is a key proven tool for making decisions in complex scenario
 Decision trees are a type of supervised learning algorithm
where data will continuously be divided into different categories
according to certain parameters.

4
Why Decision Tree?

Tree-based algorithms are a popular family of related non-


parametric and supervised methods for both classification and
regression

5
Types of Decision Trees

Categorical Variable Decision Trees


 Predicted outcome is the class (e.g. Classify tumour - benign or
malignant)
Continuous Variable Decision Trees
 Predicted outcome can be considered a real number (e.g. the
price of a house, or a patient's length of stay in a hospital).

6
Decision Tree Representation

Decision tree is the most powerful and popular tool for


classification and prediction.
A Decision tree represented as a tree structure,
 Each internal node denotes a test on an attribute,
 Each branch represents an outcome of the test,
 Each leaf node (terminal node) holds a class label.

Internal nodes are some attribute


Edges are the values of attributes
External nodes are the outcome of classification

7
Decision Tree Representation

8
Decision Tree Representation
• Terminologies
• Root Node: Root node is from where the decision tree starts.
It represents the entire dataset, which further gets divided into two or
more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot
be segregated further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root
node into sub-nodes according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches
from the tree.
• Parent/Child node: The root node of the tree is called the parent node,
and other nodes are called the child nodes.

9
MODEL

Class Label – Categorical – Classification Algorithm – Classifier


-- Numerical – Prediction Algorithm - Predictor
Classification & Prediction
• Classification is the process of finding a model that describes and distinguishes data classes
for the purpose of being able to use the model to predict the class of objects whose class
label is unknown.

Test Data Step 2:


Step 1: Training Testing

Learning Predictive
Training Data Algorithm Model

Accuracy

Unseen Predictive
Prediction
Data Model Results

Step 3: Using Model for


Prediction
Definition
Classification Steps 1/2
Model Construction
Model Construction

Age

<30 >60
30-
60
Budget
Income Income Spender

<30K >30K <30K >30K

Budget Big- Budget Big-


Spender Spender Spender Spender
Classification Steps 2/2
Model Usage
Classification VS Prediction
Issues of Classification and Prediction
Decision Tree Representation

9 – Yes
5 - No

20
Decision Tree Representation

A Typical Learned Decision Tree

21
Decision Tree Representation

 Decision trees classify instances by sorting them down the tree from the
root to some leaf node
 The decision tree in above figure classifies a particular morning according to
whether it is suitable for playing tennis and returning the classification
associated with the particular leaf.(in this case Yes or No). Classified as
negative
instance

Decision tree represent a disjunction of conjunctions of constraints on


the attribute values of instances.

22
Appropriate Problems for Decision Tree Learning

Instances describable by attribute-value pairs


 Target function is discrete valued
 Disjunctive hypothesis may be required
 Possibly noisy training data

Examples:
 Equipment or medical diagnosis
 Credit risk analysis

23
Basic Decision Tree Learning Algorithm
 The decision of making strategic splits heavily affects a tree’s
accuracy.
 The decision criteria are different for classification and regression
trees.
 Decision trees use multiple algorithms to decide to split a node into two
or more sub-nodes.
 The algorithm selection is also based on the type of target variables. Let
us look at some algorithms used in Decision Trees:
 ID3 → Basic algorithm
 C4.5 → successor of ID3
 CART → Classification And Regression Tree
 CHAID → Chi-square automatic interaction detection Performs multi-
level splits when computing classification trees
 MARS → multivariate adaptive regression splines
24
ID3 Algorithm
 ID3 (Iterative Dichotomiser 3 ) is named such because the
algorithm iteratively (repeatedly) dichotomizes(divides)
features into two or more groups at each step.
 Invented by Ross Quinlan
 ID3 algorithm builds decision trees using a top-down greedy
search approach through the space of possible branches with no
backtracking.
 Classification algorithm that follows a greedy approach by
selecting a best attribute that yields maximum Information
Gain(IG) or minimum Entropy(H).
 Typically used in Machine Learning and Natural Language
Processing domains
25
ID3 Algorithm
Steps in ID3 algorithm:
STEP 1: Begin with the original set S as the root node.
STEP 2: Iterate through the unused attribute of the set S and
calculate Entropy(H) and Information gain(IG) of this attribute.
STEP 3: Selects the attribute which has the smallest Entropy or
Largest Information gain.
STEP 4: Set S is then split by the selected attribute to produce a subset
of the data.
STEP 5: Algorithm continues to recur on each subset, considering only
attributes never selected before.

Information Gain –Attribute selection Measure

26
ID3 Algorithm

Information Gain –Attribute selection Measure

27
Entropy
 Entropy is a metric to measure the impurity in a given attribute. It
specifies randomness in data. Entropy measures homogenity of
examples
 Given a collection S containing positive and negative examples
of some target concept , Entropy of S related to this boolean
classification is:

Yes No 28
Entropy
 Entropy related to boolean
classification varies from 0 to 1.

 Where pi is the proportion of S


belonging to class i.

Entropy is 0 if we have only positive or negative examples


Entropy is 1 for equal number of positive and negative examples

29
Information Gain
 A measure of effectiveness of an attribute in classifying the training
data.
 Information gain is expected reduction in entropy.
 Information Gain(S,A) of an attribute A relative to collection of
examples S is defined as:

Information Gain=Entropy(parent)- [Average Entropy(children)]

30
Information Gain

31
A Typical Learned Decision Tree
derived from ID3 algorithm

32
Hypothesis Space Search in Decision Tree Learning
 ID3 can be characterized as searching a hypothesis space for one
that fits the training examples.
 The hypothesis space searched by ID3 is the set of possible
decision trees.
 ID3 performs a simple-to complex, hill-climbing search through
this hypothesis space,
 Begins with the empty tree, then considers progressively more
elaborate hypotheses in search of a decision tree that correctly
classifies the training data.
 The information gain is the heuristic measure that guides the
hill-climbing search.

33
Hypothesis Space Search in Decision Tree Learning

34
Inductive Bias

• Need to make assumptions

– Experience alone doesn’t allow us to make conclusions


about unseen data instances
• Two types of bias:

– Restriction: (search bias) Limit the hypothesis space


(e.g., look at rules)
– Preference: (language bias) Impose ordering on
hypothesis space (e.g., more general, consistent with
data)

35
Inductive Bias in ID3
• ID3 search strategy
– selects in favor of shorter trees over longer ones,
– selects trees that place the attributes with highest
information gain closest to the root.
– because ID3 uses the information gain heuristic and a hill
climbing strategy, it does not always find the shortest
consistent tree, and it is biased to favor trees that place
attributes with high information gain closest to the root.
Inductive Bias of ID3:
– Shorter trees are preferred over longer trees.
– Trees that place high information gain attributes close to
the root are preferred over those that do not.
3
6
Inductive Bias in ID3
Occam’s Razor
A classical example of Inductive Bias.
The simplest consistent hypothesis about the target function
is actually the best.
Select solution with fewest assumptions.

3
7
Hyperparameters

 A parameter that is set before the learning process begins. These


parameters are tunable and can directly affect how well a model
trains.
 Examples of Hyperparameters
 Learning Rate
 Number of Epochs, Momentum
 Regularization constant
 Number of branches in a decision tree
 Number of clusters in a clustering algorithm (like k-means)
 Optimizing Hyperparameters
 Grid Search
 Random Search
 Bayesian Optimization
 Gradient-based Optimization 3
8
Issues in Decision Tree learning

 Overfitting the Data


 Incorporating Continuous valued attributes
 Handling training examples with missing attribute values
 Handling attributes with different costs
 Alternative measures for selecting attributes

3
9
Overfitting
Given a hypothesis space H, a hypothesis hH is said to OVERFIT
the training data if there exists some alternative hypothesis h'H,
such that h has smaller error than h' over the training examples, but
h' has a smaller error than h over the entire distribution of instances.

Reasons for overfitting:


– Errors and noise in training examples
– Coincidental regularities (especially small number of examples
are associated with leaf nodes).
Overfitting
 Stop growing the tree earlier before it reaches the point where it
perfectly classifies the training data.
 Allow the tree to overfit the data and then post-prune the tree
 Split the training in two parts (training and validation) and use
validation to assess the utility of post-pruning
 Reduced error pruning
 Rule post pruning
Overfitting
Overfitting

• As ID3 adds new nodes to grow the decision tree, the accuracy of the tree
measured over the training examples increases monotonically.
• However, when measured over a set of test examples independent of
the training examples, accuracy first increases, then decreases.
Avoid Overfitting - Reduced-Error Pruning

• Split data into training and validation set

• Do until further pruning is harmful:

– Evaluate impact on validation set of pruning each


possible node (plus those below it)
– Greedily remove the one that most improves the
validation set accuracy
• Pruning of nodes continues until further pruning is
harmful (i.e., decreases accuracy of the tree over the
validation set).
• Using a separate set of data to guide pruning is an effective
approach provided a large amount of data is available. 4
4
Effect of Reduced Error
Pruning

• the accuracy increases over the test set as nodes are pruned from the
tree.
• the validation set used for pruning is distinct from both the training
and test sets.
Rule-Post Pruning
• Rule-Post Prunning is another sucessful method for finding
high accuracy hypotheses.
• It is used by C4.5 learning algorithm (an extension of ID3).

• Steps of Rule-Post Pruning:


– Infer the decision tree from the training set.

– Convert the learned tree into an equivalent set of rules by


creating one rule for each path from the root node to a leaf
node.
– Each path corresponds to a rule

– Each node along a path corresponds to pre-condition

– Each leaf classification is post-condition


Converting a Decision Tree to Rules

R1: If (Outlook=Sunny)  (Humidity=High) Then PlayTennis=No


R2: If (Outlook=Sunny)  (Humidity=Normal) Then
PlayTennis=Yes R3: If (Outlook=Overcast) Then PlayTennis=Yes
R4: If (Outlook=Rain)  (Wind=Strong) Then PlayTennis=No
R5: If (Outlook=Rain)  (Wind=Weak) Then PlayTennis=Yes

Pruning Rules
Each rule is pruned by removing any antecedent (precondition).

– Ex. Prune R1 by removing (Outlook=Sunny) or (Humidity=High

– R1: If (Outlook=Sunny)  (Humidity=High) Then PlayTennis=No

– R1: If (Outlook=Sunny)  (Humidity=High) Then PlayTennis=No

• Select whichever of the pruning steps produced the greatest


improvement in estimated rule accuracy.
– Then, continue with other preconditions.

– No pruning step is performed if it reduces the estimated


rule accuracy.
• In order to estimate rule accuracy:
– use a validation set of examples disjoint from the
training set
Why Convert The Decision Tree To Rules Before Pruning?

• Converting to rules improves readability.

– Rules are often easier for to understand.

• Distinguishing different contexts in which a node is


used
– separate pruning decision for each path

• No difference for root/inner

– no bookkeeping on how to reorganize tree if root node is


pruned
Continuous-Valued Attributes

• ID3 is restricted to attributes that take on a discrete set of values.

• Define new discrete valued attributes that partition the continuous


attribute value into a discrete set of intervals
• For a continuous-valued attribute A that is, create a new boolean
attribute Ac, that is true if A < c and false otherwise.
– Select c using information gain

– Sort examples according to the continuous attributeA,

– Then identify adjacent examples that differ in their target


classification
– Generate candidate thresholds midway between corresponding values
of A.
– The value of c that maximizes information gain must always lie at
a boundary.
– These candidate thresholds can then be evaluated by computing the
Continuous-Valued Attributes - Example

Temperature: 40 48 60 72 80 90
PlayTennis :No No Yes Yes Yes No

Two candidate thresholds: (48+60)/2=54 (80+90)/2=85

Check the information gain for new boolean attributes:


Temperature>54 Temperature>85

Use these new new boolean attributes same as other discrete valued
attributes.
Alternative Selection Measures

• Information gain measure favors attributes with many values

– separates data into small subsets

– high gain, poor prediction

• Ex. Date attribute has many values, and may separate training examples into very
small subsets (even singleton sets – perfect partitions)
– Information gain will be very high for Date attribute.

– Perfect partition maximum gain : Gain(S,Date) = Entropy(S) – 0 =


Entropy(S) because log21 is 0.
– It has high information gain, but very poor predictor for unseen data.

• There are alternative selection measures such as GainRatio measure based on


SplitInformation
Split Information
Strengths and Weaknesses of Decision Tree
Strengths
 Decision trees are able to generate understandable rules.
 Decision trees perform classification without requiring much
computation.
 Decision trees are able to handle both continuous and categorical
variables.
 Decision trees provide a clear indication of which fields are most
important for prediction or classification.

Weaknesses:
 Decision trees are less appropriate for estimation tasks where the
goal is to predict the value of a continuous attribute.
 Decision trees are prone to errors in classification problems with
many class and relatively small number of training examples.
 Decision tree can be computationally expensive to train. The
54
process of growing a decision tree is computationally expensive.
Train-Test Split Example

55

You might also like