0% found this document useful (0 votes)

30 views

Decision Trees: Classifier

Decision trees are a type of classifier that predict class membership by splitting the data into partitions based on attribute values. The algorithm builds a tree structure by recursively splitting the data into purer partitions based on information gain. At each node, it evaluates the attributes and selects the split with the highest information gain. The process continues until the partitions contain only a single class or until no further splits can increase the information gain. New data points are classified by traversing the tree from the root node until reaching a leaf node, which predicts the class.

Uploaded by

Harsha Vardhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

Decision Trees: Classifier

Uploaded by

Harsha Vardhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Decision Trees

Class
prediction

Input Data Attributes

X1=x1

Classifier

Y=y

XM=xM

Training data

Decision Tree Example

Three variables:

Hair = {blond, dark}

Height = {tall,short}
Country = {Gromland, Polvia}

Training data:
P:2 G:4
(B,T,P)
(B,T,P)
Hair = D?
Hair = B?
(B,S,G)
(D,S,G)
(D,T,G)
P:0 G:2
P:2 G:2
(B,S,G)
Height = S?
Height = T?
P:0 G:2

P:2 G:0

At each level of the tree,

we split the data
according to the value
of on of the attributes

After enough splits, only

one class is represented
in the node This is a
P:2terminal
G:4
leaf of the tree
We call that
class
Hair
= D?the
output class for that
Hair = B?
node
P:0 G:2

P:2 G:2
Height = T?

P:2 G:0

Height = S?
G is the output
P:0 G:2 for this node

P:2 G:4
Hair = D?

Hair = B?

P:0 G:2

P:2 G:2
Height = T?

Height = S?

P:2 G:0

P:0 G:2

The class of a new input can be classified by following the

tree all the way down to a leaf and by reporting the output
of the leaf. For example:
(B,T) is classified as P
(D,S) is classified as G

General Case (Discrete Attributes)

We have R observations
from training data
Each observation has M
attributes X1,..,XM
Each Xi can take N distinct
discrete values
Each observation has a
class attribute Y with C
distinct (discrete) values
Problem: Construct a
sequence of tests on the
attributes such that, given a
new input (x1,..,xM), the class
attribute y is correctly
predicted

X1 .. XM Y
Input x1 ..... xM ???
data
X1 . XM

Data 1 x1 ..... xM
Data 2

Data R
Training Data

X = attributes of training data (RxM) Y = Class of training data (R)

General Decision Tree (Discrete

Attributes)
X1 =
first possible value for X1?

X1 =
nth possible value for X1 ?

Output class Y = y1

Xj =
ith possible value for Xj ?
Output class Y = yc

X2=0.5

Decision Tree Example

X1 < 0.5 ??
:3
X1=0.5

X2 < 0.5??
:3

X1 < 0.5 ??
:3

X2 < 0.5??
:3

The class of a new input can be classified by following the

tree all the way down to a leaf and by reporting the output
of the leaf. For example:
(0.2,0.8) is classified as
(0.8,0.2) is classified as

General Case (Continuous Attributes)

We have R observations
from training data
Each observation has M
attributes X1,..,XM
Each Xi can take N distinct
discrete values
Each observation has a
class attribute Y with C
distinct (discrete) values
Problem: Construct a
sequence of tests of the
form Xi < ti ? on the
attributes such that, given
a new input (x1,..,xM), the
class attribute y is correctly
predicted

X1 .. XM Y
Input x1 ..... xM ???
data
X1 . XM

Data 1 x1 ..... xM
Data 2

Data R
Training Data

X = attributes of training data (RxM) Y = Class of training data (R)

General Decision Tree (Continuous

Attributes)
X 1 < t 1?

Output class Y = y1

X j < tj ?
Output class Y = yc

Basic Questions
How to choose the attribute/value to split
on at each level of the tree?
When to stop splitting? When should a
node be declared a leaf?
If a leaf node is impure, how should the
class label be assigned?
If the tree is too large, how can it be
pruned?

How to choose the attribute/value to split on

at each level of the tree?

Two classes (red circles/green crosses)

Two attributes: X1 and X2
11 points in training data
Idea Construct a decision tree such that the
leaf nodes predict correctly the class for all the
training examples

How to choose the attribute/value to split on

at each level of the tree?

Good

Bad

This node is
This node is
pure because
Good
almost pure
there is only
one class left Little
No ambiguity in ambiguity in the
the class label class label

Bad

These nodes contain a

mixture of classes
Do not disambiguate
between the classes

We want to find the most compact, smallest size

tree (Occams razor), that classifies the training
data correctly We want to find the split choices
that will get us the fastest to pure nodes

This node is
node is
pure because
GoodThis
almost pure
there is only
Little
one class left
ambiguity in the
No ambiguity in
class label
the class label

Bad

These nodes contain a

mixture of classes
Do not disambiguate
between the classes

Digression: Information Content

Frequency of
occurrence

Suppose that we are dealing with data which can come from
four possible values (A, B, C, D)
Each class may appear with some probability
Suppose P(A) = P(B) = P(C) = P(D) = 1/4
What is the average number of bits necessary to encode each
class?
In this case: average = 2 = 2xP(A)+2xP(B)+2xP(C)+2xP(D)
A 00 B 01 C 10 D 11
The distribution is not very
informative impure

A B CD Class Number

Information Content

Frequency of
occurrence

Suppose now P(A) = 1/2 P(B) = 1/4 P(C) = 1/8 P(D) = 1/8
What is the average number of bits necessary to encode each
class?
In this case, the classes can be encoded by using 1.75 bits on
average
A 0 B 10 C 110 D 111
Average
= 1xP(A)+2xP(B)+3xP(C)+3xP(D) = 1.75
The distribution is more
informative higher purity

A B CD Class Number

Entropy
In general, the average number of bits
necessary to encode n values is the
entropy:
n

H = Pi log 2 Pi
i =1

Pi = probability of occurrence of value i

High entropy All the classes are (nearly)
equally likely
Low entropy A few classes are likely; most
of the classes are rarely observed

Entropy
Frequency of
occurrence

High
Entropy

The entropy
captures the
degree of purity
of the distribution

Frequency of
occurrence

Class Number
Low
Entropy

Class Number

Example Entropy Calculation

NA = 1
NB = 6
pA = NA/(NA+NB) = 1/7
pB = NB/(NA+NB) = 6/7

NA = 3
NB = 2
pA = NA/(NA+NB) = 3/5
pB = NB/(NA+NB) = 2/5

H1 = -pAlog2 pA pBlog2 pB
= 0.59

H2 = -pAlog2 pA pBlog2 pB
= 0.97

H1 < H2 => (2) less pure than (1)

Example Entropy Calculation

Frequency of occurrence
2
of class A in node (1)

NA = 1
NA =of3 occurrence
Frequency
NB = 6
= 2node (1)
of classNBB in
pA = NA/(NA+NB) = 1/7
pA = NA/(NA+NB) = 3/5
of node (1)
pB = NB/(NA+NB) = 6/7
pB = NEntropy
B/(NA+NB) = 2/5
H1 = -pAlog2 pA pBlog2 pB
= 0.59

H2 = -pAlog2 pA pBlog2 pB
= 0.97

H1 < H2 => (2) less pure than (1)

Conditional Entropy
Entropy before splitting: H

After splitting, a fraction

PL of the data goes to
the left node, which has
entropy HL

After splitting, a fraction

PR of the data goes to the
left node, which has
entropy HR

The average entropy after splitting is:

HLx PL+ HR x PR

Conditional Entropy
Entropy before splitting: H

After splitting, a fraction

After splitting, a fraction
Probability that a random input
PL of the data goes to
PR of the data goes to the
is directed
to the left node
Entropy
left has
left node, which has
the left
node,ofwhich
node
entropy HR
entropy H
L
The average entropy after splitting is:

HLx PL+ HR x PR
Conditional Entropy

Information Gain
PL

We want nodes as pure as possible

We want to reduce the entropy as much as possible
We want to maximize the difference between the
entropy of the parent node and the expected entropy of
the children
Maximize:

IG = H (HLx PL+ HR x PR)

Notations
Entropy: H(Y) = Entropy of the distribution
of classes at a node
Conditional Entropy:
Discrete: H(Y|Xj) = Entropy after splitting with
respect to variable j
Continuous: H(Y|Xj,t) = Entropy after splitting
with respect to variable j with threshold t

Information gain:
Discrete: IG(Y|Xj) = H(Y) - H(Y|Xj) = Entropy
after splitting with respect to variable j
Continuous: IG(Y|Xj,t) = H(Y) - H(Y|Xj,t) =
Entropy after splitting with respect to variable j
with threshold t

Information Gain
PL
HL

PR
HR

We want nodes as pure as possible

We want to reduce the entropy as much as possible
We want to maximize the difference between the
entropy of the parent node and the expected entropy of
the children
Information Gain (IG) = Amount by
Maximize:

which the ambiguity is decreased

by splitting the node

IG = H (HLx PL+ HR x PR)

H = 0.99

IG =
H (HL * 4/11 + HR * 7/11)

HL = 0

H = 0.99

IG =
H (HL * 5/11 + HR * 6/11)

HR = 0.58 HL = 0.97 HR = 0.92

H = 0.99

IG = 0.62

IG = 0.052

HL = 0

HR = 0.58 HL = 0.97 HR = 0.92

H = 0.99

IG = 0.62

HL = 0

H = 0.99

Choose this split because the

information gain is greater
than with the other split

IG = 0.052

HR = 0.58 HL = 0.97 HR = 0.92

More Complete Example

= 20 training examples from class A

= 20 training examples from class B
Attributes = X1 and X2 coordinates

X1 Split value
Best split value (max Information Gain) for X1
attribute: 0.24 with IG = 0.138

X2 Split value
Best split value (max Information Gain) for X2
attribute: 0.234 with IG = 0.202

Best X1 split: 0.24, IG = 0.138

Best X2 split: 0.234, IG = 0.202
Split on X2 with 0.234

Best X split: 0.24, IG = 0.138

There
is no point
in splitting
Best
Y split:
0.234,
IG = 0.202
this node further since it
contains only data from a
single class return it as a
leaf Split
node on
withYoutput
A
with 0.234

This node is not pure so we

need to split further

X1 Split value
Best split value (max Information Gain) for X1
attribute: 0.22 with IG ~ 0.182

X2 Split value
Best split value (max Information Gain) for X2
attribute: 0.75 with IG ~ 0.353

Best X1 split: 0.22, IG = 0.182

Best X2 split: 0.75, IG = 0.353

Split on X2 with 0.75

X2
X2

Best
X split: 0.22, IG = 0.182
There is no point in splitting
Best
split:
0.75,since
IG =it 0.353
thisYnode
further
contains only data from a
single class return it as a
leaf node with output A

Split on X with 0.5

X2
X2

A
A

Final decision tree

Each of the leaf

nodes is pure
contains data from
only one class

Final decision tree

A
A

Given an input (X,Y)

Follow the tree down to a
leaf.
Return corresponding
output class for this leaf
X2

Example (X,Y) = (0.5,0.5)

Pure and Impure Leaves and When

to Stop Splitting
All the data in the node comes from a
single class We declare the node to be
a leaf and stop splitting. This leaf will
output the class of the data it contains

Several data points have exactly the same

attributes even though they are from the
same class We cannot split any further
We still declare the node to be a leaf, but it
will output the class that is the majority of
the classes in the node (in this example, B)

Decision Tree Algorithm (Continuous Attributes)

LearnTree(X,Y)
Input:
Set X of R training vectors, each containing the values (x1,..,xM) of
M attributes (X1,..,XM)
A vector Y of R elements, where yj = class of the jth datapoint

If all the datapoints in X have the same class value y

Return a leaf node that predicts y as output

If all the datapoints in X have the same attribute value (x1,..,xM)

Return a leaf node that predicts the majority of the class values in Y
as output

Try all the possible attributes Xj and threshold t and choose the
one, j*, for which IG(Y|Xj,t) is maximum
XL, YL= set of datapoints for which xj* < t and corresponding
classes
XH, YH = set of datapoints for which xj* >= t and corresponding
classes
Left Child LearnTree(XL,YL)
Right Child LearnTree(XH,YH)

Decision Tree Algorithm (Discrete Attributes)

LearnTree(X,Y)
Input:
Set X of R training vectors, each containing the values
(x1,..,xM) of M attributes (X1,..,XM)
A vector Y of R elements, where yj = class of the jth datapoint

If all the datapoints in X have the same class value y

Return a leaf node that predicts y as output

If all the datapoints in X have the same attribute value

(x1,..,xM)
Return a leaf node that predicts the majority of the class
values in Y as output

Try all the possible attributes Xj and choose the one,

j*, for which IG(Y|Xj) is maximum
For every possible value v of Xj*:
Xv, Yv= set of datapoints for which xj* = v and corresponding
classes
Childv LearnTree(Xv,Yv)

Decision Trees So Far

Given R observations from training data, each
with M attributes X and a class attribute Y,
construct a sequence of tests (decision tree) to
predict the class attribute Y from the attributes X
Basic strategy for defining the tests (when to
split) maximize the information gain on the
training data set at each node of the tree
Problems (next):
Computational issues How expensive is it to
compute the IG
The tree will end up being much too big pruning
Evaluating the tree on training data is dangerous
overfitting

UpGrad Scholarship Tests
25% (4)
UpGrad Scholarship Tests
10 pages
Decision Tree
No ratings yet
Decision Tree
23 pages
Decision Tree
No ratings yet
Decision Tree
52 pages
21 Decision Trees
No ratings yet
21 Decision Trees
62 pages
Chapter 3
No ratings yet
Chapter 3
88 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Decision Trees
No ratings yet
Decision Trees
45 pages
ML Unit 3 Notes
No ratings yet
ML Unit 3 Notes
117 pages
LVC+1+Post-Session+Summary
No ratings yet
LVC+1+Post-Session+Summary
9 pages
ML_UNIT_3_NOTES-1
No ratings yet
ML_UNIT_3_NOTES-1
118 pages
Machine Learning: Decision Trees: CS540 Jerry Zhu University of Wisconsin-Madison
No ratings yet
Machine Learning: Decision Trees: CS540 Jerry Zhu University of Wisconsin-Madison
49 pages
Decision Trees-Lecture 9&10
No ratings yet
Decision Trees-Lecture 9&10
60 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
4.Decision Tree
No ratings yet
4.Decision Tree
39 pages
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
No ratings yet
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
73 pages
2 ML Ch3 Decision Trees Final
No ratings yet
2 ML Ch3 Decision Trees Final
70 pages
Machine Learning Unit-3.2
No ratings yet
Machine Learning Unit-3.2
61 pages
Classification With Decision Trees I: Instructor: Qiang Yang
No ratings yet
Classification With Decision Trees I: Instructor: Qiang Yang
29 pages
Decision Tree Algorithm: Comp328 Tutorial 1 Kai Zhang
No ratings yet
Decision Tree Algorithm: Comp328 Tutorial 1 Kai Zhang
25 pages
22.InfoTheory-DecisionTrees-short
No ratings yet
22.InfoTheory-DecisionTrees-short
25 pages
Decision Trees (I) : ISOM3360 Data Mining For Business Analytics, Session 4
No ratings yet
Decision Trees (I) : ISOM3360 Data Mining For Business Analytics, Session 4
32 pages
19 -- Decision Tree -- ID3
No ratings yet
19 -- Decision Tree -- ID3
87 pages
M2 Decision trees
No ratings yet
M2 Decision trees
37 pages
Ds 6
No ratings yet
Ds 6
24 pages
DS_w12_DT
No ratings yet
DS_w12_DT
61 pages
Machine Learning 10601 Recitation 8 Oct 21, 2009: Oznur Tastan
No ratings yet
Machine Learning 10601 Recitation 8 Oct 21, 2009: Oznur Tastan
46 pages
AIML Lec-11
No ratings yet
AIML Lec-11
18 pages
Decision Tree
No ratings yet
Decision Tree
58 pages
2024 Decision Trees
No ratings yet
2024 Decision Trees
28 pages
Chap 18 B
No ratings yet
Chap 18 B
22 pages
Decision Tree
No ratings yet
Decision Tree
25 pages
SDG Sdgs DF
No ratings yet
SDG Sdgs DF
23 pages
Random Forest Regression
No ratings yet
Random Forest Regression
57 pages
DM chapter 4
No ratings yet
DM chapter 4
6 pages
Decision Tree Algorithm
No ratings yet
Decision Tree Algorithm
18 pages
Decision Trees
No ratings yet
Decision Trees
42 pages
Decision Trees
No ratings yet
Decision Trees
5 pages
07 - ML - Decision Tree
No ratings yet
07 - ML - Decision Tree
37 pages
Lecture 4
No ratings yet
Lecture 4
74 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
Business Analytics & Machine Learning: Decision Tree Classifiers
No ratings yet
Business Analytics & Machine Learning: Decision Tree Classifiers
60 pages
Decision-Tree Learning .
No ratings yet
Decision-Tree Learning .
29 pages
Lec7 - Nonparametric Methods - II
No ratings yet
Lec7 - Nonparametric Methods - II
38 pages
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
No ratings yet
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
83 pages
CSE445 NSU Week_4
No ratings yet
CSE445 NSU Week_4
48 pages
Act9
No ratings yet
Act9
22 pages
DMDW-CO3-SESSION-14
No ratings yet
DMDW-CO3-SESSION-14
55 pages
Decision Tree
No ratings yet
Decision Tree
30 pages
Decision Trees
No ratings yet
Decision Trees
37 pages
Decision Tree
No ratings yet
Decision Tree
33 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
7_DecisionTree
No ratings yet
7_DecisionTree
58 pages
Decision Tree Algorithm: Comp328 Tutorial 1 Kai Zhang
No ratings yet
Decision Tree Algorithm: Comp328 Tutorial 1 Kai Zhang
25 pages
DT-0 (3 Files Merged)
No ratings yet
DT-0 (3 Files Merged)
143 pages
Lecture 04 Decession Trees 04112022 015118pm
No ratings yet
Lecture 04 Decession Trees 04112022 015118pm
43 pages
AI Chapter 3 Part 2
No ratings yet
AI Chapter 3 Part 2
51 pages
Decision Tree Induction
No ratings yet
Decision Tree Induction
80 pages
15-780: Graduate Artificial Intelligence: Decision Trees
No ratings yet
15-780: Graduate Artificial Intelligence: Decision Trees
41 pages
L3 - Decision Trees
No ratings yet
L3 - Decision Trees
28 pages
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Production of Ethanol Using Molasses and Its Effluent Treatment
88% (8)
Production of Ethanol Using Molasses and Its Effluent Treatment
105 pages
1 An Example of The Dual Simplex Method
No ratings yet
1 An Example of The Dual Simplex Method
5 pages
AP Chem CH 13 Practice Quiz
No ratings yet
AP Chem CH 13 Practice Quiz
5 pages
5 Excitation Model
No ratings yet
5 Excitation Model
18 pages
Math Grade 10 Curriculum Guide
100% (3)
Math Grade 10 Curriculum Guide
5 pages
Greek Coins - 1
No ratings yet
Greek Coins - 1
11 pages
Basic Types of Welded Joints
100% (1)
Basic Types of Welded Joints
21 pages
Importance and Types of Transformer Cooling Systems
100% (1)
Importance and Types of Transformer Cooling Systems
3 pages
Asepsia Antisepsia
100% (1)
Asepsia Antisepsia
110 pages
Etzc235 Nov24 An
No ratings yet
Etzc235 Nov24 An
2 pages
Unit 2 Solubility 1
No ratings yet
Unit 2 Solubility 1
26 pages
A Quaternion Gradient Operator For Color Image Edge Detection
No ratings yet
A Quaternion Gradient Operator For Color Image Edge Detection
5 pages
Eclipse Dose Calculation - Behestan Darman
No ratings yet
Eclipse Dose Calculation - Behestan Darman
2 pages
Ammonium Bisulfate
No ratings yet
Ammonium Bisulfate
2 pages
Basic Machining Operations: 1. Table of Contents
No ratings yet
Basic Machining Operations: 1. Table of Contents
7 pages
Victrex - Material Properties Guide 3 - 7 - US
No ratings yet
Victrex - Material Properties Guide 3 - 7 - US
26 pages
Seismic Load Calculation NSCP 2010 & UBC 1997
No ratings yet
Seismic Load Calculation NSCP 2010 & UBC 1997
4 pages
MA311 Tutorial 2
No ratings yet
MA311 Tutorial 2
3 pages
Tablet Coating Basics
No ratings yet
Tablet Coating Basics
4 pages
Tutorial 1: Properties of Amino Acids and Proteins
No ratings yet
Tutorial 1: Properties of Amino Acids and Proteins
7 pages
22502-2023-Winter-Question-Paper (Msbte Study Resources)
No ratings yet
22502-2023-Winter-Question-Paper (Msbte Study Resources)
4 pages
Reduction of Vibrations G.B. Warburton, J. Wiley & Sons, Chichester, 1992, 91 Pages, 17.50 - 1993
No ratings yet
Reduction of Vibrations G.B. Warburton, J. Wiley & Sons, Chichester, 1992, 91 Pages, 17.50 - 1993
2 pages
Cloruro Ferrico
No ratings yet
Cloruro Ferrico
8 pages
Detecting and Quantifying Biological Molecules Using Spectrophotometry
No ratings yet
Detecting and Quantifying Biological Molecules Using Spectrophotometry
12 pages
STSE Syllabus- 2025
No ratings yet
STSE Syllabus- 2025
12 pages
AstralProjection Nancy Trivellato
100% (5)
AstralProjection Nancy Trivellato
78 pages
Design of Precast Concrete Structures With Regard To Accidental Loading
No ratings yet
Design of Precast Concrete Structures With Regard To Accidental Loading
11 pages
2009 Intro Organic Chem Tutorial (Teachers)
No ratings yet
2009 Intro Organic Chem Tutorial (Teachers)
14 pages
1756 - RM
No ratings yet
1756 - RM
36 pages