Dtree
Dtree
Decision Trees
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
[email protected]
412-268-7599
Gender
Marital
Status
• Easier to
appreciate
graphically
• Easier to see
“interesting”
things if we
stretch out the
histogram
bars
50s
40s
30s
20s
Po
o r
Ric
h
Male
Female
40 bad
bad
8
6
high
medium
high
medium
high
medium
low
medium
70to74
70to74
america
america
Records bad
bad
4
4
low
low
medium
medium
low
low
medium
low
70to74
70to74
asia
asia
bad 8 high high high low 75to78 america
: : : : : : : :
: : : : : : : :
: : : : : : : :
bad 8 high high high low 70to74 america
good 8 high medium high high 79to83 america
bad 8 high high high low 75to78 america
good 4 low low low low 79to83 america
bad 6 medium medium medium high 75to78 america
good 4 medium low low low 79to83 america
good 4 low low medium high 79to83 america
bad 8 high high high low 70to74 america
good 4 low medium low medium 75to78 europe
bad 5 medium medium medium medium 75to78 europe
Look at all
the
information
gains…
Records
in which
cylinders
=5
Take the And partition it
Original according
Dataset.. Records
to the value of
in which
the attribute cylinders
we split on =6
Records
in which
cylinders
=8
Build tree from Build tree from Build tree from Build tree from
These records.. These records.. These records.. These records..
Records in
Records in which cylinders
which cylinders =8
=6
Records in
Records in
which cylinders
which cylinders
=5
=4
Don’t split a
node if all
matching
records have
the same
output value
Don’t split a
node if none
of the
attributes can
create
multiple non-
empty
children
a b c d e y
0 0 0 0 0 0
0 0 0 0 1 0
32 records
0 0 0 1 0 0
0 0 0 1 1 1
0 0 1 0 0 1
: : : : : :
1 1 1 1 1 1
e=0 e=1
a b c d e y
0 0 0 0 0 0
0 0 0 0 1 0
32 records
0 0 0 1 0 0
0 0 0 1 1 1
0 0 1 0 0 1
: : : : : :
1 1 1 1 1 1
e=0 e=1
These nodes will be unexpandable
e=0 e=1
These nodes will be unexpandable
In about 12 of In about 12 of
the 16 records the 16 records
in this node the in this node the
output will be 0 output will be 1
Decreasing Increasing
MaxPchance
How?
Sort records according to increasing values of X. Then create a 2x ny
contingency table corresponding to computation of IG(Y|X:x min). Then
iterate through the records, testing for each threshold between adjacent
values of X, incrementally updating the contingency table as you go. For a
minor additional speedup, only test between values of Y that differ.
If all records in X have identical values in all their attributes (this includes the case where R<2), return a Leaf Node
predicting the majority output, breaking ties randomly. This case also includes
If all values in Y are the same, return a Leaf Node predicting this value as the output
Else
For j = 1 .. M
If j’th attribute is categorical
IGj = IG(Y|Xj)
Else (j’th attribute is real-valued)
IGj = IG*(Y|Xj) from about four slides back
Let j* = argmaxj IGj (this is the splitting attribute we’ll use)
If j* is categorical then
For each value v of the j’th attribute
Let Xv = subset of rows of X in which Xij = v. Let Yv = corresponding subset of Y
Let Childv = LearnUnprunedTree(Xv,Yv)
Return a decision tree node, splitting on j’th attribute. The number of children equals the number of values of
the j’th attribute, and the v’th child is Child v
Else j* is real-valued and let t be the best split threshold
Let XLO = subset of rows of X in which Xij <= t. Let YLO = corresponding subset of Y
Let ChildLO = LearnUnprunedTree(XLO,YLO)
Let XHI = subset of rows of X in which Xij > t. Let YHI = corresponding subset of Y
Let ChildHI = LearnUnprunedTree(XHI,YHI)
Return a decision tree node, splitting on j’th attribute. It has two children corresponding to whether the j’th
attribute is above or below the given threshold.
Return a decision tree node, splitting on j’th attribute.at tributeThe s. number of children equals the number of values of
Pedanand
the j’th attribute, ticthe
dev’th child is Child
tail:
v
a third termination
Else j* is real-valuedco andn let t be the best split threshold
dition should occu
Let XLO = subset of rows of X in which Xij <=r t.ifLet
thYeLOb=es t split
corresponding subset of Y
at tr ib u te p u ts allLO,YitLOs) records in e
Let ChildLO = LearnUnprunedTree(X
ch ild (n xactly one
Let XHI = subset of rowsoof teX th
in at this
which Xij > mt.eLet
ansYHIit=acorresponding
nd all other subset of Y
att rib u te s
Let ChildHI = LearnUnprunedTree(X have IG HI
,Y= HI
)0).
Return a decision tree node, splitting on j’th attribute. It has two children corresponding to whether the j’th
attribute is above or below the given threshold.
Attribute Attribute
equals doesn’t
value equal value