Decision Trees
Decision Trees
CS 165B
Spring 2012
1
Course outline
• Introduction (Ch. 1)
• Concept learning (Ch. 2)
• Decision trees (Ch. 3)
• Ensemble learning
• Neural Networks (Ch. 4)
• …
2
Schedule
3
Projects
• Projects proposals are due by Friday 4/20.
• 2-person teams
• If you want to define your own project:
– Submit a 1-page proposal with references and ideas
– Needs to have a significant Machine Learning
component
– You may do experimental work, theoretical work, a
combination of both or a critical survey of results in
some specialized topic.
• Originality is not mandatory but is encouraged.
• Try to make it interesting!
4
Decision tree learning
Decision tree representation
– Most popular method for representing discrete TE’s
– Decision tree represents disjunction of conjunctions
of attribute values
More general H-representation than in concept learning
5
Training Examples
• Can be represented
sunny overcast rain
by logical formulas
Humidity Yes Wind
No Yes No Yes
7
Representation in decision trees
8
Applications of Decision Trees
9
Top-Down Construction
Main loop:
1. Choose the “best” decision attribute (A) for next node
2. Assign A as decision attribute for node
3. For each value of A, create new descendant of node
4. Sort training examples to leaf nodes
5. If training examples perfectly classified, STOP,
Else iterate over new leaf nodes
Grow tree just deep enough for perfect classification
– If possible (or can approximate at chosen depth)
Which attribute is best?
10
Choosing Best Attribute?
• Consider 64 examples, 29+ and 35-
• Which one is better?
29+, 35- A1 29+, 35- A2
t f t f
• Which is better?
29+, 35- A1 29+, 35- A2
t f t f
11
Entropy
• A measure for
– uncertainty
– purity
– information content
• Information theory: optimal length code assigns (- log2p) bits to
message having probability p
• S is a sample of training examples
– p+ is the proportion of positive examples in S
– p- is the proportion of negative examples in S
• Entropy of S: average optimal number of bits to encode information
about certainty/uncertainty about S
Entropy(S) = p+(-log2p+) + p-(-log2p-) = -p+log2p+- p-log2p-
• Can be generalized to more than two values
12
Entropy
13
Choosing Best Attribute?
• Consider 64 examples (29+,35-) and compute entropies:
• Which one is better?
E(S)=0.993
29+, 35- A1 E(S)=0.993 29+, 35- A2
t f t f
0.650 0.522 0.989 0.997
25+, 5- 4+, 30- 15+, 19- 14+, 16-
• Which is better?
E(S)=0.993 E(S)=0.993
29+, 35- A1 29+, 35- A2
t f t f
0.708 0.742 0.937 0.619
21+, 5- 8+, 30- 18+, 33- 11+, 2-
14
Information Gain
• Gain(S,A): reduction in entropy after choosing attr. A
Sv
Gain( S , A) = Entropy( S ) -
vValues( A ) S
Entropy( S v )
E(S)=0.993
29+, 35- A1 E(S)=0.993 29+, 35- A2
t f t f
0.650 0.522 0.989 0.997
25+, 5- 4+, 30- 15+, 19- 14+, 16-
Gain: 0.395 Gain: 0.000
E(S)=0.993 E(S)=0.993
29+, 35- A1 29+, 35- A2
t f t f
0.708 0.742 0.937 0.619
21+, 5- 8+, 30- 18+, 33- 11+, 2-
Gain: 0.265 Gain: 0.121 15
Gain function
Gain is measure of how much can
– Reduce uncertainty
Value lies between 0,1
What is significance of
gain of 0?
example where have 50/50 split of +/- both before and after
discriminating on attributes values
gain of 1?
Example of going from “perfect uncertainty” to perfect certainty
after splitting example with predictive attribute
16
Training Examples
Humidity Wind
18
Sort the Training Examples
9+, 5- {D1,…,D14}
Outlook
? Yes ?
Ssunny= {D1,D2,D8,D9,D11}
Gain (Ssunny, Humidity) = .970
Gain (Ssunny, Temp) = .570
Gain (Ssunny, Wind) = .019 19
Final Decision Tree for Example
Outlook
Sunny Rain
Overcast
Humidity
Yes Wind
High
Normal Strong Weak
No Yes No
Yes
20
Hypothesis Space Search by ID3
• Hypothesis space (all possible trees) is complete!
– Target function is included in there
21
Hypothesis Space Search in Decision Trees
• Conduct a search of the space of decision trees which
can represent all possible discrete functions.
23
Restriction bias vs. Preference bias
• Restriction bias (or Language bias)
– Incomplete hypothesis space
• Preference (or search) bias
– Incomplete search strategy
• Candidate Elimination has restriction bias
• ID3 has preference bias
• In most cases, we have both a restriction and a
preference bias.
24
Inductive Bias in ID3
25
Overfitting the Data
• Learning a tree that classifies the training data perfectly may
not lead to the tree with the best generalization performance.
- There may be noise in the training data the tree is fitting
- The algorithm might be making decisions based on
very little data
• A hypothesis h is said to overfit the training data if the is
another hypothesis, h’, such that h has smaller error than h’
on the training data but h has larger error on the test data than h’.
On training
accuracy On testing
Complexity of tree 26
Overfitting in Decision Trees
• Consider adding noisy training example (should be +):
Day Outlook Temp Humidity Wind Tennis?
D15 Sunny Hot Normal Strong No
Outlook
27
Overfitting - Example
Strong Weak
No Yes 28
Avoiding Overfitting
29
Reduced-Error Pruning
• A post-pruning, cross validation approach
- Partition training data into “grow” set and “validation” set.
- Build a complete tree for the “grow” data
- Until accuracy on validation set decreases, do:
For each non-leaf node in the tree
Temporarily prune the tree below; replace it by majority vote.
Test the accuracy of the hypothesis on the validation set
Permanently prune the node with the greatest increase
in accuracy on the validation test.
• Problem: Uses less data to construct the tree
• Sometimes done at the rules level
30
Rule post-pruning
31
Example of rule post pruning
• IF (Outlook = Sunny) ^ (Humidity = High)
– THEN PlayTennis = No
• IF (Outlook = Sunny) ^ (Humidity = Normal)
– THEN PlayTennis = Yes
Outlook
32
Extensions of basic algorithm
33
Continuous Valued Attributes
• Create a discrete attribute from continuous variables
– E.g., define critical Temperature = 82.5
• Candidate thresholds
– chosen by gain function
– can have more than one threshold
– typically where values change quickly
(48+60)/2 (80+90)/2
Temp 40 48 60 72 80 90
Tennis? N N Y Y Y N
34
Attributes with Many Values
• Problem:
– If attribute has many values, Gain will select it (why?)
– E.g. of birthdates attribute
365 possible values
35
Attributes with many values
• Problem: Gain will select attribute with many values
• One approach: use GainRatio instead
Gain( S , A)
GainRatio ( S , A) =
SplitInformation( S , A) Entropy of the
partitioning
c Si Si
SplitInformation( S , A) = - log 2 Penalizes
higher number
i =1 S S of partitions
37
Attributes with Costs
• Consider
– medical diagnosis: BloodTest has cost $150, Pulse has a cost of $5.
– robotics, Width-From-1ft has cost 23 sec., from 2 ft 10s.
• How to learn a consistent tree with low expected cost?
• Replace gain by
– Tan and Schlimmer (1990)
Gain 2 ( S , A)
Cost ( A)
– Nunez (1988)
2Gain( S , A) - 1
(Cost ( A) + 1)w
where w [0, 1] determines importance of cost 38
Gini Index
• Another sensible measure of impurity
(i and j are classes)
. .
. .
. .
40
. .
. .
. .
. .
red
Gini Index for Color
Color? green
.
yellow .
.
.
41
Gain of Gini Index
42
Three Impurity Measures
43
Decision Trees as Features
• Rather than using decision trees to represent the target function, use
small decision trees as features
44
Regression Tree
• Similar to classification
• Use a set of attributes to predict the value (instead
of a class label)
• Instead of computing information gain, compute
the sum of squared errors
• Partition the attribute space into a set of
rectangular subspaces, each with its own predictor
– The simplest predictor is a constant value
45
Rectilinear Division
• A regression tree is a piecewise constant function of the
input attributes
X2
X1 t1
r5
r2
X2 t2 X1 t3
r3
t2 r4
r1
r1 r2 r3 X2 t4
t1 t3 X1
r4 r5
46
Growing Regression Trees
• The best split is the one that reduces the most variance:
DI ( LS , A) = vary|LS { y} -
| LS a |
vary|LS a { y}
a | LS |
47
Regression Tree Pruning
• Exactly the same algorithms apply: pre-pruning
and post-pruning.
• In post-pruning, the tree that minimizes the
squared error on VS is selected.
• In practice, pruning is more important in
regression because full trees are much more
complex (often all objects have a different output
values and hence the full tree has as many leaves
as there are objects in the learning sample)
48
When Are Decision Trees Useful ?
• Advantages
– Very fast: can handle very large datasets with many
attributes
– Flexible: several attribute types, classification and
regression problems, missing values…
– Interpretability: provide rules and attribute importance
• Disadvantages
– Instability of the trees (high variance)
– Not always competitive with other algorithms in terms
of accuracy
49
History of Decision Tree Research
• Hunt and colleagues in Psychology used full search decision
trees methods to model human concept learning in the 60’s
50
Summary
• Decision trees are practical for concept learning
• Basic information measure and gain function for best first
search of space of DTs
• ID3 procedure
– search space is complete
– Preference for shorter trees
• Overfitting is an important issue with various solutions
• Many variations and extensions possible
51
References
• Classification and regression trees, L.Breiman et al.,
Wadsworth, 1984
• C4.5: programs for machine learning, J.R.Quinlan,
Morgan Kaufmann, 1993
• Random Forests, L. Breiman, Leo, Machine Learning
45 (1): 5–32, 2001.
• The elements of statistical learning : Data mining,
inference, and prediction, Hastie, T., Tibshirani, R.,
Friedman, New York: Springer Verlag, 2001.
• Constructing Optimal Binary Decision Trees is NP-
complete, Hyafil, Laurent; Rivest, RL. Information
Processing Letters 5 (1): 15–17, 1976 52
Software
• In R:
– Packages tree and rpart
• C4.5:
– http://www.cse.unwe.edu.au/~quinlan
• Weka
– http://www.cs.waikato.ac.nz/ml/weka
53