Decision Trees
Decision Trees
Slides are assembled from various online sources with a grateful acknowledgement to all those who made them publicly available on the web.
Decision Tree Classifier
10
9 Ross Quinlan
8
Antenna Length
7
6 Abdomen Length > 7.1?
5
no yes
4
3 Antenna Length > 6.0? Katydid
2
no yes
1
Grasshopper Katydid
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Example tree
a1 a2 a3 a4 a5 a6
X Y Z
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
How to Build Decision Trees
• Greedy strategy.
– Split the records based on an attribute test that optimizes certain
criterion.
• Issues
– Determine how to split the records
– How to specify the attribute test condition?
– How to determine the best split?
– Determine when to stop splitting
How to specify the attribute test condition
CarType
Family Luxury
Sports
CarType OR CarType
{Sports, {Family,
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal Attributes
Size
{Small,
{Medium}
• What about this split? Large}
Splitting Based on Continuous Attributes
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures of Node Impurity
• Entropy
• Gini Index
• Misclassification error
Entropy
• Entropy (impurity) of a set of examples, S, relative to a
binary classification is:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = −𝑝1 log 2 ( 𝑝1 ) − 𝑝0 log 2 ( 𝑝0 )
S: Data set
3 3 2 2
E Outlook rainy = 3+, 2 − = − log 2 − log 2 = 0.971
5 5 5 5
Example
Day Outlook Temperature Humidity Wind Play
Tennis
2 2 2 2 Day1 Sunny Hot High Weak No
E Temperaturehot = 2+, 2 − = − log 2 − log 2 =1
4 4 4 4 Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Day4 Rain Mild High Weak Yes
4 4 2 2
E Temperaturemoderate = 4+, 2 − = − 6
log 2 6
− 6
log 2 6
= 0.918 Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
3 3 1 1 Day8 Sunny Mild High Weak No
E Temperaturecold = 3+, 1 − = − log 2 − log 2 = 0.811
4 4 4 4 Day9 Sunny Cool Normal Weak Yes
Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
3 3 4 4 Day12 Overcast Mild High Strong Yes
E Humidityhigh = 3+, 4 − = − log 2 − log 2 = 0.985
7 7 7 7 Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No
6 6 1 1
E Humiditynormal = 6+, 1 − = − log 2 − log 2 = 0.592 S: Data set
7 7 7 7
Example
Day Outlook Temperature Humidity Wind Play
Tennis
S: Data set
Example
Day Outlook Temperature Humidity Wind Play
Tennis
S: Data set
Example
Day Outlook Temperature Humidity Wind Play
Tennis
𝑆𝑣 Day1 Sunny Hot High Weak No
𝐺𝑎𝑖𝑛(𝑆, 𝐹) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
𝑆 Day2 Sunny Hot High Strong No
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝐹)
Day3 Overcast Hot High Weak Yes
Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
|𝑆𝑠𝑢𝑛𝑛𝑦 | Day6 Rain Cool Normal Strong No
𝐺𝑎𝑖𝑛 𝑆, Outlook = 0.94 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(Outlook sunny )
|𝑆| Day7 Overcast Cool Normal Strong Yes
|𝑆𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 |
− 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(Outlook 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 ) Day8 Sunny Mild High Weak No
|𝑆|
Day9 Sunny Cool Normal Weak Yes
|𝑆𝑟𝑎𝑖𝑛𝑦 |
− 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(Outlook rainy) Day10 Rain Mild Normal Weak Yes
|𝑆|
Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
D1, D2, D8, D9, D11 D3, D7, D12, D13 D4, D5, D6, D10, D14 Day9 Sunny Cool Normal Weak Yes
[2+, 3-] [4+, 0-] [3+, 2-] Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
? Yes ? Day12 Overcast Mild High Strong Yes
Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No
S: Data set
E Outlook sunny ٿTemperaturehot = 0+, 2 − = 0 Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
E Outlook sunny ٿTemperatureMild = 1+, 1 − = 1 Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
E Outlook sunny ٿTemperaturecool = 1+, 0 − = 0 Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
𝐺𝑎𝑖𝑛 Outlook sunny , 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = E(Outlook sunny ) Day12 Overcast Mild High Strong Yes
Day13 Overcast Hot Normal Weak Yes
|𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑠𝑢𝑛𝑛𝑦 ٿTemperaturehot | Day14 Rain Mild High Strong No
− 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(Outlook sunny ٿTemperaturehot )
|𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑠𝑢𝑛𝑛𝑦 |
|𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑠𝑢𝑛𝑛𝑦 ٿTemperaturemild |
− 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(Outlook sunny ٿTemperaturemild ) S: Data set
|𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑠𝑢𝑛𝑛𝑦 |
|𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑠𝑢𝑛𝑛𝑦 ٿTemperaturecool |
− 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(Outlook sunny ٿTemperaturecool )
|𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑠𝑢𝑛𝑛𝑦 |
Example
Day Outlook Temperature Humidity Wind Play
Tennis
S: Data set
Example
Day Outlook Temperature Humidity Wind Play
Tennis
E Outlook sunny 𝑦𝑡𝑖𝑑𝑖𝑚𝑢𝐻 ٿℎ𝑖𝑔ℎ = 0+, 3 − = 0 Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
E Outlook sunny ٿHumiditynormal = 2+, 0 − = 0 Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
Day10 Rain Mild Normal Weak Yes
𝐺𝑎𝑖𝑛 Outlook sunny , Humidity = E(Outlook sunny ) Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
|𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑠𝑢𝑛𝑛𝑦 𝑦𝑡𝑖𝑑𝑖𝑚𝑢𝐻 ٿℎ𝑖𝑔ℎ | Day13 Overcast Hot Normal Weak Yes
− 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(Outlook sunny 𝑦𝑡𝑖𝑑𝑖𝑚𝑢𝐻 ٿℎ𝑖𝑔ℎ )
|𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑠𝑢𝑛𝑛𝑦 | Day14 Rain Mild High Strong No
|𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑠𝑢𝑛𝑛𝑦 ٿHumiditynormal |
− 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(Outlook sunny ٿHumiditynormal) S: Data set
|𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑠𝑢𝑛𝑛𝑦 |
3 2
𝐺𝑎𝑖𝑛 Outlook sunny , Humidity = 0.971 − ∗ 0 − ∗ 0 =0.971
5 5
Example
Day Outlook Temperature Humidity Wind Play
Tennis
2 2 3 3
E Outlook sunny = 2+, 3 − = − log 2 − log 2 = 0.971 Day1 Sunny Hot High Weak No
5 5 5 5 Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Day4 Rain Mild High Weak Yes
E Outlook sunny = 𝑘𝑎𝑒𝑤𝑑𝑛𝑖𝑊 ٿ1+, 2 − = 0.918
Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
E Outlook sunny = 𝑔𝑛𝑜𝑟𝑡𝑠𝑑𝑛𝑖𝑊 ٿ1+, 1 − = 1 Day7 Overcast Cool Normal Strong Yes
Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
𝐺𝑎𝑖𝑛 Outlook sunny , wind = E(Outlook sunny ) Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
|𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑠𝑢𝑛𝑛𝑦 | 𝑘𝑎𝑒𝑤𝑑𝑛𝑖𝑊 ٿ Day12 Overcast Mild High Strong Yes
− 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(Outlook sunny ) 𝑘𝑎𝑒𝑤𝑑𝑛𝑖𝑊 ٿ
|𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑠𝑢𝑛𝑛𝑦 | Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No
|𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑠𝑢𝑛𝑛𝑦 | 𝑔𝑛𝑜𝑟𝑡𝑠𝑑𝑛𝑖𝑊 ٿ
− 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(Outlook sunny ) 𝑔𝑛𝑜𝑟𝑡𝑠𝑑𝑛𝑖𝑊 ٿ
|𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑠𝑢𝑛𝑛𝑦 |
S: Data set
3 2
Gain Outlook sunny , Wind = 0.971 − ∗ 0.918 − ∗ 1 =0.020
5 5
D1, D2, D1,…, D14
Example [9+, 5-]
Day Outlook Temperature Humidity Wind Play
Tennis
Outlook
Day1 Sunny Hot High Weak No
Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Sunny Overcast Rainy Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
D1, D2, D8, D9, D11 D3, D7, D12, D13 D4, D5, D6, D10, D14
[3+, 2-] Day6 Rain Cool Normal Strong No
[2+, 3-] [4+, 0-]
Day7 Overcast Cool Normal Strong Yes
Day8 Sunny Mild High Weak No
Humidity Yes ?
Day9 Sunny Cool Normal Weak Yes
Day10 Rain Mild Normal Weak Yes
High Normal Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
D1, D2, D8, D9, D11 Day13 Overcast Hot Normal Weak Yes
𝑡𝑟𝑢𝑒 if 𝐴𝑐 < 𝑐
𝐴𝑐 = ቊ
𝑓𝑎𝑙𝑠𝑒 otherwise
How to choose c ?
Tid Refund Marital Taxable
Status Income Cheat
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Purity ? ? ? ? ? ? ? ? ? ? ?
Issue with Information Gain
• There is a natural bias in the information gain measure that favors attributes with
many values over those with few values.
• As an extreme example, consider the attribute Date, which has a very large number of
possible values, it would have the highest information gain of any of the attributes.
• This is because Date alone perfectly predicts the target attribute over the training data.
• Thus, it would be selected as the decision attribute for the root node of the tree and
lead to a (quite broad) tree of depth one, which perfectly classifies the training data.
• Of course, this decision tree would fare poorly on subsequent examples, because it is
not a useful predictor despite the fact that it perfectly separates the training data.
Measure of impurity: Gini index
• Gini index for a given node t :
𝐺𝐼𝑁𝐼(𝑡) = 1 − [𝑝(𝑗|𝑡)]2
𝑗
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Practical Issues
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise
Lack of data points in the lower half of the diagram makes it difficult to
predict correctly the class labels of that region
Insufficient number of training records in the region causes the decision tree
to predict the test examples using other training records that are irrelevant to
the classification task
Notes on Overfitting
• Post-pruning
– Grow decision tree to its entirety
– Trim the nodes of the decision tree in a bottom-up fashion
– If generalization error improves after trimming, replace sub-tree by a
leaf node.
– Class label of leaf node is determined from majority class of instances
in the sub-tree
Handling Missing Attribute Values