Chapter 3
Chapter 3
Trees
CHAPTER 3: MACHINE LEARNING – THEORY & PRACTICE
Linear Discriminant Functions
Classification Algorithms
(Builds classification model
Training using historical data)
Data
Classifier
(Model)
NAME Balance Age Default
Mike 23,000 30 yes
Mary 51,100 40 yes
Bill 68,000 55 no IF Balance >= 50K
Jim 74,000 46 no
Dave 23,000 47 yes
and Age > 45
Anne 100,000 49 no THEN Default = ‘no’
Classification: Decision Trees
Yes Age
• DTs are very easy to understand <45 >=45
• Good for descriptive modeling too
Yes No
Example Data Set
OUTLOOK TEMP(F) HUMIDTY WINDY CLASS
Sunny 79 90 Windy No Play
Sunny 56 70 Nonwindy Play
Sunny 60 90 Windy Noplay
Sunny 79 75 Windy Play
Overcast 88 88 Nonwindy Play
Overcast 63 75 Windy Play
Overcast 88 95 Nonwindy Play
Rainy 78 60 Nonwindy Play
Rainy 66 70 Nonwindy Play
Rainy 68 60 Windy Noplay
Example Decision Tree
Sunny OUTLOOK
Rainy
Overcast
Humidity
Play Windy
<=75 >75 F
T
Play NP
NP Play
Sunny OUTLOOK
Rainy
Humidity Overcast
Play Windy
<=75 >75 F
T
Play NP
NP Play
Sunny OUTLOOK
Rainy
Humidity Overcast
Play Windy
<=75 >75 F
T
Play NP
NP Play
10 OUTLOOK
Sunny Rainy
Humidity Overcast
4 Play
>75 Windy
<=75 T F
Play NP
2 NP Play
Basic idea:
2. Separate training set D into subsets {D1, D2, .., Dk} where
each subset Di contains examples having the same value for a*
Income
S1
S2
Which Attribute to Choose?
Income>a
Age
no yes
b Age>b
no
yes
a Income
Oblique Split
(1,1), (2,1), (1,2), (2,2), (6,7), (7,7)
Y
(6,1), (7,1), (6,2), (7,2)
X-Y < 2
4
yes no
4 X
Summary
•Pros •Cons
+ Reasonable training time - Cannot handle complicated relationship
+ Fast application between features
+ Easy to interpret - simple decision boundaries
+ Easy to implement - problems with lots of missing data
+ Can handle both numerical and
categorical features
Calculating Information Gain for OUTLOOK
• Entropy at the split is -0.4 log 0.4 – 0.6 log 0.6 = 0.29
• Entropy at the left child is -0.25 log 0.25 – 0.75 log 0.75 = 0.244
• Entropy at the right child is 0.278; for the Middle node it is 0
• The drop in impurity is therefore
• 0.29 - 0.4*0.244 - 0.3*0.0 - 0.3(0.278) = 0.11
Solutions:
1. Grow the tree until the algorithm stops even if overfitting problem
shows up. Then prune the tree as a post-processing step.
2. Stop growing the tree as soon as the size goes beyond some
specified limit.
Decision Tree Pruning
When to Stop Splitting?
Use Cross- Validation: A part of the data is kept aside for validation.
Stop splitting when the best results are obtained on the validation data.
Impurity Functions
• Entropy
• Gini Index
• Misclassification
Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright !c 2001 by John Wiley & Sons, Inc.
Impurity Functions: Gini Index and Entropy
• 1-level decision trees: ones which classify examples on the basis of a single
attribute
• A program, called 1R, that learns 1−rules from examples was compared to C4
on 16 datasets commonly used in ML research.
• Individually, these rules are not powerful enough to classify the dataset well.
Therefore, these rules are called as weak learners.
• The main result of comparing 1R and C4 is insight into the trade-off between
simplicity and accuracy.
• 1R’s rules are only a little less accurate (3.1 percentage points) than C4’s pruned
decision trees on almost all of the datasets.
• https://www.cs.cornell.edu/courses/cs478/2000SP/lectures/decision-
Decision Trees: Weak Learners
Robert Holte, Very Simple Classification Rules Perform Well on Most Commonly Used Datasets, machine learning, 1993.
Decision Stumps : UCI ML Datasets
Example Classification Problem
h1 D2
AdaBoost: Second Weak Learner
AdaBoost: Third Weak Learner
AdaBoost: Final Classifier
Training Error of AdaBoost
Regression
• Problem: Given the problem is to learn (y =)
F(x) to minimize squared loss.
• You check this model and find the model to be good but makes
errors.
• There are some mistakes: F(x1 ) = 0.8, while y1 = 0.9, and F(x2 ) = 1.4
while y2 = 1.3. How to improve this model?
• Constraints on the learners:
• The model to learn F cannot change or no change in any
parameter of F
• You can add another model (regression tree) to learn h, so
the new prediction will be F(x) + h(x).
• A simple solution:
• We would like to improve the model such that
• Or equivalently, we want
• Loss function
• We want to minimize
• This means
• Hence
• It is of the form
Gradient Boost
• For regression with squared loss,
residual ⟺ negative gradient
h fits the residual ⟺ h fits the negative gradient
Update F based on residual ⟺ Update F based on negative gradient