0% found this document useful (0 votes)
7 views

Chapter 3

Uploaded by

praveenm026
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Chapter 3

Uploaded by

praveenm026
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

Models Based on Decision

Trees
CHAPTER 3: MACHINE LEARNING – THEORY & PRACTICE
Linear Discriminant Functions

KNNC Decision Tree Linear Non-Linear


g(X) = 𝑊 𝑡 X + b Neural Networks

Andrew Moore’s Tutorial on SVMs


Decision Trees
• Simplest and Easy to understand Abstraction
• Four coins a,b,c, and d: find the heavier coin
a+b>c+d?
Yes No
a>b? c>d?
Yes No Yes No
a b c d

• A tree where each internal node is a decision node.


• It requires two weighings to decide.
• Leaf nodes are associated with outcome.
• Each path from the root to leaf is a rule. 3
Learning Rules
A learner, inducer, induction algorithm
A method or algorithm used to generalize a pattern from a set of
examples.
Name Balance Age Default
Mike 23,000 30 yes
Learner:
Mary 51,100 40 yes Induces a pattern
Bill 68,000 55 no from examples
Jim 74,000 46 no
Dave 23,000 47 yes Pattern:
Anne 1,00,000 49 no IF Balance >= 50K and Age > 45
Then Default = ‘no’
Else Default = ‘yes’
Classification Example

Classification Algorithms
(Builds classification model
Training using historical data)
Data
Classifier
(Model)
NAME Balance Age Default
Mike 23,000 30 yes
Mary 51,100 40 yes
Bill 68,000 55 no IF Balance >= 50K
Jim 74,000 46 no
Dave 23,000 47 yes
and Age > 45
Anne 100,000 49 no THEN Default = ‘no’
Classification: Decision Trees

• Learn a series {IF (condition) Then (class)} rules.


Example: Credit risk management

New Applicant: Balance


(Mark, Balance=88K,Age= 40) <50K >=50K

Yes Age
• DTs are very easy to understand <45 >=45
• Good for descriptive modeling too
Yes No
Example Data Set
OUTLOOK TEMP(F) HUMIDTY WINDY CLASS
Sunny 79 90 Windy No Play
Sunny 56 70 Nonwindy Play
Sunny 60 90 Windy Noplay
Sunny 79 75 Windy Play
Overcast 88 88 Nonwindy Play
Overcast 63 75 Windy Play
Overcast 88 95 Nonwindy Play
Rainy 78 60 Nonwindy Play
Rainy 66 70 Nonwindy Play
Rainy 68 60 Windy Noplay
Example Decision Tree

Sunny OUTLOOK
Rainy
Overcast
Humidity
Play Windy
<=75 >75 F
T
Play NP
NP Play

• Class labels are associated with leaf nodes


• Root to leaf represents a rule:
If (outlook = sunny) and (humidity > 75) then noplay
8
Example Decision Tree

Sunny OUTLOOK
Rainy
Humidity Overcast
Play Windy
<=75 >75 F
T
Play NP
NP Play

• Classification involves making decisions at the nodes


and moving down the appropriate branch till a leaf
If (outlook = rainy), (temp = 70), (humidity = 65) and
(Windy = True) then Class = ? ---Noplay (NP) 9
Example Decision Tree

Sunny OUTLOOK
Rainy
Humidity Overcast
Play Windy
<=75 >75 F
T
Play NP
NP Play

• Irrelevant features are eliminated : For example Temp


• It can deal with both Numerical and Categorical
Features: Windy, Outlook – Categorical Temp and
Humidity - Numerical 10
Example Decision Tree

10 OUTLOOK
Sunny Rainy
Humidity Overcast
4 Play
>75 Windy
<=75 T F
Play NP
2 NP Play

• The tree could be binary or nonbinary


• The rules are simple and easy to understand
• A set of patterns is associated with each node
11
Construction of Decision Trees
There are different ways to construct trees from data.
We will concentrate on the top-down, greedy search approach:

Basic idea:

1. Choose the best attribute a* to place at the root of the tree.

2. Separate training set D into subsets {D1, D2, .., Dk} where
each subset Di contains examples having the same value for a*

3. Recursively apply the algorithm on each new subset until


examples have the same class or there are few of them.
Splitting Functions
• Type of Test:
• Axis-parallel test
• Test based on linear combination of features
• Test based on nonlinear combination of features

• Which attribute is the best to split the data?

Entropy associated with a random variable X is defined as

H(X) = - Σ pi log pi, where the log is in base 2.


Splitting Based on Entropy
Income divides the sample in to:
S1 = { 6+, 0-}
S2 = { 3+, 5-} H(S1) = 0
H(S2) = -(3/8)log(3/8)
-(5/8)log(5/8)
Age

Income
S1
S2
Which Attribute to Choose?

X (40, 30, 30)

(40, 10, 10) =a =b (0, 20, 20)


• The entropy impurity at the split is
-0.4log 0.4 – 0.3log0.3 – 0.3log0.3 = 1.38
• The entropy of the left branch is
-0.66log0.66 – 0.17log0.17 – 0.17log0.17 = 1.25
• The entropy of the right branch is
-0.5log0.5 – 0.5log0.5 = 1.0
• The drop in impurity is therefore
1.38 – 0.6*1.25 – 0.4*1.0 = 0.23
Which Attribute to Choose?
Y (40, 30, 30)
(0, 30, 30) (40, 0, 0)
=c =d
• The entropy impurity at the split is
-0.4log 0.4 – 0.3log0.3 – 0.3log0.3 = 1.38
• The entropy of the left branch is
-0.5log0.5 – 0.5log0.5 = 1.0
• The entropy of the right branch is 0.0

• The drop in impurity is therefore


1.38 – 0.6*1.0 – 0.4*0.0 = 0.78
Axis Parallel Split

Income>a
Age

no yes
b Age>b
no
yes

a Income
Oblique Split
(1,1), (2,1), (1,2), (2,2), (6,7), (7,7)
Y
(6,1), (7,1), (6,2), (7,2)

X-Y < 2
4
yes no

4 X
Summary
•Pros •Cons
+ Reasonable training time - Cannot handle complicated relationship
+ Fast application between features
+ Easy to interpret - simple decision boundaries
+ Easy to implement - problems with lots of missing data
+ Can handle both numerical and
categorical features
Calculating Information Gain for OUTLOOK

• Entropy at the split is -0.4 log 0.4 – 0.6 log 0.6 = 0.29

• Entropy at the left child is -0.25 log 0.25 – 0.75 log 0.75 = 0.244
• Entropy at the right child is 0.278; for the Middle node it is 0
• The drop in impurity is therefore
• 0.29 - 0.4*0.244 - 0.3*0.0 - 0.3(0.278) = 0.11

• This is also called Information Gain


Calculating Information Gain for HUMIDITY

• Entropy at the split is -0.4log 0.4 – 0.6log0.6 =0.29


• The drop in impurity is 0.29- 0.7[-0.7 log 0.7-0.3 log 0.3]
- 0.3[-0.33 log 0.33 -0.66 log 0.66]= 0.02 (Information Gain)

• Entropy at the split is -0.4log 0.4 – 0.6log0.6 =0.29


• The drop in impurity is 0.29- 0.6[-0.33 log 0.33 -0.66 log 0.66] -
0.4[-0.5 log 0.5 – 0.5 log 0.5]= 0.003 (Information Gain)
Calculating Information Gain for Temp

• Entropy at the split is -0.4log 0.4 – 0.6log0.6 =0.29


• The drop in impurity is 0.29- 0.8[-0.5 log 0.5-0.5 log 0.5] – 0.0 =
0.05 (Information Gain)

• Entropy at the split is -0.4log 0.4 – 0.6log0.6 =0.29


• The drop in impurity is 0.29- 0.5[-0.6 log 0.6 -0.4 log 0.4]
-0.5[-0.8 log 0.8 – 0.2 log 0.2]= 0.035 (Information Gain)
Calculating Information Gain for Windy

• Entropy at the split is -0.4log 0.4 – 0.6log0.6 =0.29


• The drop in impurity is 0.29- 0.5[-0.6 log 0.6 -0.4 log 0.4] -0.5[-0.8 log
0.8 – 0.2 log 0.2]= 0.035 (Information Gain)

• So, Outlook is the best feature in terms of the largest value of


Information Gain, that is 0.11.
• That is why Outlook was used at the root node of the Decision Tree
shown earlier against this data.
Decision-Tree: Iris Data
X[3]: Petal Width
X[2]: Petal Length
Decision-Tree: Breast Cancer Data
Both Classification and
Feature Selection at the
same time!

• X = X[:,[7,23]]: ACC = 0.9300699300699301 (KNNC with K= 11)


• X = X[:,[3,7,23]]: ACC = 0.958041958041958
Depth of the Tree: Overfitting

A tree overfits the data if we let it grow deep enough so that it


begins to capture “aberrations” in the data that harm the predictive
power on unseen examples:

Possibly just noise, but


the tree is grown deeper
Age

to capture these examples


Income
When to Stop Splitting?
• Stop when reduction in impurity is small. Stop splitting when
all the nodes are pure or almost pure.
• Training sample threshold k: stop cutting when the number of
samples at a node is ≤ k. (alternatively as percentage of
training sample size).
• Terminate the splitting process using a global criterion
function.
• A possible criterion function is of the form:

𝛼 * Size of the tree + sum of the impurities of the leaf nodes.


• Cost = R(T) + 𝛼 .∗ # leaf nodes; where R(T) is the
error R(T) goes down as the # leaf nodes increases
• α helps in balancing the two terms;
• α = 0 encourages a bigger tree
Overfitting the Data
Overfitting: The tree can grow in size. For example, having a unique
path in the tree for every training pattern .

Solutions:
1. Grow the tree until the algorithm stops even if overfitting problem
shows up. Then prune the tree as a post-processing step.

2. Stop growing the tree as soon as the size goes beyond some
specified limit.
Decision Tree Pruning
When to Stop Splitting?
Use Cross- Validation: A part of the data is kept aside for validation.
Stop splitting when the best results are obtained on the validation data.
Impurity Functions
• Entropy
• Gini Index
• Misclassification

Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright !c 2001 by John Wiley & Sons, Inc.
Impurity Functions: Gini Index and Entropy

• X = X[:,[7,23]]: ACC = 0.9301 (KNNC) • X = X[:,[7,22]]: ACC = 0.9230 (KNNC)


• X = X[:,[3,7,23]]: ACC = 0.9580 (K=11) • X = X[:,[1,7,22]]: ACC = 0.9650 (K=7)
Variance Impurity
• An impurity suitable for a 2-class problem is i(n) = P(𝑪𝟏 ) P(𝑪𝟐 )
• It is 0 at a node when the node has patterns of only 𝑪𝟏 or 𝑪𝟐 .
• This may be called Variance Impurity. In a 2-class case, we have P(𝐶2 ) = 1 – P(𝐶1 ).
• It cannot be a polynomial of degree 1 in P(𝐶1 ), of the form a.P(𝐶1 ) + b
• Because we need i(P(𝐶1 )=0) = i(P(𝐶1 ) = 1) = 0 as the condition to be satisfied
• P(𝐶1 ) = 0 gives b=0 and P(𝐶1 ) = 1 gives a+b = 0→a = 0
• If we consider the form a P(𝑪𝟏 )𝟐 + b P(𝑪𝟏 ) + c and the conditions, we get from
P(𝑪𝟏 ) = 0, c = 0, and from P(𝑪𝟏 ) = 1, we get a + b = 0 → a = -b
• So, i(P(𝐶1 )) = a P(𝑪𝟏 )𝟐 + b P(𝑪𝟏 ) + c = b P(𝑪𝟏 ) (1- P(𝑪𝟏 )) = b P(𝑪𝟏 ) P(𝑪𝟐 ) →
i ∝ P(𝑪𝟏 ) P(𝑪𝟐 )
• E[i] = 1 P(𝑪𝟏 ) + 0 P(𝑪𝟐 ) = P(𝑪𝟏 ) if i is seen as a binary RV with 1 for 𝐶1 , 0 for 𝐶2 .
• We know that Var[X] = E[𝑋 2 ] - (𝐸 𝑋 )2 for a RV X, so
• Var[i] = 𝟏𝟐 P (𝑪𝟏 ) + 𝟎𝟐 P (𝑪𝟐 ) - P (𝑪𝟏 )𝟐 = P(𝑪𝟏 ) (1- P(𝑪𝟏 )) = P(𝑪𝟏 ) P(𝑪𝟐 )
Regions and Classes
Regions and Function Values
• Tree based methods for Regression involve segmenting the space of
points into piece-wise linear regions using axis parallel splits.
• The splitting can be represented using a tree, where each internal
node abstracts a decision and each leaf node represents a simpler
region of similar patterns.
• The basic idea of these methods is to partition the space and identify
some representative vectors.
• We make predictions in a regression problem by dividing the space of all the
possible values based on the variables into Region1, Region2, …, Regionk
corresponding to k leaf nodes.
• Then for every vector X that falls into a particular region (say 𝐑𝐞𝐠𝐢𝐨𝐧𝐣 ) we
make the same prediction.
• Used knn’s earlier. Here a region is used.
Regression Based on Decision Trees
Data for Regression
Decision Tree Regression
Decision Tree
Decision Tree Regression
What is Boosting?
• A method for improving classifier accuracy
• Basic idea:
• Perform iterative search to locate the regions/examples that are
more difficult to predict.
• Through each iteration reward accurate predictions on these
regions
• Combine the rules from different iterations.
• Only requires that the underlying learning algorithm be better
than guessing.
Boosting: Weak Learners
• Boosting refers to combining multiple weak learners to get a strong learner.
• We can understand this definition by looking at some simple weak learner class.

• 1-level decision trees: ones which classify examples on the basis of a single
attribute
• A program, called 1R, that learns 1−rules from examples was compared to C4
on 16 datasets commonly used in ML research.
• Individually, these rules are not powerful enough to classify the dataset well.
Therefore, these rules are called as weak learners.
• The main result of comparing 1R and C4 is insight into the trade-off between
simplicity and accuracy.
• 1R’s rules are only a little less accurate (3.1 percentage points) than C4’s pruned
decision trees on almost all of the datasets.
• https://www.cs.cornell.edu/courses/cs478/2000SP/lectures/decision-
Decision Trees: Weak Learners

Robert Holte, Very Simple Classification Rules Perform Well on Most Commonly Used Datasets, machine learning, 1993.
Decision Stumps : UCI ML Datasets
Example Classification Problem

AdaBoost Blog: https://mccormickml.com/2013/12/13/adaboost-tutorial/


AdaBoost: The Overall Idea
• Consider the diagram below:
• Decision stump on feature1 is the
first weak learner. (first box).
• We have 3 misclassified observations out of total 10 using this
stump.
• We give larger weights to these 3 misclassified observations next.
• It becomes very important to classify these right. Hence, the next
decision stump moves towards the right side edge in the second
box.
• We repeat this process and combine the learners weighted
appropriately.
• Which learner is more important?
Weak Learner - 1

Swap the Class Labels!


AdaBoost: Example 2-Class Data
1
𝐷1 (i) =
10

All the patterns are


given equal weight
AdaBoost: First Weak Learner

h1 D2
AdaBoost: Second Weak Learner
AdaBoost: Third Weak Learner
AdaBoost: Final Classifier
Training Error of AdaBoost
Regression
• Problem: Given the problem is to learn (y =)
F(x) to minimize squared loss.
• You check this model and find the model to be good but makes
errors.
• There are some mistakes: F(x1 ) = 0.8, while y1 = 0.9, and F(x2 ) = 1.4
while y2 = 1.3. How to improve this model?
• Constraints on the learners:
• The model to learn F cannot change or no change in any
parameter of F
• You can add another model (regression tree) to learn h, so
the new prediction will be F(x) + h(x).
• A simple solution:
• We would like to improve the model such that

• Or equivalently, we want

• Can any regression tree h achieve this goal?


Additive Models
• Some regression tree might be able to do this approximately.
• Just

• The role of h is to compensate the shortcoming of the existing F. If the new


model F + h is still not satisfactory, we can add another regression tree …
• They improve the predictions on the training data, but will the procedure also
work on test data?
• We are building a model, and the model can be applied on test data also.
• How is this related to gradient descent?
Gradient Boosting
• Gradient boosting: train many models sequentially.

• Each new model minimizes further the loss function (y = ax + b + e, e


needs special attention as it is an error term) of the whole system
using gradient descent method.

• The learning procedure consecutively fits new models to provide a


more accurate estimate of the response variable, y.

• The principle idea behind this algorithm is to construct new base


learners which can be maximally correlated with negative gradient
of the loss function, associated with the whole ensemble.
Gradient Boosting
• Problem- set of variables x1 , x2 , and x3 . We need to predict y ∈ ℝ
• Gradient Boost Algorithm
1. Start with mean to predict using all variables.
2. Calculate deviation of each observation from the mean.
3. Find the variable to split the deviations better and find the
value for the split. This is assumed to be the latest.
4. Calculate errors of each observation from the mean in both
sides of split.
5. Repeat steps 3 and 4 till the objective function minimizes.
6. The final predictor is a weighted mean of all the classifiers.
Gradient Boosting: Example x Y-f(x)
x y
0 -2
0 1 f(x) = 3
x>2 1 -1
1 2 1.9 T
F 2 -0.1
2 2.9
x>0 x>3 3 1
3 4
4 2.1
4 5.1
-2 x g(x)
-0.55 1 2.1
x Y-f(x)-h(x) 0 0
0 0 3 -0.55+0.45 x>1 1 -0.45
1 -0.45 = 2.9
2 0.45
2 0.45 x>0 x>2 3 0
3 0 4 0
4 0 0 -0.45 0.45 0
Gradient Descent
• Minimize a function by going in the negative direction of the gradient.

• How is this related to gradient descent?

• Loss function

• We want to minimize

• Note that are some numbers.


Gradient Descent
• We can view as follows:

• We can see that 𝑦𝑖 - F(𝑥𝑖 ) as the negative of the gradient, that is

• This means

• Hence

• It is of the form
Gradient Boost
• For regression with squared loss,
residual ⟺ negative gradient
h fits the residual ⟺ h fits the negative gradient
Update F based on residual ⟺ Update F based on negative gradient

• So, the model is updated using gradient descent!

• It is known that the concept of gradients is more general and useful


than the concept of residuals.

• So, the name of gradient boost


Random Forest
• Decision Tree classifier is greedy and computationally
expensive to deal with high dimensional data sets.
• Random forest (or random forests) is an ensemble classifier
that consists of many decision trees and outputs the class that is
the majority class label output by individual trees.
• If there are N data points and M features, then select S1, S2,
..., Sk where |Si| = N, select with replacement, k >=100. Build k
decision trees
• At each node of a decision tree, choose m <<M features and
find the best feature.
• Decide based on majority class label out of k outputs.
Random Forest: Strength and Correlation

Leo Brieman, Random Forests, Machine Learning, Volume-45, 2001.


Train: (6237,617)
Test: (1560,617)
Best: 0.9378
Train: (7017,617)
Test: (780,617)
Best: 0.9397
Random Forest: Practical Considerations

• Splits are chosen according to a purity measure:


• E.g. squared error (regression), Gini index or
entropy(classification)
• How to select number of trees or forest size?
• Build trees until the error no longer decreases
• How to select M?
• Try to recommend defaults, half of them and twice of
them and pick the best.
Features and Advantages
The advantages of random forest are:
• It is one of the most accurate learning algorithms available. For many
data sets, it produces a highly accurate classifier.
• It runs efficiently on large databases.
• It can handle thousands of input variables without variable deletion.
• It gives estimates of what variables are important in the classification.
• It generates an internal unbiased estimate of the generalization error
as the forest building progresses.
• It has an effective method for estimating missing data and maintains
accuracy when a large proportion of the data are missing.
Example Digit Patterns: XGBoost Classifier
Random Forest
• Decision Tree classifier is greedy and computationally
expensive to deal with high dimensional data sets.
• Random forest (or random forests) is an ensemble classifier
that consists of many decision trees and outputs the class that is
the majority class label output by individual trees.
• If there are N data points and D features, then select S1, S2,
..., Sk where |Si| = N, select with replacement, k >=100. Build k
decision trees
• At each node of a decision tree, choose d << D features and
find the best feature.
• Decide based on majority class label out of k outputs.
Random Forest: Strength and Correlation

Leo Brieman, Random Forests, Machine Learning, Volume-45, 2001.


Confusion Matrices: RF and XGB Classifiers

Random Forest Classifier XGBoost Classifier


Train: (6237,617)
Test: (1560,617)
Best: 0.9378
Random Forest: Practical Considerations

• Splits are chosen according to a purity measure:


• E.g. squared error (regression), Gini index or
entropy(classification)
• How to select number of trees or forest size?
• Build trees until the error no longer decreases
• How to select d?
• Try to recommend defaults, half of them and twice of
them and pick the best.
Features and Advantages
The advantages of random forest are:
• It is one of the most accurate learning algorithms available. For many
data sets, it produces a highly accurate classifier.
• It runs efficiently on large databases.
• It can handle thousands of input variables without variable deletion.
• It gives estimates of what variables are important in the classification.
• It generates an internal unbiased estimate of the generalization error
as the forest building progresses.
• It has an effective method for estimating missing data and maintains
accuracy when a large proportion of the data are missing.

You might also like