Note 6
Note 6
Example:
• Goal: classify a record as “will accept credit card offer” or “will not accept”
• Rule might be “IF (Income >= 106) AND (Education < 1.5) AND (Family <= 2.5) THEN
Class = 0 (nonacceptor)
• Also called CART, Decision Trees, or just Trees
• Rules are represented by tree diagrams
1
Key Ideas
Recursive partitioning: Repeatedly split the records into two parts so as to achieve maximum
homogeneity of outcome within each new part
Pruning the tree: Simplify the tree by pruning peripheral branches to avoid overfitting
Recursive Partitioning
• Pick a value of x say s , that divides the training data into two (not necessarily equal)
i, i
portions
• Measure how “pure” or homogeneous each of the resulting portions is
• “Pure” = containing records of mostly one class (or, for prediction, records with similar
outcome values)
• Algorithm tries different values of x and s to maximize purity in initial split
i, i
• After you get a “maximum purity” split, repeat the process for a second split (on any
variable), and so on
2
Income Lot_Size Ownership
60.0 18.4 owner
85.5 16.8 owner
64.8 21.6 owner
61.5 20.8 owner
87.0 23.6 owner
110.1 19.2 owner
108.0 17.6 owner
82.8 22.4 owner
69.0 20.0 owner
93.0 20.8 owner
51.0 22.0 owner
81.0 20.0 owner
75.0 19.6 non-owner
52.8 20.8 non-owner
64.8 17.2 non-owner
43.2 20.4 non-owner
84.0 17.6 non-owner
49.2 17.6 non-owner
59.4 16.0 non-owner
66.0 18.4 non-owner
47.4 16.4 non-owner
33.0 18.8 non-owner
51.0 14.0 non-owner
63.0 14.8 non-owner
How to split
• Order records according to one variable, say income
• Take a predictor value, say 60 (the first record) and divide records into those with income >=
60 and those < 60
• Measure resulting purity (homogeneity) of class in each resulting portion
• Try all other split values
• Repeat for other variable(s)
• Select the one variable & split that yields the most purity
3
Note: Categorical Variables
• Examine all possible ways in which the categories can be split.
• E.g., categories A, B, C can be split 3 ways
{A} and {B, C}
{B} and {A, C}
{C} and {A, B}
• With many categories, # of splits becomes huge
The first split: Income = 60
4
After All Splits
Measuring Impurity
Gini Index:
Gini index for rectangle
𝑚 2
I(A) = 1 - 𝑘=1 𝑝𝑘
Entropy:
𝑚
• Entropy ranges between 0 (most pure) and log (m) (equal representation of classes)
2
5
Impurity and Recursive Partitioning
• Obtain overall impurity measure (weighted avg. of individual rectangles)
• At each successive stage, compare this measure across all possible splits in all variables
• Choose the split that reduces impurity the most
• Chosen split points become nodes on the tree
R codes:
library(rpart)
library(rpart.plot)
mower.df <- read.csv("RidingMowers.csv")
6
> prp(class.tree) # gives the following plot
# Cp: complexity parameter. Any split that does not decrease the overall lack of fit by a factor of cp
is not attempted. For instance, with anova splitting, this means that the overall R-squared must
increase by cp at each step. The main role of this parameter is to save computing time by pruning
off splits that are obviously not worthwhile. Essentially,t he user informs the program that any split
which does not improve the fit by cp will likely be pruned off by cross-validation, and that hence
the program need not pursue it. Default value cp=0.01.
7
# minsplit the minimum number of observations that must exist in a node in order for a split to be
attempted. Default value minsplit = 20
The first split is on Income, then the next split is on Lot Size for both the low income group (at
lot size 21) and the high income split (at lot size 20)
8
Example 2: Acceptance of Personal Loan
code for creating a default classification tree
library(rpart)
library(rpart.plot)
# partition
set.seed(1)
train.index <- sample(c(1:dim(bank.df)[1]), dim(bank.df)[1]*0.6)
train.df <- bank.df[train.index, ]
valid.df <- bank.df[-train.index, ]
# classification tree
default.ct <- rpart(Personal.Loan ~ ., data = train.df, method = "class")
# plot tree
prp(default.ct, type = 1, extra = 1, under = TRUE, split.font = 1, varlen = -10)
9
The Overfitting Problem
# plot tree
10
> length(deeper.ct$frame$var[deeper.ct$frame$var == "<leaf>"])
[1] 53
Full trees are too complex – they end up fitting noise, overfitting the data
code for classifying the validation data using a tree and computing the
confusion matrices and accuracy for the training and validation data
11
> table(default.ct.point.pred.train, train.df$Personal.Loan)
default.ct.point.pred.train 0 1
0 2696 26
1 13 265
Note:
Accuracy : 0.987 note: =(2696+265)/(2696+26+13+265)
Sensitivity : 0.9952
Specificity : 0.9107
'Positive' Class : 0
default.ct.point.pred.valid 0 1
0 1792 18
1 19 171
>
> ### repeat the code for using the deeper tree
> default.ct.point.pred.train <- predict(deeper.ct, train.df,type = "class")
> table(default.ct.point.pred.train, train.df$Personal.Loan)
default.ct.point.pred.train 0 1
0 2709 0
1 0 291
>
> default.ct.point.pred.valid <- predict(deeper.ct, valid.df,type = "class")
> table(default.ct.point.pred.valid, valid.df$Personal.Loan)
default.ct.point.pred.valid 0 1
0 1788 19
1 23 170
>
12
Suppose a 2x2 confusion matrix table is denoted as:
Reference
Event A B
No Event C D
PPV=(sensitivity∗prevalence)/((sensitivity∗prevalence)+((1−specificity)∗(1−prevalence)));
NPV=(specificity∗(1−prevalence))/(((1−sensitivity)∗prevalence)+((specificity)∗(1−prevalence)));
DetectionRate=A/(A+B+C+D); DetectionPrevalence=(A+B)/(A+B+C+D);
Overfitting produces poor predictive performance – past a certain point in tree complexity, the
error rate on new data starts to increase.
13
Stopping tree growth - CHAID
• CHAID, older than CART, uses chi-square statistical test to limit tree growth
• Splitting stops when purity improvement is not statistically significant
One can think of different criteria for stopping the tree growth before it starts overfitting the
data. Examples are tree depth (i.e., number of splits), minimum number of records in a terminal
node, and minimum reduction in impurity. In R’s rpart(), for example, we can control the depth
of the tree with the complexity parameter (CP). The problem is that it is not simple to
determine what is a good stopping point using such rules.
14
Pruning the tree with the validation data solves the problem of overfitting, but it does not
address the problem of instability. Recall that the CART algorithm may be unstable in choosing
one or another variable for the top-level splits, and this effect then cascades down and produces
highly variable rule sets. The solution is to avoid relying on just one partition of the data into
training and validation. Rather, we do so repeatedly using cross-validation (see below), then
pool the results. Of course, just accumulating a set of different trees with their different rules
will not do much by itself. However, we can use the results from all those trees to learn how deep
to grow the original tree. In this process, we introduce a parameter that can measure, and
control, how deep we grow the tree. We will note this parameter value for each minimum-error
tree in the cross-validation process, take an average, then apply that average to limit tree growth
to this optimal depth when working with new data.
Tree instability
• If 2 or more variables are of roughly equal importance, which one CART chooses for the
first split can depend on the initial partition into training and validation
• A different partition into training/validation could lead to a different initial split
• This can cascade down and produce a very different tree from the first training/validation
partition
• Solution is to try many different training/validation splits – “cross validation”
Cross validation
Do many different partitions (“folds*”) into training and validation, grow & pruned tree for
each
Problem: We end up with lots of different pruned trees. Which one to choose?
Solution: Don’t choose a tree, choose a tree size:
For each iteration, record the cp that corresponds to the minimum validation error
Average these cp’s
With future data, grow tree to that optimum cp value
*typically folds are non-overlapping, i.e. data used in one validation fold will not be used in
others
15
Cross validation, “best pruned”
In the above procedure, we select cp for minimum error tree
But… simpler is better: slightly smaller tree might do just as well
Solution: add a cushion to minimum error
Calculate standard error of cv estimate – this gives a rough range for chance variation
Add standard error to the actual error to allow for chance variation
Choose smallest tree within one std. error of minimum error
You can then use the corresponding cp to set cp for future data
16
code for tabulating tree error as a function of the complexity parameter (CP)
Classification tree:
rpart(formula = Personal.Loan ~ ., data = train.df, method = "class",
cp = 1e-05, minsplit = 5, xval = 5)
n= 3000
17
18
Best-pruned tree obtained by fitting a full tree to the training data, pruning it using the cross-
validation data, and choosing the smallest tree within one standard error of the minimum xerror
tree
19
>
> set.seed(1)
> cv.ct <- rpart(Personal.Loan ~ ., data = train.df, method = "class", cp = 0.00001, minsplit =
+ 1, xval = 5)
>
> # minsplit is the minimum number of observations in a node for a split to be attempted.
> # xval is number K of folds in a K-fold cross-validation.
>
> printcp(cv.ct)
Classification tree:
rpart(formula = Personal.Loan ~ ., data = train.df, method = "class",
cp = 1e-05, minsplit = 1, xval = 5)
n= 3000
> # Print out the cp table of cross-validation errors. The R-squared for a regression tree is
> # 1 minus rel error. xerror (or relative cross-validation error where "x" stands for
"cross")
> # is a scaled version of overall average of the 5 out-of-sample errors across the 5 folds.
>
> pruned.ct <- prune(cv.ct, cp = 0.0154639)
> prp(pruned.ct, type = 1, extra = 1, under = TRUE, split.font = 1, varlen = -10,
+ box.col=ifelse(pruned.ct$frame$var == "<leaf>", 'gray', 'white'))
>
20
>
21
The tree method can also be used for a numerical outcome variable. Regression trees for
prediction operate in much the same fashion as classification trees. The outcome variable (Y) is
a numerical variable in this case, but both the principle and the procedure are the same: Many
splits are attempted, and for each, we measure “impurity” in each branch of the resulting tree
Regression Trees
22
9.8 Improving Prediction: Random Forests and Boosted Trees
Notwithstanding the transparency advantages of a single tree as described above, in a pure
prediction application, where visualizing a set of rules does not matter, better performance is
provided by several extensions to trees that combine results from multiple trees. These are
examples of ensembles (see Chapter 13). One popular multitree approach is random forests,
introduced by Breiman and Cutler.1 Random forests are a special case of bagging, a method for
improving predictive power by combining multiple classifiers or prediction algorithms. See
Chapter 13 for further details on bagging.
Description
randomForest implements Breiman’s random forest algorithm (based on Breiman and Cutler’s original
Fortran code) for classification and regression. It can also be used in unsupervised mode for
assessing proximities among data points.
23
Arguments:
xtest a data frame or matrix (like x) containing predictors for the test set.
ytest response for the test set.
ntree Number of trees to grow. This should not be set to too small a number, to ensure that every input
row gets predicted at least a few times.
mtry Number of variables randomly sampled as candidates at each split. Note that the default values are
different for classification (sqrt(p) where p is number of variables in x) and regression (p/3)
replace Should sampling of cases be done with or without replacement?
classwt Priors of the classes. Need not add up to one. Ignored for regression.
cutoff (Classification only) A vector of length equal to number of classes. The ‘winning’ class for an
observation is the one with the maximum ratio of proportion of votes to cutoff. Default is 1/k where k is
the number of classes (i.e., majority vote wins).
strata A (factor) variable that is used for stratified sampling.
sampsize Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of
strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be
drawn from the strata.
nodesize Minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown
(and thus take less time). Note that the default values are different for classification (1) and regression
(5).
importance Should importance of predictors be assessed?
>
> library(randomForest)
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
margin
Warning message:
package ‘randomForest’ was built under R version 3.5.3
>
> ## random forest
> rf <- randomForest(as.factor(Personal.Loan) ~ ., data = train.df, ntree =
500,
+ mtry = 4, nodesize = 5, importance = TRUE)
>
24
Unlike a single tree, results from a random forest cannot be displayed in a tree-like diagram,
thereby losing the interpretability that a single tree provides. However, random forests can
produce “variable importance” scores, which measure the relative contribution of the different
predictors. The importance score for a particular predictor is computed by summing up the
decrease in the Gini index for that predictor over all the trees in the forest. Figure 9.15 shows the
variable importance plots generated from the random forest model for the personal loan
example. We see that Income and Education have the highest scores, with Family being third.
Importance scores for the other predictors are considerably lower.
> ## variable importance plot
> varImpPlot(rf, type = 1)
>
> ## confusion matrix
> rf.pred <- predict(rf, valid.df)
> table(rf.pred, valid.df$Personal.Loan)
rf.pred 0 1
0 1801 19
1 10 170
25
>
The second type of multitree improvement is boosted trees. Here a sequence of trees is fitted, so
that each tree concentrates on misclassified records from the previous tree.
Boosting, like RF, is an ensemble method – but uses an iterative approach in which each
successive tree focuses its attention on the misclassified trees from the prior tree.
0 1
0 1803 17
1 8 172
>
26
Table 9.5 shows the result of running a boosted tree on the loan acceptance example that
we saw earlier. We can see that compared to the performance of the single pruned tree
(Table 9.3), the boosted tree has better performance on the validation data in terms of
overall accuracy and especially in terms of correct classification of 1’s—the rare class of
special interest. Where does boosting’s special talent for finding 1’s come from? When
one class is dominant (0’s constitute over 90% of the data here), basic classifiers are
tempted to classify cases as belonging to the dominant class, and the 1’s in this case
constitute most of the misclassifications with the single best-pruned tree. The boosting
algorithm concentrates on the misclassifications (which are mostly 1’s), so it is naturally
going to do well in reducing the misclassification of 1’s (from 18 in the single tree to 15 in
the boosted tree, in the validation set).
27
The adabag package in R can be used to generate bagged and boosted trees. Tables
13.1 and 13.2 show the R code and output producing a bagged tree and a boosted tree for
the personal loan data, and how they are used to generate classifications for the validation
set.
Example Of Bagging and Boosting Classification Trees on the Personal Loan Data: R code
> library(adabag)
> library(rpart)
> library(caret)
>
> bank.df <- read.csv("UniversalBank.csv")
> bank.df <- bank.df[ , -c(1, 5)] # Drop ID and zip code columns.
>
> # transform Personal.Loan into categorical variable
> bank.df$Personal.Loan = as.factor(bank.df$Personal.Loan)
>
> # partition the data
> train.index <- sample(c(1:dim(bank.df)[1]), dim(bank.df)[1]*0.6)
> train.df <- bank.df[train.index, ]
> valid.df <- bank.df[-train.index, ]
>
> # single tree
> tr <- rpart(Personal.Loan ~ ., data = train.df)
> pred <- predict(tr, valid.df, type = "class")
> table(pred, valid.df$Personal.Loan)
pred 0 1
0 1803 13
1 10 174
>
Usage
bagging(formula, data, mfinal = 100, control, par=FALSE,...)
Arguments
Formula a formula, as in the lm function.
Data a data frame in which to interpret the variables named in the formula
Mfinal an integer, the number of iterations for which boosting is run or the number
of trees to use. Defaults to mfinal=100 iterations.
Control options that control details of the rpart algorithm. See rpart.control for
more details
28
> # bagging
> bag <- bagging(Personal.Loan ~ ., data = train.df)
> pred <- predict(bag, valid.df, type = "class")
> table(pred$class, valid.df$Personal.Loan)
0 1
0 1806 25
1 7 162
>
Boosting:
Description Fits the AdaBoost.M1 (Freund and Schapire, 1996) and SAMME (Zhu et al., 2009) algorithms
using classification trees as single classifiers.
Usage
boosting(formula, data, boos = TRUE, mfinal = 100, coeflearn = 'Breiman', control,...)
Arguments
formula a formula, as in the lm function. data a data frame in which to interpret the variables named in
formula.
boos if TRUE (by default), a bootstrap sample of the training set is drawn using the weights for each
observation on that iteration. If FALSE, every observation is used with its weights.
mfinal an integer, the number of iterations for which boosting is run or the number of trees to use.
Defaults to mfinal=100 iterations.
coeflearn if ’Breiman’(by default), alpha=1/2ln((1-err)/err) is used. If ’Freund’ alpha=ln((1-err)/err) is
used. In both cases the AdaBoost.M1 algorithm is used and alpha is the weight updating coefficient. On
the other hand, if coeflearn is ’Zhu’ the SAMME algorithm is implemented with alpha=ln((1-err)/err)+
ln(nclasses-1).
control options that control details of the rpart algorithm. See rpart.control for more details.
> # boosting
> boost <- boosting(Personal.Loan ~ ., data = train.df)
> pred <- predict(boost, valid.df, type = "class")
> table(pred$class, valid.df$Personal.Loan)
0 1
0 1810 18
1 3 169
>
29
Summary
• Classification and Regression Trees are an easily understandable and transparent method for
predicting or classifying new records
• A single tree is a graphical representation of a set of rules
• Tree growth must be stopped to avoid overfitting of the training data – cross-validation helps
you pick the right cp level to stop tree growth
• Ensembles (random forests, boosting) improve predictive performance, but you lose
interpretability and the rules embodied in a single tree
Problems
1. Competitive Auctions on eBay.com. The file eBayAuctions.csv contains
information on 1972 auctions that transacted on eBay.com during May–June 2004.
The goal is to use these data to build a model that will classify auctions as
competitive or noncompetitive. A competitive auction is defined as an auction with
at least two bids placed on the item auctioned. The data include variables that
describe the item (auction category), the seller (his/her eBay rating), and the
auction terms that the seller selected (auction duration, opening price, currency,
day-of-week of auction close). In addition, we have the price at which the auction
closed. The task is to predict whether or not the auction will be competitive.
Data Preprocessing. Convert variable Duration into a categorical variable. Split the
data into training (60%) and validation (40%) datasets.
a. Fit a classification tree using all predictors, using the best-pruned tree. To avoid
overfitting, set the minimum number of records in a terminal node to 50 (in
R: minbucket = 50). Also, set the maximum number of levels to be displayed at seven
(in R: maxdepth = 7). Write down the results in terms of rules. (Note: If you had to
slightly reduce the number of predictors due to software limitations, or for clarity of
presentation, which would be a good variable to choose?)
b. Is this model practical for predicting the outcome of a new auction?
c. Describe the interesting and uninteresting information that these rules provide.
d. Fit another classification tree (using the best-pruned tree, with a minimum number
of records per terminal node = 50 and maximum allowed number of displayed levels
= 7), this time only with predictors that can be used for predicting the outcome of a
new auction. Describe the resulting tree in terms of rules. Make sure to report the
smallest set of rules required for classification.
e. Plot the resulting tree on a scatter plot: Use the two axes for the two best
(quantitative) predictors. Each auction will appear as a point, with coordinates
corresponding to its values on those two predictors. Use different colors or symbols
to separate competitive and noncompetitive auctions. Draw lines (you can sketch
these by hand or use R) at the values that create splits. Does this splitting seem
30
reasonable with respect to the meaning of the two predictors? Does it seem to do a
good job of separating the two classes?
f. Examine the lift chart and the confusion matrix for the tree. What can you say about
the predictive performance of this model?
g. Based on this last tree, what can you conclude from these data about the chances of
an auction obtaining at least two bids and its relationship to the auction settings set
by the seller (duration, opening price, ending day, currency)? What would you
recommend for a seller as the strategy that will most likely lead to a competitive
auction?
31
3. Predicting Prices of Used Cars (Regression Trees). The
file ToyotaCorolla.csv contains the data on used cars (Toyota Corolla) on sale
during late summer of 2004 in the Netherlands. It has 1436 records containing
details on 38 attributes, including Price, Age, Kilometers, HP, and other
specifications. The goal is to predict the price of a used Toyota Corolla based on
its specifications. (The example in Section 9.19 is a subset of this dataset).
Data Preprocessing. Split the data into training (60%), and validation (40%)
datasets.
a. Run a regression tree (RT) with outcome variable Price and predictors Age_08_04,
KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfr_Guarantee,
Guarantee_Period, Airco, Automatic_Airco, CD_Player, Powered_Windows,
Sport_Model, and Tow_Bar. Keep the minimum number of records in a terminal
node to 1, maximum number of tree levels to 30, and cp = 0.001, to make the run
least restrictive.
i. Which appear to be the three or four most important car specifications for
predicting the car’s price?
ii. Compare the prediction errors of the training and validation sets by
examining their RMS error and by plotting the two boxplots. How does the
predictive performance of the validation set compare to the training set?
Why does this occur?
iii. How might we achieve better validation predictive performance at the
expense of training performance?
iv. Create a less deep tree by leaving the arguments cp, minbucket, and
maxdepth at their defaults. Compared to the deeper tree, what is the
predictive performance on the validation set?
b. Let us see the effect of turning the price variable into a categorical
variable. First, create a new variable that categorizes price into 20 bins. Now
repartition the data keeping Binned_Price instead of Price. Run a
classification tree with the same set of input variables as in the RT, and with
Binned_Price as the output variable. As in the less deep regression tree, leave
the arguments cp, minbucket, and maxdepth at their defaults.
i. Compare the tree generated by the CT with the one generated by the less
deep RT. Are they different? (Look at structure, the top predictors, size of
tree, etc.) Why?
ii. Predict the price, using the less deep RT and the CT, of a used Toyota Corolla
with the specifications listed in Table 9.6.
iii. Compare the predictions in terms of the predictors that were used, the
magnitude of the difference between the two predictions, and the
advantages and disadvantages of the two methods.
Table 9.6 Specifications For A Particular Toyota Corolla
Variable Value
Age_-08_-04 77
32
KM 117,000
Fuel_Type Petrol
HP 110
Automatic No
Doors 5
Quarterly_Tax 100
Mfg_Guarantee No
Guarantee_Period 3
Airco Yes
Automatic_Airco No
CD_Player No
Powered_Windows No
Sport_Model No
Tow_Bar Yes
33