Moizen Classification and Regression Trees
Moizen Classification and Regression Trees
See also: Invasive Species. Follett PA and Duan JJ (eds.) (2000) Nontarget Effects of Biological
Control. Boston, MA: Kluwer Academic.
Jervis M and Kidd N (eds.) (1996) Insect Natural Enemies: Practical
Further Reading Approaches to their Study and Evaluation. London: Chapman and
Hall.
Bellows TS and Fisher TW (eds.) (1999) Handbook of Biological Control: Julien MH and Griffiths MW (eds.) (1998) Biological Control of Weeds, a
Principles and Applications of Biological Control. San Diego: World Catalogue of Agents and their Target Weeds, 4th edn.
Academic Press. Wallingford: CABI Publishing.
Clausen CP (ed.) (1978) Agricultural Research Service: Handbook No. Van Driesche J and Van Driesche RG (2000) Nature Out of Place:
480: Introduced Parasites and Predators of Arthropod Pests and Biological Invasions in a Global Age. Washington, DC: Island
Weeds: A World Review. Washington, DC: USDA: Agricultural Press.
Research Service. Van Driesche RG, Hoddle M, and Center T (2008) Control of Pests and
DeBach P and Rosen D (1991) Biological Control by Natural Enemies. Weeds by Natural Enemies, an Introduction to Biological Control.
Cambridge, UK: Cambridge University Press. London: Blackwell.
N = 1544
Present: 21%
Absent: 79%
Yes No
N = 393 N = 1151
Present: 22% Present: 56%
Absent: 78% Absent: 43%
Yes No
N = 928 N = 223
Present: 62% Present: 21%
Absent: 38% Absent: 79%
Yes No
N = 306 N = 622
Present: 48% Present: 68%
Absent: 51% Absent: 32%
Yes No
N = 204 N = 102
Present: 69% Present: 33%
Absent: 31% Absent: 67%
Figure 1 A simple example of a classification tree describing the relationship between presence/absence of P. menziesii and
explanatory factors of elevation (ELEV) and aspect (ASP) in the mountains of northern Utah. Thin-lined boxes indicate a node from which
a split emerges. Thick-lined boxes indicate a terminal node.
Least absolute deviations. This method minimizes the practice, and is more sensitive than the misclassification
mean absolute deviation from the median within a node. error to changes in node probability.
The advantage of this over least squares is that it is not as Entropy index. Also called the cross-entropy or deviance
sensitive to outliers and provides a more robust model. measure of impurity, the entropy index can be written
PK
The disadvantage is in insensitivity when dealing with k¼1 p̂mk log p̂mk . This too is more sensitive than misclas-
data sets containing a large proportion of zeros. sification error to changes in node probability.
Twoing. Designed for multiclass problems, this
approach favors separation between classes rather than
For Classification Trees node heterogeneity. Every multiclass split is treated as a
binary problem. Splits that keep related classes together
There are many criteria by which node impurity is mini- are favored. The approach offers the advantage of reveal-
mized in a classification problem, but four commonly ing similarities between classes and can be applied to
used metrics include: ordered classes as well.
Misclassification error. The misclassification error is
simply the proportion of observations in the node that
are not members of the majority class in that node. Pruning
Gini index. Suppose there are a total of K classes, each
indexed by k. Let p̂mk be the proportion of class k observa- A tree can be grown to be quite large, almost to the point
tions in node m. The Gini index can then be written as where it fits the training data perfectly, that is, sometimes
PK
k¼1 p̂mk 1 – p̂mk . This measure is frequently used in having just one observation in each leaf. However, this
Author's personal copy
Ecological Informatics | Classification and Regression Trees 585
results in overfitting and poor predictions on independent Questions often arise as to whether one should use an
test sets. A tree may also be constructed that is too small independent test set or cross-validated estimates of error
and does not extract all the useful relationships that exist. rates. One thing to consider is that cross-validated error
Appropriate tree size can be determined in a number of rates are based on models built with only 90% of the
ways. One way is to set a threshold for the reduction in data. Consequently, they will not be as good as a model
impurity measure, below which no split will be made. A built with all of the data and will consistently result in
preferred approach is to grow an overly large tree until slightly higher error rates, providing the modeler a con-
some minimum node size is reached. Then prune the tree servative independent estimate of error. However, in
back to an optimal size. Optimal size can be determined regression tree applications in particular, this overesti-
using an independent test set or cross-validation mate of error can be substantially higher than the truth,
(described below). In either case, what results is a tree of giving more incentive to the modeler to find an inde-
optimal size accompanied by an independent measure of pendent test set.
its error rate.
1-SE Rule
Independent Test Set Under both the testing and cross-validation sections
above, tree size was based on the minimum error rate.
If the sample size is sufficiently large, the data can be A slight modification on this strategy is often used where
divided into two subsets randomly, namely, one for train- the smallest tree size is selected such that the error rate
ing and other for testing. Defining sufficiently large is is within one standard error of the minimum. This
problem specific, but one rule of thumb in classification results in more parsimonious trees, with little sacrifice
problems is to allow a minimum of 200 observations for a in error.
binary classification model, with an additional 100 obser-
vations for each additional class. An overly large tree is
grown on the training data. Then, using the test set, error
rates are calculated for the full tree as well as all smaller Costs
subtrees (i.e., trees having fewer terminal nodes than the
full tree). Error rates for classification trees are typically The notion of costs is interlaced with the issues of split-
the overall misclassification rate, while for regression ting criteria and pruning, and is used in a number of ways
problems, mean squared error or mean absolute deviation in fitting and assessing classification trees.
from the median are the criteria used to rank trees of
different size. The subtree with the smallest error rate
Costs of Explanatory Variables and
based on the independent test set is then chosen as the
Misclassification
optimal tree.
In many applications, some explanatory variables are
much more expensive to collect or process than others.
Cross-Validation
Preference may be given to choosing less expensive
If the sample size is not large, it is necessary to retain all explanatory variables in the splitting process by assigning
the data for training purposes. However, pruning and costs or scalings to be applied when considering splits.
testing must be done using independent data. A way This way, the improvement made by splitting on a parti-
around the dilemma is through v-fold cross-validation. cular variable is downweighted by its cost in determining
Here, all the data are used to fit an initial overly large the final split.
tree. The data is then divided into (usually) v ¼ 10 sub- Other times in practice, the consequences are greater
groups, and 10 separate models fit. The first model uses for misclassifying one class over another. Therefore, it is
subgroups 1–9 for training, and subgroup 10 for testing. possible to give preference for correctly classifying cer-
The second model uses groups 1–8 and 10 for training, tain classes, or even assigning specific costs to how an
and group 9 for testing, and so on. In all cases, an inde- observation is misclassified, that is, which wrong class it
pendent test subgroup is available. These 10 test falls in.
subgroups are then combined to give independent error
rates for the initial overly large tree which was fit using all
Cost of Tree Complexity
the data. Pruning of this initial tree proceeds as it did in
the case of the independent test set, where error rates are As discussed in the pruning section, an overly large tree can
calculated for the full tree as well as all smaller subtrees. easily be grown to some user-defined minimum node size.
The subtree with the smallest error rate based on the Often, though, the final tree selected through tree pruning
independent test set is then chosen as the optimal tree. is substantially smaller than the original overly large tree. In
Author's personal copy
586 Ecological Informatics | Classification and Regression Trees
the case of regression trees, the final tree may be 10 times the same variable. If the modeler suspects strong linear
smaller. This result can be a substantial amount of wasted relationships, small trees can first be fit to the data to
computing time. Consequently, one can specify a penalty partition it into a few more similar groups, and then
for cost complexity which is equal to the resubstitution standard parametric models can be run on these groups.
error rate (error obtained using just the training data) plus Another alternative available in some software packages
some penalty parameter multiplied by the number of is creating linear combinations of the explanatory vari-
nodes. A very large tree will have a low misclassification ables, then entering these as new explanatory variables
rate but high penalty, while a small tree will have a high for the tree.
misclassification but low penalty. Cost complexity can be
used to reduce the size of the initial overly large tree
grown prior to pruning, which can greatly improve com- Competitors and Surrogates
putational efficiency, particularly when cross-validation It should be noted that when selecting splits, classification
is being used. and regression trees may track the competitive splits at
One process that combines the cross-validation and each decision point along the way. A competitive split is
cost complexity ideas is to generate a sequence of trees one that results in nearly as pure a node as the chosen
of increasing size by gradually decreasing the penalty split. Classification and regression trees may also keep
parameter in the cost-complexity approach. Then, tenfold track of surrogate variables. Use of a surrogate variable
cross-validation is applied to this relatively small set of at a given split results in a similar node impurity measure
trees to choose the smallest tree whose error falls within (as would a competitor) but also mimics the chosen split
one standard error of the minimum. Because each time a itself in terms of which and how many observations go
tenfold cross-validation procedure is run a modeler might which way in the split.
see a different tree size chosen, multiple (like 50) tenfold
processes may be run, with the most frequently appearing
tree size chosen. Missing Values
As mentioned before, one of the advantages of classifica-
tion and regression trees is their ability to accommodate
Additional Tree-Fitting Issues missing values. If a response variable is missing, that
observation can be excluded from the analysis, or, in the
Although the main issues of fitting classification and case of classification problem, treated as a new class (e.g.,
regression trees revolve around splitting, pruning, and missing) to identify any potential patterns in the loss of
costs, numerous other details remain. Several of these information. If explanatory variables are missing, trees
are discussed below. can use surrogate variables in their place to determine
the split. Alternatively, an observation can be passed to
the next node using a variable that is not missing for that
Heteroscedasticity observation.
In the case of regression trees, heteroscedasticity, or the
tendency for higher-value responses to have more varia-
tion, can be problematic. Because regression trees seek to Observation Weights
minimize within-node impurity, there will be a tendency There are a number of instances where it might be desir-
to split nodes with high variance. Yet, the observations able to give more weight to certain observations in the
within that node may, in fact, belong together. The training set. Some examples include if the training sample
remedy is to apply variance-stabilizing transformations has a disproportionate number of cases in certain classes
to the response as one would do in a linear regression or if the data were collected under a stratified design with
problem. Although regression trees are invariant to one strata having greater or lesser sampling intensity. In
monotonic transformations on explanatory variables, these cases, observations can be weighted to reflect the
transformations like a natural log or square root may be importance each should bear.
appropriate for the response variable.
Variable Importance
Linear Structure
The importance of individual explanatory variables can
Classification and regression trees are not particularly be determined by measuring the proportion of variability
useful when it comes to deciphering linear relationships, accounted for by splits associated with each explanatory
having no choice but to produce a long line of splits on variable. Alternatively, one may address variable
Author's personal copy
Ecological Informatics | Classification and Regression Trees 587
Hastie T, Tibshirani R, and Friedman J (2001) The Elements of Statistical Steinberg D and Colla P (1995) CART: Tree-Structured Nonparametric
Learning. New York: Springer. Data Analysis. San Diego, CA: Salford Systems.
Murthy SK (1998) Automatic construction of decision trees from data: Vayssieres MP, Plant RP, and Allen-Diaz BH (2000) Classification trees:
A multi-disciplinary survey. Data Mining and Knowledge Discovery An alternative non-parametric approach for predicting species
2: 345–389. distributions. Journal of Vegetation Science 11: 679–694.
Quinlan JR (1993) C4.5: Programs for Machine Learning. San Mateo, Venables WN and Ripley BD (1999) Modern Applied Statistics with
CA: Morgan Kaufmann. S-Plus. New York: Springer.
Ripley BD (1996) Pattern Recognition and Neural Networks. Cambridge:
Cambridge University Press.