Decision Tree
Decision Tree
Katydid or Grasshopper?
For any domain of interest, we can measure features
Abdomen Thorax
Length Length Antennae
Length
Mandible
Size
Spiracle
Diameter Leg Length
My_Collection
We can store features
Insect Abdomen Antennae Insect Class
in a database. ID Length Length
1 2.7 5.5 Grasshopper
2 8.0 9.1 Katydid
3 0.9 4.7 Grasshopper
The classification
4 1.1 3.1 Grasshopper
problem can now be
5 5.4 8.5 Katydid
expressed as:
6 2.9 1.9 Grasshopper
7 6.1 6.6 Katydid
• Given a training database
(My_Collection), predict the class 8 0.5 1.0 Grasshopper
label of a previously unseen Katydid
instance
9 8.3 6.6
10 8.1 4.7 Katydids
10
9
8
7
Antenna Length
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Grasshoppers Katydids
We will also use this lager dataset
as a motivating example…
10
9
8
7
Antenna Length
6
5 Each of these data
4 objects are called…
• exemplars
3
• (training) examples
2 • instances
1 • tuples
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Decision Tree Classifier
10
9 Ross Quinlan
8
7
Antenna Length
5
no yes
4
3 Antenna Length > 6.0? Katydid
2
no yes
1
Grasshopper Katydid
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Antennae shorter than body?
Yes No
3 Tarsi?
Grasshopper
Yes No
Yes No
Cricket
Comic 8” 290 38 ?
Of the 3 features we had, Weight
was best. But while people who
weigh over 160 are perfectly
classified (as males), the under 160
yes no
people are not perfectly Weight <= 160?
classified… So we simply recurse!
This time we find that we
can split on Hair length, and
we are done! no
yes
Hair Length <= 2?
We need don’t need to keep the data
around, just the test conditions. Weight <= 160?
yes no
How would
Hair Length <= 2?
these people Male
be classified?
yes no
Male Female
It is trivial to convert Decision
Weight <= 160?
Trees to rules…
yes no
Male Female
For example, the rule “Wears green?” perfectly classifies the data, so does
“Mothers name is Jacqueline?”, so does “Has blue shoes”…
Avoid Overfitting in Classification
• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure falling
below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data to
decide which is the “best pruned tree”
Advantages/Disadvantages of Decision Trees
• Advantages:
– Easy to understand (Doctors love them!)
– Easy to generate rules
• Disadvantages:
– May suffer from overfitting.
– Classifies by rectangular partitioning (so does
not handle correlated features very well).
– Can be quite large – pruning is necessary.
– Does not handle streaming data easily
Classification and Prediction
5/14/2019
21
Why Classification? A motivating
application
• Credit approval
– A bank wants to classify its customers based on whether they are
expected to pay back their approved loans
– The history of past customers is used to train the classifier
– The classifier provides rules, which identify potentially reliable
future customers
– Classification rule:
• If age = “31...40” and income = high then credit_rating = excellent
– Future customers
• Paul: age = 35, income = high excellent credit rating
• John: age = 20, income = medium fair credit rating
5/14/2019
22
Descriptive and Predictive
modeling
5/14/2019
23
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction: training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test samples is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
5/14/2019
24
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Mellisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes 26
5/14/2019
Classification by Decision Tree Induction
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
5/14/2019
27
Ways of attribute splitting
{>25K, <50K
{>10K, <25K}
{<10K}
{Married,
{single }{married} {divorced} {single} Divorced}
5/14/2019
28
Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
5/14/2019
29
Output: A Decision Tree for
“buys_computer”
age?
<=30 overcast
30..40 >40
no yes no yes
5/14/2019
30