Decision Trees
Decision Trees
ApproximaIon
Problem Se*ng
• Set of possible instances X
• Set of possible labels Y
• Unknown target funcIon f : X ! Y
• Set of funcIon hypotheses H = {h | h : X ! Y}
hxi , yi i
Decision Tree
• A possible decision tree for the data:
6
DecisionDecision Tree Learning
Tree Learning
Problem Setting:
• Set of possible instances X
– each instance x in X is a feature vector
– e.g., <Humidity=low, Wind=weak, Outlook=rain, Temp=hot>
• Unknown target function f : XY
– Y is discrete valued
• Set of function hypotheses H={ h | h : XY }
– each hypothesis h is a decision tree
– trees sorts x to leaf, which assigns y
X, Y
Train the model:
learner
model ß classifier.train(X, Y )
x model ypredicIon
Apply the model to new data:
• Given: new unlabeled instance x ⇠ D(X )
ypredicIon ß model.predict(x)
Example ApplicaIon: A Tree to
Predict Caesarean SecIon Risk
Color
green blue
red
Size + Shape
big small square round
- + Size +
big small
- +
Decision Tree – Decision Boundary
• Decision trees divide the feature space into axis-
parallel (hyper-)rectangles
• Each rectangular region is labeled with one label
– or a probability distribuIon over labels
Decision
boundary
11
Expressiveness
• Decision trees can represent any boolean funcIon of
the input aHributes
• In the worst case, the tree will require exponenIally
many nodes
Expressiveness
Decision trees have a variable-sized hypothesis space
• As the #nodes (or depth) increases, the hypothesis
space grows
– Depth 1 (“decision stump”): can represent any boolean
funcIon of one feature
– Depth 2: any boolean fn of two features; some involving
three features (e.g., )
(x1 ^ x2 ) _ (¬x1 ^ ¬x3 )
– etc.
Which split is more informaIve: Patrons? or Type?
23
Based on slide by Pedro Domingos
Impurity
24
Based on slide by Pedro Domingos
Entropy: a common way to measure impurity
Entropy # of possible
values for X
Entropy H(X) of a random variable X
32
Based on slide by Pedro Domingos
From Entropy to InformaIon Gain
Entropy
Entropy H(X) of a random variable X
child ⎛1 1 ⎞ ⎛ 12 12 ⎞
impurity = −⎜
entropy 13 ⋅ log 2 ⎟ − ⎜ ⋅ log 2 ⎟ = 0.391
⎝ 13 ⎠ ⎝ 13 13 ⎠
⎛ 17 ⎞ ⎛ 13 ⎞
(Weighted) Average Entropy of Children = ⎜ ⋅ 0.787 ⎟ + ⎜ ⋅ 0.391⎟ = 0.615
⎝ 30 ⎠ ⎝ 30 ⎠
Information Gain= 0.996 - 0.615 = 0.38 38
Based on slide by Pedro Domingos
Entropy-Based AutomaIc Decision
Tree ConstrucIon
39
Based on slide by Pedro Domingos
Using InformaIon Gain to Construct
a Decision Tree
Choose the aHribute A
Full Training Set X with highest informaIon
AHribute A
gain for the full training
v1 v2 vk set at the root of the tree.
Construct child nodes
for each value of A. Set X X ={x X | value(A)=v1}
Each has an associated
subset of vectors in repeat
which A has a parIcular recursively
Ill when?
value.