DTreesAndOverfitting-1-11-2011_final
DTreesAndOverfitting-1-11-2011_final
Tom M. Mitchell
Machine Learning Department
Carnegie Mellon University
Today: Readings:
• What is machine learning? • “The Discipline of ML”
• Decision tree learning • Mitchell, Chapter 3
• Course logistics • Bishop, Chapter 14.4
Machine Learning:
Study of algorithms that
• improve their performance P
• at some task T
• with experience E
1
Learning to Predict Emergency C-Sections
[Sims et al., 2000]
(Prof. H. Schneiderman)
2
Learning to classify text documents
Reading
a noun
(vs verb)
[Rustandi et al.,
2005]
3
Machine Learning - Practice
Speech Recognition
Object recognition
Mining Databases
• Supervised learning
• Bayesian networks
Control learning
Text analysis • Hidden Markov models
• Unsupervised clustering
• Reinforcement learning
• ....
4
Economics Computer science
and Animal learning
(Cognitive science,
Organizational Psychology,
Behavior Neuroscience)
Machine learning
Adaptive Control
Evolution Theory
Statistics
5
Machine Learning in Computer Science
6
Function approximation
Problem Setting:
• Set of possible instances X
• Unknown target function f : XY
• Set of function hypotheses H={ h | h : XY }
superscript: ith training example
Input:
• Training examples {<x(i),y(i)>} of unknown target function f
Output:
• Hypothesis h ∈ H that best approximates target function f
7
Decision Tree Learning
Problem Setting:
• Set of possible instances X
– each instance x in X is a feature vector
– e.g., <Humidity=low, Wind=weak, Outlook=rain, Temp=hot>
• Unknown target function f : XY
– Y is discrete valued
• Set of function hypotheses H={ h | h : XY }
– each hypothesis h is a decision tree
– trees sorts x to leaf, which assigns y
Input:
• Training examples {<x(i),y(i)>} of unknown target function f
Output:
• Hypothesis h ∈ H that best approximates target function f
8
Decision Trees
Suppose X = <X1,… Xn>
where Xi are boolean variables
9
[ID3, C4.5, Quinlan]
node = Root
Entropy # of possible
values for X
Entropy H(X) of a random variable X
10
Sample Entropy
Entropy
Entropy H(X) of a random variable X
11
Information Gain is the mutual information between
input attribute A and target variable Y
12
13
Decision Tree Learning Applet
• http://www.cs.ualberta.ca/%7Eaixplore/learning/
DecisionTrees/Applet/DecisionTreeApplet.html
14
Why Prefer Short Hypotheses? (Occam’s Razor)
Arguments in favor:
Arguments opposed:
Argument in favor:
• Fewer short hypotheses than long ones
a short hypothesis that fits the data is less likely to be
a statistical coincidence
highly probable that a sufficiently complex hypothesis
will fit the data
Argument opposed:
• Also fewer hypotheses with prime number of nodes
and attributes beginning with “Z”
• What’s so special about “short” hypotheses?
15
16
17
Split data into training and validation set
Create tree that classifies training set correctly
18
19
What you should know:
• Well posed function approximation problems:
– Instance space, X
– Sample of labeled training data { <x(i), y(i)>}
– Hypothesis space, H = { f: XY }
20