Schapire MachineLearning
Schapire MachineLearning
www.cs.princeton.edu/schapire
Machine Learning studies how to automatically learn to make accurate predictions based on past observations classication problems: classify examples into given set of categories
new example labeled training examples classification rule
predicted classification
Examples of Classication Problems bioinformatics classify proteins according to their function predict if patient will respond to particular drug/therapy based on microarray proles predict if molecular structure is a small-molecule binding site text categorization (e.g., spam ltering) fraud detection optical character recognition
natural-language processing (e.g., spoken language understanding) market segmentation (e.g.: predict if customer will respond to promotion)
Characteristics of Modern Machine Learning primary goal goal: highly accurate predictions on test data goal is not to uncover underlying truth methods should be general purpose purpose, fully automatic and off-the-shelf however, in practice, incorporation of prior, human knowledge is crucial rich interplay between theory and practice emphasis on methods that can handle large datasets
often much more accurate than human-crafted rules (since data driven) humans often incapable of expressing what they know (e.g., rules of English, or how to recognize letters), but can easily classify examples automatic method to search for hypotheses explaining data cheap and exible can apply to any learning task need a lot of labeled data error prone usually impossible to get perfect accuracy often difcult to discern what was learned
disadvantages
This Talk conditions for accurate learning two state-of-the-art algorithms: boosting support-vector machines
Example: Good versus Evil problem problem: identify people as good or bad from their appearance sex batman robin alfred penguin catwoman joker batgirl riddler male male male male female male female male mask cape tie ears smokes class training data yes yes no yes no Good yes yes no no no Good no no yes no no Good no no yes no yes Bad yes no no yes no Bad no no no no no Bad test data yes yes no yes no ?? yes no no no no ??
An Example Classier
tie no cape no bad yes good yes smokes no yes good bad
mask no smokes no ears no tie no bad yes no good bad yes cape yes good yes bad male good no sex female bad no smokes no good yes bad yes cape yes ears yes good
female bad
problem problem: cant tell best classier complexity from training error
controlling overtting is the central problem of machine learning classiers must be expressive enough to t training data (so that true patterns are fully captured)
100
BUT: classiers that are too complex may overt (capture noise or spurious patterns in the data)
On test data On training data
train
0.75 0.85 0.9 10 0.8
50
40
30
0.55
0.65
0.5
0.6
error (%)
ycaruccA
0.7
20
test
classiers should be as simple as possible, but no simpler simplicity closely related to prior expectations
enough training examples good performance on training set classier that is not too complex (Occams Occams razor razor) measure complexity by: number bits needed to write down number of parameters VC-dimension
Boosting
Example: Spam Filtering problem problem: lter out spam (junk email)
From: [email protected] From: [email protected] . .
non-spam spam . .
easy to nd rules of thumb that are often correct If buy now occurs in message, then predict spam hard to nd single rule that is very highly accurate
The Boosting Approach devise computer program for deriving rough rules of thumb apply procedure to subset of emails obtain rule of thumb apply to 2nd subset of emails obtain 2nd rule of thumb repeat T times
Details how to choose examples on each round? concentrate on hardest examples (those most often misclassied by previous rules of thumb) how to combine rules of thumb into single prediction rule? take (weighted) majority vote of rules of thumb can prove prove: if can always nd weak rules of thumb slightly better than random guessing (51% accuracy), then can learn almost perfectly (99% accuracy) using boosting
initialize weights D1 to be uniform across training examples for t = 1, . . . , T : train weak classier (rule of thumb) ht on Dt compute new weights Dt+1: decrease weight of examples correctly classied by ht increase weight of examples incorrectly classied by ht output nal classier Hnal = weighted majority vote of h1, , hT
Toy Example
D1
D2
Round 1
h1
1 =0.30 1=0.42
D3
h2
Round 2
2 =0.21 2=0.65
h3
Round 3
3 =0.14 3=0.92
+ 0.92
+ 0.65
Final Classier
0.42
H = sign final
Theory of Boosting assume each weak classier slightly better than random can prove training error drops to zero exponentially fast even so, naively expect signicant overtting overtting, since a large number of rounds implies a large nal classier surprisingly, usually does not overt
error
10 5 0
# of rounds (T)
test error continues to drop even after training error is zero! explanation explanation: with more rounds of boosting, nal classier becomes more condent in its predictions increase in condence implies better test error (regardless of number of rounds)
Support-Vector Machines
Geometry of SVMs
choose hyperplane that maximizes minimum margin intuitively: want to separate +s from s as much as possible margin = measure of condence support vectors = examples closest to hyperplane
What If Not Linearly Separable? answer #1 #1: penalize each point by distance must be moved to obtain large margin answer #2 #2: map into higher dimensional space in which data becomes linearly separable
Example
hyperplane in mapped space has form 2=0 a + bx1 + cx2 + dx1x2 + ex2 + f x 1 2 = conic in original space linearly separable in mapped space
Higher Dimensions Dont (Necessarily) Hurt may project to very high dimensional space
statistically statistically, may not hurt since VC-dimension independent of number of dimensions ((R/ )2) computationally computationally, only need to be able to compute inner products (x) (z)
then
permits efcient computation of SVMs in very high dimensions many kernels have been proposed and studied provides power, versatility and opportunity for incorporation of prior knowledge
Signicance of SVMs and Boosting grounded in rich theory with provable guarantees exible and general purpose fast and easy to use off-the-shelf and fully automatic able to work effectively in very high dimensional spaces
Summary central issues in machine learning: avoidance of overtting balance between simplicity and t to data quick look at two learning algorithms: boosting and SVMs many other algorithms not covered: decision trees neural networks nearest neighbor algorithms Naive Bayes bagging . .
Other Machine Learning Problem Areas supervised learning classication regression predict real-valued labels rare class / cost-sensitive learning unsupervised no labels clustering density estimation semi-supervised in practice, un unlabeled examples much cheaper than labeled examples how to take advantage of both labeled and unlabeled examples active learning how to carefully select which unlabeled examples to have labeled
Further reading on machine learning in general: Ethem Alpaydin. Introduction to machine learning. MIT Press, 2004. Luc Devroye, L azl o Gy or and G abor Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996. Richard O. Duda, Peter E. Hart and David G. Stork. Pattern Classication (2nd ed.). Wiley, 2000. Trevor Hastie, Robert Tibshirani and Jerome Friedman. The Elements of Statistical Learning : Data Mining, Inference, and Prediction. Springer, 2001. Michael J. Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994. Tom M. Mitchell. Machine Learning. McGraw Hill, 1997. Vladimir N. Vapnik. Statistical Learning Theory. Wiley, 1998. Boosting: Ron Meir and Gunnar R atsch. An Introduction to Boosting and Leveraging. In Advanced Lectures on Machine Learning (LNAI2600), 2003. http://www.boosting.org/papers/MeiRae03.pdf Robert E. Schapire. The boosting approach to machine learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classication, 2002. http://www.cs.princeton.edu/schapire/boost.html Many more papers, tutorials, etc. available at www.boosting.org. Support-vector machines: Nello Cristianni and John Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 2000. See www.support-vector.net. Many more papers, tutorials, etc. available at www.kernel-machines.org.