0% found this document useful (0 votes)

58 views

Schapire MachineLearning

Rob Schapire: Machine Learning studies how to automatically learn to make accurate predictions. He says primary goal goal is to make highly accurate predictions on test data. Methods should be general purpose purpose, fully automatic and "off-the-shelf" he says two state-of-the-art algorithms: boosting support-vector machines.

Uploaded by

Toon Man

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views

Schapire MachineLearning

Uploaded by

Toon Man

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Introduction to Machine Learning

Rob Schapire Princeton University

www.cs.princeton.edu/schapire

Machine Learning studies how to automatically learn to make accurate predictions based on past observations classication problems: classify examples into given set of categories
new example labeled training examples classification rule

machine learning algorithm

predicted classification

Examples of Classication Problems bioinformatics classify proteins according to their function predict if patient will respond to particular drug/therapy based on microarray proles predict if molecular structure is a small-molecule binding site text categorization (e.g., spam ltering) fraud detection optical character recognition

machine vision (e.g., face detection)

natural-language processing (e.g., spoken language understanding) market segmentation (e.g.: predict if customer will respond to promotion)

Characteristics of Modern Machine Learning primary goal goal: highly accurate predictions on test data goal is not to uncover underlying truth methods should be general purpose purpose, fully automatic and off-the-shelf however, in practice, incorporation of prior, human knowledge is crucial rich interplay between theory and practice emphasis on methods that can handle large datasets

Why Use Machine Learning? advantages advantages:

often much more accurate than human-crafted rules (since data driven) humans often incapable of expressing what they know (e.g., rules of English, or how to recognize letters), but can easily classify examples automatic method to search for hypotheses explaining data cheap and exible can apply to any learning task need a lot of labeled data error prone usually impossible to get perfect accuracy often difcult to discern what was learned

disadvantages

This Talk conditions for accurate learning two state-of-the-art algorithms: boosting support-vector machines

Conditions for Accurate Learning

Example: Good versus Evil problem problem: identify people as good or bad from their appearance sex batman robin alfred penguin catwoman joker batgirl riddler male male male male female male female male mask cape tie ears smokes class training data yes yes no yes no Good yes yes no no no Good no no yes no no Good no no yes no yes Bad yes no no yes no Bad no no no no no Bad test data yes yes no yes no ?? yes no no no no ??

An Example Classier

tie no cape no bad yes good yes smokes no yes good bad

Another Possible Classier

mask no smokes no ears no tie no bad yes no good bad yes cape yes good yes bad male good no sex female bad no smokes no good yes bad yes cape yes ears yes good

perfectly classies training data

BUT: intuitively, overly complex

Yet Another Possible Classier

sex male good

overly simple doesnt even t available data

female bad

problem problem: cant tell best classier complexity from training error

Complexity versus Accuracy on An Actual Dataset 0 10 20 30 40 50 60 70 80 90 100

controlling overtting is the central problem of machine learning classiers must be expressive enough to t training data (so that true patterns are fully captured)
100

BUT: classiers that are too complex may overt (capture noise or spurious patterns in the data)
On test data On training data

train
0.75 0.85 0.9 10 0.8

0.55

0.65

0.5

0.6

error (%)

ycaruccA

0.7

50 complexity (tree size)

test

Building an Accurate Classier for good test peformance, need:

classiers should be as simple as possible, but no simpler simplicity closely related to prior expectations

enough training examples good performance on training set classier that is not too complex (Occams Occams razor razor) measure complexity by: number bits needed to write down number of parameters VC-dimension

Theory can prove:

d (generalization error) (training error) + O m

with high probability d = VC-dimension m = number training examples

Boosting

Example: Spam Filtering problem problem: lter out spam (junk email)
From: [email protected] From: [email protected] . .

gather large collection of examples of spam and non-spam:

Rob, can you review a paper... Earn money without working!!!! ... . .

non-spam spam . .

main observation observation:

easy to nd rules of thumb that are often correct If buy now occurs in message, then predict spam hard to nd single rule that is very highly accurate

The Boosting Approach devise computer program for deriving rough rules of thumb apply procedure to subset of emails obtain rule of thumb apply to 2nd subset of emails obtain 2nd rule of thumb repeat T times

Details how to choose examples on each round? concentrate on hardest examples (those most often misclassied by previous rules of thumb) how to combine rules of thumb into single prediction rule? take (weighted) majority vote of rules of thumb can prove prove: if can always nd weak rules of thumb slightly better than random guessing (51% accuracy), then can learn almost perfectly (99% accuracy) using boosting

AdaBoost given training examples

initialize weights D1 to be uniform across training examples for t = 1, . . . , T : train weak classier (rule of thumb) ht on Dt compute new weights Dt+1: decrease weight of examples correctly classied by ht increase weight of examples incorrectly classied by ht output nal classier Hnal = weighted majority vote of h1, , hT

Toy Example
D1

weak classiers = vertical or horizontal half-planes

Round 1

1 =0.30 1=0.42

Round 2

2 =0.21 2=0.65

Round 3

3 =0.14 3=0.92

+ 0.92

+ 0.65

Final Classier

0.42

H = sign final

Theory of Boosting assume each weak classier slightly better than random can prove training error drops to zero exponentially fast even so, naively expect signicant overtting overtting, since a large number of rounds implies a large nal classier surprisingly, usually does not overt

Theory of Boosting (cont.)

20 15

C4.5 test error test train

10 100 1000

error

10 5 0

(boosting C4.5 on letter dataset)

# of rounds (T)

test error does not increase, even after 1000 rounds

test error continues to drop even after training error is zero! explanation explanation: with more rounds of boosting, nal classier becomes more condent in its predictions increase in condence implies better test error (regardless of number of rounds)

Support-Vector Machines

Geometry of SVMs

given linearly separable data

margin = distance to separating hyperplane

choose hyperplane that maximizes minimum margin intuitively: want to separate +s from s as much as possible margin = measure of condence support vectors = examples closest to hyperplane

Theoretical Justication let = minimum margin R = radius of enclosing sphere then

2 R VC-dim

in contrast, unconstrained hyperplanes in Rn have VC-dim = (# parameters) = n + 1

so larger margins lower complexity independent of number of dimensions

What If Not Linearly Separable? answer #1 #1: penalize each point by distance must be moved to obtain large margin answer #2 #2: map into higher dimensional space in which data becomes linearly separable

Example

not linearly separable

2) , x map x = (x1, x2) (x) = (1, x1, x2, x1x2, x2 1 2

hyperplane in mapped space has form 2=0 a + bx1 + cx2 + dx1x2 + ex2 + f x 1 2 = conic in original space linearly separable in mapped space

Higher Dimensions Dont (Necessarily) Hurt may project to very high dimensional space

statistically statistically, may not hurt since VC-dimension independent of number of dimensions ((R/ )2) computationally computationally, only need to be able to compute inner products (x) (z)

sometimes can do very efciently using kernels

Example (cont.) modify slightly:

2) (x) = (1, 2x1, 2x2, 2x1x2, x2 , x 1 2

then

2 + x2 + z 2 (x) (z) = 1 + 2x1z1 + 2x2z2 + 2x1x2z1z2 + x2 z 1 1 2 2 = (1 + x1z1 + x2z2)2 = (1 + x z)2

in general, for polynomial of degree d, use (1 + x z)d

very efcient, even though nding hyperplane in O (nd) dimensions

Kernels kernel = function K for computing

K (x, z) = (x) (z)

permits efcient computation of SVMs in very high dimensions many kernels have been proposed and studied provides power, versatility and opportunity for incorporation of prior knowledge

Signicance of SVMs and Boosting grounded in rich theory with provable guarantees exible and general purpose fast and easy to use off-the-shelf and fully automatic able to work effectively in very high dimensional spaces

performs well empirically in many experiments and in many applications

Summary central issues in machine learning: avoidance of overtting balance between simplicity and t to data quick look at two learning algorithms: boosting and SVMs many other algorithms not covered: decision trees neural networks nearest neighbor algorithms Naive Bayes bagging . .

also, classication just one of many problems studied in machine learning

Other Machine Learning Problem Areas supervised learning classication regression predict real-valued labels rare class / cost-sensitive learning unsupervised no labels clustering density estimation semi-supervised in practice, un unlabeled examples much cheaper than labeled examples how to take advantage of both labeled and unlabeled examples active learning how to carefully select which unlabeled examples to have labeled

Further reading on machine learning in general: Ethem Alpaydin. Introduction to machine learning. MIT Press, 2004. Luc Devroye, L azl o Gy or and G abor Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996. Richard O. Duda, Peter E. Hart and David G. Stork. Pattern Classication (2nd ed.). Wiley, 2000. Trevor Hastie, Robert Tibshirani and Jerome Friedman. The Elements of Statistical Learning : Data Mining, Inference, and Prediction. Springer, 2001. Michael J. Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994. Tom M. Mitchell. Machine Learning. McGraw Hill, 1997. Vladimir N. Vapnik. Statistical Learning Theory. Wiley, 1998. Boosting: Ron Meir and Gunnar R atsch. An Introduction to Boosting and Leveraging. In Advanced Lectures on Machine Learning (LNAI2600), 2003. http://www.boosting.org/papers/MeiRae03.pdf Robert E. Schapire. The boosting approach to machine learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classication, 2002. http://www.cs.princeton.edu/schapire/boost.html Many more papers, tutorials, etc. available at www.boosting.org. Support-vector machines: Nello Cristianni and John Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 2000. See www.support-vector.net. Many more papers, tutorials, etc. available at www.kernel-machines.org.

12.6.1 Packet Tracer - Troubleshooting Challenge - Use Documentation To Solve Issues
0% (1)
12.6.1 Packet Tracer - Troubleshooting Challenge - Use Documentation To Solve Issues
3 pages
Service Training: Anti-Slip-Control (ASC) Single Drum Rollers Generation 3
67% (3)
Service Training: Anti-Slip-Control (ASC) Single Drum Rollers Generation 3
20 pages
FLS Airtech ESP Controller - PIACS User Manual
100% (9)
FLS Airtech ESP Controller - PIACS User Manual
39 pages
Introductiontomachinelearning 230723174746 1a0e5edc
No ratings yet
Introductiontomachinelearning 230723174746 1a0e5edc
27 pages
Unit-1 Introduction To Machine Learning
No ratings yet
Unit-1 Introduction To Machine Learning
24 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
15 pages
Introduction To ML
No ratings yet
Introduction To ML
31 pages
03-Introduction To Machine Learning - DNN
No ratings yet
03-Introduction To Machine Learning - DNN
35 pages
Data in ML
No ratings yet
Data in ML
26 pages
Unit I MACHINE LEARNING
No ratings yet
Unit I MACHINE LEARNING
87 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
AI-900 - Fundamental Principles of ML
No ratings yet
AI-900 - Fundamental Principles of ML
55 pages
Chap-6 Machine Learning Introduction
No ratings yet
Chap-6 Machine Learning Introduction
49 pages
Inductive Learning and Machine Learning
100% (1)
Inductive Learning and Machine Learning
321 pages
Machine Learning
No ratings yet
Machine Learning
24 pages
Module 1 ML
No ratings yet
Module 1 ML
51 pages
Module 4
No ratings yet
Module 4
28 pages
01 - ML - Introduction (1)
No ratings yet
01 - ML - Introduction (1)
65 pages
Chapter 01 Introduction To Machine Learning
No ratings yet
Chapter 01 Introduction To Machine Learning
59 pages
Unit 1
No ratings yet
Unit 1
62 pages
Classification
No ratings yet
Classification
53 pages
Lecture 4.2 Supervised Learning Classification
No ratings yet
Lecture 4.2 Supervised Learning Classification
25 pages
Introduction
No ratings yet
Introduction
41 pages
Presentation on ML - Copy
No ratings yet
Presentation on ML - Copy
469 pages
ML Lec 1
No ratings yet
ML Lec 1
47 pages
Lecture 1
No ratings yet
Lecture 1
47 pages
An Enlightenment To Machine Learning
100% (1)
An Enlightenment To Machine Learning
16 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
10 pages
Key Ideas in Machine Learning
No ratings yet
Key Ideas in Machine Learning
11 pages
COS 511: Foundations of Machine Learning
No ratings yet
COS 511: Foundations of Machine Learning
7 pages
Chapter1
No ratings yet
Chapter1
30 pages
ML1-Introduction To Machine Learning
No ratings yet
ML1-Introduction To Machine Learning
46 pages
DAIOT UNIT 5 (1) Own
No ratings yet
DAIOT UNIT 5 (1) Own
13 pages
01 - Introduction
No ratings yet
01 - Introduction
35 pages
AML All Merged PDF Class 1 To 8
No ratings yet
AML All Merged PDF Class 1 To 8
423 pages
p78 Domingos
No ratings yet
p78 Domingos
10 pages
Python UNIT-5
100% (1)
Python UNIT-5
67 pages
1. Machine Learning - Introduction
No ratings yet
1. Machine Learning - Introduction
138 pages
l9
No ratings yet
l9
110 pages
Introduction to ML Unit-1 PPT
No ratings yet
Introduction to ML Unit-1 PPT
90 pages
Domingos
No ratings yet
Domingos
9 pages
Intro To ML PDF
No ratings yet
Intro To ML PDF
66 pages
Lecture 1
No ratings yet
Lecture 1
43 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
68 pages
unit 1
100% (1)
unit 1
13 pages
Introduction Class
No ratings yet
Introduction Class
134 pages
ML Unit 1
No ratings yet
ML Unit 1
20 pages
ML Unit-1
No ratings yet
ML Unit-1
12 pages
Machine Learning INTRO
No ratings yet
Machine Learning INTRO
12 pages
PUSHKAR
No ratings yet
PUSHKAR
15 pages
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
Unit1 ML NGP
No ratings yet
Unit1 ML NGP
106 pages
ML -1_Sovan_Introduction to ML
No ratings yet
ML -1_Sovan_Introduction to ML
83 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
31 pages
Unit - 1 Machine Learning
No ratings yet
Unit - 1 Machine Learning
82 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
11 pages
ML Bu
No ratings yet
ML Bu
31 pages
ETI microproject
No ratings yet
ETI microproject
11 pages
(Technical) Machine Learning U1-2 (2019 Pattern)
No ratings yet
(Technical) Machine Learning U1-2 (2019 Pattern)
86 pages
Subtraction
From Everand
Subtraction
Sally Fisk
No ratings yet
Evensong (Deseret Alphabet Edition)
From Everand
Evensong (Deseret Alphabet Edition)
John H. Jenkins
No ratings yet
The Picture of Dorian Gray (Deseret Alphabet Ebook)
From Everand
The Picture of Dorian Gray (Deseret Alphabet Ebook)
Oscar Wilde
No ratings yet
His Last Bow (Deseret Alphabet ebook): Some Reminiscences of Sherlock Holmes
From Everand
His Last Bow (Deseret Alphabet ebook): Some Reminiscences of Sherlock Holmes
Arthur Conan Doyle
No ratings yet
Abstract Arse P 2008
No ratings yet
Abstract Arse P 2008
1 page
Hls Survey PDF
No ratings yet
Hls Survey PDF
27 pages
Simulation and Fpga Implementation
No ratings yet
Simulation and Fpga Implementation
8 pages
I. Fourier's Series for a function with a period of 2π
No ratings yet
I. Fourier's Series for a function with a period of 2π
5 pages
TCP SlidingWindows06
No ratings yet
TCP SlidingWindows06
22 pages
EEE202 Lect13 LaplaceSolutionTransientCircuits
100% (1)
EEE202 Lect13 LaplaceSolutionTransientCircuits
14 pages
mixmodForMatlabQuickStart 2-2 PDF
No ratings yet
mixmodForMatlabQuickStart 2-2 PDF
2 pages
Brignull Rogers03 Interact
No ratings yet
Brignull Rogers03 Interact
8 pages
How Computer Memory Works
No ratings yet
How Computer Memory Works
6 pages
Documentation Policy Based Routing PBR
No ratings yet
Documentation Policy Based Routing PBR
8 pages
The Impact of AJAX On The Network
No ratings yet
The Impact of AJAX On The Network
7 pages
Mricro Cost
No ratings yet
Mricro Cost
10 pages
2.definition:: Is Not Equal To 0. Polynomial Functions of Only One
No ratings yet
2.definition:: Is Not Equal To 0. Polynomial Functions of Only One
6 pages
An Overview of Gas Insulated Substation
No ratings yet
An Overview of Gas Insulated Substation
30 pages
A Practitioner's Guide To Linux As A Computer Forensic Platform
100% (2)
A Practitioner's Guide To Linux As A Computer Forensic Platform
197 pages
2 Comparative Study of DFIG Power Control Using Stator
No ratings yet
2 Comparative Study of DFIG Power Control Using Stator
8 pages
Economic Incentive For Intermittent Operation of Air Separation Plants With Variable Power Cost
No ratings yet
Economic Incentive For Intermittent Operation of Air Separation Plants With Variable Power Cost
8 pages
General Bearing Requirements and Design Criteria
100% (2)
General Bearing Requirements and Design Criteria
6 pages
Anaerobic Digester Covers
No ratings yet
Anaerobic Digester Covers
2 pages
Recipe Editor User Guide
No ratings yet
Recipe Editor User Guide
183 pages
Crystallization PDF
No ratings yet
Crystallization PDF
27 pages
Jso 2021 Paper III
No ratings yet
Jso 2021 Paper III
13 pages
How To Add A Data Model To MiCom S1 Studio
0% (1)
How To Add A Data Model To MiCom S1 Studio
16 pages
Lec # 26 Brushless DC Motor
No ratings yet
Lec # 26 Brushless DC Motor
12 pages
Computer Class VII- Ch.1
No ratings yet
Computer Class VII- Ch.1
6 pages
Silicon On Insulator
No ratings yet
Silicon On Insulator
16 pages
LINCOLN Lubrication Centro - Matic
No ratings yet
LINCOLN Lubrication Centro - Matic
53 pages
(De Gruyter Textbook) Seán Thomas Barry - Chemistry of Atomic Layer Deposition-De Gruyter (2021)
No ratings yet
(De Gruyter Textbook) Seán Thomas Barry - Chemistry of Atomic Layer Deposition-De Gruyter (2021)
116 pages
C4C EC Field Mapping 1611
No ratings yet
C4C EC Field Mapping 1611
84 pages
Transducer Accessories
No ratings yet
Transducer Accessories
6 pages
Gromacs-Manual-3 3
100% (9)
Gromacs-Manual-3 3
300 pages
Automatic Temperature Control System Using Arduino: Kyi - Kyi - Khaing@miit - Edu.mm
100% (1)
Automatic Temperature Control System Using Arduino: Kyi - Kyi - Khaing@miit - Edu.mm
8 pages
Essentials of Business Analytics Jeffrey D. Camm - The ebook in PDF format is ready for immediate access
100% (5)
Essentials of Business Analytics Jeffrey D. Camm - The ebook in PDF format is ready for immediate access
63 pages
html practical 2024 (1) (1)
No ratings yet
html practical 2024 (1) (1)
42 pages
Alpro
No ratings yet
Alpro
2 pages
Separation Techniques Worksheet Ms Tay-1
No ratings yet
Separation Techniques Worksheet Ms Tay-1
2 pages
Basics of Computer Science
No ratings yet
Basics of Computer Science
2 pages

Schapire MachineLearning

Uploaded by

Schapire MachineLearning

Uploaded by

Introduction to Machine Learning

Rob Schapire Princeton University

machine learning algorithm

machine vision (e.g., face detection)

Why Use Machine Learning? advantages advantages:

Conditions for Accurate Learning

Another Possible Classier

perfectly classies training data

BUT: intuitively, overly complex

Yet Another Possible Classier

sex male good

Complexity versus Accuracy on An Actual Dataset 0 10 20 30 40 50 60 70 80 90 100

50 complexity (tree size)

Building an Accurate Classier for good test peformance, need:

Theory can prove:

d (generalization error) (training error) + O m

with high probability d = VC-dimension m = number training examples

gather large collection of examples of spam and non-spam:

main observation observation:

AdaBoost given training examples

weak classiers = vertical or horizontal half-planes

Theory of Boosting (cont.)

C4.5 test error test train

(boosting C4.5 on letter dataset)

test error does not increase, even after 1000 rounds

given linearly separable data

margin = distance to separating hyperplane

Theoretical Justication let = minimum margin R = radius of enclosing sphere then

in contrast, unconstrained hyperplanes in Rn have VC-dim = (# parameters) = n + 1

so larger margins lower complexity independent of number of dimensions

not linearly separable

2) , x map x = (x1, x2) (x) = (1, x1, x2, x1x2, x2 1 2

sometimes can do very efciently using kernels

Example (cont.) modify slightly:

2) (x) = (1, 2x1, 2x2, 2x1x2, x2 , x 1 2

2 + x2 + z 2 (x) (z) = 1 + 2x1z1 + 2x2z2 + 2x1x2z1z2 + x2 z 1 1 2 2 = (1 + x1z1 + x2z2)2 = (1 + x z)2

in general, for polynomial of degree d, use (1 + x z)d

very efcient, even though nding hyperplane in O (nd) dimensions

Kernels kernel = function K for computing

K (x, z) = (x) (z)

performs well empirically in many experiments and in many applications

also, classication just one of many problems studied in machine learning

You might also like