0% found this document useful (0 votes)
6 views

Lecture5 Learning Theory v1.1

lecture5-learning-theory-v1.1

Uploaded by

Aniket Dwivedi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture5 Learning Theory v1.1

lecture5-learning-theory-v1.1

Uploaded by

Aniket Dwivedi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Introduction to Machine Learning

Learning theory: generalization and VC


dimension
Yifeng Tao
School of Computer Science
Carnegie Mellon University

Slides adapted from Eric Xing

Yifeng Tao Carnegie Mellon University 1


Outline
oComputational learning theories
o PAC framework
o Agnostic framework
oVC dimension

PAC Agnostic
|H| finite

|H| infinite,
but VC(H) finite

Yifeng Tao Carnegie Mellon University 2


Generalizability of Learning
oIn machine learning it's really generalization error that we care, but
most learning algorithms fit their models to the training set.
oWhy should doing well on the training set tell us anything about
generalization error? Specifically, can we relate error on training
set to generalization error?
oAre there conditions under which we can actually prove that learning
algorithms will work well?
oLecture 1:

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 3


What General Laws Constrain Inductive Learning?
oWant theory to relate:
o Training examples: m
o Complexity of hypothesis/concept space: H
o Accuracy of approximation to target concept:
o Probability of successful learning:
oAll the results in O(…)

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 4


Prototypical concept learning task
oBinary classification
o Everything we'll say here generalizes to other, including regression and
multi-class classification problems.
oGiven:
o Instances X: Possible days, each described by the attributes Sky, AirTemp,
Humidity, Wind, Water, Forecast
o Target function c: EnjoySport : X à {0, 1}
o Hypotheses space H: Conjunctions of literals. E.g.
o (?, Cold, High, ?, ?, ¬EnjoySport).
o Training examples S: iid positive and negative examples of the target
function
o (x1, c(x1)), ... (xm, c(xm))
oDetermine:
o A hypothesis h in H such that h(x) is "good" w.r.t c(x) for all x in S?
o A hypothesis h in H such that h(x) is "good" w.r.t c(x) for all x in the true
distribution D?

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 5


Sample Complexity
oHow many training examples m are sufficient to learn the target
concept?
oTraining scenarios:
o If learner proposes instances, as queries to teacher
o Learner proposes instance x, teacher provides c(x)
o If teacher (who knows c) provides training examples
o Teacher provides sequence of examples offer m(x,c(x))
o If some random process (e.g., nature) proposes instances
o Instance x generated randomly, teacher provides c(x)

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 6


Two Basic Competing Models

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 7


Protocol

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 8


True error of a hypothesis

oDefinition: The true error (denoted εD(h)) of hypothesis h with


respect to target concept c and distribution D is the probability that h
will misclassify an instance drawn at random according to D.

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 9


Two notions of error
oTraining error (a.k.a., empirical risk or empirical error) of hypothesis
h with respect to target concept c
o How often h(x) ≠ c(x) over training instance from S

oTrue error of (a.k.a., generalization error, test error) hypothesis h


with respect to c
o How often h(x) ≠ c(x) over future random instances drew iid from D

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 10


The Union Bound

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 11


Hoeffding inequality

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 12


Version Space

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 13


Consistent Learner
oA learner is consistent if it outputs hypothesis that perfectly fits the
training data
o This is a quite reasonable learning strategy
oEvery consistent learning outputs a hypothesis belonging to the
version space
oWe want to know how such hypothesis generalizes

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 14


Probably Approximately Correct

oDouble “hedging"
o Approximately
o Probably
oNeed both!

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 15


Exhausting the version space

VSH,D

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 16


How many examples will ε-exhaust the VS

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 17


Proof

[Slide from Eric Xing and David Sontag]

Yifeng Tao Carnegie Mellon University 18


What it means
o[Haussler, 1988]: probability that the version space is not ε-
exhausted after m training examples is at most |H|e-εm

oSuppose we want this probability to be at most δ

oHow many training examples suffice?

oIf

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 19


Learning Conjunctions of Boolean Literals

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 20


PAC Learnability
oA learning algorithm is PAC learnable if it
o Requires no more than polynomial computation per training example, and
o no more than polynomial number of samples
oTheorem: conjunctions of Boolean literals is PAC learnable

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 21


How about EnjoySport?

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 22


PAC-Learning

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 23


Agnostic Learning

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 24


Empirical Risk Minimization Paradigm

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 25


The Case of Finite H
oH = {h1, ..., hk} consisting of k hypotheses.
oWe would like to give guarantees on the generalization error of ĥ.
oFirst, we will show that is a reliable estimate of ε(h) for all h.
oSecond, we will show that this implies an upper-bound on the
generalization error of ĥ.

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 26


Misclassification Probability
o

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 27


Uniform Convergence
o

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 28


[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 29


Sample Complexity

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 30


Generalization Error Bound

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 31


Agnostic framework

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 32


What if H is not finite?
oCan’t use our result for infinite H

oNeed some other measure of complexity for H


o Vapnik-Chervonenkis (VC) dimension!

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 33


How do we characterize “power”?
oDifferent machines have different amounts of “power”.
oTradeoff between:
o More power: Can model more complex classifiers but might overfit
o Less power: Not going to overfit, but restricted in what it can model

oHow do we characterize the amount of power?

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 34


Shattering a Set of Instances

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 35


Three Instances Shattered

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 36


The Vapnik-Chervonenkis Dimension

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 37


VC dimension: examples

[Slide from Eric Xing and David Sontag]

Yifeng Tao Carnegie Mellon University 38


VC dimension: examples

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 39


[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 40


[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 41


[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 42


The VC Dimension and the Number of Parameters
oThe VC dimension thus gives concreteness to the notion of the
capacity of a given set of h.
oIs it true that learning machines with many parameters would have
high VC dimension, while learning machines with few parameters
would have low VC dimension?

oAn infinite-VC function with just one parameter!

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 43


An infinite-VC function with just one parameter

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 44


Sample Complexity from VC Dimension

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 45


Consistency
oA learning process (model) is said to be consistent if model error,
measured on new data sampled from the same underlying
probability laws of our original sample, converges, when original
sample size increases, towards model error, measured on original
sample.

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 46


Vapnik main theorem

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 47


Agnostic Learning: VC Bounds

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 48


Model convergence speed

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 49


How to control model generalization capacity
oRisk Expectation = Empirical Risk + Confidence Interval
oTo minimize Empirical Risk alone will not always give a good
generalization capacity: one will want to minimize the sum of
Empirical Risk and Confidence Interval
oWhat is important is not the numerical value of the Vapnik limit, most
often too large to be of any practical use, it is the fact that this limit is
a non decreasing function of model family function “richness”

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 50


Structural Risk Minimization

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 51


SRM strategy

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 52


SRM strategy

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 53


SRM strategy

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 54


Putting SRM into action: linear models case
oThere are many SRM-based strategies to build models:
oIn the case of linear models
y = wTx + b
oone wants to make ||w|| a controlled parameter: let us call HC the
linear model function family satisfying the constraint:
||w|| < C
oVapnik Major theorem: When C decreases, d(HC) decreases

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 55


Putting SRM into action: linear models case

yi - wTxi - b

wTxi + b

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 56


Take away message
oSample complexity varies with the learning setting
o Learner actively queries trainer
o Examples provided at random
oWithin the PAC learning setting, we can bound the probability that
learner will output hypothesis with given error
o For ANY consistent learner (case where c in H)
o For ANY “best fit” hypothesis (agnostic learning, where perhaps c not in H)
oVC dimension as measure of complexity of H
oQuantitative bounds characterizing bias/variance in choice of H

[Slide from Eric Xing]

Yifeng Tao Carnegie Mellon University 57


Take home message

PAC

[Slide from Matt Gormley]

Yifeng Tao Carnegie Mellon University 58


References
oEric Xing, Ziv Bar-Joseph. 10701 Introduction to Machine Learning:
http://www.cs.cmu.edu/~epxing/Class/10701/
oMatt Gormley. 10601 Introduction to Machine Learning:
http://www.cs.cmu.edu/~mgormley/courses/10601/index.html
oDavid Sontag. Introduction To Machine Learning.
https://people.csail.mit.edu/dsontag/courses/ml12/slides/lecture14.p
df

Yifeng Tao Carnegie Mellon University 59

You might also like