Lecture Notes For ECE 695-09/08/03
Lecture Notes For ECE 695-09/08/03
)), (1)
where m = |S|.
Equivalently, P, H, , , with probability > 1 if some h H has
E
S
(h) = 0 and
m >
1
(ln(|H|) + ln(
1
)), (2)
then E
P
(h) < where m = |S|.
We wish to analyze the error of hs that do not necessarily have zero empirical
error. The main tool that will be used is the Cherno/Hoeding bound.
Cherno/Hoeding Bound: Let {X
i
}
i=1
be independent Bernoulli ran-
dom variables each with expectation 0 p 1. Then, m
Pr
_
1
m
m
i=1
X
i
p
>
_
< 2e
2m
2
.
1
In our case, we use it as follows: Fix any h : X {0, 1} and let
X
i
=
_
1 if h(x
i
) = y
i
0 otherwise.
Note that E(X
i
) = E
P
(h). By Hoedings inequality,
Pr
_
|E
S
(h) E
P
(h)| >
_
< 2e
2m
2
.
It can also be shown that
Pr
_
E
P
(h) E
S
(h) >
_
< e
2m
2
. (3)
Exercise: Prove that P, H, m, , with probability > 1 , for all
h H
E
P
(h) E
S
(h) <
_
(ln(|H|) + ln(1/))
2m
, (4)
where m = |S|.
Bounds depending on description length: Given H, we x a description
of the hypothesis in H as follows. Let
() : H {0, 1}
map all h H to
binary strings of nite length. Thus,
h is the string coding h and let |
h| denote
the description length of h. Assume that the above binary mapping is prex
free. Then, by Krafts inequality,
hH
2
|
h|
1. (5)
Suppose that we have dierent error tolerances for dierent h viz., let
h
be
the error tolerance for h. Set
h
=
(ln 2(|
h|) + ln(1/))
2m
,
and by (3) we get for every xed h,
Pr
_
E
P
(h) E
S
(h) >
h
_
<
2
|
h|
Using the union bound and Kraft inequality (5), we can conclude that
Pr(h H : E
P
(h) E
S
(h) >
h
) < ,
OR
With probability atleast 1 , h H, E
P
(h) E
S
(h)
h
.
2
More generally, we can say that P, H, m, , with probability > 1
for all h H
E
P
(h) E
S
(h) <
(ln 2(|
h|) + ln(1/))
2m
, (6)
where m = |S|.
Bayesian Approach: We model our prior knowledge by a prior distribution
over the hypotheses. This is equivalent to specifying a particular description of
H. For example, for a 2-adic distribution q(h) over H we can choose binary
Human coding as the description mapping. Then q(h) = 2
|
h|
and
E
P
(h) E
S
(h) <
_
(ln(1/q(h)) + ln(1/))
2m
,
Relations between dierent predictors Note that there exists some
f : X {0, 1} minimizing E
P
(). It is given by
f(x) =
_
1 if
P(x,1)
P(x,0)+P(x,1)
> 0.5
0 otherwise.
(7)
Ofcourse since P is unknown, f is unknown. So, let us constrain ourselves
to a class H of predictors. Let h
) E
S
(h
), (8)
E
P
(h
) E
P
(h
). (9)
Then it follows from our arguments that
E
P
(h
) < E
S
(h
) + E
S
(h
) + E
P
(h
) + 2.
This gives us a handle on the dierence between the true errors of the sam-
plewise optimal predictor and the best predictor (in terms of minimizing the
true error).
3