BayesianHypothesisTesting
BayesianHypothesisTesting
In a wide range of applications, one must make decisions based on a set of data. As
we’ve discussed, just a few examples among many include medical diagnosis, voice and
face recognition, DNA sequence analysis, autonomous vehicle navigation, and digital
communication. In general, the available data are noisy, incomplete, or otherwise
imperfect, and thus the decisions produced will not always be correct. However, we
would like to use a decision process that is as good as possible in an appropriate sense.
12
approach, which is among the oldest approaches to inference problems, and model the
hypothesis as a (categorical) random variable, denoting it using H.
Accordingly, in Bayesian hypothesis testing, a complete description of our models
consists of the a priori probabilities
pH (Hm ), m = 0, 1, . . . , M 1,
together with a characterization of the observed data under each hypothesis, which
takes the form of the conditional probability distributions8
py|H (·|Hm ), m = 0, 1, . . . , M 1. (2.1)
We emphasize that in this basic version of the problem, the constituent distributions
are all fully specified; e.g., there are no unknown parameters associated with any of
them.9
Consistent with the notational conventions we have established, in the case of
discrete-valued data, (2.1) denotes a collection of probability mass functions, while
in the case of continuous-valued data it denotes a collection of probability density
functions. In our development, the appropriate interpretation will be clear from
context.10
We use Y to denote the set of possible observations. For example, we might have
Y = {heads, tails}, Y = {|, ~, , }}, or Y = Z in the discrete case, or Y = [0, 1],
Y = [0, 1), or Y = Rk for some integer k > 0 in the continuous case. For generality,
we use vector notation for the data y regardless of whether it is discrete- or continuous-
valued, though obviously in the discrete case we can always equivalently interpret the
data as a scalar variable over a larger alphabet. In practice, the application often
dictates which abstraction is most natural.
Of course, a complete characterization of our knowledge of the correct hypothesis
based on our observations is the set of a posteriori probabilities
pH|y (Hm |y), m = 0, 1, . . . , M 1. (2.3)
The distribution of possible values of H is often referred to as our belief about the
hypothesis. From this perspective, we can view the a priori probabilities as our prior
8
As related terminology, the function py|H (y|·), where y is the observed data, is referred to as the
likelihood function.
9
This may be because the mechanism for generating the data is perfectly understood from physical
laws, or because by some (supervised) training procedure, both distributions have been accurately
learned. As our development proceeds, we will relax these assumptions.
10
It is also worth pointing out that an equivalent specification of the constituent models is in the
form
pH|y (Hm |·), py (·) , m 2 {0, 1, . . . , M 1}. (2.2)
Our choice emphasizes an interpretation of the hypothesis testing problem as a solving a kind of
inverse problem in which y is “generated” from H. By contrast, (2.2) emphasizes a “forward model”
view, in which, e.g., a label H is “generated” from y. Nevertheless, as you might hope, the solution
to the problem is invariant to the choice of viewpoint.
13
belief, and view (2.3) as the revision of our belief based on having observed the data
y. The belief update is, of course, computed from the particular data y based on the
model via Bayes’ Rule:
py|H (y|Hm ) pH (Hm )
pH|y (Hm |y) = X .
py|H (y|Hm0 ) pH (Hm0 )
0
m
Based on the outcome y, we need to decide whether we were given the fair or biased
coin.
14
Example 2.2 (Detection of Averages). Suppose we are trying to distinguish whether
someone is reporting an average of two numbers or not. Specifically, under hypothesis
H0 we observe a number y = u 2 Y = [0, 1] that is uniformly distributed, i.e.,
u ⇠ U([0, 1]). But under hypothesis H1 we observe the average of two such numbers,
i.e., y = (u + v )/2, where v ⇠ U([0, 1]) with u and v independent. Then it is
straightforward to verify from the convolution property of distributions of sums of
independent random variables that
py |H (y|H0 ) = 1
, 0 y 1.
py |H (y|H1 ) = 2 1 |2y 1|
2 1 2
(y s0 ) /(2
2
)
py |H (y|H0 ) = N(y; s0 , )= p 2
e
2⇡ (2.6)
2 1 2
(y s1 ) /(2
2
)
py |H (y|H1 ) = N(y; s1 , )= p 2
e .
2⇡
In addition, if 0’s and 1’s are equally likely to be transmitted we would set the a priori
probabilities to
P0 = P1 = 1/2.
15
Y0
Y1
Y0
Y0
Y
that choosing the function Ĥ(·) is equivalent to partitioning the observation space Y
into two disjoint decision regions, corresponding to the values of y for which each of
the two possible decisions are made. Specifically, we use Ym to denote those values of
y 2 Y for which our rule decides Hm , i.e.,
Y0 = y 2 Y : Ĥ(y) = H0
(2.7)
Y1 = y 2 Y : Ĥ(y) = H1 .
to denote the cost of deciding that the hypothesis is Ĥ = Hi when the correct
hypothesis is H = Hj . Then the optimum decision rule takes the form
16
where the expected cost, which is referred to as the Bayes risk, is
⇥ ⇤
'(f ) , E C(H, f (y)) , (2.10)
and where the expectation in (2.10) is over both y and H, and where f (·) is a decision
rule.
Generally, the application dictates an appropriate choice of the costs Cij . For
example, a symmetric cost function of the form Cij = 1 1i=j , i.e.,
C00 = C11 = 0
(2.11)
C01 = C10 = 1,
corresponds to seeking a decision rule that minimizes the probability of a decision
(i.e., classification) error. This is often referred to as the 0-1 loss function, and is used
when seeking to optimize a classifier’s accuracy. However, there are many applications
for which such symmetric cost functions are not well-matched. For example, in a
medical diagnosis problem where H0 denotes the hypotheses that the patient does not
have a particular disease and H1 that he does, we would typically want to select cost
assignments such that C01 C10 .
Definition 2.1. A set of costs {Cij } is valid if the cost of a correct decision is lower
than the cost of an incorrect decision, i.e., Cjj < Cij whenever i 6= j.
Theorem 2.1. Given a priori probabilities P0 , P1 , data y, observation models py|H (·|H0 ),
py|H (·|H1 ), and valid costs C00 , C01 , C10 , C11 , the Bayesian decision rule takes the form
Ĥ(y)=H1
py|H (y|H1 ) P0 (C10 C00 )
L(y) , R , ⌘, (2.12)
py|H (y|H0 ) Ĥ(y)=H0
P1 (C01 C11 )
i.e., the decision is H1 when L(y) > ⌘, the decision is H0 when L(y) < ⌘, and the
decision can be made arbitrarily when L(y) = ⌘.11
Some standard terminology is associated with rules of the form (2.12). In particular,
the left-hand side of (2.12) is referred to as the likelihood ratio, and thus (2.12) is
referred to as a likelihood ratio test (LRT).
Proof of Theorem 2.1. Consider an arbitrary but fixed decision rule f (·). In terms of
this generic f (·), the Bayes risk can be expanded in the form
⇥ ⇤
'(f ) = E C(H, f (y))
h ⇥ ⇤i
= E E C(H, f (y)) y
Z
= '(f ˜ (y), y) py (y) dy, (2.13)
11
As such, the decision rule can also be expressed in the equivalent form Ĥ(y) = H1L(y)>⌘ .
17
with ⇥ ⇤
'(H,
˜ y) , E C(H, H) y = y , (2.14)
and where to obtain the second equality in (2.13) we have used iterated expectation.
Note from (2.13) that since py (y) is nonnegative, it is clear that we minimize ' if
we minimize '(f
˜ (y), y) for each particular value of y. Hence, we can determine the
optimum decision rule Ĥ(·) on a point-by-point basis, i.e., Ĥ(y) for each y.
Accordingly, let’s consider a particular (observation) point y = y⇤ . For this point,
if we choose the assignment
Ĥ(y⇤ ) = H0 ,
then our conditional expectation (2.14) takes the value
˜ 0 , y⇤ ) = C00 pH|y (H0 |y⇤ ) + C01 pH|y (H1 |y⇤ ).
'(H (2.15)
Alternatively, if we choose the assignment
Ĥ(y⇤ ) = H1 ,
then our conditional expectation (2.14) takes the value
˜ 1 , y⇤ ) = C10 pH|y (H0 |y⇤ ) + C11 pH|y (H1 |y⇤ ).
'(H (2.16)
Hence, the optimum assignment for the value y⇤ is simply the choice corresponding
to the smaller of (2.15) and (2.16). It is convenient to express this optimum decision
rule using the following notation (now replacing our particular observation y⇤ with a
generic observation y):
Ĥ(y)=H1
C00 pH|y (H0 |y) C10 pH|y (H0 |y)
R (2.17)
+ C01 pH|y (H1 |y) Ĥ(y)=H0
+ C11 pH|y (H1 |y).
Note that when the two sides of (2.17) are equal, then either assignment is equally
good—both have the same e↵ect on the objective function (2.13).
A minor rearrangement of the terms in (2.17) results in
Ĥ(y)=H1
(C01 C11 ) pH|y (H1 |y) R (C10 C00 ) pH|y (H0 |y). (2.18)
Ĥ(y)=H0
Since for any valid choice of costs the terms in parentheses in (2.18) are both positive,
we can equivalently write (2.18) in the form12
Ĥ(y)=H1
pH|y (H1 |y) (C10 C00 )
R . (2.19)
pH|y (H0 |y) Ĥ(y)=H0
(C01 C11 )
12
Technically, we have to be careful about dividing by zero here if pH|y (H0 |y) = 0. To simplify our
exposition, however, as we discuss in Section 2.2.2, it is natural to restrict our attention to the case
where pH|y (Hm |y) > 0 for m = 0, 1.
18
which expresses the optimum decision rule in terms of our beliefs, i.e., of the a posteriori
probabilities
py|H (y|Hm ) Pm
pH|y (Hm |y) = , m = 0, 1. (2.20)
py|H (y|H0 ) P0 + py|H (y|H1 ) P1
Finally, when we then substitute (2.20) into (2.19) and multiply both sides by P0 /P1 ,
we obtain the decision rule in its final form (2.12), directly in terms of the candidate
models.
19
y L(·) + Ĥ
Figure 2.2: An implementation of a likelihood ratio test that represents the Bayesian
binary decision rule, i.e., a classifier.
20
Even when implementing a decision rule via a neural network for practical reasons,
the performance of the likelihood ratio test serves as an valuable performance bound,
and thus its analysis is insightful. For the purposes of such analysis, it is also important
to emphasize that L = L(y) is a random variable—i.e., it takes on a di↵erent value in
each experiment. As such, we will frequently be interested in evaluating its probability
density function—or at least moments such as its mean and variance—under each of
H0 and H1 . Such densities can be derived using the usual method of events.
In carrying out such analysis, it follows immediately from its definition in (2.12)
that, depending on the problem, some values of y may lead to L(y) being zero or
infinite. In particular, the former occurs when py|H (y|H1 ) = 0 but py|H (y|H0 ) > 0,
which is an indication that values in a neighborhood of y e↵ectively cannot occur
under H1 but can under H0 . In this case, there will be values of y for which we’ll
e↵ectively know with certainty that the correct hypothesis is H0 . When the likelihood
ratio is infinite, corresponding a division-by-zero scenario, an analogous situation
exists, but with the roles of H0 and H1 reversed. These cases where such perfect
decisions are possible are referred to as singular decision scenarios. In some practical
problems, these scenarios do in fact occur. However, in other cases they suggest a
potential lack of robustness in the data modeling, i.e., that some source of inherent
uncertainty may be missing from the model. In any event, to simplify our development
for the remainder of the topic we will largely restrict our attention to the case where
0 < L(y) < 1 for all y.
While the likelihood ratio focuses the observed data into a single scalar for the
purpose of making an optimum decision, the threshold ⌘ for the test plays a com-
plementary role, and serves to bias the decision making in a deliberate and precise
manner. In particular, from (2.12) we see that ⌘ focuses the relevant details of the cost
function and a priori probabilities into a single scalar. Furthermore, this information is
combined in a manner that is intuitively satisfying. For example, as (2.12) also reflects,
an increase in P0 means that H0 is more likely, so that ⌘ is increased to appropriately
bias the test toward deciding H0 for any particular observation. Similarly, an increase
in C10 means that deciding H1 when H0 is true is more costly, so ⌘ is increased to
appropriately bias the test toward deciding H0 to o↵set this risk. Finally, note that
adding a constant to the cost function (i.e., to all Cij ) has, as we would anticipate, no
e↵ect on the threshold. Hence, without loss of generality we may set at least one of
the correct decision costs—i.e., C00 or C11 —to zero.
Finally, it is important to emphasize that the likelihood ratio test (2.12) indirectly
determines the decision regions (2.7). In particular, we have
As Fig. 2.1 suggests, while a decision rule expressed in the measurement data space Y
belief functions, rather than likelihood ratios, but each is a proxy for the other.
21
can be complicated,15 (2.12) tells us that the observations can be transformed into
a one-dimensional space defined via L = L(y) where the decision regions have a
particularly simple form: the decision Ĥ(L) = H0 is made whenever L lies to the left
of some point ⌘ on the line, and Ĥ(L) = H1 whenever L lies to the right.
The corresponding decision rule in this case can be obtained as a special case of
(2.12).
Corollary 2.1. The minimum probability-of-error decision rule takes the form
The rule (2.23), in which one chooses the hypothesis for which our belief is largest, is
referred to as the maximum a posteriori (MAP) decision rule.
Proof. Instead of specializing (2.12), we specialize the equivalent test (2.18), from
which we obtain a form of the minimum probability-of-error test expressed in terms of
the a posteriori probabilities for the problem, viz.,
Ĥ(y)=H1
pH|y (H1 |y) R pH|y (H0 |y). (2.24)
Ĥ(y)=H0
From (2.24) we see that the desired decision rule can be expressed in the form (2.23)
Still further simplification is possible when the hypotheses are equally likely
(P0 = P1 = 1/2). In this case, we have the following.
Corollary 2.2. When the hypotheses are equally likely, the minimum probability-of-
error decision rule takes the form
The rule (2.25), which is referred to as the maximum likelihood (ML) decision rule,
chooses the hypothesis for which the corresponding likelihood function is largest.
15
Indeed, neither of the respective sets Y0 and Y1 are even connected in general.
22
Proof. Specializing (2.12) we obtain
Ĥ(y)=H1
py|H (y|H1 )
R 1, (2.26)
py|H (y|H0 ) Ĥ(y)=H0
or, equivalently,
Ĥ(y)=H1
py|H (y|H1 ) R py|H (y|H0 ),
Ĥ(y)=H0
whence (2.25)
Example 2.4 (Coin Flipping, Continued). Returning to our scenario of Example 2.1,
suppose we are equally likely to have been handed either of the two coins. Then to
minimize the probability of an error in decision whether we flipped the fair or biased
coin, we use the ML rule
Ĥ(y)=H1
L(y) R 1,
Ĥ(y)=H0
whence (
H1 y = heads
Ĥ(y) =
H0 y = tails.
The resulting error probability is then
Evidently, we can do at least a little better than just random guessing, which yields
an error probability of 1/2.
23
2
1.6
1.2
0.8
0.4
0
0 0.2 0.4 0.6 0.8 1
Figure 2.3: The likelihood ratio for the average detection problem of Example 2.2
(and its continuation Example 2.5).
Ĥ(y)=H1
L(y) R 1,
Ĥ(y)=H0
where
py |H (y|H1 )
L(y) = = py |H (y|H1 ) = 2 1 |2y 1| , 0 y 1,
py |H (y|H0 )
as depicted in Fig. 2.3. We can equivalently express the decision rule in the form
(
H1 1/4 y 3/4
Ĥ(y) =
H0 otherwise,
Pe , P Ĥ(y ) 6= H
= P Ĥ(y ) = H1 H = H0 P0 + P Ĥ(y ) = H0 H = H1 P1
1 1
= P L(y ) 1 H = H0 + P L(y ) < 1 H = H1
2 2
24
1⇣ ⌘
= P y 2 [1/4, 3/4] H = H0 + P y 2 [0, 1/4) [ (3/4, 1] H = H1
2 !
Z 3/4 Z 1/4 Z 1
1 3
= 1 dy + 4y dy + 4(1 y) dy = .
2 1/4 0 3/4 8
| {z } | {z } | {z }
=1/2 =1/8 =1/8
Ĥ(y)=H1
1 ⇥ ⇤
`(y) = 2 (y s0 ) 2
(y s1 ) 2
R ln ⌘. (2.28)
2 Ĥ(y)=H0
Expanding the quadratics and canceling terms in (2.28), and using that s1 > s0 , we
obtain the test in the simple form
Ĥ(y)=H1 2
s1 + s0 ln ⌘
y R + , . (2.29)
Ĥ(y)=H0
2 s1 s0
Ĥ(y)=H1
|y s0 | R |y s1 |,
Ĥ(y)=H0
25
particular, we have
Pe , P Ĥ(y ) 6= H
= P Ĥ(y ) = H1 H = H0 P0 + P Ĥ(y ) = H0 H = H1 P1
1 1
= P L(y ) 1 H = H0 + P L(y ) < 1 H = H1
2 ✓ 2
◆ ✓ ◆
1 s0 + s1 s0 + s1
= P y H = H0 + P y < H = H1
2 2 2
✓ ◆ ✓ ◆
1 y s0 s1 s0 y s1 s0 s1
= P H = H0 + P < H = H1
2 2 2
✓ ◆ ✓ ✓ ◆◆
1 s1 s0 s0 s1
= Q + 1 Q (2.30)
2 2 2
✓ ◆
s1 s0
=Q (2.31)
2
✓ ◆
d
=Q , (2.32)
2
where to obtain (2.30) we have used that (as introduced in the first installment of the
notes) Z 1
1 2
Q(↵) , p e t /2 dt (2.33)
2⇡ ↵
is the area under the tail of the unit Gaussian density, where to obtain (2.31) we have
used the symmetry relation
Q(↵) = 1 Q( ↵), (2.34)
which follows from (2.33), and where in (2.32) the parameter
s1 s0
d, >0
from which it follows that the error probability decays (at least) exponentially with
the squared-distance.
16
Later in the subject, we will develop a broader perspective on this measure of distance, and the
associated geometry more generally, but we don’t need that yet.
26
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6
In Example 2.6, the decisions regions in the y coordinate have a particularly simple
form:
Y0 = {y 2 R : y < } and Y1 = {y 2 R : y > }. (2.35)
In other problems—even Gaussian ones—the decision regions can be more complicated,
as our final example illustrates.
Example 2.7. Suppose that a zero-mean Gaussian random variable has one of two
possible variances, 12 or 02 , where 12 > 02 . Let the costs and prior probabilities be
arbitrary. Then the likelihood ratio test for this problem takes the form
1 2
y /(2
2
1)
p 2
e Ĥ(y)=H1
2⇡
L(y) = 1
R ⌘.
1 2
y /(2
2
0)
p 2
e Ĥ(y)=H0
2⇡ 0
In this problem, it is a straightforward exercise to show that the test simplifies to one
of the form s
Ĥ(y)=H1 2 2 ✓ ◆
|y| R 2 2 0 1
2 ln ⌘
1
, .
Ĥ(y)=H0 1 0 0
Hence, the decision region Y1 is the union of two disconnected regions in this case, i.e.,
Y1 = {y 2 R : y > } [ {y 2 R : y < }.
27
2.4 Beyond Binary Classification
The Bayesian hypothesis testing framework we have developed naturally extends to
the case of M 2 hypotheses. As you might expect, the optimum decision rule
for such M -ary hypothesis testing involves combining the outputs of multiple LRTs.
However, what may not be apparent is that this rule only requires M 1 features;
e.g.,
py|H (y|Hm )
Lm (y) , , m = 1, . . . , M 1,
py|H (y|H0 )
i.e., for the purposes of a specific classification task, we need only have access to the
(M 1)-dimensional vector representation L1 (y), . . . , LM 1 (y) of the data y. To see
this most easily will benefit from more formally developing the concept of a sufficient
statistic, to which we will return.
28