0% found this document useful (0 votes)
18 views

BayesianHypothesisTesting

The document describes Bayesian hypothesis testing and classification. It introduces the concepts of hypotheses, prior probabilities, likelihood functions, and posterior probabilities. It then focuses on binary hypothesis testing, providing examples like coin flipping, detecting averages, and communicating a bit over a noisy channel.

Uploaded by

Wenjun Hou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

BayesianHypothesisTesting

The document describes Bayesian hypothesis testing and classification. It introduces the concepts of hypotheses, prior probabilities, likelihood functions, and posterior probabilities. It then focuses on binary hypothesis testing, providing examples like coin flipping, detecting averages, and communicating a bit over a noisy channel.

Uploaded by

Wenjun Hou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science


6.437 Inference and Information
Spring 2022

2 Bayesian Hypothesis Testing

2.1 Hypothesis Testing and Classification . . . . . . . . . . . 12


2.2 Binary Hypothesis Testing . . . . . . . . . . . . . . . . . . 14
2.2.1 Optimum Decision Rules: The Likelihood Ratio Test . . . . 15
2.2.2 Interpreting and Implementing the Likelihood Ratio Test . 19
2.2.3 Maximum A Posteriori and Maximum Likelihood Decision
Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Decision-Making Examples . . . . . . . . . . . . . . . . . . 23
2.4 Beyond Binary Classification . . . . . . . . . . . . . . . . 28

In a wide range of applications, one must make decisions based on a set of data. As
we’ve discussed, just a few examples among many include medical diagnosis, voice and
face recognition, DNA sequence analysis, autonomous vehicle navigation, and digital
communication. In general, the available data are noisy, incomplete, or otherwise
imperfect, and thus the decisions produced will not always be correct. However, we
would like to use a decision process that is as good as possible in an appropriate sense.

2.1 Hypothesis Testing and Classification


The most basic version of such problems is the domain of decision theory, and
a natural framework is in terms of what is referred to as hypothesis testing. In
this framework, each of the possible scenarios corresponds to a hypothesis. When
there are M hypotheses (M 2) we denote the set of possible hypotheses using
7
H = {H0 , H1 , . . . , HM 1 }. For each of the possible hypotheses, there is a di↵erent
model for the observed data, and this is what we exploit to distinguish among them.
In the language of learning, we refer to H as the model family or hypothesis class, and
the observed data as test data.
The hypothesis testing problem can be equivalently interpreted as one of classifi-
cation. In the language of classification, we refer to H as a collection of (class) labels,
and the goal is to determine the correct label for some (test) data.
In our formulation, the observed collection of data is represented as a random
vector y, which may be discrete- or continuous-valued. There are a variety of ways to
model the hypotheses. In this section, we follow what is referred to as the Bayesian
7
Note that H0 is sometimes referred to as the “null” hypothesis, particularly in asymmetric
problems where it has special significance.

12
approach, which is among the oldest approaches to inference problems, and model the
hypothesis as a (categorical) random variable, denoting it using H.
Accordingly, in Bayesian hypothesis testing, a complete description of our models
consists of the a priori probabilities
pH (Hm ), m = 0, 1, . . . , M 1,
together with a characterization of the observed data under each hypothesis, which
takes the form of the conditional probability distributions8
py|H (·|Hm ), m = 0, 1, . . . , M 1. (2.1)
We emphasize that in this basic version of the problem, the constituent distributions
are all fully specified; e.g., there are no unknown parameters associated with any of
them.9
Consistent with the notational conventions we have established, in the case of
discrete-valued data, (2.1) denotes a collection of probability mass functions, while
in the case of continuous-valued data it denotes a collection of probability density
functions. In our development, the appropriate interpretation will be clear from
context.10
We use Y to denote the set of possible observations. For example, we might have
Y = {heads, tails}, Y = {|, ~, , }}, or Y = Z in the discrete case, or Y = [0, 1],
Y = [0, 1), or Y = Rk for some integer k > 0 in the continuous case. For generality,
we use vector notation for the data y regardless of whether it is discrete- or continuous-
valued, though obviously in the discrete case we can always equivalently interpret the
data as a scalar variable over a larger alphabet. In practice, the application often
dictates which abstraction is most natural.
Of course, a complete characterization of our knowledge of the correct hypothesis
based on our observations is the set of a posteriori probabilities
pH|y (Hm |y), m = 0, 1, . . . , M 1. (2.3)
The distribution of possible values of H is often referred to as our belief about the
hypothesis. From this perspective, we can view the a priori probabilities as our prior
8
As related terminology, the function py|H (y|·), where y is the observed data, is referred to as the
likelihood function.
9
This may be because the mechanism for generating the data is perfectly understood from physical
laws, or because by some (supervised) training procedure, both distributions have been accurately
learned. As our development proceeds, we will relax these assumptions.
10
It is also worth pointing out that an equivalent specification of the constituent models is in the
form
pH|y (Hm |·), py (·) , m 2 {0, 1, . . . , M 1}. (2.2)
Our choice emphasizes an interpretation of the hypothesis testing problem as a solving a kind of
inverse problem in which y is “generated” from H. By contrast, (2.2) emphasizes a “forward model”
view, in which, e.g., a label H is “generated” from y. Nevertheless, as you might hope, the solution
to the problem is invariant to the choice of viewpoint.

13
belief, and view (2.3) as the revision of our belief based on having observed the data
y. The belief update is, of course, computed from the particular data y based on the
model via Bayes’ Rule:
py|H (y|Hm ) pH (Hm )
pH|y (Hm |y) = X .
py|H (y|Hm0 ) pH (Hm0 )
0
m

While the belief is a complete characterization of our knowledge of the true


hypothesis, in applications one must often go further and make a decision—i.e., guess
the hypothesis—based on this information. To make a good decision (i.e., an intelligent
guess) we need some measure of goodness, appropriately chosen for the application of
interest. In the sequel, we develop a framework for such decision-making, restricting
our attention to the binary (M = 2) case to simplify the exposition.

2.2 Binary Hypothesis Testing


Specializing to the binary case, our model consists of two components. One is the set
of prior probabilities
P0 = pH (H0 )
(2.4)
P1 = pH (H1 ) = 1 P0 .
The second is the observation model, corresponding to the likelihood functions
H0 : py|H (y|H0 )
(2.5)
H1 : py|H (y|H1 ).
As will become apparent, the development is essentially the same whether the
observations are discrete or continuous. The continuous case di↵ers only in that
summations are replaced by integrals, and by the associated additional mathematical
subtleties. When not otherwise specified, we will choose this case for our analysis.
We begin with some simple example scenarios to which we will return later.
Example 2.1 (Coin Flipping). Suppose we are handed one of two coins, flip it, and
observe the outcome y 2 Y = {heads, tails}. One of the coins is fair, and the other
is biased. Specifically, the two mass functions are
(
1/2 y = heads
py |H (y|H0 ) =
1/2 y = tails
(
2/3 y = heads
py |H (y|H1 ) =
1/3 y = tails.

Based on the outcome y, we need to decide whether we were given the fair or biased
coin.

14
Example 2.2 (Detection of Averages). Suppose we are trying to distinguish whether
someone is reporting an average of two numbers or not. Specifically, under hypothesis
H0 we observe a number y = u 2 Y = [0, 1] that is uniformly distributed, i.e.,
u ⇠ U([0, 1]). But under hypothesis H1 we observe the average of two such numbers,
i.e., y = (u + v )/2, where v ⇠ U([0, 1]) with u and v independent. Then it is
straightforward to verify from the convolution property of distributions of sums of
independent random variables that

py |H (y|H0 ) = 1
, 0  y  1.
py |H (y|H1 ) = 2 1 |2y 1|

Based on our measurement y, we need to decide which hypothesis is correct.

Example 2.3 (Communicating a Bit). Suppose a single bit of information m 2 {0, 1}


is encoded into a codeword sm and sent over a communication channel, where s0 and
s1 are both deterministic, known quantities, and where without loss of generality
we let s1 > s0 . Let’s further assume that the channel is noisy; specifically, what is
received is
y = sm + w ,
where w is a zero-mean Gaussian random variable with variance 2 and independent
of H. From this information, we can readily construct the probability density for the
observation under each of the hypotheses, obtaining

2 1 2
(y s0 ) /(2
2
)
py |H (y|H0 ) = N(y; s0 , )= p 2
e
2⇡ (2.6)
2 1 2
(y s1 ) /(2
2
)
py |H (y|H1 ) = N(y; s1 , )= p 2
e .
2⇡
In addition, if 0’s and 1’s are equally likely to be transmitted we would set the a priori
probabilities to
P0 = P1 = 1/2.

2.2.1 Optimum Decision Rules: The Likelihood Ratio Test


The solution to a hypothesis test is specified in terms of a decision rule (or, as
equivalent terminology, a classifier ). For now, we focus on deterministic decision rules.
In practice, such classifiers take many di↵erent forms, often governed by architectural
constraints. Indeed, modern neural networks for classification are ubiquitous examples.
In this section, we begin to understand the form of an optimum classifier, against
which any other can be measured.
We begin with a suitable abstraction of the problem. In particular, a decision rule
is a function Ĥ(·) that uniquely maps every possible observation y 2 Y to one of the
two hypotheses, i.e., Ĥ : Y 7! H, where H = {H0 , H1 }. From this perspective, we see

15
Y0

Y1
Y0
Y0
Y

Figure 2.1: The regions Y0 and Y1 as defined in (2.7) corresponding to an example


decision rule Ĥ(·), where Y is the the observation alphabet.

that choosing the function Ĥ(·) is equivalent to partitioning the observation space Y
into two disjoint decision regions, corresponding to the values of y for which each of
the two possible decisions are made. Specifically, we use Ym to denote those values of
y 2 Y for which our rule decides Hm , i.e.,

Y0 = y 2 Y : Ĥ(y) = H0
(2.7)
Y1 = y 2 Y : Ĥ(y) = H1 .

These regions are depicted schematically in Fig. 2.1.


Our goal, then, is to design this bi-valued function (equivalently the associated
decision regions Y0 and Y1 ) in such a way that the best possible performance is
obtained. In order to do this, we need to be able to quantify the notion of “best.” This
requires that we have a well-defined objective function corresponding to a suitable
measure of goodness. In the Bayesian approach, we use an objective function taking
the form of an expected cost (or, equivalently, loss). Specifically, we use

C(Hj , Hi ) , Cij (2.8)

to denote the cost of deciding that the hypothesis is Ĥ = Hi when the correct
hypothesis is H = Hj . Then the optimum decision rule takes the form

Ĥ(·) = arg min '(f ), (2.9)


f (·)

16
where the expected cost, which is referred to as the Bayes risk, is
⇥ ⇤
'(f ) , E C(H, f (y)) , (2.10)

and where the expectation in (2.10) is over both y and H, and where f (·) is a decision
rule.
Generally, the application dictates an appropriate choice of the costs Cij . For
example, a symmetric cost function of the form Cij = 1 1i=j , i.e.,

C00 = C11 = 0
(2.11)
C01 = C10 = 1,
corresponds to seeking a decision rule that minimizes the probability of a decision
(i.e., classification) error. This is often referred to as the 0-1 loss function, and is used
when seeking to optimize a classifier’s accuracy. However, there are many applications
for which such symmetric cost functions are not well-matched. For example, in a
medical diagnosis problem where H0 denotes the hypotheses that the patient does not
have a particular disease and H1 that he does, we would typically want to select cost
assignments such that C01 C10 .
Definition 2.1. A set of costs {Cij } is valid if the cost of a correct decision is lower
than the cost of an incorrect decision, i.e., Cjj < Cij whenever i 6= j.
Theorem 2.1. Given a priori probabilities P0 , P1 , data y, observation models py|H (·|H0 ),
py|H (·|H1 ), and valid costs C00 , C01 , C10 , C11 , the Bayesian decision rule takes the form
Ĥ(y)=H1
py|H (y|H1 ) P0 (C10 C00 )
L(y) , R , ⌘, (2.12)
py|H (y|H0 ) Ĥ(y)=H0
P1 (C01 C11 )

i.e., the decision is H1 when L(y) > ⌘, the decision is H0 when L(y) < ⌘, and the
decision can be made arbitrarily when L(y) = ⌘.11
Some standard terminology is associated with rules of the form (2.12). In particular,
the left-hand side of (2.12) is referred to as the likelihood ratio, and thus (2.12) is
referred to as a likelihood ratio test (LRT).
Proof of Theorem 2.1. Consider an arbitrary but fixed decision rule f (·). In terms of
this generic f (·), the Bayes risk can be expanded in the form
⇥ ⇤
'(f ) = E C(H, f (y))
h ⇥ ⇤i
= E E C(H, f (y)) y
Z
= '(f ˜ (y), y) py (y) dy, (2.13)

11
As such, the decision rule can also be expressed in the equivalent form Ĥ(y) = H1L(y)>⌘ .

17
with ⇥ ⇤
'(H,
˜ y) , E C(H, H) y = y , (2.14)
and where to obtain the second equality in (2.13) we have used iterated expectation.
Note from (2.13) that since py (y) is nonnegative, it is clear that we minimize ' if
we minimize '(f
˜ (y), y) for each particular value of y. Hence, we can determine the
optimum decision rule Ĥ(·) on a point-by-point basis, i.e., Ĥ(y) for each y.
Accordingly, let’s consider a particular (observation) point y = y⇤ . For this point,
if we choose the assignment
Ĥ(y⇤ ) = H0 ,
then our conditional expectation (2.14) takes the value
˜ 0 , y⇤ ) = C00 pH|y (H0 |y⇤ ) + C01 pH|y (H1 |y⇤ ).
'(H (2.15)
Alternatively, if we choose the assignment
Ĥ(y⇤ ) = H1 ,
then our conditional expectation (2.14) takes the value
˜ 1 , y⇤ ) = C10 pH|y (H0 |y⇤ ) + C11 pH|y (H1 |y⇤ ).
'(H (2.16)
Hence, the optimum assignment for the value y⇤ is simply the choice corresponding
to the smaller of (2.15) and (2.16). It is convenient to express this optimum decision
rule using the following notation (now replacing our particular observation y⇤ with a
generic observation y):
Ĥ(y)=H1
C00 pH|y (H0 |y) C10 pH|y (H0 |y)
R (2.17)
+ C01 pH|y (H1 |y) Ĥ(y)=H0
+ C11 pH|y (H1 |y).

Note that when the two sides of (2.17) are equal, then either assignment is equally
good—both have the same e↵ect on the objective function (2.13).
A minor rearrangement of the terms in (2.17) results in
Ĥ(y)=H1
(C01 C11 ) pH|y (H1 |y) R (C10 C00 ) pH|y (H0 |y). (2.18)
Ĥ(y)=H0

Since for any valid choice of costs the terms in parentheses in (2.18) are both positive,
we can equivalently write (2.18) in the form12
Ĥ(y)=H1
pH|y (H1 |y) (C10 C00 )
R . (2.19)
pH|y (H0 |y) Ĥ(y)=H0
(C01 C11 )
12
Technically, we have to be careful about dividing by zero here if pH|y (H0 |y) = 0. To simplify our
exposition, however, as we discuss in Section 2.2.2, it is natural to restrict our attention to the case
where pH|y (Hm |y) > 0 for m = 0, 1.

18
which expresses the optimum decision rule in terms of our beliefs, i.e., of the a posteriori
probabilities
py|H (y|Hm ) Pm
pH|y (Hm |y) = , m = 0, 1. (2.20)
py|H (y|H0 ) P0 + py|H (y|H1 ) P1

Finally, when we then substitute (2.20) into (2.19) and multiply both sides by P0 /P1 ,
we obtain the decision rule in its final form (2.12), directly in terms of the candidate
models.

2.2.2 Interpreting and Implementing the Likelihood Ratio Test


The decision rule (2.12) has some remarkable structure. First, note that the likelihood
ratio L(y) is constructed exclusively from the observations model and the data.
Meanwhile, the right-hand side of (2.12), i.e., ⌘, is a precomputable threshold that is
determined exclusively from the a priori probabilities and costs.
The likelihood ratio is an example of what is classically referred to as a statistic,
i.e., it is a real-valued function of the data. In the learning literature, functions of the
data are equivalently referred to as features, particularly when the data are continuous;
when the data are discrete, such functions are sometimes referred to as embeddings.
Frequently, it is convenient to combine k such functions together to form a single
vector-valued one, i.e., one mapping Y to Rk . In practice, neural networks and other
inference architectures compute such features from the data, and make decisions based
on them.
From this perspective, Theorem 2.1 establishes important results. First, in Bayesian
binary hypothesis testing, no loss in performance need be incurred by first computing
a feature from the data and then making a decision based on the result. Second, if we
choose this architecture, then a feature to use that will ensure no loss is the likelihood
ratio L(·). In essence, Theorem 2.1 is telling us that L(y) summarizes everything we
need to know about the data y in order to make the best possible decision about
H. Phrased di↵erently, in terms of our ability to make the optimum decision (in the
Bayesian sense in this case), knowledge of L(y) is as good as knowledge of the full
data vector y itself.
In the language of statistics, L(y) is an example of what is referred to as a sufficient
statistic. While we will develop the notion of a sufficient statistic more precisely and
in greater generality in a subsequent section of the notes, for now we note some
additional properties. As one such property, sufficient statistics are in general not
unique, and this is the case for the Bayesian binary hypothesis testing. For example,
any invertible function of L(y) is also a sufficient statistic. Indeed, we can always
rewrite the likelihood ratio test in the form
Ĥ(y)=H1
`(y) = g(L(y)) R g(⌘), (2.21)
Ĥ(y)=H0

19
y L(·) + Ĥ

Figure 2.2: An implementation of a likelihood ratio test that represents the Bayesian
binary decision rule, i.e., a classifier.

where g(·) is some suitably chosen, monotonically increasing function. An especially


important example is the case corresponding to g(·) = log(·). Not only does this
particular choice simplify and facilitate implementation of a many tests involving
distributions with exponential factors,13 such as Gaussians, it also arises naturally in
information theoretic of such hypothesis testing, as we will later develop.
Even more significantly, Theorem 2.1 tells us that we don’t need several features
of the data—one is enough, so long as we choose it to be the likelihood ratio L(·) (or
some invertible transformation thereof). Indeed, L(y) is scalar-valued, i.e., L : Y ! R,
regardless of the alphabet size |Y| if the data are discrete, and regardless of the
dimension K if the data are continuous with Y ⇢ RK . As a result, implementation
of the Bayesian binary classifier—i.e., decision rule for binary hypothesis testing—is
especially simple, as depicted in Fig. 2.2.
Compared to the classifier in Fig. 2.2, neural network classifiers, even when there
are only two classes, are generally much more complicated. For example, they typically
compute not one but in fact a very large number of features of the data in carrying
out classification. Ultimately, there is a good reason for this. In particular, a neural
network architecture is designed to be able to express a variety of di↵erent classifiers,
where the parameters of the network control which classifier is being implemented.
From this perspective, it needs to be able to e↵ectively implement the likelihood ratio
function L(·) for any hypothesis testing problem of interest. However, the range of
possible classification tasks and associated likelihood ratio functions is enormous, so
an exact fit is generally not possible, and so the network must e↵ectively approximate
this function by optimizing its parameters appropriately. But then once we realize we
cannot implement the likelihood ratio exactly, Theorem 2.1 no longer applies, and
in particular, incorporating additional features can improve performance over that
achieved by a single “approximate” likelihood ratio feature. Ultimately, the range
and quality of approximations a network can produce is governed by its architecture
and number of parameters, and is e↵ectively measured by what is referred to as the
expressive power of the network.14
13
We will discuss an important such family of distributions—exponential families—in detail in a
subsequent section of the notes.
14
Typically, in the design of such networks, the focus is on the networks ability to express di↵erent

20
Even when implementing a decision rule via a neural network for practical reasons,
the performance of the likelihood ratio test serves as an valuable performance bound,
and thus its analysis is insightful. For the purposes of such analysis, it is also important
to emphasize that L = L(y) is a random variable—i.e., it takes on a di↵erent value in
each experiment. As such, we will frequently be interested in evaluating its probability
density function—or at least moments such as its mean and variance—under each of
H0 and H1 . Such densities can be derived using the usual method of events.
In carrying out such analysis, it follows immediately from its definition in (2.12)
that, depending on the problem, some values of y may lead to L(y) being zero or
infinite. In particular, the former occurs when py|H (y|H1 ) = 0 but py|H (y|H0 ) > 0,
which is an indication that values in a neighborhood of y e↵ectively cannot occur
under H1 but can under H0 . In this case, there will be values of y for which we’ll
e↵ectively know with certainty that the correct hypothesis is H0 . When the likelihood
ratio is infinite, corresponding a division-by-zero scenario, an analogous situation
exists, but with the roles of H0 and H1 reversed. These cases where such perfect
decisions are possible are referred to as singular decision scenarios. In some practical
problems, these scenarios do in fact occur. However, in other cases they suggest a
potential lack of robustness in the data modeling, i.e., that some source of inherent
uncertainty may be missing from the model. In any event, to simplify our development
for the remainder of the topic we will largely restrict our attention to the case where
0 < L(y) < 1 for all y.
While the likelihood ratio focuses the observed data into a single scalar for the
purpose of making an optimum decision, the threshold ⌘ for the test plays a com-
plementary role, and serves to bias the decision making in a deliberate and precise
manner. In particular, from (2.12) we see that ⌘ focuses the relevant details of the cost
function and a priori probabilities into a single scalar. Furthermore, this information is
combined in a manner that is intuitively satisfying. For example, as (2.12) also reflects,
an increase in P0 means that H0 is more likely, so that ⌘ is increased to appropriately
bias the test toward deciding H0 for any particular observation. Similarly, an increase
in C10 means that deciding H1 when H0 is true is more costly, so ⌘ is increased to
appropriately bias the test toward deciding H0 to o↵set this risk. Finally, note that
adding a constant to the cost function (i.e., to all Cij ) has, as we would anticipate, no
e↵ect on the threshold. Hence, without loss of generality we may set at least one of
the correct decision costs—i.e., C00 or C11 —to zero.
Finally, it is important to emphasize that the likelihood ratio test (2.12) indirectly
determines the decision regions (2.7). In particular, we have

Y0 = y 2 Y : Ĥ(y) = H0 = y 2 Y : L(y) < ⌘


(2.22)
Y1 = y 2 Y : Ĥ(y) = H1 = y 2 Y : L(y) > ⌘ .

As Fig. 2.1 suggests, while a decision rule expressed in the measurement data space Y
belief functions, rather than likelihood ratios, but each is a proxy for the other.

21
can be complicated,15 (2.12) tells us that the observations can be transformed into
a one-dimensional space defined via L = L(y) where the decision regions have a
particularly simple form: the decision Ĥ(L) = H0 is made whenever L lies to the left
of some point ⌘ on the line, and Ĥ(L) = H1 whenever L lies to the right.

2.2.3 Maximum A Posteriori and Maximum Likelihood Decision Rules


An important cost assignment for many problems is that given by (2.11), which as
we recall corresponds to a minimum probability-of-error (or, equivalently, maximum
accuracy) criterion. Indeed, in this case, we have

'(Ĥ) = P Ĥ(y) = H0 , H = H1 + P Ĥ(y) = H1 , H = H0 .

The corresponding decision rule in this case can be obtained as a special case of
(2.12).

Corollary 2.1. The minimum probability-of-error decision rule takes the form

Ĥ(y) = arg max pH|y (H|y). (2.23)


H2{H0 ,H1 }

The rule (2.23), in which one chooses the hypothesis for which our belief is largest, is
referred to as the maximum a posteriori (MAP) decision rule.
Proof. Instead of specializing (2.12), we specialize the equivalent test (2.18), from
which we obtain a form of the minimum probability-of-error test expressed in terms of
the a posteriori probabilities for the problem, viz.,

Ĥ(y)=H1
pH|y (H1 |y) R pH|y (H0 |y). (2.24)
Ĥ(y)=H0

From (2.24) we see that the desired decision rule can be expressed in the form (2.23)
Still further simplification is possible when the hypotheses are equally likely
(P0 = P1 = 1/2). In this case, we have the following.

Corollary 2.2. When the hypotheses are equally likely, the minimum probability-of-
error decision rule takes the form

Ĥ(y) = arg max py|H (y|H). (2.25)


H2{H0 ,H1 }

The rule (2.25), which is referred to as the maximum likelihood (ML) decision rule,
chooses the hypothesis for which the corresponding likelihood function is largest.
15
Indeed, neither of the respective sets Y0 and Y1 are even connected in general.

22
Proof. Specializing (2.12) we obtain

Ĥ(y)=H1
py|H (y|H1 )
R 1, (2.26)
py|H (y|H0 ) Ĥ(y)=H0

or, equivalently,
Ĥ(y)=H1
py|H (y|H1 ) R py|H (y|H0 ),
Ĥ(y)=H0

whence (2.25)

2.3 Decision-Making Examples


We conclude with some simple examples that illustrate the use of Bayesian hypothesis
testing framework.

Example 2.4 (Coin Flipping, Continued). Returning to our scenario of Example 2.1,
suppose we are equally likely to have been handed either of the two coins. Then to
minimize the probability of an error in decision whether we flipped the fair or biased
coin, we use the ML rule
Ĥ(y)=H1
L(y) R 1,
Ĥ(y)=H0

where the likelihood ratio is


(
4/3 y = heads
L(y) =
2/3 y = tails,

whence (
H1 y = heads
Ĥ(y) =
H0 y = tails.
The resulting error probability is then

P Ĥ(y ) 6= H = P Ĥ(y ) = H1 H = H0 P0 + P Ĥ(y ) = H0 H = H1 P1


1 1
= P L(y ) 1 H = H0 + P L(y ) < 1 H = H1
2✓ 2 ◆
1 5
= P y = heads H = H0 + P y = tails H = H1 =
2 | {z } | {z } 12
=1/2 =1/3

Evidently, we can do at least a little better than just random guessing, which yields
an error probability of 1/2.

23
2

1.6

1.2

0.8

0.4

0
0 0.2 0.4 0.6 0.8 1

Figure 2.3: The likelihood ratio for the average detection problem of Example 2.2
(and its continuation Example 2.5).

Example 2.5 (Detection of Averages, Continued). Returning to the scenario of


Example 2.2, suppose that the two hypotheses are equally. Then the minimum
probability-of-error rule for deciding whether we are observing an average or not is
given by the likelihood ratio test (or ML decision rule)

Ĥ(y)=H1
L(y) R 1,
Ĥ(y)=H0

where
py |H (y|H1 )
L(y) = = py |H (y|H1 ) = 2 1 |2y 1| , 0  y  1,
py |H (y|H0 )

as depicted in Fig. 2.3. We can equivalently express the decision rule in the form
(
H1 1/4  y  3/4
Ĥ(y) =
H0 otherwise,

using which we obtain the resulting error probability as

Pe , P Ĥ(y ) 6= H
= P Ĥ(y ) = H1 H = H0 P0 + P Ĥ(y ) = H0 H = H1 P1
1 1
= P L(y ) 1 H = H0 + P L(y ) < 1 H = H1
2 2

24
1⇣ ⌘
= P y 2 [1/4, 3/4] H = H0 + P y 2 [0, 1/4) [ (3/4, 1] H = H1
2 !
Z 3/4 Z 1/4 Z 1
1 3
= 1 dy + 4y dy + 4(1 y) dy = .
2 1/4 0 3/4 8
| {z } | {z } | {z }
=1/2 =1/8 =1/8

Example 2.6 (Communicating a Bit, Continued). Continuing with Example 2.3, we


obtain from (2.6) that the likelihood ratio test for this problem takes the form
1 2
(y s1 ) /(2
2
)
p 2
e Ĥ(y)=H1
L(y) = 2⇡ R ⌘. (2.27)
1 2
(y s0 ) /(2
2
)
p 2
e Ĥ(y)=H0
2⇡
As (2.27) suggests—and as is generally the case in Gaussian problems—the natural
logarithm of the likelihood ratio is a more convenient sufficient statistic to work with
in this example. In this case, taking logarithms of both sides of (2.27) yields

Ĥ(y)=H1
1 ⇥ ⇤
`(y) = 2 (y s0 ) 2
(y s1 ) 2
R ln ⌘. (2.28)
2 Ĥ(y)=H0

Expanding the quadratics and canceling terms in (2.28), and using that s1 > s0 , we
obtain the test in the simple form

Ĥ(y)=H1 2
s1 + s0 ln ⌘
y R + , . (2.29)
Ĥ(y)=H0
2 s1 s0

Note that with a minimum probability-of-error criterion, if P0 = P1 then ln ⌘ = 0


and we see immediately from (2.28) that the optimum test takes the form

Ĥ(y)=H1
|y s0 | R |y s1 |,
Ĥ(y)=H0

which corresponds to a “minimum-distance” decision rule, i.e.,

Ĥ(y) = Hm̂ , m̂ = arg min|y sm |.


m2{0,1}

This minimum-distance property turns out to hold in multidimensional Gaussian


problems as well, and leads to convenient analysis in terms of Euclidean geometry.
The error probability associated with the optimum decision rule (2.29) of Ex-
ample 2.6 is naturally expressed in “standard form” using Q-function notation. In

25
particular, we have

Pe , P Ĥ(y ) 6= H
= P Ĥ(y ) = H1 H = H0 P0 + P Ĥ(y ) = H0 H = H1 P1
1 1
= P L(y ) 1 H = H0 + P L(y ) < 1 H = H1
2 ✓ 2
◆ ✓ ◆
1 s0 + s1 s0 + s1
= P y H = H0 + P y < H = H1
2 2 2
 ✓ ◆ ✓ ◆
1 y s0 s1 s0 y s1 s0 s1
= P H = H0 + P < H = H1
2 2 2
 ✓ ◆ ✓ ✓ ◆◆
1 s1 s0 s0 s1
= Q + 1 Q (2.30)
2 2 2
✓ ◆
s1 s0
=Q (2.31)
2
✓ ◆
d
=Q , (2.32)
2

where to obtain (2.30) we have used that (as introduced in the first installment of the
notes) Z 1
1 2
Q(↵) , p e t /2 dt (2.33)
2⇡ ↵
is the area under the tail of the unit Gaussian density, where to obtain (2.31) we have
used the symmetry relation
Q(↵) = 1 Q( ↵), (2.34)
which follows from (2.33), and where in (2.32) the parameter
s1 s0
d, >0

is a natural measure of “signal-to-noise ratio” or more specifically “distance” between


the distributions characterizing the hypotheses.16 The dependence of error probability
on d is shown in Fig. 2.4. As we would expect, larger values of d correspond to smaller
error probabilities. In fact, using geometry we will develop later, it can be shown that
2
↵ /2
Q(↵)  e , ↵ > 0,

from which it follows that the error probability decays (at least) exponentially with
the squared-distance.
16
Later in the subject, we will develop a broader perspective on this measure of distance, and the
associated geometry more generally, but we don’t need that yet.

26
0.5

0.4

0.3

0.2

0.1

0
0 1 2 3 4 5 6

Figure 2.4: The dependence of error probability Pe on the “distance” d = (s1 s0 )/


between Gaussian hypotheses N(s0 , 2 ) and N(s1 , 2 ) in Example 2.3.

In Example 2.6, the decisions regions in the y coordinate have a particularly simple
form:
Y0 = {y 2 R : y < } and Y1 = {y 2 R : y > }. (2.35)
In other problems—even Gaussian ones—the decision regions can be more complicated,
as our final example illustrates.

Example 2.7. Suppose that a zero-mean Gaussian random variable has one of two
possible variances, 12 or 02 , where 12 > 02 . Let the costs and prior probabilities be
arbitrary. Then the likelihood ratio test for this problem takes the form
1 2
y /(2
2
1)
p 2
e Ĥ(y)=H1
2⇡
L(y) = 1
R ⌘.
1 2
y /(2
2
0)
p 2
e Ĥ(y)=H0
2⇡ 0

In this problem, it is a straightforward exercise to show that the test simplifies to one
of the form s
Ĥ(y)=H1 2 2 ✓ ◆
|y| R 2 2 0 1
2 ln ⌘
1
, .
Ĥ(y)=H0 1 0 0

Hence, the decision region Y1 is the union of two disconnected regions in this case, i.e.,

Y1 = {y 2 R : y > } [ {y 2 R : y < }.

27
2.4 Beyond Binary Classification
The Bayesian hypothesis testing framework we have developed naturally extends to
the case of M 2 hypotheses. As you might expect, the optimum decision rule
for such M -ary hypothesis testing involves combining the outputs of multiple LRTs.
However, what may not be apparent is that this rule only requires M 1 features;
e.g.,
py|H (y|Hm )
Lm (y) , , m = 1, . . . , M 1,
py|H (y|H0 )
i.e., for the purposes of a specific classification task, we need only have access to the
(M 1)-dimensional vector representation L1 (y), . . . , LM 1 (y) of the data y. To see
this most easily will benefit from more formally developing the concept of a sufficient
statistic, to which we will return.

28

You might also like