An Introduction To Bayesian Statistics
An Introduction To Bayesian Statistics
Harvey Thornburg
Center for Computer Research in Music and Acoustics (CCRMA)
Department of Music, Stanford University
Stanford, California 94305
1
Bayesian Parameter Estimation
2
way to indicate “prior information” about θ. We
simply include these past trials in our estimate:
θ̂ = (10 + H)/(10 + H + 20 + T )
3
• Suppose, due to computer crash, we had lost the
details of the experiment, and our memory has also
failed (due to lack of sleep), that we forget even the
number of heads and tails (which are the sufficient
statistics for the Bernoulli distribution). However, we
believe the probability of heads is about 1/3, but this
probability itself is somewhat uncertain, since we only
performed 30 trials.
• In short, we claim to have a prior distribution over the
probability θ, which represents our prior belief.
Suppose this distribution is P (θ) and
P (θ) ∼ Beta(10, 20):
θ9(1 − θ)19
g(θ) = R 9
θ (1 − θ)19dθ
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
4
P (θ|y1:H+T ) according to Bayes’ Rule:
P (y|θ)P (θ)
P (θ|y) =
P (y)
P (y|θ)P (θ)
= R
P (y|θ)P (θ)dθ
The term P (y|θ) is, as before, the likelihood function
of θ. The marginal P (y) comes by integrating out θ:
Z
P (y) = P (y|θ)P (θ)dθ
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
6
• Key Point The “output” of the Bayesian analysis is
not a single estimate of θ, but rather the entire
posterior distribution. The posterior distribution
summarizes all our “information” about θ. As we get
more data, if the samples are truly iid, the posterior
distribution will become more sharply peaked about a
single value.
• Of course, we can use this distribution to make
inference about θ. Suppose an “oracle” was to tell us
the true value of θ used to generate the samples. We
want to guess θ that minimizes the mean squared
error between our guess and the true value. This is
the same criterion as in maximum likelihood
estimation. We would choose the mean of the
posterior distribution, because we know conditional
mean minimizes mean square error.
• Let our prior be Beta(H0, T0) and
θ̂ = E(θ|y1:N )
H0 + H
=
H0 + H + T 0 + T
• The same way, we can do prediction. What is
7
P (yN +1 = 1|y1:N )?
Z
P (yN +1 = 1|y1:N ) = P (yN +1 = 1|θ, y1:N )P (θ|y1:N )dθ
Z
= P (yN +1 = 1|θ)P (θ|y1:N )dθ
Z
= θP (θ|y1:N )dθ
= E(θ|y1:N )
H0 + H
=
H0 + H + T 0 + T
8
Bayesian Hypothesis Testing
9
• We optimize the tradeoff by comparing the likelihood
ratio to a nonnegative threshold, say exp(T ) > 0:
D∗(y1:N ) = 1 fθ1 (y)
fθ (y)
>exp(T )
0
11
• When f, g belong to the same parametric family, we
adopt the shorthand: K(θ0, θ1) rather than
K(fθ0 , fθ1 ). Then we have an additional fact. When
hypotheses are “close”, K-L distance behaves
approximately like the square of the Euclidean metric
in parameter (θ)-space. Specifically:
2K(θ0, θ1) ≈ (θ1 − θ0)0J(θ0)(θ1 − θ0).
where J(θ0) is the Fisher information. The right hand
side is sometimes called the square of the
Mahalanobis distance.
• Furthermore, we may assume the hypotheses are
“close” enough that J(θ0) ≈ J(θ1). Then, K-L
information appears also symmetric.
• Practically there is still the problem to choose T , or
to choose “desirable” probabilities of miss and false
alarm which obey Stein’s lemma, which gives also the
data size. We can solve for T given the error
probabilities. However, it is often “unnatural” to
specify these probabilities; instead, we are concerned
about other, observable effects on the system. Hence,
the usual scenario results in a lot of lost sleep, as we
are continually varying T , running simulations, and
then observing some distant outcome.
• Fortunately, the Bayesian approach comes to the
12
rescue. Instead of optimizing a probability tradeoff,
we assign costs: CM > 0 to a miss event and
CF A > 0 to a false alarm event. Additionally, we have
a prior distribution on θ
P (θ = θ1) = π1
13
• Let D(y1:N ) be the decision function as before. The
Bayes risk, or expected cost, is as follows.
R(D) = π1E [D(y1:N ) = 0|θ = θ1] + (1 − π1)E [D(y1:N ) = 1
14
Bayesian Sequential Analysis
15
likelihood ratio:
fθ1 (y1:n)
Sn =
fθ0 (y1:n)
16
θ1 are given by the K-L informations:
E(sn|θ0) = −K(θ0, θ1)
E(sn|θ1) = −K(θ1, θ0)
20
−20
−40
0 20 40 60 80 100 120 140 160 180 200
17
• Again, we have practical issues concerning how we
choose thresholds a, b. By invoking Wald’s equation,
or some results from martingale theory, these are
easily related to the probabilities of error at the
stopping time of the test. However, the problem
arises how to choose both probabilities of error, since
we have a three-way tradeoff with the average run
lengths Eθ0 (T ), Eθ1 (T ) !!
• Fortunately, the Bayesian formulation comes to our
rescue. We can again assign costs to the probabilities
of false alarm and miss CF A, CM . We also include a
cost proportional to the number of observations prior
to stopping. Let this cost equal the number of
observations, which is T . The goal is to minimize
expected cost, or sequential Bayes risk. What is our
prior information? Again, we must know
P (θ = θ1) = π1.
• It turns out that the optimal Bayesian strategy is
again a SPRT. This follows from the theory of
optimal stopping. Suppose at time n, our we have yet
to make a decision concerning θ. We must decide
among the following alternatives:
– Stop, and declare θ0 or θ1.
– Take one more observation.
18
• We choose to stop only when the minimum additional
cost of stopping is less than the minimum expected
additional cost of taking one more observation.
• We compute these costs using the posterior
distribution of θ, i.e:
π1(n) = P (θ = θ1|y1:n)
19
• Unfortunately, one cannot get a close form expression
for the thresholds in terms of the costs, but the
“Bayes” formulation allows at least to involve prior
information about the hypotheses.
• We will see a much richer extension to the problem of
Bayesian change detection.
20