LectureNotes22 WI4455
LectureNotes22 WI4455
Inference
Lecture notes for the course wi4455
Version 1.42
These lecture notes support the course “statistical inference” at Delft University of Technology. Just like
Young and Smith [2005], this course aims to provide a concise account of the essential elements of statistical
inference and theory with particular emphasis on the contrasts among frequentist and Bayesian approaches.
I believe it is very important to learn that there are different thoughts on how to do proper statistics. Many
people collect data, often with the goal of information based decision making. Next, they seem to hope that
statistics will give a clear cut answer to the various questions they may have. This is usually not possible; I
will try to teach you why. Clearly, there are always choices to be made in modelling the data. But even if two
well-educated statisticians agree on the statistical model, this does not imply they agree on the conclusions
to be drawn. This does not invalidate the use of statistics, but you should be aware of the underlying causes
for disagreement.
This syllabus is based on various sources. At many places I present the material as in either Young
and Smith [2005] or Schervish [1995]. The chapter on decision theory is based on chapters 7 and 8 from
Parmigiani and Inoue [2009]. At some places I have used the material from Shao [2003], Keener [2010] and
Ghosh et al. [2006].
Starred sections or theorems in these lecture notes are not part of the exam. A couple of articles from
that appeared in the statistical literature have been added. Though I strongly advise you to read these, they
are not part of exam material. Exercises are scattered throughout the text. Many are from Keener [2010],
Schervish [1995] and Young and Smith [2005].
The content of Chapter 2 should be mostly familiar and for that reason I will go through this part rather
quickly.
Scripts corresponding to numerical examples can be found at https://github.com/fmeulen/WI4455.
These scripts are written in either R or Julia.
Thanks to Eni Musta, Marc Corstanje and Laura Middeldorp for providing many exercise solutions and
checking parts of these notes. I thank Geurt Jongbloed and students that took the course wi4455 for pointing
out typo’s, mistakes and suggestions for improving the text. In particular, Joost Pluim (2017-2018).
v
vi
Contents
1 Preliminaries 1
1.1 What is statistics about? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Some well known distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Exponential family models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Examples of statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Different views on inferential statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5.1 Example: classical and Bayesian estimation for Bernoulli trials . . . . . . . . . . . . 12
1.5.2 A few intriguing examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Stochastic convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
vii
CONTENTS
4 Bayesian statistics 57
4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.1 Definition of a Bayesian statistical experiment . . . . . . . . . . . . . . . . . . . . . 58
4.1.2 Dominated Bayesian statistical models . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.3 Examples: dominated case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.4 An example where the posterior is not dominated by the prior* . . . . . . . . . . . . 60
4.1.5 Basu’s example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1.6 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.7 Do we need to discern between prior, likelihood, data, parameters? . . . . . . . . . . 64
4.1.8 Bayesian updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1.9 Posterior mean, median, credible sets . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1.10 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 An application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.1 Bayesian updating for linear regression . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.2 State-space models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Justifications for Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.1 Exchangeability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Choosing the prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.1 Improper priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.2 Jeffreys’ prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5 Hierarchical Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6 Empirical Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.7 Bayesian asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.7.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.7.2 Asymptotic normality of the posterior . . . . . . . . . . . . . . . . . . . . . . . . . 90
5 Bayesian computation 95
5.1 The Metropolis-Hastings algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.1.1 A general formulation of the Metropolis-Hastings algorithm* . . . . . . . . . . . . 97
5.1.2 Convergence of the Metropolis-Hastings algorithm . . . . . . . . . . . . . . . . . . 98
5.2 Examples of proposal kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3 Cycles, mixtures and Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4 Applying MCMC methods to the Baseball data . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4.1 MCMC algorithm for model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.2 MCMC algorithm for model 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4.3 Some simulation results based on model 2 . . . . . . . . . . . . . . . . . . . . . . . 105
5.5 Applying Gibbs sampling for missing data problems* . . . . . . . . . . . . . . . . . . . . . 107
5.6 Variational Inference* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.7 Probabilistic programming languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.8 Expectation Maximisation (EM) algorithm* . . . . . . . . . . . . . . . . . . . . . . . . . . 110
viii
CONTENTS
ix
CONTENTS
x
Chapter 1
Preliminaries
In this chapter we set notation and introduce some well known statistical models, including exponential
family models. Some examples are presented which cause disagreement about the right type of inference
among statisticians. The final section is on stochastic convergence and includes definitions that will be used
in later chapters.
Inference is the problem of turning data into knowledge, where knowledge often is expressed in
terms of entities that are not present in the data per se but are present in models that one uses to
interpret the data.
It is important to know that for statistics as a field of study, one usually discerns
• inferential statistics, which is about planning experiments, drawing conclusions from data and quan-
tifying the uncertainty of statements;
• statistical decision theory, which is about using the available information to choose a among a number
of alternative actions.
A clear informal introduction is given in the opening section of the worth reading book by Winkler [1972]
Applications of statistics occur in virtually all fields of endeavour –business, the social sciences,
education, and so on, almost without end. Although the specific details differ somewhat in the
different fields, the problems can all be treated with the general theory of statistics. To begin,
it is convenient to identify three major branches of statistics: descriptive statistics, inferential
statistics, and the statistical decision theory. Descriptive statistics is a body of techniques for
the effective organization, summarization, and communication of data. When the “man on the
street” speaks of “statistics”, he usually means data organized by the methods of descriptive
statistics. Inferential statistics is a body of methods for arriving at conclusions extending be-
yond the immediate data. For example, given some information regarding a small subset of a
given population, what can be said about the entire population? Inferential statistics, then, refers
1
CHAPTER 1. PRELIMINARIES
to the process of drawing conclusions or making predictions on the basis of limited informa-
tion. Finally, statistical decision theory goes one step further; instead of just making inferential
statements, the decision maker uses the available information to choose among a number of
alternative actions.
For most practitioners, descriptive statistics is probably most relevant. Tidying data and preprocessing data
for further analysis is typically time-consuming but its importance cannot be understated. This course is
mostly concerned with inference and to a lesser extend with statistical decision theory (Chapter 6). A key
component of inference is the concept of a statistical model.
𝜃𝑥
ℙ(𝑋 = 𝑥) = 𝑒−𝜃 , 𝑥 = 0, 1, 2, … .
𝑥!
The function 𝑥 → ℙ(𝑋 = 𝑥) is the probability mass function. If we assume 𝑌 ∼ 𝐸𝑥𝑝 (𝜆) (with 𝜆 > 0), then
this translates to assuming 𝑌 takes values in [0, ∞) and
𝑏
ℙ(𝑎 ≤ 𝑌 ≤ 𝑏) = 𝜆𝑒−𝜆𝑦 𝟏[0,∞) (𝑦) d𝑦, 𝑎 ≤ 𝑏.
∫𝑎
Here, 𝑓𝑌 (𝑦) = 𝜆𝑒−𝜆𝑦 𝟏[0,∞) (𝑦) is the probability density function. The distinction between “continuous”
and “discrete” random variables is restrictive and unnecessary. As a concrete example, suppose the random
variable 𝑍 is defined by {
𝑋 with probability 𝑤
𝑍= , (1.1)
𝑌 with probability 1 − 𝑤
where 𝑤 ∈ (0, 1). Clearly, 𝑍 is neither discrete nor continuous. Is it possible to define a density (similar
to probability mass functions or density functions)? This is certainly not an exceptional example, consider
for example a 𝑌 𝐸𝑥𝑝 (1), 𝑐 > 0 and define 𝑋 = min(𝑌 , 𝑐). Verify yourself that 𝑋 is neither discrete nor
continuous.
Using the language of measure theory, we now make the notions of sample space and random variable
more precise. Actually, as we are also interested in describing probabilities of random vectors or even random
functions, we will talk about random quantities to refer to any of these.
We assume that there is a probability space (𝑆, , ℙ) underlying all calculations. A random quantity 𝑋
is a measurable mapping
𝑋 ∶ (𝑆, ) → (, ),
where (with 𝜎-field ) is called the sample space. Recall the definition of the pre-image of a set under a
map: if 𝜙 ∶ 𝐴 → 𝐵 is a map and 𝑉 ⊆ 𝐵, then the pre-image of 𝑉 under 𝜙 is defined by
preim𝜙 (𝑉 ) = {𝑎 ∈ 𝐴 ∣ 𝜙(𝑎) ∈ 𝑉 }.
2
1.2. STATISTICAL MODELS
Often, this set is denoted by 𝜙−1 (𝑉 ), but note that the definition for the pre-image does not require 𝜙 to be
invertible. The distribution of 𝑋, denoted by P, is defined by
Other terminology for the same thing is “probability distribution of 𝑋” or “law of 𝑋”.1 If 𝑋 depends on
the parameter 𝜃 one often writes P𝜃 for the distribution of 𝑋. The expectation of 𝑋 is written either as
𝔼𝑋 = ∫ 𝑋(𝑠) dℙ(𝑠) or E𝜃 𝑋 = ∫ 𝑥 dP𝜃 (𝑥), the latter showing explicitly the dependence on the parameter 𝜃.
Sometimes we write P𝑋 to denote the distribution of the random quantity 𝑋.
Example 1.1. Take (𝑆, , ℙ) = ([0, 1], ([0, 1]), 𝜇) with 𝜇 denoting Lebesgue measure on [0, 1]. Define
𝑋 ∶ [0, 1] → (0, ∞) by 𝑋(𝑠) = −𝜃 log(𝑠), then P𝜃 ([0, 𝑥]) = 𝜇([𝑒−𝑥∕𝜃 , 1]) = 1 − 𝑒−𝑥∕𝜃 . Hence 𝑃𝜃 is in fact
the distribution of a random variable with Exponential distribution with mean 𝜃.
While introductory texts in probability usually start from the underlying probability space, rather quickly
this space is no longer mentioned as genuine interest lies in the distribution of 𝑋.
Exercise 1.1 Consider the outcome of a throw with a fair coin. Give two distinct pairs of underly-
ing probability space and random variable that yield the same distribution P, assigning probability
1∕2 to tails and probability 1∕2 to heads.
All computer simulation are built from random numbers, implying that the underlying probability space
is [0, 1]𝑁 with its Borel-𝜎-algebra and Lebesgue measure.
Definition 1.2. A statistical model (statistical experiment) is a family of probability measures {P𝜃 , 𝜃 ∈
Ω} on a measurable space (, ). The set Ω is the parameter space. We denote the model by =
(, , {P𝜃 , 𝜃 ∈ Ω}).
The measures P𝜃 are often called sampling probabilities.
Important note on notation: In many texts the letter Θ is used for the parameter space. In these notes,
the parameter space will always be denoted by Ω, and we reserve the symbol Θ for a random quantity (this
is particularly useful when considering Bayesian statistics, where all unknowns are treated as random quan-
tities).
In this course we will mainly be dealing with parametric models, where the dimension of Ω is finite. Non-
parametric models are models which are not parametric. Hence, in such a model at least one of the parameters
is infinite-dimensional. The infinite-dimensional parameter is usually a function or measure. Examples of
nonparametric models are
iid
1. 𝑋1 , … , 𝑋𝑛 ∼ P and it is assumed that 𝑥 → P((−∞, 𝑥]) is concave;
3
CHAPTER 1. PRELIMINARIES
1.2.1 Densities
Often, it is convenient to specify statistical models by densities. In basic probability courses the density
refers to either the probability mass function or probability density function. We aim for more generality and
for that reason recap the following definitions from measure theory.
Definition 1.3. Suppose 𝜇 and 𝜈 are measures on the measurable space (, ).
• A 𝜎-finite measure 𝜇 is absolutely continuous with respect to the 𝜎-finite measure 𝜈 if there exists a
measurable function 𝑓 ∶ → [0, ∞] such that
Theorem 1.4 (Radon-Nikodym). A 𝜎-finite measure 𝜇 is absolutely continuous with respect to the 𝜎-finite
measure 𝜈 if and only if 𝜇 ≪ 𝜈. The density is 𝜈-almost surely unique.
Definition 1.5. A statistical model {P𝜃 , 𝜃 ∈ Ω} is dominated if there exists a 𝜎-finite measure 𝜈 ∶ →
[0, ∞] such that for all 𝜃 ∈ Ω we have P𝜃 ≪ 𝜈.
By the Radon-Nikodym theorem, we may represent a dominated model in terms of probability densities
𝑓 (⋅; 𝜃) = ( dP𝜃 ∕ d𝜈)(⋅). Note that the dominating measure is not unique and henceforth densities are not
unique. In these notes, the symbol 𝜈 is reserved for the dominating measure. Discrete and continuous
random variables are now seen to be random variables with density with respect to counting measure and
Lebesgue measure respectively. The following example shows that mixed forms also exist. These do pop up
in realistic applications of statistics!
Example 1.6. Suppose 𝑋 ∼ 𝑃 𝑜𝑖𝑠 (𝜃) and 𝑌 ∼ 𝐸𝑥𝑝 (𝜆). If we let 𝜈𝑐 denote counting measure on {0, 1, 2, … , },
then
𝜃𝑥
P𝑋 (𝐵) = 𝑒−𝜃 d𝜈𝑐 (𝑥).
∫𝐵 𝑥!
If we let 𝜈𝓁 denote Lebesgue measure on ℝ, then
dP𝑋 dP𝑌
𝑤 (𝑥)𝟏IN (𝑥) + (1 − 𝑤) (𝑥)𝟏ℝ⧵IN (𝑥)
d𝜈𝑐 d𝜈𝓁
with respect to 𝜈𝑐 + 𝜈𝓁 .
4
1.2. STATISTICAL MODELS
d𝑃 𝑋 dP𝑋
P𝑋 (𝐵) = (𝑥) d𝜈𝑐 (𝑥) = (𝑥)𝟏IN (𝑥) d𝜈𝑐 (𝑥)
∫𝐵 d𝜈𝑐 ∫𝐵 d𝜈𝑐
dP𝑋
= (𝑥)𝟏IN (𝑥) d(𝜈𝑐 + 𝜈𝓁 )(𝑥).
∫𝐵 d𝜈𝑐
Similarly, we get
dP𝑌 dP𝑌
P𝑌 (𝐵) = (𝑥) d𝜈𝓁 (𝑥) = (𝑥)𝟏ℝ⧵IN (𝑥) d𝜈𝓁 (𝑥)
∫𝐵 d𝜈𝓁 ∫𝐵 d𝜈𝓁
dP𝑌
= (𝑥)𝟏ℝ⧵IN (𝑥) d(𝜈𝑐 + 𝜈𝓁 )(𝑥).
∫𝐵 d𝜈𝓁
The result now follows by combining the results in the previous two displays when computing 𝑤P𝑋 (𝐵) +
(1 − 𝑤)P𝑌 (𝐵). Please do note that it is important to include the indicators 𝟏IN and 𝟏ℝ⧵IN in the density!
In probability and statistics, product spaces are of particular importance. If (, , 𝜇) and (, , 𝜈) are
measure spaces, then there exists a unique measure 𝜇 × 𝜈, called the product measure, on ( × , ∨ )
such that
(𝜇 × 𝜈)(𝐴 × 𝐵) = 𝜇(𝐴)𝜈(𝐵)
for all 𝐴 ∈ and 𝐵 ∈ . Here ∨ is the smallest 𝜎-field containing all sets 𝐴 × 𝐵 with 𝐴 ∈ and
𝐵 ∈ . Suppose random variables 𝑋1 , … , 𝑋𝑛 are independent:
Notation 1.7. The density of a random quantity 𝑋 will always be denoted by 𝑓𝑋 , or simply 𝑓 , if there is no
risk of confusion with densities of other random quantities. If the density depends on a parameter 𝜃, Then
we write 𝑓𝑋 (⋅; 𝜃).
5
CHAPTER 1. PRELIMINARIES
iid
Exercise 1.3 Suppose 𝑋1 , 𝑋2 , 𝑋3 ∼ 𝐸𝑥𝑝 (𝜃). In introductory courses on statistics you learn about
maximum likelihood estimation.
2. Now suppose we do not fully observe 𝑋3 . Instead, we only observe where 𝑋3 ∈ [3, 4) or not.
This means that the data are given by the random vector 𝑋 = (𝑋1 , 𝑋2 , 𝟏[3,4) (𝑋3 )). Can you
derive the maximum likelihood estimator in this case as well? Note that the likelihood is in
fact the density of the distribution of 𝑋. In the fully observed case one can take Lebesgue
measure on ℝ3 as dominating measure. What dominating measure can be taken for P𝑋 in the
partially observed case?
Example 1.8. If 𝑋 = 1 with probability 𝜃 and 0 else, we write 𝑋 ∼ 𝐵𝑒𝑟 (𝜃) and say that 𝑋 has the Bernoulli-
∑
distribution with parameter 𝜃. If 𝑋1 , … , 𝑋𝑛 are independent 𝐵𝑒𝑟 (𝜃)-random variables, then 𝑌 = 𝑛𝑖=1 𝑋𝑖 ∼
𝐵𝑖𝑛 (𝑛, 𝜃) and 𝑌 is said to have the Binomial distribution. The density of 𝑌 is given by
( )
𝑛 𝑦
𝑓𝑌 (𝑦; 𝜃) = 𝜃 (1 − 𝜃)𝑛−𝑦 , 𝑦 = 0, 1, … , 𝑛.
𝑦
Example 1.9. The Negative Binomial distribution arises by counting the number of successes in a sequence
of independent and identically distributed Bernoulli trials (each with succes probability 𝜃) before a specified
(non-random) number of 𝑟 failures occurs. We write 𝑋 ∼ 𝑁𝑒𝑔𝐵𝑖𝑛 (𝑟, 𝜃), for which
( )
𝑥+𝑟−1 𝑥
𝑓𝑋 (𝑥; 𝜃) = 𝜃 (1 − 𝜃)𝑟 , 𝑥 = 0, 1, … .
𝑥
Exercise 1.4 Verify the form of the density in the preceding example.
Example 1.10. The Geometric distribution arises by counting the number of independent and identically
distributed Bernoulli trials (each with success probability 𝜃) necessary for obtaining a first succes. If 𝑋
denotes the number of trials, then 𝑋 ∼ 𝐺𝑒𝑜𝑚 (𝜃) and 𝑓𝑋 (𝑥; 𝜃) = (1 − 𝜃)𝑥−1 𝜃, 𝑥 = 1, 2, ….
Example 1.11. The Poisson distribution arises as a limiting case of the 𝐵𝑖𝑛 (𝑛, 𝜃∕𝑛)-distribution when 𝑛 →
∞. We write 𝑋 ∼ 𝑃 𝑜𝑖𝑠 (𝜃) so that 𝑓𝑋 (𝑥; 𝜃) = 𝑒−𝜃 𝜃 𝑥 ∕(𝑥!), 𝑥 = 0, 1, ….
( )
Example 1.12. If 𝑋 is normally distributed with mean 𝜇 and variance 𝜎 2 we write 𝑋 ∼ 𝑁 𝜇, 𝜎 2 . The
parameter 𝜌 = 1∕𝜎 2 is called the precision parameter. If 𝑋 ∼ 𝑁(𝜇𝑋 , 𝜎𝑋 2 ), 𝑌 ∼ 𝑁(𝜇 , 𝜎 2 ), 𝑋 and 𝑌 are
𝑌 𝑌
independent, then
𝑐𝑋 + 𝑌 ∼ 𝑁(𝑐𝜇𝑋 + 𝜇𝑌 , 𝑐 2 𝜎𝑋 2
+ 𝜎𝑌2 )
( )
for 𝑐 ∈ ℝ. The density of the 𝑁 𝜇, 𝜎 2 -distribution is denoted by 𝜙(⋅; 𝜇, 𝜎 2 ). The cumulative distribution
function of the 𝑁 (0, 1)-distribution is denoted by Φ. The upper 𝛼-quantile of the 𝑁 (0, 1)-distribution is
denoted by 𝜉𝛼 so that ℙ(𝑍 ≥ 𝜉𝛼 ) = 𝛼.
6
1.3. SOME WELL KNOWN DISTRIBUTIONS
Definition 1.13. Suppose 𝑍1 , … , 𝑍𝑘 are independent 𝑁 (0, 1)-distributed random variables. Define 𝑍 =
[ ]′
𝑍1 ⋯ 𝑍𝑘 . A 𝑘-dimensional random vector 𝑋 has the multivariate normal distribution with mean
vector 𝜇 and covariance matrix Σ if 𝑋 has the same probability distribution as the vector 𝜇 + 𝐿𝑍, for a 𝑘 × 𝑘
matrix 𝐿 with Σ = 𝐿𝐿′ and 𝑘-dimensional vector 𝜇. The density of 𝑋 is then given by
( )
1
𝑓𝑋 (𝑥) = (2𝜋)−𝑘∕2 (detΣ)−1∕2 exp − (𝑥 − 𝜇)′ Σ−1 (𝑥 − 𝜇) .
2
More information can be found for instance on the Wikipedia page https://en.wikipedia.org/
wiki/Multivariate_normal_distribution. For future reference we include the following lemma.
𝑋 ∼ 𝑁 (𝑚, 𝑃 )
𝑌 ∣ 𝑋 ∼ 𝑁 (𝐻𝑋 + 𝑢, 𝑅)
then [ ] ([ ] [ ])
𝑋 𝑚 𝑃 𝑃 𝐻′
∼𝑁 , .
𝑌 𝐻𝑚 + 𝑢 𝐻𝑃 𝐻𝑃 𝐻 ′ + 𝑅
The covariance for 𝑌 can be easily remembered from the law of total variance
Example 1.15. We say 𝑋 ∼ 𝐸𝑥𝑝 (𝜃) if 𝑋 has density 𝑓𝑋 (𝑥; 𝜃) = 𝜃𝑒−𝜃𝑥 𝟏[0,∞) (𝑥). If 𝑋1 , … , 𝑋𝑛 are inde-
∑
pendent 𝐸𝑥𝑝 (𝜃)-random variables, then 𝑌 = 𝑛𝑖=1 𝑋𝑖 has the 𝐺𝑎 (𝑛, 𝜃)-distribution. We have
𝜃 𝑛 𝑛−1 −𝜃𝑦
𝑓𝑌 (𝑦; 𝑛, 𝜃) = 𝑦 𝑒 𝟏[0,∞) (𝑦).
Γ(𝑛)
This definition applies also to noninteger values of 𝑛 (provided that 𝑛 > 0).
Example 1.16. We say 𝑋 has the Beta density, denoted by 𝑋 ∼ 𝐵𝑒 (𝛼, 𝛽), if 𝑋 has density
1
𝑓𝑋 (𝑥; 𝛼, 𝛽) = 𝑥𝛼−1 (1 − 𝑥)𝛽−1 𝟏(0,1) (𝑥)
B(𝛼, 𝛽)
with respect to Lebesgue measure. Here B denotes the Beta-function and 𝛼, 𝛽 > 0. We have 𝐸𝑋 = 𝛼∕(𝛼 +𝛽)
and Var 𝑋 = (𝛼+𝛽)2𝛼𝛽
(𝛼+𝛽+1)
.
densities. The family of densities {𝜙(⋅; 𝜇, 𝜎 2 ), 𝜇 ∈ ℝ, 𝜎 2 ∈ (0, ∞)} constitutes an important example.
7
CHAPTER 1. PRELIMINARIES
Definition 1.18. A parametric family with parameter space Ω and density 𝑓𝑋 (𝑥; 𝜃) with respect to a measure
𝜈 on (, ) is called an exponential family if
( 𝑘 )
∑
𝑓𝑋 (𝑥; 𝜃) = 𝑐(𝜃)ℎ(𝑥) exp 𝜉𝑖 (𝜃)𝜏𝑖 (𝑥)
𝑖=1
Definition 1.19. In an exponential family the natural parameter is the vector 𝜉 = (𝜉1 (𝜃), … , 𝜉𝑘 (𝜃)) and
{ ( 𝑘 ) }
∑
Γ= 𝜉 ∈ ℝ𝑘 ∶ ℎ(𝑥) exp 𝜉𝑖 𝜏𝑖 (𝑥) d𝜈(𝑥) < ∞
∫
𝑖=1
One advantage of this parametrisation is that the set Γ is convex. A proof is given in Theorem 2.62 in
Schervish [1995]. The quantities 𝜏𝑖 (𝑥) are sometimes referred to as natural statistics.
iid
Example 1.20. If 𝑋1 , … , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎 2 ) then 𝑋 = (𝑋1 , … , 𝑋𝑛 ) forms an exponential family with respect
to Lebesgue measure on ℝ𝑛 with
∑
𝑛
∑
𝑛
ℎ(𝑥) = (2𝜋)−𝑛∕2 𝜏1 (𝑥) = 𝑥𝑖 𝜏2 (𝑥) = 𝑥2𝑖
𝑖=1 𝑖=1
( )
−𝑛 𝑛𝜇 2 𝜇 1
𝑐(𝜃) = 𝜎 exp − 2 𝜉1 (𝜃) = 𝜉2 (𝜃) = − .
2𝜎 𝜎2 2𝜎 2
where 𝜃 = (𝜇, 𝜎). The natural parameter space is given by Γ = ℝ × (−∞, 0).
Example 1.21 (Continuation of example 1.16.). If 𝑋 ∼ 𝐵𝑒 (𝛼, 𝛽), then the parameter is given by 𝜃 = (𝛼, 𝛽)
and 𝑋 is within the exponential family with
𝟏(0,1) (𝑥)
ℎ(𝑥) = 𝑐(𝜃) = 1∕B(𝛼, 𝛽)
𝑥(1 − 𝑥)
8
1.4. EXAMPLES OF STATISTICAL MODELS
If the parameter space contains an open set, the family is said to be of full rank, else it is called curved.
Exercise 1.5
1. Prove that random samples from the following distributions form exponential families: Pois-
son, geometric, Gamma. What about the negative Binomial distribution?
2. Identify the natural statistics and the natural parameters in each case. What are the distribu-
tions of the natural statistics?
( )
Exercise 1.6 Let 𝑌1 , … , 𝑌𝑛 be independent and identically distributed 𝑁 𝜇, 𝜇2 . Show that this
model is an example of a curved exponential family.
Exercise 1.7 Suppose 𝑋1 , … , 𝑋𝑛 are independent Bernoulli random variables with 𝜃𝑖 the suc-
cess probability for 𝑋𝑖 . Suppose these success probabilities are related to a sequence of variables
𝑡1 , … , 𝑡𝑛 , viewed as known constants, through
1
𝜃𝑖 = ( ), 𝑖 = 1, … , 𝑛.
1 + exp −𝛼 − 𝛽𝑡𝑖
Show that the joint densities of 𝑋1 , … , 𝑋𝑛 form a two-parameter exponential family, and identify
the statistics 𝜏1 and 𝜏2 .
Hint: In order to derive the statistics first write down the joint density of the {𝑋𝑖 }. After you have
done this, try to derive the relationship between the logarithm 𝜃𝑖 and 1 − 𝜃𝑖 and substitute this into
the joint density.
𝑌𝑖 ∣ 𝑋𝑖 = 𝑥 ∼ 𝜇𝜃 (𝑥) + 𝜀𝑖 ,
where {𝜀𝑖 } is assumed to be a sequence of mean-zero independent and identically distributed random vari-
ables and 𝜇𝜃 (𝑥) = 𝜃 ′ 𝑥. It is common to assume that 𝜀𝑖 has a Normal distribution, but more heavily tailed
distributions are also possible.
Generalised linear models generalise linear models by allowing for non-Normal response distributions
, such as the Gamma, Poisson and Bernoulli-distribution. In Poisson-regression, one assumes
( )
𝑌𝑖 ∣ 𝑋𝑖 = 𝑥 ∼ 𝑃 𝑜𝑖𝑠 𝑒𝜇𝜃 (𝑥) ,
with 𝜓(𝑥) = (1+𝑒−𝑥 )−1 . In both cases 𝜇𝜃 (𝑥) is transformed such that the parameter only takes “valid” values
(i.e. positive in case of the Poisson-distribution; in (0, 1) for the Bernoulli-distribution).
9
CHAPTER 1. PRELIMINARIES
Example 1.23. Let 𝑘 be a positive integer and suppose {𝜋𝑖 , 𝑖 = 1, … , 𝑘} are numbers in [0, 1] that add
to unity. A finite mixture experiment consists of the composite experiment in which the 𝑖-th experiment
is chosen with probability 𝜋𝑖 . As a simple example,
( suppose
) 𝑌 = 1 with probability 𝜋1 and 𝑌 = 2 with
probability 𝜋2 = 1 − 𝜋1 . Suppose 𝑋 ∣ 𝑌 = 𝑦 ∼ 𝑁 𝜇𝑦 , 1 . We have
= 𝜋1 𝜙(𝑥; 𝜇1 , 1) d𝑥 + 𝜋2 𝜙(𝑥; 𝜇2 , 1) d𝑥
∫𝐵 ∫𝐵
and hence the density of 𝑋 with respect to Lebesgue measure is given by
𝑓𝑋 (𝑥) = 𝜋1 𝜙(𝑥; 𝜇1 , 1) + 𝜋2 𝜙(𝑥; 𝜇2 , 1).
The joint distribution of (𝑋, 𝑌 ) can be found upon noting that
10
1.5. DIFFERENT VIEWS ON INFERENTIAL STATISTICS
1.5
1.0
X
0.5
0.0
−0.5
0 25 50 75 100
t
Figure 1.1: Simulation of a trajectory of the diffusion with 𝑏(𝑥) = (−0.2𝑥 − sin(4𝜋𝑥)) and 𝜎 = 0.3, starting
at 0. The sin-term in the drift causes multimodal behaviour, whereas the linear part of the drift ensures
mean-reversion within the modes.
Example 1.26. Consider a rod of length 𝐿 with insulated sides that is given an initial temperature 𝑔(𝑥) at
𝑥 ∈ [0, 𝐿]. Suppose that 𝑢(𝑥, 𝑡) is the temperature at 𝑥 ∈ [0, 𝐿] at time 𝑡 ≥ 0. Then 𝑢 satisfies the heat
equation
𝜕𝑢 𝜕2𝑢
= 𝜃 2.
𝜕𝑡 𝜕𝑥
If the temperature at the end of the rod is kept at 0 degrees, then the boundary conditions to this partial
differential equation are given by
𝑢(0, 𝑡) = 𝑢(𝐿, 𝑡) = 0.
Furthermore, from the initial temperature we obtain the initial condition 𝑢(0, 𝑥) = 𝑔(𝑥). Suppose (𝑔(𝑥), 𝑥 ∈
[0, 𝐿]) is known and at time 𝑡 = 𝑇 the temperature is measured with noise at points 𝑥1 , … , 𝑥𝐾 . The mea-
surements can be modelled as independent realisations of 𝑌1 , … , 𝑌𝐾 , where
𝑌𝑖 = 𝑢(𝑥𝑖 , 𝑇 ) + 𝜀𝑖 .
iid
Assume that {𝜀𝑖 } ∼ 𝑁(0, 𝜎 2 ). The statistical problem is to estimate (𝜃, 𝜎 2 ). Here, the likelihood is easily
written down, but computationally hard to evaluate.
11
CHAPTER 1. PRELIMINARIES
• Objective Bayesian inference, using constant prior densities for unknowns, was prominent from 1775–
1925, under the name inverse probability. The idea dates back to Laplace.
• By 1940, the prevailing statistical philosophies were either Fisherian (associated with Ronald Fisher)
or frequentist (associated with Jerzy Neyman and Egon Pearson).
• Subjective Bayesian analysis became prominent by 1960 and was strongly advocated by Savage and
Lindley.
It is fair to say that most people nowadays approach statistics from a frequentist point of view. Methods such
as maximum likelihood and Neyman-Pearson testing (using type I and type II errors) are parts of introductory
statistics courses, whereas Bayesian estimation quite often is not. Over the past 20 years Bayesian methods
have flourished and to a large extent this is due to a tremendous increase in computational algorithms. An-
other development is often referred to as “big data”, which means that either the amount of experimental units
or the number of parameters in the statistical model is very large. It turns out that many ideas from Bayesian
analysis can be used in these settings to derive estimators that enjoy favourable frequentist properties (as an
example, the posterior mode can sometimes be interpreted as a regularised maximum likelihood estimator).
Because of the impact of choosing either the frequentist or Bayesian paradigm, it is no surprise that
there has been a lively debate on the correct way to perform statistical inference. To get an impression
of the debate you may wish to read chapter 16 of Jaynes [2003] (an online version is available on http:
//www-biba.inrialpes.fr/Jaynes/prob.html). Here is one quote from this book.
During the aforementioned period, the average worker in physics, chemistry, biology, medicine,
or economics with a need to analyze data could hardly be expected to understand theoretical
principles that did not exist, and so the approved methods of data analysis were conveyed to him
in many different, unrelated ad hoc recipes in “cookbooks” which, in effect, told one to “Do
this... then do that ... and don’t ask why.
R.A. Fisher’s “Statistical Methods for Research Workers (1925)” was the most influential of
these cookbooks.
The book by Jaynes [2003] is very opinionated (but a must read in my opinion), and a somewhat milder
discussion on the use of frequentist and Bayesian statistics can be found in two recent articles by the well
known statistician Efron (Cf. Efron [1986] and Efron [2005]). The recent, very accessible, book Clayton
[2021] strongly advocates Jaynes’ approach (which is Bayesian, with probability being interpreted as infor-
mation). To fight the reproducibility crisis, the inability to replicate experiments (especially in psychology
and sociology), he proposes to stop teaching frequentist hypothesis testing, confidence intervals, sufficient
statistics, and all those topics that have been taught traditionally in about any course on statistics. We’ll take
a bit milder point of view here, and shortly recap those topics in Chapter 2.
12
1.5. DIFFERENT VIEWS ON INFERENTIAL STATISTICS
∑
𝑛
𝐿(𝜃) = 𝜃 𝑆 (1 − 𝜃)𝑛−𝑆 , 𝜃 ∈ [0, 1] 𝑆 = 𝑋𝑖 .
𝑖=1
(B) Uniformly Minimum Variance Unbiased (UMVU) estimation: among all estimators 𝑇 = 𝑑(𝑋1 , … , 𝑋𝑛 )
that satisfy E𝜃 𝑇 = 𝜃 (if any exist), choose the estimator with minimum variance. It turns out that
Θ̂ 𝑈 𝑀𝑉 𝑈 = 𝑋̄ 𝑛 which is just the maximum likelihood estimator.
(C) Minimax estimation: choose the estimator that minimises the maximum value of the Mean Squared
error. Define the Mean Square Error of 𝑇 for estimating 𝜃 by MSE(𝜃, 𝑇 ) = E𝜃 (𝑇 − 𝜃)2 . The minimax
estimator is defined as the estimator that minimises the worst case value of the Mean Square Error:
While the third estimator is different from the other two, it is easily seen that asymptotically, when 𝑛 is large,
the difference diminishes.
Within Bayesian statistics all unknowns are treated as random quantities, including in particular the
parameter. In this sense, the only distinction between data and parameters hinges on what is observed. The
Bayesian statistician then proceeds by defining the joint distribution of (𝑋, Θ), where 𝑋 = (𝑋1 , … , 𝑋𝑛 ), in
the following way:
iid
𝑋1 , … , 𝑋𝑛 ∣ Θ = 𝜃 ∼ 𝐵𝑒𝑟 (𝜃)
Θ ∼ 𝐵𝑒 (𝛼, 𝛽) .
The first line in this hierarchy includes the assumption that 𝑋1 , … , 𝑋𝑛 are conditionally independent, instead
of independent. This implies that 𝑋1 , … , 𝑋𝑛 are exchangeable: an assumption that is considerably weaker
than independence. The second line in this hierarchy specifies the prior distribution for Θ. The need for
providing this distribution is often seen as a weak aspect of Bayesian statistics by frequentists (the idea being
that one can subjectively choose a prior and “anything” can result from that). On the opposite, Bayesians
consider it to be a key advantage of the Bayesian approach, enabling to incorporate available prior knowl-
edge. Regarding choice of prior, here we have chosen the 𝐵𝑒 (𝛼, 𝛽)-distribution as prior as this turns out to
simplify calculations. Once the joint distribution of (𝑋, Θ) has been specified, Bayesian statistics is concep-
tually straightforward (“conceptually”, as there may be computationally demanding problems remaining):
all inference should be based on the posterior distribution, which is the distribution of Θ conditional on 𝑋.
We have
𝑓𝑋∣Θ (𝑥 ∣ 𝜃)𝑓Θ (𝜃)
𝑓Θ∣𝑋 (𝜃 ∣ 𝑥) = .
∫ 𝑓𝑋∣Θ (𝑥 ∣ 𝜃)𝑓Θ (𝜃) d𝜃
It turns out that Θ ∣ 𝑋 ∼ 𝐵𝑒 (𝛼 + 𝑆, 𝛽 + 𝑛 − 𝑆). A point estimator can then for example be defined by the
posterior mean
𝛼+𝑆
𝔼[Θ ∣ 𝑋] = .
𝛼+𝛽+𝑛
13
CHAPTER 1. PRELIMINARIES
It is interesting to see that Θ ̂ 𝑈 𝑀𝑉 𝑈 = 𝑋̄ 𝑛 is obtained upon letting both 𝛼 and 𝛽 tend to 0. Moreover,
̂ 𝑀𝐿𝐸 = Θ
√
the minimax estimator Θ𝑀𝑀 is obtained by taking 𝛼 = 𝛽 = 𝑛∕2. Obviously, if we would have taken
̂
another prior distribution, the posterior would change accordingly.
Whereas the various approaches to estimation are fundamentally different, asymptotically they all agree
that 𝑋̄ 𝑛 is a good estimator for 𝜃. The Bernstein-Von Mises theorem essentially states that this is even the
case for all prior distributions with density that is strictly positive on (0, 1).
If it comes to hypothesis testing, differences between the frequentist and Bayesian approach become more
pronounced. An entertaining and accessible introduction to Bayesian statistics is the article by Lindley and
Phillips [1976].
̂ = 𝑋1 + 𝑋2 .
Θ
2
Consider the confidence set for 𝜃 given by
{
(𝑋1 + 𝑋2 )∕2 if 𝑋1 ≠ 𝑋2
𝐶(𝑋1 , 𝑋2 ) = .
𝑋1 − 1 if 𝑋1 = 𝑋2
It is easily verified that
ℙ(𝐶(𝑋1 , 𝑋2 ) contains 𝜃 ∣ 𝑋1 ≠ 𝑋2 ) = 1
ℙ(𝐶(𝑋1 , 𝑋2 ) contains 𝜃 ∣ 𝑋1 = 𝑋2 ) = 0.5
14
1.5. DIFFERENT VIEWS ON INFERENTIAL STATISTICS
Within classical statistics, this is the confidence set that is to be reported. However, some statisticians find
it rather silly to report this confidence set: you are absolutely sure about the secret number when 𝑥1 ≠ 𝑥2 is
observed.
The following example is known as Stein’s paradox and dates back to 1956 (Stein [1956]).
ind
Example 1.29. Suppose 𝑋1 , 𝑋2 , 𝑋3 ∼ 𝑁 (𝜃, 1). As the likelihood equals
( )
1∑
3
−3∕2
𝐿(𝜃; 𝑋) = (2𝜋) exp − (𝑋 − 𝜃)2
2 𝑖=1 𝑖
we find that the maximum likelihood estimator (MLE) equals Θ ̂ 𝑀𝐿𝐸 = 𝑋̄ 3 . In the chapter on statistical
decision theory we will see that this estimator has many favourable properties, including that it is
• admissible;
• asymptotically efficient.
satisfies
‖̂ ‖2 ‖̂ ‖2
E𝜃 ‖Θ − 𝜃 ‖ < E𝜃 ‖Θ − 𝜃‖
‖ 𝐽 𝑆 ‖ ‖ 𝑀𝐿𝐸 ‖
for all 𝜃. That is, the estimator Θ
̂ 𝐽 𝑆 improves upon the MLE. Put differently, the MLE is inadmissible! Con-
vince yourself that this is somewhat counter intuitive: for estimating the mean of 𝑋𝑗 , we use all observations
{𝑋𝑖 }, while at the same time all {𝑋𝑖 } are assumed independent.
( )
Example 1.30. Consider the statistical model where 𝑌𝑖 ∼ 𝐵𝑖𝑛 𝑛𝑖 , 𝑝𝑖 . Efron and Norris [1975] considered
the example where 𝑖 indexes baseball players and 𝑌𝑖 denotes the number of homeruns out of 𝑛𝑖 times at bat by
the 𝑖-th player. If you are not into sports, you may also think of 𝑖 indexing hospitals and 𝑌𝑖 being the number
of successful operations out of 𝑛𝑖 operations in the 𝑖-th hospital. A straightforward estimator for 𝑝𝑖 is given
by 𝑃̂𝑖 = 𝑌𝑖 ∕𝑛𝑖 . Now just as in the previous example, it turns out that better estimators can be found if we are
interested in the aggregate performance of the estimators {𝑃̂𝑖 }.
15
CHAPTER 1. PRELIMINARIES
We can transform the problem to the setting of the preceding example by applying a variance stabilising
transformation. Define ( )
√ 𝑌𝑖
𝑋𝑖 = 𝑛𝑖 arcsin 2 − 1 ,
𝑛𝑖
( )
then, for 𝑛𝑖 large, we have that 𝑋𝑖 has approximately the 𝑁 𝜇𝑖 , 1 distribution with
√
𝜇𝑖 = 𝑛𝑖 arcsin(2𝑝𝑖 − 1).
Cf. Exercise 1.11 ahead. We will return to this example when discussing hierarchical Bayesian models and
Markov Chain Monte Carlo methods.
Definition 1.31. A sequence of random vectors {𝑋𝑛 } is said to converge in distribution to a random vector
𝑋 if
ℙ(𝑋𝑛 ≤ 𝑥) → ℙ(𝑋 ≤ 𝑥), 𝑛→∞
for every 𝑥 at which the limit distribution function 𝑥 → ℙ(𝑋 ≤ 𝑥) is continuous. This is denoted by 𝑋𝑛 ⇝ 𝑋.
Alternative names are weak convergence or convergence in law. The “Portmanteau theorem” (lemma
2.2 in van der Vaart [1998] includes equivalence of convergence in distribution to
Let 𝑑 be a distance function on ℝ𝑘 that generates the usual topology, for instance Euclidean distance.
Definition 1.32. A sequence of random vectors 𝑋𝑛 is said to converge in probability to 𝑋 if for all 𝜀 > 0
ℙ(𝑑(𝑋𝑛 , 𝑋) > 𝜀) → 0, 𝑛 → ∞.
p
This is denoted 𝑋𝑛 ←→ 𝑋.
p p
Hence, 𝑋𝑛 ←→ 𝑋 is equivalent to 𝑑(𝑋𝑛 , 𝑋) ←→ 0.
a.s.
This is denoted 𝑋𝑛 ←→ 𝑋.
We have
a.s. p
𝑋𝑛 ←→ 𝑋 implies 𝑋𝑛 ←→ 𝑋 implies 𝑋𝑛 ⇝ 𝑋.
16
1.6. STOCHASTIC CONVERGENCE
Show that the sequence {𝑋𝑛 }𝑛 converges in distribution, but not in probability.
Exercise 1.10 Suppose random variables {𝑋𝑛 } are defined on ([0, 1], ([0, 1]), 𝜆) (where 𝜆 is
Lebesgue measure) as follows
⎡ 𝜕1 𝑔1 (𝜃) … 𝜕𝑘 𝑔1 (𝜃) ⎤
𝑔𝜃′ (ℎ) =⎢ ⋮ ⋮ ⎥ ℎ.
⎢ ⎥
⎣𝜕1 𝑔𝑚 (𝜃) … 𝜕𝑘 𝑔𝑚 (𝜃)⎦
17
CHAPTER 1. PRELIMINARIES
The proof can for example be found in section 3.1 of van der Vaart [1998]. We end with a couple of appli-
cations of the Delta-method.
Example 1.37. Suppose 𝑋1 , … , 𝑋𝑛 are iid with expectation 𝜃 and variance 𝜎 2 . By the central limit theorem
√ ( ) ( )
𝑛 𝑋 𝑛 − 𝜃 ⇝ 𝑁 0, 𝜎 2 .
2
By the Delta-method we immediately obtain the limiting distribution of 𝑋 𝑛
√ ( 2 ) ( ) ( )
𝑛 𝑋 𝑛 − 𝜃 2 ⇝ 2𝜃𝑁 0, 𝜎 2 ∼ 𝑁 0, 4𝜃 2 𝜎 2 .
( √ )
√ √ 1
2 𝑛 𝑋̄ 𝑛 − 𝜃 ⇝ √ 𝑁 (0, 𝜃) ∼ 𝑁 (0, 1) .
𝜃
As the variance in the limit is independent of 𝜃, the transformation is called variance stabilising. The result
in the preceding display can be used for deriving an asymptotic confidence interval for 𝜃.
Writing 𝜎 2 = 𝜎 2 (𝜃), we see that the transformation 𝑔 is obtained as
𝜃
1
𝑔(𝜃) = 𝑑𝑥.
∫0 𝜎(𝑥)
Example 1.39. In the multivariate case, by the central limit theorem, we obtain that (under weak moment-
assumptions)
√ ( )
𝑛 [𝑋 𝑛 , 𝑋𝑛2 ] − 𝜇 ⇝ 𝑁 (0, Σ) .
Here 𝜇 and 0 are vectors in ℝ2 and Σ a 2 × 2 matrix. From this result the limiting distribution of 𝑆𝑛2 =
∑
𝑛−1 𝑛𝑖=1 (𝑋𝑖 − 𝑋 𝑛 )2 can be derived using the Delta-method by noting that 𝑆𝑛2 = 𝑔(𝑋 𝑛 , 𝑋𝑛2 ) with 𝑔(𝑥1 , 𝑥2 ) =
𝑥2 − 𝑥21 .
Exercise 1.11 Verify the claim from example 1.30: Let 𝑌 ∼ 𝐵𝑖𝑛 (𝑛, 𝑝) and define
√ ( )
𝑌
𝑋𝑛 = 𝑛 arcsin 2 − 1 .
𝑛
( )
Show that for 𝑛 large, 𝑋𝑛 has approximately the 𝑁 𝜇𝑛 , 1 distribution with
√
𝜇𝑛 = 𝑛 arcsin(2𝑝 − 1).
Hint: first note that if 𝑌 ∼ 𝐵𝑖𝑛 (𝑛, 𝑝) then by the central limit theorem:
√
𝑛(𝑌 ∕𝑛 − 𝑝) ⇝ 𝑁 (0, 𝑝(1 − 𝑝)) , 𝑛 → ∞.
√
Next, take 𝑔 such that 𝑔 ′ (𝑥) = 1∕ 𝑥(1 − 𝑥).
18
Chapter 2
2.1 Sufficiency
A statistic 𝑇 is a measurable function of the data 𝑋. With slight abuse of notation, we often denote 𝑇 (𝑋) by
𝑇 , so that 𝑇 is both used to denote the mapping and the random quantity 𝑇 (𝑋). The value that 𝑇 assumes
when 𝑋 = 𝑥 is denoted by 𝑇 (𝑥). A sufficient statistic is a statistic with a special property.
Definition 2.1. Let 𝑋 be a sample from an unknown probability measure in the set {P𝜃 , 𝜃 ∈ Ω}. A statistic
𝑇 is said to be sufficient for 𝜃 ∈ Ω if the conditional distribution of 𝑋 given 𝑇 is known (does not depend
on 𝜃).
Whence we observe 𝑋 and compute a sufficient statistic 𝑇 (𝑋), the original data 𝑋 do not contain any
further information concerning the unknown 𝜃. In case 𝑋 ∼ P𝜃 has a density with respect to counting
measure), we have
∑
ℙ(𝑋 = 𝑥; 𝜃) = ℙ(𝑋 = 𝑥 ∣ 𝑇 = 𝑡)ℙ(𝑇 = 𝑡; 𝜃). (2.1)
𝑡
This illustrates that we can sample 𝑋 by first using 𝜃 to sample 𝑇 , and next sample 𝑋 conditional on 𝑇 .
Note that {𝑇 = 𝑡} has probability zero for “continuous” random variables, and in that case measure theory is
required to give meaning to Equation (2.1). This is further complicated upon noticing that (𝑋, 𝑇 (𝑋)) “lives
on the diagonal {(𝑥, 𝑇 (𝑥)), 𝑥 ∈ } (hence, if one wants to talk about densities the dominating measure will
not be a product-measure).
Finding sufficient statistics from the definition tends to be complicated. The following theorem simplifies
this task.
Theorem 2.2 (Factorisation theorem). Suppose that 𝑋 is a sample from P𝜃 , where {P𝜃 , 𝜃 ∈ Ω} is a family
of probability measures on (ℝ𝑛 , 𝑛 ) dominated by a 𝜎-finite measure 𝜈. Then 𝑇 is sufficient for 𝜃 ∈ Ω if
and only if there are nonnegative Borel functions ℎ (which does not depend on P𝜃 ) on (ℝ𝑛 , 𝑛 ) and 𝑔𝜃 on the
range of 𝑇 such that
dP𝜃
(𝑥) = 𝑔𝜃 (𝑇 (𝑥))ℎ(𝑥).
d𝜈
19
CHAPTER 2. TOPICS FROM CLASSICAL STATISTICS
Proof. We give the proof in case 𝑋 is discrete. One direction is easy: suppose 𝑇 is sufficient for 𝜃, then
In the continuous case, the proof is much harder due to measure-theoretic technicalities.
There are many sufficient statistics for a family {P𝜃 , 𝜃 ∈ Ω}. If 𝑇 is a sufficient statistic and 𝑇 = ℎ(𝑆), where
ℎ is measurable and 𝑆 is another statistic, then 𝑆 is sufficient. This follows trivially from the factorisation
theorem. This motivates the following definition.
Definition 2.3. Let 𝑇 ∶ → be a sufficient statistic for 𝜃 ∈ Ω. 𝑇 is called a minimal sufficient statistic
if, for any other statistic 𝑆 ∶ → that is sufficient for 𝜃 ∈ Ω, there is a measurable function ℎ ∶ →
such that 𝑇 = ℎ(𝑆), P𝜃 -a.s. for all 𝜃.
iid ( )
Example 2.4. Suppose 𝑋1 , … , 𝑋𝑛 ∼ 𝑁 0, 𝜎 2 . The following are all sufficient statistics
𝑇1 (𝑋) = (𝑋1 , … , 𝑋𝑛 )
Lemma 2.5. Suppose 𝑆 and 𝑇 are both minimal sufficient statistics for 𝜃 ∈ Ω. Then there exists a function
ℎ, which is injective on the range of 𝑆, such that 𝑇 = ℎ(𝑆) 𝑃𝜃 -a.s.
20
2.1. SUFFICIENCY
Proof. Since 𝑇 is minimal sufficient, there exists a measurable function ℎ such that 𝑇 = ℎ(𝑆). Similarly
there exists a measurable function ℎ̃ such that 𝑆 = ℎ(𝑇
̃ ). Suppose 𝑥 and 𝑥′ are such that ℎ(𝑆(𝑥)) = ℎ(𝑆(𝑥′ )).
This means that 𝑇 (𝑥) = 𝑇 (𝑥 ) which implies
′
̃ (𝑥′ )) = 𝑆(𝑥′ ).
̃ (𝑥)) = ℎ(𝑇
𝑆(𝑥) = ℎ(𝑇
This proves that ℎ is injective on the range of 𝑆.
Hence the minimal sufficient statistic is unique in the sense that two statistics that are one-to-one functions
of each other can be treated as one statistic. Establishing that a sufficient statistic is minimal sufficient can
be hard. One useful criterion is the following
Lemma 2.6. Suppose for each 𝜃 ∈ Ω that P𝜃 has density 𝑓 (𝑥; 𝜃) = 𝑔𝜃 (𝑇 (𝑥))ℎ(𝑥) with respect to a dominating
measure 𝜈. If 𝑓 (𝑥; 𝜃) = 𝑐𝑓 (𝑦; 𝜃) for some 𝑐 = 𝑐(𝑥, 𝑦) implies 𝑇 (𝑥) = 𝑇 (𝑦), then 𝑇 is minimal sufficient.
Proof*. We give a proof by contradiction. The proof consists of a few steps:
• If 𝑇 is minimal sufficient, this means that for any sufficient statistic 𝑇 ∗ we have a mapping 𝑟 such that
𝑇 = 𝑟(𝑇 ∗ ). In particular, for all 𝑥, 𝑦, if 𝑇 ∗ (𝑥) = 𝑇 ∗ (𝑦), then 𝑇 (𝑥) = 𝑇 (𝑦).
• Now suppose 𝑇 is not min. sufficient. Then there exist 𝑥, 𝑦 such that 𝑇 ∗ (𝑥) = 𝑇 ∗ (𝑦), but 𝑇 (𝑥) ≠ 𝑇 (𝑦).
• Use the factorisation theorem:
ℎ(𝑥)
𝑓 (𝑥; 𝜃) = 𝑔𝜃 (𝑇 ∗ (𝑥))ℎ(𝑥) = 𝑔𝜃 (𝑇 ∗ (𝑦))ℎ(𝑥) = 𝑓 (𝑦; 𝜃).
ℎ(𝑦)
• By the assumption in the theorem, this implies 𝑇 (𝑥) = 𝑇 (𝑦). Hence we have reached a contradiction.
Another way for establishing is to show that a sufficient statistic is complete and apply the Lehmann-
Scheffé theorem. Details are in Section 2.8
𝑏
Exercise 2.1 Let 𝜙 be a positive (Borel) function ℝ such that ∫𝑎 𝜙(𝑥) d𝑥 < ∞ for a pair 𝜃 = (𝑎, 𝑏)
with −∞ < 𝑎 < 𝑏 < ∞. Let Ω = {𝜃 = (𝑎, 𝑏) ∈ ℝ2 ∶ 𝑎 < 𝑏}. Define
with 𝑐(𝜃) such that ∫ 𝑓 (𝑥; 𝜃) d𝑥 = 1. Then {𝑓 (⋅; 𝜃), 𝜃 ∈ Ω} is called a truncation family. Suppose
iid
𝑋1 , … , 𝑋𝑛 ∼ 𝑓 (⋅; 𝜃). Let 𝑋 = (𝑋1 , … , 𝑋𝑛 ). Show that 𝑇 (𝑋) = (𝑋(1) , 𝑋(𝑛) ) is sufficient.
21
CHAPTER 2. TOPICS FROM CLASSICAL STATISTICS
∑
Hence, if we define 𝑡𝑖 (𝑋) = 𝑛𝑗=1 𝜏𝑖 (𝑋𝑗 ), then 𝑇 (𝑋) = (𝑡1 (𝑋), … , 𝑡𝑘 (𝑋)) is sufficient for 𝜃 (as a consequence
of the factorisation theorem). This statistic is sometimes called the natural sufficient statistic.
Lemma 2.7. If 𝑋 has an exponential family distribution, then so does the natural sufficient statistic 𝑇 (𝑋),
and the natural parameter for 𝑇 is the same as for 𝑋.
Proof. We only give the proof in the “discrete setting”. Fix a vector 𝑦 = (𝑦1 , … , 𝑦𝑘 ) and let
𝑦 = {𝑥 ∶ 𝑡1 (𝑥) = 𝑦1 , … , 𝑡𝑘 (𝑥) = 𝑦𝑘 }.
Then
∑
ℙ(𝑡1 (𝑋) = 𝑦1 , … , 𝑡𝑘 (𝑋) = 𝑦𝑘 ; 𝜃) = ℙ(𝑋 = 𝑥; 𝜃)
𝑥∈𝑦
{ 𝑛 } ( 𝑘 )
∑ ∏ ∑
= 𝑐(𝜃)𝑛 ℎ(𝑥𝑗 ) exp 𝜉𝑖 (𝜃)𝑡𝑖 (𝑥)
𝑥∈𝑦 𝑗=1 𝑖=1
( 𝑘 )
∑
= 𝑐(𝜃)𝑛 ℎ0 (𝑦) exp 𝜉𝑖 (𝜃)𝑦𝑖 ,
𝑖=1
∑ {∏ }
𝑛
where ℎ0 (𝑦) = 𝑥∈𝑦 𝑗=1 ℎ(𝑥𝑗 ) .
Theorem 2.8. If the natural parameter space Ω of an exponential family contains an open set in ℝ𝑘 , then the
natural sufficient statistic 𝑇 (𝑋) is complete and sufficient.
A proof is given in Chapter 2 of Schervish [1995].
As examples, the sufficient statistics from normal, exponential, Poisson and Bernoulli distributions are
complete.
Exercise 2.2 [YS exercise 6.2.] Find a minimal sufficient statistic for 𝜃 based on an independent
sample of size 𝑛 from each of the following distributions:
2. 𝑈 𝑛𝑖𝑓 (𝜃 − 1, 𝜃 + 1);
22
2.2. INFORMATION MEASURES
1. There exists 𝐵 with 𝜈(𝐵) = 0 such that for all 𝜃, 𝜕𝑓𝑋 (𝑥; 𝜃)∕𝜕𝜃𝑖 exists for 𝑥 ∉ 𝐵 and each 𝑖.
2. ∫ 𝑓𝑋 (𝑥; 𝜃) d𝜈(𝑥) can be differentiated under the integral sign with respect to each coordinate of 𝜃.
The third condition rules out the 𝑈 𝑛𝑖𝑓 (0, 𝜃)-distribution for example, which is an important counterex-
ample to keep in mind.
Definition 2.10. Assume the FI regularity conditions hold. The Fisher information matrix 𝐼(𝜃; 𝑋) about
𝜃 based on 𝑋 is defined as the matrix with elements
( )
𝐼𝑖,𝑗 (𝜃) = Cov𝜃 𝑠𝑖 (𝜃; 𝑋), 𝑠𝑗 (𝜃; 𝑋) .
∞( )2
1 𝑢𝑓 ′ (𝑢)
𝐼(𝑎) = 2 1+ 𝑓 (𝑢) d𝑢.
𝑎 ∫0 𝑓 (𝑢)
Under FI regularity conditions we can find a different representation for the Fisher information.
23
CHAPTER 2. TOPICS FROM CLASSICAL STATISTICS
𝜕2
𝑓 (𝑋; 𝜃)
𝜕𝜃𝑖 𝜕𝜃𝑗 𝑋
𝜕2
− log 𝑓𝑋 (𝑋; 𝜃) = −
𝜕𝜃𝑖 𝜕𝜃𝑗 𝑓𝑋 (𝑋; 𝜃)
( )( )
𝜕
𝜕𝜃𝑖
𝑓𝑋 (𝑋; 𝜃) 𝜕𝜃𝜕 𝑓𝑋 (𝑋; 𝜃)
𝑗
+
𝑓𝑋2 (𝑋; 𝜃)
we find that
[ ] [( )( )]
𝜕2 𝜕 𝜕
− E𝜃 log 𝑓𝑋 (𝑋; 𝜃) = E log 𝑓𝑋 (𝑋; 𝜃) log 𝑓𝑋 (𝑋; 𝜃) .
𝜕𝜃𝑖 𝜕𝜃𝑗 𝜕𝜃𝑖 𝜕𝜃𝑗
The following lemma states that Fisher information is additive in case of independent random quantities.
Lemma 2.15. Suppose 𝑋1 , … , 𝑋𝑛 are independent and the Fisher information is 𝐼𝑖 (𝜃) for each 𝑋𝑖 . If 𝑋 =
(𝑋1 , … , 𝑋𝑛 ) then the Fisher information 𝐼1∶𝑛 (𝜃) of 𝑋 is given by
∑
𝑛
𝐼1∶𝑛 (𝜃) = 𝐼𝑖 (𝜃).
𝑖=1
𝜕2
𝐼𝑖,𝑗 (𝜋) = − log 𝑐(𝜋).
𝜕𝜋𝑖 𝜕𝜋𝑗
FI under reparametrisation
If we change the parametrisation of a statistical model, then its Fisher information changes accordingly. That
is, Fisher information is not invariant under reparametrisation. A simple example illustrates this.
24
2.2. INFORMATION MEASURES
Example 2.16. Suppose that 𝑋 ∼ 𝐸𝑥𝑝 (𝜃). Then it is easy to verify that 𝐼(𝜃) = 1∕𝜃 2 . Now suppose we use
a different parametrisation
√
𝜓 = 𝑔(𝜃) = 1∕𝜃 2 ⇐⇒ 𝜃 = 𝑔 −1 (𝜓) = 1∕ 𝜓,
While maybe less convenient, a parametrisation in terms of 𝜓 yields the same model as under 𝜃. A straight-
forward calculation shows that 𝐼(𝜓) = 1∕(4𝜓 2 ).( Now√ we) see that notation is tricky here. Let’s consider
𝑋 ∼ 𝐸𝑥𝑝 (𝜃) as parametrisation 1, and 𝑌 ∼ 𝐸𝑥𝑝 1∕ 𝜓 as parametrisation 2. Then we write 𝐼1 (𝜃) = 𝜃 −2
and 𝐼2 (𝜓) = 1∕(4𝜓 2 ). The point we want to make in this example √ is that 𝐼2 (𝜓) cannot be obtained from
𝐼1 (𝜃) by simply plugging in the relation between 𝜓 and 𝜃, i.e. 𝜃 = 1∕ 𝜓. This is what is meant by “lack of
invariance under reparametrisation”.
Lemma 2.17. Let 𝑔 ∶ ℝ → ℝ be a one-to-one, onto differentiable mapping with differentiable inverse.
Assume 𝑋 ∼ 𝑓𝑋 (⋅; 𝜃), where 𝜃 is one-dimensional. Let 𝜓 = 𝑔(𝜃) and 𝑌 ∼ 𝑓𝑌 (⋅; 𝜓) ∶= 𝑓𝑋 (⋅; 𝑔 −1 (𝜓)). Then
| d −1 |2
𝐼2 (𝜓) = 𝐼1 (𝑔 −1 (𝜓)) || 𝑔 (𝜓)|| .
| d𝜓 |
Proof. As we assume FI regularity conditions, the score function has mean zero and hence
( ) [( )2 ]
d d
𝐼2 (𝜓) = Var 𝜓 log 𝑓𝑌 (𝑌 ; 𝜓) = E𝜓 log 𝑓𝑌 (𝑌 ; 𝜓)
d𝜓 d𝜓
( )2
d
= log 𝑓𝑌 (𝑦; 𝜓) 𝑓𝑌 (𝑦; 𝜓) d𝑦
∫ d𝜓
( )2
d
= log 𝑓𝑋 (𝑦; 𝑔 (𝜓)) 𝑓𝑋 (𝑦; 𝑔 −1 (𝜓)) d𝑦
−1
∫ d𝜓
d d | | d −1 |
log 𝑓𝑋 (𝑦; 𝑔 −1 (𝜓)) = log 𝑓𝑋 (𝑦; 𝜃)|| |
| 𝑔 (𝜓)|| .
d𝜓 d𝜃 |𝜃=𝑔−1 (𝜓) | d𝜓 |
| d −1 |2 ( )2 |
d |
𝐼2 (𝜓) = || 𝑔 (𝜓)|| log 𝑓𝑋 (𝑦; 𝜃) 𝑓𝑋 (𝑦; 𝜃) d𝑦|
| d𝜓 | ∫ d𝜃 | −1
|𝜃=𝑔 (𝜓)
| d𝜃 |2
𝐼2 (𝜓) = 𝐼1 (𝜃) || | .
|
| d𝜓 |
Remark 2.18. If we replace log 𝑓𝑌 (𝑦; 𝜃) by by an arbitrary smooth function 𝑅(𝑦; 𝜃), the lemma still holds.
25
CHAPTER 2. TOPICS FROM CLASSICAL STATISTICS
d𝑃
𝐾𝐿(𝑃 , 𝑄) = log d𝑃
∫ d𝑄
if 𝑃 is absolutely continuous with respect to 𝑄 and the integral is meaningful, and 𝐾𝐿(𝑃 , 𝑄) = ∞ otherwise.
In case 𝑃 and 𝑄 have densities 𝑝 and 𝑞 with respect to a common measure 𝜈 this implies
𝑝(𝑥)
𝐾𝐿(𝑃 , 𝑄) = log 𝑝(𝑥) d𝜈(𝑥).
∫ 𝑞(𝑥)
In case of parametric families, where 𝑋 ∼ P𝜃 and P𝜃 ≪ 𝜈 for all 𝜃 with densities 𝑓𝑋 (⋅; 𝜃), we write
𝑓𝑋 (𝑋; 𝜃)
𝐾𝐿𝑋 (𝜃, 𝜓) ∶= E𝜃 log , 𝜃, 𝜓 ∈ Ω.
𝑓𝑋 (𝑋; 𝜓)
To help understanding the interpretation of 𝐾𝐿(𝑃 , 𝑄), consider the hypothesis testing problem
𝐻0 ∶ 𝑋 ∼ 𝑄 𝐻1 ∶ 𝑋 ∼ 𝑃 .
d𝑃
𝜆(𝑋) = log (𝑋)
d𝑄
If 𝐻1 is true, we expect E𝑃 𝜆(𝑋) = 𝐾𝐿(𝑃 , 𝑄) to be large. This gives the interpretation that
KL information is also known as KL distance though in general 𝐾𝐿(𝑃 , 𝑄) ≠ 𝐾𝐿(𝑄, 𝑃 ) so that 𝐾𝐿(⋅, ⋅)
is not a true distance. If 𝐾𝐿(𝑃 , 𝑄) = 0 though, we can conclude that 𝑃 = 𝑄.
𝑞(𝑋) 𝑞(𝑋)
E𝑃 log ≤ log E𝑃 = log 1 = 0.
𝑝(𝑋) 𝑝(𝑋)
Therefore
𝑞(𝑋)
𝐾𝐿(𝑃 , 𝑄) = − E𝑃 log ≥ 0.
𝑝(𝑋)
As 𝑥 → log 𝑥 is strictly concave, we can only have equality if 𝑃 = 𝑄.
Exercise 2.6 If 𝑋 ∼ 𝑁 (𝜃, 1), show that 𝐾𝐿𝑋 (𝜃, 𝜓) = (𝜃 − 𝜓)2 ∕2. In this case 𝐾𝐿𝑋 (𝜃, 𝜓) is a
multiple of squared Euclidean distance.
26
2.3. PARAMETER ESTIMATION
Exercise 2.8 (Schervish [1995], exercise 45 in chapter 2.) Suppose that 𝑋 ∼ 𝑈 𝑛𝑖𝑓 (0, 𝜃). Find
the Kullback-Leibler information 𝐾𝐿𝑋 (𝜃1 , 𝜃2 ) for all pairs (𝜃1 , 𝜃2 ).
𝑓 (𝑦; 𝜃)
= log 𝑋 𝑓 (𝑦; 𝜃) d𝑦 = 𝐾𝐿𝑋 (𝜃, 𝜓).
∫ 𝑓𝑋 (𝑦; 𝜓) 𝑋
Note that the lack of invariance to reparametrisation for Fisher-information is caused by the derivative
in its definition and a consequence of the chain-rule.
27
CHAPTER 2. TOPICS FROM CLASSICAL STATISTICS
Now 𝑑 is supposed to estimate 𝜃 ∈ (0, 1), but can only take the values {0, 1}.
∑
∞
𝜃𝑘
𝜙(𝑘)𝑒−𝜃 = 𝑒−3𝜃 .
𝑘=0
𝑘!
2. Prove that (−2)𝑋 is unbiased for 𝑒−3𝜃 . Do you think this is a sensible estimator?
Historically seen, a commonly advocated strategy by classical statisticians is the following. In the search
of optimal estimators, choose the estimator with minimal variance among all unbiased estimators.
Definition 2.25. An estimator 𝜙(𝑋) for 𝜃 is called UMVU (Uniformly Minimum Variance Unbiased) if
it is unbiased, has finite variance, and for every other unbiased estimator 𝜓(𝑋) of 𝜃 we have
There are various problems with this kind of “optimality criterion”. First of all, as we have just seen,
unbiased estimators need not exist. Moreover, they may be very hard to derive. Second, what we really
want is that some dissimilarity measure between 𝜙(𝑋) and 𝜃 is small. In statistical decision theory this is
called a loss function. Say we take 𝐿(𝜙, 𝜃) = (𝜙(𝑋) − 𝜃)2 . Then 𝔼𝜃 𝐿(𝜙, 𝜃) = (𝔼𝜃 − 𝜃)2 + Var 𝜃 𝜙(𝑋), the
usual bias-variance trade-off of the mean square error. Searching for 𝜙 to minimise 𝔼𝜃 𝐿(𝜙, 𝜃) is different
than deriving a UMVU-estimator. By insisting on unbiasedness, we may in fact get bad estimators! So take-
home message: unbiasedness, though intuitively appealing, is not at all an optimality criterion and shouldn’t
play a role in deriving estimators.
There are at least two strategies for deriving that a particular estimator is UMVU for 𝜃 (or possibly 𝑔(𝜃)):
• Applying the the Lehmann-Scheffé theorem which states that an unbiased estimator that is a function
of a complete sufficient statistic is UMVU. Details are in Section 2.8.
• Showing that the Cramér-Rao lower bound is attained. We state and prove this result below.
In certain cases a lower bound on the variance of unbiased estimators can be derived. The best known
of these bounds is the Cramér-Rao lower bound. For simplicity we assume the parameter 𝜃 to be one-
dimensional.
Theorem 2.26. Assume Ω ⊂ ℝ and let 𝜙(𝑋) be a one-dimensional statistic with E𝜃 |𝜙(𝑋)| < ∞ for all 𝜃.
Suppose the FI regularity conditions of definition 2.9 are satisfied, 𝐼(𝜃) > 0 and also that ∫ 𝜙(𝑥)𝑓𝑋 (𝑥; 𝜃)𝑑𝜈(𝑥)
can be differentiated under the integral sign with respect to 𝜃. Then
( )2
d
d𝜃
E𝜃 𝜙(𝑋)
Var 𝜃 𝜙(𝑋) ≥ .
𝐼(𝜃)
28
2.3. PARAMETER ESTIMATION
Proof. Let 𝐷 = 𝐶 ∩ 𝐵 𝑐 from definition 2.9, so that for all 𝜃, P𝜃 (𝐷) = 1. We have
d d
E 𝜙(𝑋) = 𝜙(𝑥)𝑓𝑋 (𝑥; 𝜃) d𝜈(𝑥)
d𝜃 𝜃 d𝜃 ∫
d
= 𝜙(𝑥) 𝑓𝑋 (𝑥; 𝜃) d𝜈(𝑥)
∫ d𝜃
= 𝜙(𝑥)𝑠(𝜃; 𝑥)𝑓𝑋 (𝑥; 𝜃) d𝜈(𝑥) = E𝜃 [𝜙(𝑋)𝑠(𝜃; 𝑋)] ,
∫
d
where 𝑠(𝜃 ∣ 𝑥) = d𝜃
log 𝑓𝑋 (𝑥 ∣ 𝜃). Upon taking 𝜙 ≡ 1 we get E𝜃 𝑠(𝜃 ∣ 𝑋) = 0. Hence
d [ ]
E𝜃 𝜙(𝑋) = E𝜃 (𝜙(𝑋) − E𝜃 𝜙(𝑋))𝑠(𝜃; 𝑋) .
d𝜃
iid ( )
Exercise 2.11 If 𝑋1 , … , 𝑋𝑛 ∼ 𝑁 𝜃, 𝜎 2 , then verify that 𝑋̄ 𝑛 is unbiased for 𝜃 and that the
Cramér-Rao bound is met.
Now suppose 𝜙(𝑋) is an unbiased estimator for 𝜃, i.e. E𝜃 𝜙(𝑋) = 𝜃. By the Cramér-Rao bound, the
d
smallest possible variance for any unbiased estimator is 1∕𝐼(𝜃) (as in this case d𝜃 E𝜃 𝜙(𝑋) = 1). It is a
natural question if there exists an estimator that attains the bound.
Inspecting the proof of theorem 2.26, the lower bound is achieved when the ≤ in the Cauchy-Schwarz’
d
inequality is in fact an equality sign. This is only the case if 𝜙(𝑋)−E𝜃 𝜙(𝑋) = 𝜙(𝑋)−𝜃 and d𝜃 log 𝑓𝑋 (𝑋; 𝜃)
are linearly related, which entails
d
log 𝑓𝑋 (𝑥; 𝜃) = 𝜆(𝜃)(𝜙(𝑥) − 𝜃) for all 𝜃
d𝜃
for a certain mapping 𝜆 ∶ Ω → ℝ. This implies that 𝑓𝑋 should satisfy
for some functions 𝐴, 𝐵 and 𝐶. This is to be compared with a one-paremeter exponential family with a
one-dimensional sufficient statistic 𝜏(𝑋):
Hence, the Cramér-Rao lower bound can only be sharp if we are within the exponential family, and an UMVU
estimator for 𝜃 only exists if E𝜃 𝜏(𝑋) = 𝜃.
1
For vectors 𝑥, 𝑦 ∈ ℝ𝑛 , this inequality states that |⟨𝑥, 𝑦⟩| ≤ ‖𝑥‖‖𝑦‖ with equality if and only if 𝑦 = 𝜆𝑥 for some 𝜆 ∈ ℝ. The
inequality holds much more generally (in 𝐿2 -spaces). A particular √formulation is the following: if 𝑋 and 𝑌 are random variables
with E 𝑋 = E 𝑌 = 0, E 𝑋 2 < ∞ and E 𝑌 2 < ∞, then | E[𝑋𝑌 ]| ≤ E 𝑋 2 E 𝑌 2 with equality if and only if 𝑌 = 𝜆𝑋 for some 𝜆 ∈ ℝ.
29
CHAPTER 2. TOPICS FROM CLASSICAL STATISTICS
The score equation is given by 𝑠(𝜃; 𝑋) = 0, which is to be interpreted as an equation in 𝜃. In many cases
maximum likelihood estimators are obtained from solving the score equation.
Example 2.28. Suppose data are independent realisations of pairs (𝑋𝑖 , 𝑌𝑖 ) and it is assumed that
( )
𝑌𝑖 ∣ 𝑋𝑖 = 𝑥𝑖 ∼ 𝑁 𝜃 ′ 𝑥𝑖 , 𝜎 2 .
Assume the marginal density of 𝑋 is given by 𝑓𝑋 (⋅; 𝜂), where 𝜂 is an unknown parameter. The likelihood is
given by
( )
∏𝑛 ( )
1 1
𝐿(𝜃, 𝜎 2 , 𝜂; ) = √ exp − 2 (𝑌𝑖 − 𝜃 ′ 𝑋𝑖 )2 𝑓𝑋𝑖 (𝑋𝑖 ∣ 𝜂)
𝑖=1 2𝜋𝜎 2 2𝜎
( )( 𝑛 )
1 ∑ ∏
𝑛
2 −𝑛∕2 ′ 2
= (2𝜋𝜎 ) exp − 2 (𝑌𝑖 − 𝜃 𝑋𝑖 ) 𝑓𝑋𝑖 (𝑋𝑖 ∣ 𝜂) ,
2𝜎 𝑖=1 𝑖=1
with = {(𝑋𝑖 , 𝑌𝑖 ), 1 ≤ 𝑖 ≤ 𝑛}. This implies that the maximum likelihood estimator for 𝜃 is found by
minimising
∑𝑛
𝜃 → (𝑌𝑖 − 𝜃 ′ 𝑋𝑖 )2 .
𝑖=1
For this reason these estimators are also called least squares estimators.
Exercise 2.12 Verify that the maximum likelihood estimator for 𝜃 is unbiased.
If the likelihood is not smooth, the MLE may not be obtained as a zero of the score function.
iid
Exercise 2.13 Verify that if 𝑋1 , … , 𝑋𝑛 ∼ 𝑈 𝑛𝑖𝑓 (0, 𝜃), then the MLE for 𝜃 equals
𝑋(𝑛) = max(𝑋1 , … , 𝑋𝑛 ).
The principle of maximum likelihood estimation does not necessarily result in a unique estimator.
30
2.3. PARAMETER ESTIMATION
iid
Example 2.29. Suppose 𝑋1 , … , 𝑋𝑛 ∼ 𝑈 𝑛𝑖𝑓 (𝜃 − 1∕2, 𝜃 + 1∕2). The likelihood function is
So any Θ
̂ 𝑛 ∈ [𝑋(𝑛) − 1∕2, 𝑋(1) + 1∕2] is an MLE. In particular, for 𝛼𝑛 ∈ [0, 1]
If 𝑢 ∶ Ω → Ω̄ is a bijective function, then instead of parametrising the model using the parameter 𝜃 ∈ Ω,
we can also parametrise it using 𝜃̄ = 𝑢(𝜃) ∈ Ω. ̄ From the definition of a maximum likelihood estimator it
then follows that if Θ
̂ is an MLE for 𝜃, then 𝑢(Θ)̂ is an MLE for 𝑢(𝜃). This means that the MLE is invariant
under reparametrisation. For an arbitrary map 𝑢, not necessarily bijective, we define the (a) MLE for 𝑢(𝜃)
by 𝑢(Θ).
̂ Now 𝑢(Θ) ̂ maximises
𝜓 → sup 𝐿(𝜃; 𝑋).
𝜃∈Ω∶𝑢(𝜃)=𝜓
to find maximum likelihood estimators for the mean and variance of the 𝑋𝑖 . Some tedious computations
(which you not need carry out here) reveal that
To find the MLE for E 𝑋1 and Var 𝑋1 we therefore can simply first find (the MLE
) for (𝜇, 𝜎 2 ) and then plug
these into the expressions of the preceding display. But since 𝑌𝑖 ∼ 𝑁 𝜇, 𝜎 2 we have 𝜇̂ 𝑀𝐿𝐸 = 𝑌̄𝑛 and
∑
𝜎̂ 2
𝑀𝐿𝐸
= 𝑛−1 𝑛 (𝑌𝑖 − 𝑌̄𝑛 )2 . Hence we immediately obtain that the MLE’s for E 𝑋1 and Var 𝑋1 equal
𝑖=1
2
𝜉(𝜇̂ 𝑀𝐿𝐸 , 𝜎̂ 𝑀𝐿𝐸 ) and 𝐷(𝜇̂ 𝑀𝐿𝐸 , 𝜎̂ 𝑀𝐿𝐸
2 ) respectively.
Exercise 2.15 [YS exercise 8.1.] Let 𝑋1 , … , 𝑋𝑛 be a random sample of size 𝑛 ≥ 3 from the
exponential distribution with mean 1∕𝜃.
1. Find a sufficient statistic 𝑇 (𝑋) for 𝜃 and write down its density.
3. Calculate the Cramér-Rao Lower bound for the variance of an unbiased estimator, and ex-
plain why you would not expect the bound to be attained in this example. Confirm this by
calculating the variance of your unbiased estimator and comment on its behaviour as 𝑛 → ∞.
31
CHAPTER 2. TOPICS FROM CLASSICAL STATISTICS
2.3.3 Asymptotics
Consistency of maximum likelihood estimators
Assume 𝑋1 , … , 𝑋𝑛 are independent and identically; distributed with density 𝑓 (⋅; 𝜃). The likelihood is given
∏
by 𝐿𝑛 (𝜃) = 𝑛𝑖=1 𝑓 (𝑋𝑖 ; 𝜃). Suppose the model is identifiable:
Lemma 2.31. Assume 𝐾𝐿𝑋1 (𝜃0 , 𝜃) < ∞. Then for each 𝜃0 , and each 𝜃 ≠ 𝜃0
1∑
𝑛
𝑓 (𝑋𝑖 ; 𝜃0 )
𝑀𝑛 (𝜃) = log > 0.
𝑛 𝑖=1 𝑓 (𝑋𝑖 ; 𝜃)
By the weak law of large numbers, 𝑀𝑛 (𝜃) converges in probability (under P𝜃0 ) to
𝑓 (𝑋; 𝜃0 )
E𝜃0 log = 𝐾𝐿(𝜃0 , 𝜃).
𝑓 (𝑋; 𝜃)
By identifiability, there exists an 𝜀 > 0 such that 𝐾𝐿(𝜃0 , 𝜃) > 𝜀. Now
So if |𝑀𝑛 − 𝐾𝐿(𝜃0 , 𝜃)| < 𝜀∕2, then 𝑆𝑛 > −𝜀∕2 + 𝜀 = 𝜀∕2 > 0. Hence
( ) ( )
P𝜃0 |𝑀𝑛 − 𝐾𝐿(𝜃0 , 𝜃)| < 𝜀∕2 ≤ P𝜃0 𝑀𝑛 > 0 .
This lemma suggests that the MLE should be consistent, as the likelihood is maximal at the value 𝜃0
asymptotically. However, some further conditions are needed for obtaining consistency. Precise conditions
require solid knowledge of various stochastic convergence concepts and for this reason we do not go into
details. Chapter 7.3 of Schervish [1995] and Chapter 5 of van der Vaart [1998] are good starting points for
further reading.
Asymptotic normality
In certain “nice” settings, the MLE turns out to be asymptotically Normal. Finding sufficient conditions for
such a result is really part of the subject that is known as “asymptotic statistics”. A clear treatment of the
topic is van der Vaart [1998], from which we adapt some results of Chapter 5 (section 5).
Definition 2.32. A statistical model {𝑃𝜃 , 𝜃 ∈ Ω} is called differentiable in quadratic mean if there exists
a measurable vector-valued function 𝑥 → 𝓁(𝑥;
̇ 𝜃0 ) such that, as 𝜃 → 𝜃0
[√ √ √ ]2 ( )
1 ̇ 𝜃0 ) 𝑓 (𝑥; 𝜃0 ) 𝜈( d𝑥) = 𝑜 ‖𝜃 − 𝜃0 ‖2 .
𝑓 (𝑥; 𝜃) − 𝑓 (𝑥; 𝜃0 ) − (𝜃 − 𝜃0 )′ 𝓁(𝑥;
∫ 2
32
2.3. PARAMETER ESTIMATION
√
Note that if for every 𝑥, the map 𝜃 → 𝑓 (𝑥; 𝜃) is differentiable, then
( )√
𝜕 √ 1 𝜕 1 𝜕
𝑓 (𝑥; 𝜃) = √ 𝑓 (𝑥; 𝜃) = log 𝑓 (𝑥; 𝜃) 𝑓 (𝑥; 𝜃)
𝜕𝜃 2 𝑓 (𝑥; 𝜃) 𝜕𝜃 2 𝜕𝜃
√
and then a Taylor-expansion of 𝜃 → 𝑓 (𝑥; 𝜃) at 𝜃0 suggests that we can take 𝓁̇ to be the score function, i.e.
𝜕 𝜕
𝜕𝜃
log 𝑓 (𝑥; 𝜃). However, differentiability in quadratic mean does not require existence of 𝜕𝜃 𝑓 (𝑥; 𝜃) for every
𝑥. The following theorem is adapted from Theorem 5.39 in van der Vaart [1998].
Theorem 2.33. Suppose that the model {P𝜃 , 𝜃 ∈ Ω} is differentiable in quadratic mean at an inner point 𝜃0 ∈
( )2
Ω ⊂ ℝ𝑘 . Furthermore, suppose that there exists a measurable function 𝑥 → 𝓁(𝑥)
̇ with ∫ 𝓁(𝑥)̇ 𝑓 (𝑥; 𝜃0 )𝜈( d𝑥) <
∞ such that, for every 𝜃1 and 𝜃2 in a neighbourhood of 𝜃0
̇
| log 𝑓 (𝑥; 𝜃1 ) − log 𝑓 (𝑥; 𝜃2 )| ≤ 𝓁(𝑥)‖𝜃1 − 𝜃2 ‖.
√ ( )
̂ 𝑛 − 𝜃0 ) ⇝ 𝑁 0, 𝐼(𝜃0 )−1
𝑛(Θ as 𝑛 → ∞.
The take-home message from this theorem is that in certain cases maximum likelihood estimators are
asymptotically unbiased and attain the Cramér-Rao variance asymptotically. The latter property is called
asymptotic efficiency. While such efficiency is often put forward as a selling point of maximum likelihood,
it may give the wrong impression that the MLE is unique in this sense. As Basu [1975] (page 35) wrote
We all know that under certain circumstances the ML method works rather satisfactorily in
an asymptotic sense. But the community of practising statisticians are not always informed of
the fact that under the same circumstances the Bayesian method: “Begin with a reasonable prior
measure 𝑞 of your belief in the various possible values of 𝜃, match it with the likelihood function
generated by the data, and then estimate 𝜃 by the mode of the posterior distribution so obtained”,
will work as well as the ML method, because the two methods are asymptotically equivalent.
Exercise 2.17 A certain amount of smoothness of the map 𝜃 → 𝑓 (⋅; 𝜃) is essential for obtaining
iid
asymptotic normality. Suppose 𝑋1 , … , 𝑋𝑛 ∼ 𝑈 𝑛𝑖𝑓 (0, 𝜃). Prove that the MLE is given by 𝑋(𝑛) =
max(𝑋1 , … , 𝑋𝑛 ). Show that −𝑛(𝑋(𝑛) − 𝜃) ⇝ 𝐸𝑥𝑝 (1∕𝜃).
33
CHAPTER 2. TOPICS FROM CLASSICAL STATISTICS
ind
Exercise 2.18 Suppose 𝑋1 , … , 𝑋𝑛 ∼ 𝑁 (𝜃, 1), with 𝜃 ≥ 0.
1. Show that the MLE is given by Θ ̂ 𝑛 = 𝑋̄ 𝑛 𝟏{𝑋̄ 𝑛 ≥ 0}.
√
2. (*, advanced) Show that if 𝜃 > 0, 𝑛(Θ ̂ 𝑛 − 𝜃) ⇝ 𝑁 (0, 1).
Hint: The idea is that if 𝜃 > 0, then 𝑋̄ 𝑛 > 0 and then Θ ̂ 𝑛 = 𝑋̄ 𝑛 . The asymptotic distribution
then follows from the central limit theorem. It requires a bit of work to make this precise.
Define the event 𝐴𝑛 = {𝑋̄ 𝑛 ≥ 0}. By the law of large numbers, if 𝜃 > 0, then P𝜃 (𝐴𝑛 ) → 1
(𝑛 → ∞). Let 𝟏{𝐴} denote the indicator function of a set 𝐴 (Hence 𝟏{𝐴} = 1 if 𝐴 holds true,
else it is zero). Note that P𝜃 (𝐴) = E𝜃 𝟏{𝐴}. Take any 𝑢 ∈ ℝ
√ √
ℙ𝜃 ( 𝑛(Θ ̂ 𝑛 − 𝜃) ≤ 𝑢) = 𝔼𝜃 𝟏{ 𝑛(Θ ̂ 𝑛 − 𝜃) ≤ 𝑢}
[ √ ] [ √ ]
= 𝔼𝜃 𝟏{ 𝑛(Θ ̂ 𝑛 − 𝜃) ≤ 𝑢}𝟏{𝐴𝑛 } + 𝔼𝜃 𝟏{ 𝑛(Θ ̂ 𝑛 − 𝜃) ≤ 𝑢}𝟏{𝐴𝐶 }
𝑛
⎧0 if 𝑥 < 0
√
̂ 𝑛 ≤ 𝑥) = ⎪
P0 ( 𝑛Θ ⎨1∕2 if 𝑥 = 0 .
⎪Φ(𝑥) if 𝑥 > 0
⎩
Hence, in this case the limit distribution is not normal, but a mixture of a point mass at zero
and the standard normal distribution.
34
2.3. PARAMETER ESTIMATION
Exercise 2.19 [YS exercise 8.10.] A random sample 𝑋1 , … , 𝑋𝑛 is taken from the normal distri-
bution 𝑁 (𝜇, 1).
1. Show that the maximum likelihood estimator 𝜇̂ 𝑛 of 𝜇 is the minimum variance unbiased esti-
mator (show that the Cràmer-Rao lower bound is attained).
(a) Show that P𝜇 (𝑇𝑛 ≠ 𝜇̂ 𝑛 ) tends to one when 𝜇 = 0 but to zero if 𝜇 > 0.
(b) Derive that for 𝜇 = 0, the asymptotic distribution of 𝑇𝑛 is 𝑁 (0, 1∕(4𝑛)).
(c) Derive that for 𝜇 > 0, the asymptotic distribution of 𝑇𝑛 is 𝑁 (0, 1∕𝑛).
Compare the MSE of 𝜇̂ 𝑛 and 𝑇𝑛 both in case 𝜇 = 0 and 𝜇 > 0. Is 𝑇𝑛 a sensible estimator in
practise?
with respect to the dominating measure 𝜈. Suppose that the natural parameter space Ω is an open subset of
ℝ𝑘 . Let Θ
̂ 𝑛 be the MLE of 𝜃 based on 𝑋1 , … , 𝑋𝑛 (if it exists). Then lim𝑛→∞ P𝜃 (Θ
̂ 𝑛 exists) = 1 and under P𝜃
√ ( )
𝑛(Θ̂ 𝑛 − 𝜃) ⇝ 𝑁 0, 𝐼(𝜃)−1 .
This theorem implies that the MLE is consistent and that if 𝑔 ∶ Ω → ℝ has continuous partial derivatives,
then 𝑔(Θ
̂ 𝑛 ) is an asymptotically efficient estimator of 𝑔(𝜃) ([Schervish [1995], Corollary 7.58 and Corollary
7.59).
35
CHAPTER 2. TOPICS FROM CLASSICAL STATISTICS
⋯ although all efforts at a proof of the general existence of [asymptotically] efficient estimates
⋯ as well as a proof of the efficiency of ML estimates were obviously inaccurate and although
accurate proofs of similar statements always referred not to the general case but to particular
classes of estimates ⋯ a general belief became established that the above statements are true in
the most general sense.
Indeed, there are examples where the MLE is inconsistent. The following example is classical.
ind
Example 2.35 (Neyman-Scott probleem). Let 𝑋𝑖𝑗 ∼ 𝑁(𝜃𝑖 , 𝜎 2 ), 𝑖 = 1, … , 𝑛 and 𝑗 = 1, … , 𝑟. The unique
MLE is given by
∑𝑟
1 ∑∑
𝑛 𝑟
Θ̂𝑖 = 1 𝑋𝑖𝑗 Σ̂ 2 = (𝑋 − Θ ̂ 𝑖 )2 .
𝑟 𝑗=1 𝑟𝑛 𝑖=1 𝑗=1 𝑖𝑗
These can be derived by taking partial derivatives of the loglikelihood function and equating these to zero.
∑
To see that Σ̂ 2 is not consistent, note that the statistics 𝑆𝑖2 = 𝑟𝑗=1 (𝑋𝑖𝑗 − Θ ̂ 𝑖 )2 are independent and
∑ p
identically distributed with expectation E 𝑆𝑖2 = (𝑟 − 1)𝜎 2 . Therefore 𝑛−1 𝑛𝑖=1 𝑆𝑖2 ←→ (𝑟 − 1)𝜎 2 and hence
p 𝑟−1
Σ̂ 2 ←→ 𝜎2.
𝑟
In this example, the number of “nuisance parameters” (parameters which are in the model but not of direct
interest), {𝜃𝑖 }𝑛𝑖=1 grows with 𝑛.
In case the MLE is inconsistent, it is common to fix this problem by either adjusting the likelihood
function (in ways that can appear to be a bit at-hoc) or the estimator. As an example, in the previous Neyman-
𝑟 ̂2
Scott problem, it is obvious that the estimator 𝑟−1 Σ is consistent for estimating 𝜎 2 .
In nonparametric estimation problems inconsistency of the MLE is not uncommon. See for instance Sec-
tion 2.2 of Groeneboom and Jongbloed [2014] for an example on nonparametric estimation of a decreasing
density.
36
2.4. HYPOTHESIS TESTING
𝐻0 ∶ 𝜃 ∈ Ω0 ,
Ideally, we would like to have 𝛽(𝜃) = 0 if 𝜃 ∈ Ω0 and 𝛽(𝜃) = 1 if 𝜃 ∈ Ω1 , but due to randomness this is
of course impossible.
An important questions concerns the choice of 𝑇 and 𝐶 for a particular statistical model. For the moment
suppose we have chosen some statistic 𝑇 . The critical region should be such that if 𝑇 ∈ 𝐶 we have an
indication that 𝐻1 may be true. In simple settings, this is sort of clear. In general, there is no rule for this,
and there may be situations where it is unclear how to choose 𝐶 (examples are given in Clayton [2021]). For
now, let’s neglect these issues. The classical Neyman-Pearson theory then treats 𝐻0 and 𝐻1 asymmetrically
and prescribes to choose the region 𝐶 as follows:
1. First choose a significance level 𝛼 ∈ [0, 1] and take 𝐶 such that the inequality
is satisfied. This ensures the probability of a type I error (incorrectly rejecting 𝐻0 ) to be bounded by
𝛼.
2. Conditional on (2.2), maximise 𝛽(𝜃) for all 𝜃 ∈ Ω1 by choosing the volume of 𝐶 as large as possible.
This ensures minimising the type II error (incorrectly not rejecting 𝐻0 ).
If 𝐶 is given, the number sup𝜃∈Ω0 𝛽(𝜃) is called the significance level of the test.
For choosing a test statistic we need to define an optimality criterion.
Definition 2.38. Suppose (𝑇 , 𝐶) is a test with significance level 𝛼. It is called Uniformly Most Powerful
(UMP) if
P𝜃 (𝑇 ∈ 𝐶) ≥ P𝜃 (𝑇̃ ∈ 𝐶) ̃ for all 𝜃 ∈ Ω1
for all tests (𝑇̃ , 𝐶)
̃ with significance level 𝛼 (for the same hypothesis testing problem).
Does such a test exist? To be able to answer this question we have to include randomised tests. These
are tests where the decision for rejecting the null hypothesis not only depends on the observation 𝑋, but on an
additional independent random variable 𝑈 as well. Few statisticians will recommend such a test for practical
purposes. However, including randomised tests enables us to establish UMP-tests in certain settings. Before
we give an example of a randomised test, we note that for a nonrandomised test the power function can be
written as
𝛽(𝜃) = E𝜃 𝜙(𝑋) where 𝜙(𝑋) = 𝟏{𝑇 (𝑋)∈𝐶} .
This function 𝜙 is called the critical function of the test: it gives the probability to reject the null hypothesis
when 𝑋 is observed. Clearly, 𝜙(𝑋) ∈ {0, 1} in case of a nonrandomised test.
37
CHAPTER 2. TOPICS FROM CLASSICAL STATISTICS
Example 2.39. Suppose 𝑋 ∼ 𝐵𝑖𝑛 (10, 𝜃) and we wish to test 𝐻0 ∶ 𝜃 ≤ 1∕2 versus 𝐻1 ∶ 𝜃 > 1∕2. As
we only have one observation it is natural to take 𝑇 (𝑋) = 𝑋. We will reject 𝐻0 for large values of 𝑋 and
hence the critical region is of the form 𝐶 = {𝑐, 𝑐 + 1, … , 10}. To obtain maximal power of the test under
the alternative hypothesis, we wish to attain significance level 𝛼 = 0.05 exactly (equality in equation (2.2)).
Since {
0.055 if 𝑐 = 8
sup P𝜃 (𝑇 ∈ 𝐶) = P1∕2 (𝑋 ≥ 𝑐) = , (2.3)
𝜃≤1∕2 0.011 if 𝑐 = 9
we see that this is impossible. As a remedy, one can choose 𝑐 = 8 with probability 𝛾 and 𝑐 = 9 with
probability 1 − 𝛾, where we choose 𝛾 such that
0.055𝛾 + 0.011(1 − 𝛾) = 𝛼.
Taking 𝛼 = 0.05 gives 𝛾 ≈ 0.89 and leads to the notion of a randomised test. the probability 𝛾 is chosen such
that on average (if we would repeat the testing procedure infinitely often) significance level 𝛼 is obtained.
To make this mathematically precise, define
where 𝑈 ∼ 𝑈 𝑛𝑖𝑓 (0, 1) is independent of 𝑋. The interpretation of this function is that we reject when
𝜙(𝑋, 𝑈 ) = 1. We have
E𝜃 𝜙(𝑋, 𝑈 ) = E𝜃 E[𝜙(𝑋, 𝑈 ) ∣ 𝑋],
the inner expectation on the right-hand-side being over 𝑈 .
The critical function of the randomised test is defined by
This function gives the probability of rejecting 𝐻0 when 𝑋 = 𝑥 is observed. In this example, we have
So we always reject when 𝑋 ∈ {9, 10} and if 𝑋 = 8 we reject with probability 𝛾. As a check on our
calculations, we establish that
Suppose Ω0 = {𝜃0 } and the test statistic is such that we reject for large values. We claim that we can always
find values 𝑘 and 𝛾 ∈ [0, 1] such that the test
38
2.4. HYPOTHESIS TESTING
satisfies E𝜃0 𝜙(𝑋) = 𝛼. The construction goes as follows: first find the smallest value 𝑘 such that P𝜃0 (𝑇 ≥ 𝑘)
is above 𝛼 (if this probability happens to be exactly 𝛼 no randomisation is required and this value of 𝑘 gives
a nonrandomised test of level 𝛼 with 𝛾 = 0). Using equation (2.3) of the example we have 𝑘 = 8 in that case.
If we define
𝛼 − P𝜃0 (𝑇 > 𝑘)
̄
𝛾= (2.4)
P𝜃0 (𝑇 ≥ 𝑘)
̄ − P𝜃 (𝑇 > 𝑘)
0
̄
then we have E𝜃0 𝜙(𝑋) = 𝛼, as claimed.
We now turn to the question of choosing the test statistic. In case Ω0 = {𝜃0 } and Ω1 = {𝜃1 } there
is a clear answer to this question. If the distribution under the null or alternative hypothesis is completely
specified we call the hypothesis simple. So here we consider the case of testing a simple versus a simple
hypothesis. This case is easy to analyse as the power function for a test 𝜙 has only two values: E0 𝜙 and E1 𝜙
(E𝑖 denotes expectation under P𝜃𝑖 ). We look for a test that
maximises E1 𝜙(𝑋)
(2.5)
subject to E0 𝜙(𝑋) ≤ 𝛼
Denote the density of 𝑃𝜃𝑖 by 𝑓𝑖 . 2 The following lemma is known as the Neyman-Pearson lemma.
Lemma 2.41. Define the test
⎧1 if 𝑓1 (𝑥) > 𝑘𝑓0 (𝑥)
⎪
𝜙∗ (𝑥) = ⎨𝛾(𝑥) if 𝑓1 (𝑥) = 𝑘𝑓0 (𝑥) ,
⎪0 if 𝑓1 (𝑥) < 𝑘𝑓0 (𝑥)
⎩
where 𝑘 ≥ 0 is a constant and 𝛾 ∶ → [0, 1] an arbitrary function. Then
1. Optimality: Let 𝛼 ∗ = E0 𝜙∗ . For any 𝑘 and 𝛾(𝑥), 𝜙∗ is the test that maximises E1 𝜙 over all tests 𝜙
with E0 𝜙 ≤ 𝛼 ∗ . That is 𝜙∗ is the most powerful size 𝛼 ∗ -test.
2. Existence: Given 𝛼 ∈ (0, 1), there exist constants 𝑘 and 𝛾 such that the test 𝜙∗ with 𝛾(𝑥) = 𝛾 has size
exactly equal to 𝛼.
3. Uniqueness: If the test 𝜙 has size 𝛼 and is of maximum power amongst all possible tests with level at
most 𝛼, then 𝜙𝟏𝐵 = 𝜙∗ 𝟏𝐵 , where 𝐵 = {𝑥 ∶ 𝑓1 (𝑥) ≠ 𝑘𝑓0 (𝑥)}.
Proof. Note that the existence part was already proved in the previous discussion (take 𝑇 (𝑥) = 𝑓1 (𝑥)∕𝑓0 (𝑥)).
For the optimality result, let 𝜙 be any test with E0 𝜙 ≤ 𝛼 ∗ . Define
𝑈 (𝑥) = (𝜙∗ (𝑥) − 𝜙(𝑥)(𝑓1 (𝑥) − 𝑘𝑓0 (𝑥))
and verify that 𝑈 (𝑥) ≥ 0. Hence
39
CHAPTER 2. TOPICS FROM CLASSICAL STATISTICS
Remark 2.42. The NP-setup has become a default choice for many scientists, but has been criticised a lot
(the same holds for 𝑝-values). Instead of bounding the type I error by 𝛼 one could for example also derive a
test by choosing the critical region 𝐶 to minimise a linear combination of the type I and type II errors. Take
𝑘 > 0 and suppose Ω = {𝜃0 , 𝜃1 } (the simplest case). Instead of the NP-setup we could choose the critical
region to minimise
The form of the critical region is just as with the standard NP-test, but the criterion to decide to reject is
derived differently.
Example 2.43. Suppose 𝑋 ∼ 𝐸𝑥𝑝 (𝜃) and consider 𝐻0 ∶ 𝜃 = 𝜃0 = 1 (say) against 𝐻1 ∶ 𝜃 = 𝜃1 where
𝜃1 > 1. We reject 𝐻0 when
𝑓𝑋 (𝑥; 𝜃1 )
= 𝜃1 𝑒−(𝜃1 −1)𝑥 ≥ 𝑘.
𝑓𝑋 (𝑥; 𝜃0 )
( )
This is the case when 𝑥 < (𝜃1 − 1)−1 log 𝜃1 ∕𝑘 =∶ 𝑘′ . Solving P𝜃0 (𝑋 < 𝑘′ ) = 𝛼 gives 𝑘′ = − log(1 − 𝛼).
Hence, according to the Neyman-Pearson lemma, the optimal test rejects when 𝜙(𝑋) = 1 where
Existence of uniformly most powerful tests is a great deal to expect. It is asking the Neyman-Pearson
test for simple vs. simple hypothesis to be the same for every pair of simple hypotheses contained within 𝐻0
and 𝐻1 . In example 2.43 it is easy to see that the derived test is UMP for the hypothesis 𝐻0 ∶ 𝜃 = 𝜃0 vs.
𝐻1 ∶ 𝜃 > 𝜃0 since the critical function of the test 𝐻0 ∶ 𝜃 = 𝜃0 versus 𝐻1 ∶ 𝜃 = 𝜃1 does not depend on 𝜃1 .
This argument holds more generally in case of monotone likelihood ratios and a one-sided alternative hy-
pothesis.
Definition 2.44. The family of densities {𝑓 (⋅; 𝜃), 𝜃 ∈ Ω ⊆ ℝ} is said to be of increasing monotone like-
lihood ratio (MLR) if there exists a function 𝑡(𝑥) such that the likelihood ratio 𝑓 (𝑥; 𝜃2 )∕𝑓 (𝑥; 𝜃1 ) is a non-
decreasing function of 𝑡(𝑥) whenever 𝜃2 ≥ 𝜃1 .
40
2.4. HYPOTHESIS TESTING
This is a one-parameter exponential family parametrised in terms of its natural parameter 𝜃. Set 𝑋 =
(𝑋1 , … , 𝑋𝑛 ). Then,
∏
𝑛
∑𝑛
𝜏(𝑥𝑖 )
𝑓𝑋 (𝑥) = 𝑐(𝜃)𝑛 ℎ(𝑥𝑖 )𝑒𝜃 𝑖=1 .
𝑖=1
∑𝑛
Define 𝑡(𝑥) = 𝑖=1 𝜏(𝑥𝑖 ). For any 𝜋2 ≥ 𝜋1 we have
( )𝑛
𝑓 (𝑥; 𝜃2 ) 𝑐(𝜃2 ) ( )
= exp (𝜃2 − 𝜃1 )𝑡(𝑥)
𝑓 (𝑥; 𝜃1 ) 𝑐(𝜃1 )
Theorem 2.46 (Young and Smith [2005], theorem 4.2). Suppose 𝑋 has a distribution from a family which is
of increasing MLR with respect to the statistic 𝑡(𝑋) and that we wish to test 𝐻 ∶ 𝜃 ≤ 𝜃0 against 𝐻𝐴 ∶ 𝜃 > 𝜃0 .
Suppose the distribution of 𝑋 is absolutely continuous with respect to Lebesgue measure on ℝ𝑘 .
1. The test
{
∗ 1 if 𝑡(𝑥) > 𝑡0
𝜙 (𝑥) =
0 if 𝑡(𝑥) ≤ 𝑡0
2. Given some 𝛼 ∈ (0, 1], there exists some 𝑡0 such that the test 𝜙∗ has size exactly 𝛼.
Two sided hypothesis testing involves testing 𝐻0 ∶ 𝜃 ∈ [𝜃1 , 𝜃2 ] vs. 𝐻1 ∶ 𝜃 ∈ (−∞, 𝜃1 ) ∪ (𝜃2 , ∞), where
𝜃1 < 𝜃2 or 𝐻0 ∶ 𝜃 = 𝜃0 vs. 𝐻1 ∶ 𝜃 ≠ 𝜃0 . Quoting from Young and Smith [2005] (section 7.1):
In this situation, we cannot in general expect to find a UMP test, even for nice families, such
as the monotone likelihood ratio or exponential family models. The reason is obvious: if we
construct a Neyman-Pearson test of say 𝜃 = 𝜃0 against 𝜃 = 𝜃1 for some 𝜃1 ≠ 𝜃0 , the test takes
quite a different form when 𝜃1 > 𝜃0 from when 𝜃1 < 𝜃0 . We simply cannot expect one test to be
most powerful in both cases simultaneously.
( )
Exercise 2.21 [YS exercise 4.2.] Let 𝑋1 , … , 𝑋𝑛 be independent 𝑁 𝜇, 𝜎 2 random variables
where 𝜎 2 > 0 is a known constant and 𝜇 ∈ ℝ is unknown. Show that 𝑋 = (𝑋1 , … , 𝑋𝑛 ) has
monotone likelihood ratio. Given 𝛼 ∈ (0, 1) and 𝜇0 ∈ ℝ, construct a uniformly most powerful test
of size 𝛼 of 𝐻0 ∶ 𝜇 ≤ 𝜇0 against 𝐻1 ∶ 𝜇 > 𝜇0 , expressing the critical region in terms of the standard
normal distribution function Φ.
iid
Exercise 2.22 Let 𝑋1 , … , 𝑋𝑛 ∼ 𝐸𝑥𝑝 (𝜃). Consider 𝐻0 ∶ 𝜃 ≤ 1 against 𝐻1 ∶ 𝜃 > 1. Derive a
uniformly most powerful test of size 𝛼.
41
CHAPTER 2. TOPICS FROM CLASSICAL STATISTICS
iid
Exercise 2.23 [YS exercise 4.5.] Let 𝑋1 , … , 𝑋𝑛 ∼ 𝑈 𝑛𝑖𝑓 (0, 𝜃).
1. Show that there exists a uniformly most powerful size 𝛼 test of 𝐻0 ∶ 𝜃 = 𝜃0 against 𝐻1 ∶ 𝜃 >
𝜃0 and find its form.
where 𝑏 = 𝜃0 𝛼 1∕𝑛 , is a uniformly most powerful test of size 𝛼 for testing 𝐻0 against 𝐻1′ ∶ 𝜃 ≠
𝜃0 . (Note that in a more “regular” situation a UMP test of 𝐻0 against 𝐻1′ does not exist.)
Exercise 2.24 Let 𝜎12 > 0 and 𝜎22 > 0. Suppose 𝑋1 ∼ 𝑁(0, 𝜎12 ) and 𝑋2 ∼ 𝑁(0, 𝜎22 ) are inde-
pendent. Find the critical region for the best symmetric test 𝐻0 ∶ 𝑋1 ∼ 𝑁(0, 𝜎12 ), 𝑋2 ∼ 𝑁(0, 𝜎22 )
against 𝐻1 ∶ 𝑋2 ∼ 𝑁(0, 𝜎12 ), 𝑋1 ∼ 𝑁(0, 𝜎22 ).
A symmetric test here is a test that takes the opposite action if the two data values are switched, so
𝜙(𝑥1 , 𝑥2 ) = 1 − 𝜙(𝑥2 , 𝑥1 ). For a symmetric test the error probabilities under 𝐻0 and 𝐻1 will be
equal.
Note that it is not just the data 𝐷, we also consider hypothetical data we have not seen (and may never see).
The issues with 𝑝-values have been reported over decades, yet these dominate statistical reporting in the
literature.
Bayesian statistics, to be fully discussed in Chapter 4 focusses on something entirely different
ℙ(𝐻0 ) ℙ(𝐷 ∣ 𝐻0 )ℙ(𝐻0 )
ℙ(𝐻0 ∣ 𝐷) = ℙ(𝐷 ∣ 𝐻0 ) = , (2.6)
ℙ(𝐷) ℙ(𝐷 ∣ 𝐻0 )ℙ(𝐻0 ) + ℙ(𝐷 ∣ 𝐻1 )ℙ(𝐻1 )
the final equality being true if 𝐻1 is of the form not(𝐻0 ). So we answer the question what the probability of
the hypothesis is, in light of the data. Doesn’t that sound more natural? Equation (2.6) is just Bayes’ formula,
which can be rewritten to
ℙ(𝐻1 ∣ 𝐷) ℙ(𝐷 ∣ 𝐻1 ) ℙ(𝐻1 )
= .
ℙ(𝐻0 ∣ 𝐷) ℙ(𝐷 ∣ 𝐻0 ) ℙ(𝐻0 )
This reads
posterior odds = likelihood ratio × prior odds .
It is this prior odds term that frequentist object to. In the specific setting where 𝑋 ∼ 𝑓 (⋅; 𝜃) and 𝐻0 ∶ 𝜃 = 𝜃0
and 𝐻1 ∶ 𝜃 = 𝜃1 , we have
ℙ(𝐷 ∣ 𝐻1 ) 𝑓 (𝑥; 𝜃1 )
= .
ℙ(𝐷 ∣ 𝐻0 ) 𝑓 (𝑥; 𝜃0 )
42
2.6. P-VALUES
The Neyman-Person Lemma states that this is the optimal frequentist test-statistic. Therefore, at least in this
simple setting, both approaches agree that the likelihood-ratio is of central importance. However, as we will
see, the way this is used to reach a decision (accept or reject) is fundamentally different.
2.6 P-values
Suppose (𝑇 , 𝐶𝛼 ) is a (nonrandomised) test with significance level 𝛼 and we observe a realisation 𝑡 from 𝑇 .
The rule is to reject the null hypothesis when 𝑡 ∈ 𝐶𝛼 . This gives a binary decision rule: either reject or not.
A common criticism is that this is not informative enough and one should also provide a measure of strength
of evidence in favour (or against) the null hypothesis. A perfect candidate would be the probability that the
null hypothesis is true. However, as the parameter is supposed to be fixed in classical statistics this measure
is not available (in Bayesian statistics it is).
Suppose we test 𝐻0 ∶ 𝜃 ∈ Ω0 . Because of the defining relation for 𝐶𝛼 :
sup P𝜃 (𝑇 ∈ 𝐶𝛼 ) ≤ 𝛼,
𝜃∈Ω0
decreasing 𝛼 yields a critical region of smaller volume. Now if 𝛼 = 1, then certainly 𝑡 ∈ 𝐶𝛼 . By successively
decreasing the value of 𝛼, at a certain value we will no longer have that 𝑡 ∈ 𝐶𝛼 . The tipping point is precisely
what the 𝑝-value is.
Definition 2.47. Suppose (𝑇 , 𝐶𝛼 ) is a (nonrandomised) test with significance level 𝛼 and we observe a real-
isation 𝑡 from 𝑇 . The p-value is defined as
𝑝 = inf {𝛼 ∈ [0, 1] ∶ 𝑡 ∈ 𝐶𝛼 }.
Hence, it is the smallest value of 𝛼 for which we reject the null-hypothesis when observing the realisation
𝑡. Since 𝑝 ≤ 𝛼0 if and only if 𝑡 ∈ 𝐶𝛼0 we can use the 𝑝-value to decide upon rejecting of the hypothesis for
any level 𝛼0 ∈ [0, 1].
Use of 𝑝-values has been much disputed in the statistical literature. For an interesting discussion, see
Schervish [1996] where it is shown that the common informal use of 𝑝-values as measures of support or
evidence for hypotheses has serious logical flaws.
43
CHAPTER 2. TOPICS FROM CLASSICAL STATISTICS
where 𝑍 ∼ 𝑁 (0, 1). The first ≈-sign is obtained by applying the central limit theorem, which requires 𝑛
to be sufficiently large. Hence it appears we have evidence against 𝜃 = 1∕2 when using significance-level
𝛼 = 0.05, the evidence being the same for all values of 𝑛 for which the central-limit-theorem-approximation
is valid. However, the observed proportion of successes is given by
𝑥 1 1
= +√
𝑛 2 𝑛
In many practical settings of hypothesis testing, 𝐻0 is by definition not true. Some examples:
Hence, with a sufficiently large sample size, 𝐻0 is always rejected (as it should). But then we only find out
what we already knew before collecting data! 3
Furthermore, there are problems with the interpretation of rejecting a null hypothesis. This can be seen
from the following line of reasoning which is often seen in journal articles (taken from an article by Cohen
[1994])
3
An argument in favour of point-null hypothesis testing 𝐻0 ∶ 𝜃 = 𝜃0 is that it should be viewed as a limiting case of 𝐻0 ∶
𝜃 ∈ (𝜃0 − 𝜀, 𝜃0 + 𝜀), with 𝜀 ↓ 0. As we will see later, also for Bayesian statistics point-null hypothesis testing gives certain specific
difficulties.
44
2.6. P-VALUES
If 𝐻0 is true, then this result (statistical significance) would probably not occur.
This result has occurred.
Then 𝐻0 is probably not true and therefore formally invalid.
The first line corresponds to finding a critical region 𝐶 for a test statistic 𝑇 . The second line corresponds to
discovering that the observed value of 𝑇 , denoted by 𝑡, satisfies 𝑡 ∈ 𝐶.
This reasoning is incorrect, as seen from the following
If a person is an American, then he is probably not a member of congress. True, right?
This person is a member of congress.
Therefore, he is probably not an American.
Only if the word “probably” would not appear, and the first line (the major premise) would be correct,
then this line of reasoning would be correct. Hence, there is a logical flaw in frequentist hypothesis testing.
45
CHAPTER 2. TOPICS FROM CLASSICAL STATISTICS
For testing the difference of two proportions one can think of the 𝑧-statistic, the proportion test (with or
without continuity correction), 𝜒 2 -test, Fisher’s exact test, McNemar’s test, logistic regression. Now
choose the statistic that gives a low 𝑝-value, don’t tell about other tests. 5
3. If the test has a two-sided alternative, change the formulated to a one-sided alternative. This will allow
you to halve the 𝑝-value.
Our last option, if you cannot lower your p-value any other way, is to change what is ac-
cepted as publishable. So, instead of a 𝑝-value of 0.05, use 0.10 and just state that this is the
level you consider as statistically significant. I haven’t seen any other number besides 0.10,
however, so if your p-value is larger than this the best you can do is to claim that your results
are “suggestive” or “in the expected direction”. Don’t scoff, because this sometimes works.
You can really only get away with this in secondary and tertiary journals (which luckily are
increasing in number) or in certain fields where the standard of evidence is low, or when
your finding is one which people want to be true. This worked for second-hand smoking
studies, for example, and currently works for anything said to be negatively associated with
global warming.
This is about the right time to start reading the article “Abandon Statistical Significance” by McShane et al.,
which is at the very end of these lecture notes. It illustrates that there is much doubt on hypothesis testing
using 𝑝-values as there are many questionable aspects on the procedure. Another account against the use of
𝑝-values for statistical evidence is Meester [2019] (in dutch).
Definition 2.49. Let 𝑔 ∶ Ω → 𝐺 be a function, let 𝜂 be the collection of all subsets of 𝐺, and let 𝑅 ∶ → 𝜂
be a function. The function 𝑅 is a level-𝛾 confidence set for 𝑔(𝜃) if for every 𝜃 ∈ Ω
P𝜃 (𝑔(𝜃) ∈ 𝑅(𝑋)) ≥ 𝛾
(an implicit assumption is that {𝑥 ∶ 𝑔(𝜃) ∈ 𝑅(𝑥)} is measurable). The confidence set is exact if P𝜃 (𝑔(𝜃) ∈
𝑅(𝑋)) = 𝛾 for all 𝜃 ∈ Ω. If inf 𝜃∈Ω P𝜃 (𝑔(𝜃) ∈ 𝑅(𝑋)) > 𝛾, the confidence set is conservative.
To highlight that it is really the set that is random (and not the parameter 𝜃) one sometimes writes
P𝜃 (𝑅(𝑋) ∋ 𝑔(𝜃)) ≥ 𝛾
5
Choosing just one favourable is cheating, but one can state all results. If these are in conflict, that may indicate that the evidence
for the alternative hypothesis is perhaps not so strong.
46
2.7. CONFIDENCE SETS
as it is common to put the random quantity on the left in probability statements (this is of course not nec-
essary). The interpretation of confidence sets is somewhat subtle: on repeated replication of the statistical
experiment, the fraction of sets that contain the “true” parameter 𝜃 is at least 𝛾; for a given dataset, the
confidence set either contains 𝜃, or not (in practice we don’t know which of the two cases applies).
The construction of confidence sets is a bit of an art. It is super easy to find a conservative confidence
set: just take Ω!. Naturally, we aim for a set that has confidence as close as possible to 𝛾 while being of
minimal size. There are some tricks/procedures for constructing confidence sets:
In case a pivotal quantity exists, the construction of a confidence set is relatively easy
Definition 2.50. A pivot is a function ℎ ∶ × Ω → ℝ whose distribution does not depend on the parameter.
Suppose ℎ(𝑋, 𝜃) is pivotal. If its distribution is known, then we can find constants 𝑐𝓁 and 𝑐𝑟 such that
Suppose for ease of exposition that Ω ⊂ ℝ. If moreover ℎ(𝑋, 𝜃) is of reasonably manageable form, we can
rewrite this relation as
P𝜃 (𝐿(𝑋) ≤ 𝜃 ≤ 𝑈 (𝑋)) = 𝛾 for all 𝜃 ∈ Ω.
This would then deliver the exact level-𝛾 confidence set 𝑅(𝑋) = [𝐿(𝑋), 𝑈 (𝑋)].
iid ( )
Example 2.51. Assume 𝑋1 , … , 𝑋𝑛 ∼ 𝑁 𝜃, 𝜎 2 , where both 𝜃 and 𝜎 are unknown. Let
1∑ 1 ∑
𝑛 𝑛
𝑋̄ 𝑛 = 𝑋 and 𝑆𝑛2 = (𝑋 − 𝑋̄ 𝑛 )2 .
𝑛 𝑖=1 𝑖 𝑛 − 1 𝑖=1 𝑖
Then, if 𝑋 = (𝑋1 , … , 𝑋𝑛 ),
√ 𝑋̄ 𝑛 − 𝜃
ℎ(𝑋, 𝜃) = 𝑛 ∼ 𝑡 (𝑛 − 1)
𝑆𝑛
is pivotal. Hence, using the quantiles of the 𝑡 (𝑛 − 1)-distribution, we can find constants 𝑐𝓁 and 𝑐𝑟 such that
(2.7) holds under 𝑃(𝜃,𝜎
′
2)
. In this case we can rewrite this expression as
( )
𝑆 𝑆
P(𝜃,𝜎 2 ) 𝑋̄ 𝑛 − 𝑐𝑟 √𝑛 ≤ 𝜃 ≤ 𝑋̄ 𝑛 − 𝑐𝓁 √𝑛 =𝛾
𝑛 𝑛
iid
Exercise 2.26 Suppose 𝑋1 , … , 𝑋𝑛 ∼ 𝑈 𝑛𝑖𝑓 (0, 𝜃). Show that 𝑋(𝑛) ∕𝜃 is pivotal and construct a
confidence interval for 𝜃 of level 𝛾.
47
CHAPTER 2. TOPICS FROM CLASSICAL STATISTICS
Pivots need not exist. In that case one can try to find an asymptotic pivotal to derive an asymptotic
confidence set. Suppose the MLE of a one-dimensional parameter is asymptotically normal:
√
̂ 𝑛 − 𝜃0 ) ⇝ 𝑁 (0, 1) .
𝑛𝐼(𝜃0 )(Θ
Suppose 𝐼(𝜃
̂ 0 ) is a weakly consistent estimator for 𝐼(𝜃0 ). Then
√
̂
𝑛𝐼(𝜃 ̂
0 )(Θ𝑛 − 𝜃0 ) ⇝ 𝑁 (0, 1) .
probability of the confidence interval is closer to 1 − 𝛼 when using the observed Fisher information instead
of the plug-in estimator for the Fisher information).
There is a direct relation between point null hypothesis testing and constructing confidence sets. The
idea is that a 1 − 𝛼 confidence set is obtained from those values 𝜃0 for which the hypothesis 𝐻0 ∶ 𝑔(𝜃) = 𝜃0
is not rejected at level 𝛼. As a consequence, confidence sets can be derived from hypothesis tests and vice
versa.
Proposition 2.52 (Schervish [1995], proposition 5.48). Let 𝑔 ∶ Ω → 𝐺 be a function.
• For each 𝑦 ∈ 𝐺, let 𝜙𝑦 be a level-𝛼 nonrandomised test of 𝐻0 ∶ 𝑔(𝜃) = 𝑦. Let 𝑅(𝑥) = {𝑦 ∶ 𝜙𝑦 (𝑥) =
0}. Then 𝑅 is a level-(1 − 𝛼) confidence set for 𝑔(𝜃).
𝜙𝑦 (𝑥) = 𝟏{𝑦∉𝑅(𝑥)} .
The final equality holds since the test is nonrandomised and hence takes values in {0, 1}. The result
follows since sup𝑔(𝜃)=𝑦 E𝜃 𝜙𝑦 (𝑋) ≤ 𝛼.
Hence
sup 𝛽(𝜃) = 1 − inf P𝜃 (𝑦 ∈ 𝑅(𝑋)) = 1 − inf P𝜃 (𝑔(𝜃) ∈ 𝑅(𝑋)) ≤ 𝛼.
𝑔(𝜃)=𝑦 𝑔(𝜃)=𝑦 𝑔(𝜃)=𝑦 ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏟
≥1−𝛼
As in many settings no “best” hypothesis test exists, it is not surprising that constructing a “best” confi-
dence interval (set) is far from trivial.
48
2.8. SOME ADDITIONAL RESULTS ON COMPLETE STATISTICS*
If 𝑇 is complete and 𝑆 = ℎ(𝑇 ), then 𝑆 is complete. The Lehmann-Scheffé theorem states that a
complete sufficient statistic is minimal sufficient.
Theorem 2.54. Suppose 𝑇 is sufficient and complete for 𝜃 ∈ Ω, then 𝑇 is minimal sufficient.
Proof. Let 𝑆 be a minimal sufficient statistic. Then there exists a measurable function ℎ such that 𝑆 = ℎ(𝑇 ).
Define 𝑔(𝑆) = E[𝑇 ∣ 𝑆], which does not depend on 𝜃, as 𝑆 is sufficient. Taking expectations, we get
E𝜃 [𝑔(𝑆)] = E𝜃 [E[𝑇 ∣ 𝑆]] = E𝜃 [𝑇 ]. Hence E𝜃 [𝑇 − 𝑔(𝑆)] = 0 for all 𝜃. By completeness, this implies
𝑇 = 𝑔(𝑆), P𝜃 -a.s. As both 𝑆 = ℎ(𝑇 ) and 𝑇 = 𝑔(𝑆), we conclude that 𝑆 and 𝑇 are one-to-one functions of
each other. Hence 𝑇 is minimal sufficient.
Theorem 2.55 (Lehmann-Scheffé). If 𝑇 is a complete statistic, then all unbiased estimators of 𝑔(𝜃) that are
functions of 𝑇 alone are equal P𝜃 -a.s. for all 𝜃. If there exists an unbiased estimator that is a function of a
complete sufficient statistic, then it is UMVU.
Proof. Suppose both 𝜙1 (𝑇 ) and 𝜙2 (𝑇 ) are unbiased estimators for 𝑔(𝜃). Then
( )
E𝜃 𝜙1 (𝑇 ) − 𝜙2 (𝑇 ) = 0
for all 𝜃 and by completeness it follows that 𝜙1 (𝑇 ) = 𝜙2 (𝑇 ), P𝜃 -a.s. This proves the first statement.
For the second statement, suppose there is an unbiased estimator 𝜙(𝑋) with finite-variance. Define
𝜙3 (𝑇 ) = E[𝜙(𝑋) ∣ 𝑇 ].
The inequality follows from the conditional Jensen inequality and 𝑥 → 𝐿(𝑥) being convex.
Exercise 2.27 Suppose 𝑋 ∼ 𝐸𝑥𝑝 (𝜃). Show that no unbiased estimator for 𝜃 exists. Proceed as
follows:
3. As this implies that 𝜙(𝑋) = 0, conclude that 𝜙(𝑋) cannot be unbiased for 𝜃.
49
CHAPTER 2. TOPICS FROM CLASSICAL STATISTICS
50
Chapter 3
In this chapter we discuss three principles: the likelihood- sufficiency- and conditionality principle.
Definition 3.1 (Weak Sufficiency principle (WSP)). Two observations 𝑥 and 𝑥′ factorising through the
same value of a sufficient statistic 𝑇 , that is, such that 𝑇 (𝑥) = 𝑇 (𝑥′ ), must lead to the same inference on 𝜃.
The second principle that we discuss here can be attributed to Fisher [1959] and Barnard [1949] and was
formalised by Birnbaum [1962]. In its definition, the notion of information is to be considered in the general
sense of the collection of all possible inferences on 𝜃.
Definition 3.2 (Likelihood principle (LP)). The information brought by an observation 𝑥 about 𝜃 is entirely
contained in the likelihood function 𝐿(𝜃; 𝑥). Moreover, if 𝑥 and 𝑥′ are two observations depending on the
same parameter 𝜃 (possibly in different experiments), such that there exists a constant 𝑐 satisfying 𝐿(𝜃; 𝑥) =
𝑐𝐿′ (𝜃; 𝑥′ ) for every 𝜃, they bring the same information about 𝜃 and must lead to identical inferences.
The likelihood principle says this: the likelihood function, long known to be a minimal sufficient
statistic, is much more than merely a sufficient statistic, for given the likelihood function in
which an experiment has resulted, everything else about the experiment – what its plan was,
what different data might have resulted from it, the conditional distributions of statistics under
given parameter values, and so on –is irrelevant.
Under this principle, quantities that depend on the sampling distribution of a statistic, which is in general not
a function of the likelihood function alone, are irrelevant for statistical inference. This principle is only valid
when
51
CHAPTER 3. THE LIKELIHOOD PRINCIPLE
Example 3.3. Suppose 9 Bernoulli trials are observed with success, and 3 Bernoulli trials are observed with
failure. In case we assume 12 observations were to be collected, then the number of successes 𝑋 has the
𝐵𝑖𝑛 (12, 𝜃)-distribution and the likelihood for 𝑥 = 9 equals
( )
12 9
𝐿(𝜃; 𝑥) = 𝜃 (1 − 𝜃)3 .
9
In case we assume we continued experimenting until we had obtained 3 failures, then the number of successes
𝑋 ′ has the negative Binomial distribution: 𝑋 ′ ∼ 𝑁𝑒𝑔𝐵𝑖𝑛 (3, 𝜃). In this case
( )
′ ′ 11 9
𝐿 (𝜃; 𝑥 ) = 𝜃 (1 − 𝜃)3 .
9
The likelihood principle implies that inference on 𝜃 should be identical for both models. Hence, the rule
which determines to stop gathering any more samples is irrelevant.
Now consider the problem of testing 𝐻0 ∶ 𝜃 = 1∕2 versus 𝐻1 ∶ 𝜃 > 1∕2. Assuming the Binomial
distribution, the 𝑝-value equals
{( ) ( ) ( ) ( )} ( )12
12 12 12 12 1
+ + + ≈ 0.073.
9 10 11 12 2
Using significance level 𝛼 = 0.05, the calculated 𝑝-values lead to different conclusions. From this simple
example we can clearly see that as the 𝑝-value depends on a tail probability, it violates the likelihood principle.
This conclusion generalises to all frequentist hypothesis testing.
This example illustrates that classical hypothesis testing does not satisfy the LP.
Related to this example you may wish to read the first section of the article Lindley and Phillips [1976].
In this work the Bernoulli experiment consists of throwing a metal drawing pin (thumb tack) and observing
whether it fell with the point uppermost (abbreviated to U) or with the point resting on the table, downwards
(abbreviated to D). Regarding the criterion to stop sampling they remark:
In other words, the significance (in the technical sense) to be associated with the hypothesis of
equal chances depends heavily on what other results could have been achieved besides the 9 U’s
and 3 D’s reported. Thus was (10,2) or (10,3) an alternative possibility? And yet it is rare for
anyone to ask a scientist for this information.In fact, in the little experiment with the drawing
pin I continued tossing until my wife said "Coffee’s ready". Exactly how a significance test is to
be performed in these circumstances is unclear to me.
Since the Bayesian approach is entirely based on the posterior distribution which only depends on 𝑥
through the likelihood, the LP is automatically satisfied in a Bayesian setting 1 . However, when it comes to
estimation, the LP does not imply Bayesian estimators, as for example maximum likelihood is a frequentist
implementation of the LP.
1
However, in case Jeffrey’s prior is used in a Bayesian analysis, the likelihood principle is violated.
52
3.1. THREE PRINCIPLES IN STATISTICS
Surely not all statisticians accept the LP. Indeed, there is still controversy among researchers about its
validity and implications. 2 However, Birnbaum [1962] proved that if you accept the WSP together with the
following principle, then you necessarily must accept the LP. We give a precise statement in section 3.2.
Definition 3.4 (Conditionality Principle (CP)). If two experiments on the parameter 𝜃, 1 and 2 , are
available and if one of these two experiments is selected with probability 𝑝, the resulting inference on 𝜃
should only depend on the selected experiment.
Note that it is assumed that the probability 𝑝 does not depend on 𝜃. Both the WSP and CP appear to be
reasonable. As they imply the LP this has far reaching consequences.
To illustrate the point of the conditionality principle, we consider an example. It is copied from the recent
article by Gandenberger [2015] and resembles the example given in chapter 7.2 of Young and Smith [2005].
Example 3.5. Suppose you work in a laboratory that contains three thermometers, T1, T2, and
T3. All three thermometers produce measurements that are normally distributed about the true
temperature being measured. The variance of T1’s measurements is equal to that of T2’s but
much smaller than that of T3’s. T1 belongs to your colleague John, so he always gets to use
it. T2 and T3 are common lab property, so there are frequent disputes over the use of T2. One
day, you and another colleague both want to use T2, so you toss a fair coin to decide who gets
it. You win the toss and take T2. That day, you and John happen to be performing identical
experiments that involve testing whether the temperature of your respective indistinguishable
samples of some substance is greater than 0◦ C or not. John uses T1 to measure his sample
and finds that his result is just statistically significantly different from 0◦ C. John celebrates and
begins making plans to publish his result. You use T2 to measure your sample and happen to
measure exactly the same value as John. You celebrate as well and begin to think about how
you can beat John to publication. “Not so fast”, John says. “Your experiment was different from
mine. I was bound to use T1 all along, whereas you had only a 50% chance of using T2. You
need to include that fact in your calculations. When you do, you’ll find that your result is no
longer significant.”
Gandenberger [2015] comments on this example
According to radically ‘behaviouristic’ forms of frequentism, John may be correct. You per-
formed a mixture experiment by flipping a coin to decide which of two thermometers to use,
and thus which of two component experiments to perform. The uniformly most powerful level
𝛼 test for that mixture experiment does not consist of performing the uniformly most power-
ful level 𝛼 test for whichever component experiment is actually performed. Instead, it involves
accepting probability of Type I error greater than 𝛼 when T3 is used in exchange for a proba-
bility of Type I error less than 𝛼 when T2 is used, in such a way that the probability of Type I
error for the mixture experiment as a whole remains 𝛼 (see Cox [1958], p. 360). Most statisti-
cians, including most frequentists, reject this line of reasoning. It seems suspicious for at least
three reasons. First, the claim that your measurement warrants different conclusions from John’s
seems bizarre. They are numerically identical measurements from indistinguishable samples of
the same substance made using measuring instruments with the same stochastic properties. The
only difference between your procedures is that John was ‘bound’ to use the thermometer he
used, whereas you had a 50% chance of using a less precise thermometer. It seems odd to claim
2
The topic of this chapter is complicated, but I feel you should at least have taken notice of the principles discussed. Somewhat
confusingly, the three principles are not always stated in exactly the same form in the literature. I am not a specialist on this topic
and have tried to summarise some of the main findings in an accessible way.
53
CHAPTER 3. THE LIKELIHOOD PRINCIPLE
that the fact that you could have used an instrument other than the one you actually used is rele-
vant to the interpretation of the measurement you actually got using the instrument you actually
used. Second, the claim that John was “bound” to use T1 warrants scrutiny. Suppose that he
had won that thermometer on a bet he made ten years ago that he had a 50% chance of winning,
and that if he hadn’t won that bet, he would have been using T3 for his measurements. Accord-
ing to his own reasoning, this fact would mean that his result is not statistically significant after
all. The implication that one might have to take into account a bet made ten years ago that has
nothing to do with the system of interest to analyse John’s experiment is hard to swallow. In
fact, this problem is much deeper than the fanciful example of John winning the thermometer
in a bet would suggest. If John’s use of T1 as opposed to some other thermometer with differ-
ent stochastic properties was a nontrivial result of any random process at any point in the past
that was independent of the temperature being measured, then the denial of Weak Condition-
ality Principle as applied to this example implies that John analysed his data using a procedure
that fails to track evidential meaning. Third, at the time of your analysis you know which ther-
mometer you received. How could it be better epistemically to fail to take that knowledge into
account?
The following is a numerical case to this example. Suppose 𝛽 ∈ [0, 1] and 𝑋 has density
Assume 𝜎 2 is known and for one experiment 𝛽 = 1 and for the other experiment 𝛽 = 1∕2. Assume we wish
to test 𝐻0 ∶ 𝜇 = 0 versus 𝐻1 ∶ 𝜇 ≠ 0 at level 𝛼 = 0.05. In case 𝛽 = 1, we reject when |𝑥| > 1.96. If
𝛽 = 1∕2 we reject when |𝑥| > 𝑘, where 𝑘 solves
∞ ∞
1 1
𝜙(𝑥; 0, 1) d𝑥 + 𝜙(𝑥; 0, 𝜎 2 ) d𝑥 = 0.025.
2 ∫𝑘 2 ∫𝑘
This means that if you accept the intuitively reasonable principles CP and WSP, you necessarily accept
the LP and its consequences.
Exercise 3.1 [YS exercise 7.4.] Suppose 𝑋 ∼ 𝑁 (𝜃, 1) or 𝑋 ∼ 𝑁 (𝜃, 4), depending on whether the
outcome, 𝑌 , of tossing a fair coin is heads (𝑦 = 1) or tails (𝑦 = 0). It is desired to test 𝐻0 ∶ 𝜃 = −1
against 𝐻1 ∶ 𝜃 = 1. Show that the most powerful (unconditional) size 𝛼 = 0.05 test is the test with
rejection region given by 𝑥 ≥ 0.598 if 𝑦 = 1 and 𝑥 ≥ 2.392 if 𝑦 = 0.
Suppose instead that we condition on the outcome of the coin toss in construction of the tests. Verify
that, given 𝑦 = 1, the resulting most powerful size 𝛼 = 0.05 test would reject if 𝑥 ≥ 0.645, while,
given 𝑦 = 0, the rejection region would be 𝑥 ≥ 2.290.
54
3.2. PROOF OF BIRNBAUM’S RESULT*
Exercise 3.2 ⟨Berger and Wolpert (1988), exercise 1.17 in Robert [2007].⟩ Consider an experiment
with outcomes in {1, 2, 3} and probability mass functions 𝑓 (⋅ ∣ 𝜃), 𝜃 ∈ {0, 1} given by
𝑥 1 2 3
𝑓 (𝑥; 0) 0.9 0.05 0.05
𝑓 (𝑥; 1) 0.1 0.05 0.85
2. Now suppose you get the realisation 𝑥 = 2. The frequentist test then rejects. Based on the
likelihood ratio, is there strong evidence to reject?
Exercise 3.3 ⟨Cox (1958), exercise 1.20 in Robert [2007].⟩ In a research laboratory, a physical
quantity 𝜃 can be measured by a precise but often busy machine, which provides a measurement
𝑋1 ∼ 𝑁 (𝜃, 0.1), with probability 0.5, or through a less precise but always available machine, which
gives 𝑋2 ∼ 𝑁 (𝜃, 10). Suppose you obtain an observation and wish to make a confidence interval
for 𝜃, say with confidence 95%. If you take into account that both machines could have been se-
lected, show that the half-width of the confidence interval equals 5.19, while the half-width of the
confidence interval obtained from the precise machine equals 0.62.
Hint: In order to derive the half-width for the case where both machines could have been selected,
first think about what the density will be in that case. See the previous page for an example of such
a combined density. Given this density, try to derive the half-width.
Exercise 3.4 Show by means of an example that the principle of unbiased estimation does not
respect the likelihood principle.
Hint: construct unbiased estimators for 𝜃 in example 3.3
Definition 3.7. If one makes the same inference on 𝜃 if one performs 1 and observes 𝑥1 or performs 2 and
observes 𝑥2 , then we write
(1 , 𝑥1 ) ∼ (2 , 𝑥2 )
and say (1 , 𝑥1 ) and (2 , 𝑥2 ) are equivalent.
55
CHAPTER 3. THE LIKELIHOOD PRINCIPLE
If we assume the measures P𝜃,1 and P𝜃,2 admit densities 𝑓 (1) (⋅; 𝜃) and 𝑓 (2) (⋅; 𝜃) respectively, then we can
reformulate
Note [LP] gives equivalence of different experiments, while [WSP] is concerned with a single experiment.
Proof of theorem 3.6. Assume 𝑓 (1) (𝑥1 ; 𝜃) = 𝑐𝑓 (2) (𝑥2 ; 𝜃) for all 𝜃 ∈ Ω and some 𝑐 > 0. Consider the mixture
experiment
1 𝑐
= + .
1+𝑐 1 1+𝑐 2
Within the experiment , the points (1, 𝑥1 ) and (2, 𝑥2 ) have densities
1 𝑐
𝑓 (1) (𝑥1 ; 𝜃) and 𝑓 (2) (𝑥2 ; 𝜃)
1+𝑐 1+𝑐
respectively. These are the same as we assumed that 𝑓 (1) (𝑥1 ; 𝜃) = 𝑐𝑓 (2) (𝑥2 ; 𝜃).
This equality, together with the existence of a sufficient statistic 𝑆 in the mixture experiment , imply
that (use lemma 2.6) 𝑆((1, 𝑥1 )) = 𝑆((2, 𝑥2 )). By the sufficiency principle this gives
56
Chapter 4
Bayesian statistics
This chapter gives an introduction to the Bayesian approach to statistics. We discuss some justifications,
such as exchangeability. We discuss prior specification, an essential requirement for Bayesian inference.
Subsequently graphical and Bayesian models are connected by means of hierarchical models. In the final
sections we discuss empirical Bayes methods and Bayesian asymptotics.
4.1 Setup
Bayesian statistics is based on the “axiom” that uncertainties can only be described with probability. As the
Bayesian statistician D.V. Lindley put it:
Whatever way uncertainty is approached, probability is the only sound way to think about it.
This means that, for anything we don’t know, the only logical way to describe our beliefs is probability.
In particular, the belief about the parameter 𝜃 in a statistical model should be described by a probability
distribution. One way to view this distribution is that it reflects our state of knowledge (information) about
the parameter before seeing any data. On the contrary, in classical statistics, 𝜃 is considered a fixed unknown
parameter. As within Bayesian statistics the parameter is considered to be a random quantity, we denote it
with a capital letter Θ. Its distribution 𝜇Θ is called the prior distribution. The density of the data 𝑋, is
then in fact to be interpreted as the conditional density of 𝑋 given Θ (with respect to some 𝜎-finite measure
𝜈). Hence, whereas we would write 𝑓𝑋 (𝑥; 𝜃) in classical statistics, we now write 𝑓𝑋∣Θ (𝑥 ∣ 𝜃). Note that the
“∣”-sign really denotes “conditional on” and for that reason within classical statistics we wrote “;” to merely
say “depends on 𝜃 as well”. Now densities need not always exist, as illustrated in some of the examples
that follow. For that reason, we first give a general definition of a Bayesian statistical model and afterwards
specialise to the case where densities exist. The following definition turns out to be important.
Definition 4.1. A Markov kernel with source (𝑋, ) and target (𝑌 , ) is a map 𝜅 ∶ 𝑋 × → [0, 1] such
that
57
CHAPTER 4. BAYESIAN STATISTICS
Hence, compared to the definition of a statistical experiment there is additionally the prior that is part of the
definition. The decomposition in (4.1) says informally that to sample from (Θ, 𝑋) one first samples 𝜃 from
𝜇Θ and next 𝑥 from P𝜃 (⋅). In Bayesian statistics we are interested in the “reverse” way of sampling: first
sample 𝑥 from its marginal, followed by sampling 𝜃 conditional on 𝑥. By marginalisation, we can define the
predictive measure 𝜇𝑋 (𝐵) = 𝜇(Θ,𝑋) (Ω × 𝐵). If there exists a Markov kernel Π such that
58
4.1. SETUP
(provided that 𝑓𝑋 (𝑥) > 0; see Remark 4.2). First, this shows that the measure 𝜇Θ ⊗ 𝜇𝑋 acts in a natural way
as a dominating measure:
d𝜇(Θ,𝑋) 𝐿(𝜃; 𝑥)
(𝜃, 𝑥) = . (4.5)
d(𝜇Θ ⊗ 𝜇𝑋 ) 𝑓𝑋 (𝑥)
Secondly, from (4.4) it follows that the posterior is given by
𝐿(𝜃; 𝑥)
Π𝑥 (𝐴) = 𝜇 ( d𝜃).
∫𝐴 𝑓𝑋 (𝑥) Θ
d𝜇Θ
If we additionally assume 𝜇Θ ≪ 𝜉 and denote d𝜉
(𝜃) = 𝑓Θ (𝜃) then
𝐿(𝜃; 𝑥)𝑓Θ (𝜃)
Π𝑥 (𝐴) = 𝜉( d𝜃).
∫𝐴 ∫ 𝐿(𝜃; 𝑥)𝑓Θ (𝜃)𝜉( d𝜃)
Ω
Clearly, the posterior depends on the data only via the likelihood: the likelihood principle is satisfied.
Considering 𝑥 fixed, we see from this expression that the posterior measure is dominated by the measure
𝜉 and has density
dΠ𝑥 𝐿(𝜃; 𝑥)𝑓Θ (𝜃)
𝑓Θ∣𝑋 (𝑥 ∣ 𝜃) ∶= (𝜃) = .
d𝜉 ∫Ω 𝐿(𝜃; 𝑥)𝑓Θ (𝜃)𝜉( d𝜃)
Note that the term on the denominator does not depend on 𝜃. Therefore, one often sees this equation written
as
𝑓Θ∣𝑋 (𝑥 ∣ 𝜃) ∝ 𝐿(𝜃; 𝑥)𝑓Θ (𝜃)
which reads
posterior density ∝ likelihood × prior density .
This is the formula often seen in introductory books on Bayesian statistics.
Remark 4.2. This is a bit of a technical remark. One may wonder what happens if 𝑓𝑋 (𝑥) = 0 in (4.4).
Formally, rather than (4.5), we can define
d𝜇(Θ,𝑋)
(𝜃, 𝑥) = 𝑘(𝜃, 𝑥)
d(𝜇Θ ⊗ 𝜇𝑋 )
with { 𝐿(𝜃;𝑥)
𝑓𝑋 (𝑥)
if 𝑓𝑋 (𝑥) ≠ 0
𝑘(𝜃, 𝑥) =
if 𝑓𝑋 (𝑥) = 0
𝑐
{ }
where 𝑐 ∈ ℝ is arbitrary. Let 𝐶0 = 𝑥 ∶ 𝑓𝑋 (𝑥) = 0 . This is valid, since
59
CHAPTER 4. BAYESIAN STATISTICS
The following example is very important. It shows that a standard measurement model with the Normal
distribution is fully tractable, if the variance is assumed to be known and a Gaussian prior on the mean is
used.
where we assume 𝜎 2 to be known. The parameter 𝜇0 and 𝜎02 are also assumed to be known and are part of
the prior specification. If we set 𝑋 = (𝑋1 , … , 𝑋𝑛 ), then
Θ ∣ 𝑋 ∼ 𝑁(𝜇1 , 𝜎12 )
with
1 1 𝑛
2
= 2 + 2. (4.6)
𝜎1 𝜎0 𝜎
and
1∕𝜎02 𝑛∕𝜎 2 ̄
𝜇1 = 𝜇 +
2 0
𝑋𝑛 = 𝑤𝑛 𝜇0 + (1 − 𝑤𝑛 )𝑋̄ 𝑛 ,
1∕𝜎1 1∕𝜎12
where 𝑤𝑛 = 𝜎 2 ∕(𝜎02 +𝑛𝜎 2 ) is the ratio fo the prior over the posterior precision. Equation (4.6) reads: posterior
precision equals prior precision + data precision (precision being defined a the inverse of variance). Note
that the posterior mean is a convex combination of the prior mean 𝜇0 and the sample average 𝑋̄ 𝑛 and that
𝑤𝑛 → 0 if 𝑛 → ∞. Note that for large values of 𝑛, 𝜎12 ≈ 𝜎 2 ∕𝑛 and hence 𝜇1 ≈ 𝑋̄ 𝑛 . One says that as the
sample size gets large the prior gets “washed away”.
iid
Exercise 4.2 Suppose 𝑋1 , … , 𝑋𝑛 ∣ Θ = 𝜃 ∼ 𝑃 𝑜𝑖𝑠 (𝜃) and Θ ∼ 𝐺𝑎 (𝛼, 𝛽). Derive the posterior
distribution and compute the posterior mean.
60
4.1. SETUP
Let 𝜇Θ denote the prior measure and assume it has density 𝑓Θ . We have
( )
𝜇(Θ,𝑋) (𝐴 × 𝐵) = 𝑝𝟏𝐵 (𝜃) + (1 − 𝑝) 𝑓 (𝑥 − 𝜃) d𝑥 𝑓Θ (𝜃) d𝜃
∫𝐴 ∫𝐵 𝑍
The second equality follows from Fubini’s theorem and the final equality follows from
Thus we have
𝑝𝑓Θ (𝑥)𝟏𝐴 (𝑥) + (1 − 𝑝) ∫𝐴 𝑓𝑍 (𝑥 − 𝜃)𝑓Θ (𝜃) d𝜃
𝜇(Θ,𝑋) (𝐴 × 𝐵) = 𝜇𝑋 ( d𝑥).
∫𝐵 𝑓𝑋 (𝑥)
Therefore, the posterior is given by
In particular, the posterior measure assigns probability 𝑝𝑚(𝑥)∕𝑓𝑋 (𝑥) to {𝑥}. Now whereas the prior is ab-
solutely continuous with respect to Lebesgue measure, the posterior is not.
61
CHAPTER 4. BAYESIAN STATISTICS
a ticket from an urn with tickets labeled 𝑎𝑖 𝜃, where the 𝑖-th ticket is chosen with probability 𝑝𝑖 , 1 ≤ 𝑖 ≤ 𝑁
(the preceding simply corresponds to 𝑝𝑖 = 1∕𝑁). We have
∑
𝑁
P𝜃 (𝐵) = 𝑝𝑖 𝟏𝐵 (𝑎𝑖 𝜃).
𝑖=1
Then
𝑖 . Hence (take
using the substitution 𝑢𝑖 = 𝑎𝑖 𝜃 at the third equality sign and where 𝑤𝑖 (𝑥) = 𝑝𝑖 𝑚(𝑥∕𝑎𝑖 )𝑎−1
𝐴 = (0, ∞))
∑𝑁
𝜇𝑋 (𝐵) = 𝑓𝑋 (𝑥) d𝑥 with 𝑓𝑋 (𝑥) = 𝑤𝑖 (𝑥).
∫𝐵
𝑖=1
∑
𝑁
𝑋
̃ ∶=
Θ 𝜃𝜃Π𝑋 ( d𝜃) = 𝑤̄ 𝑗 (𝑋)
∫ 𝑎𝑗
𝑗=1
will inherit this smooth behaviour. Note that completely different behaviour when compared to the MLE.
From this example we clearly see that the MLE and posterior mean can behave radically different: in the
former case we look for a maximiser whereas in the latter case we average over the parameter space.
62
4.1. SETUP
Exercise 4.1. Verify that 𝑤𝑖 (𝑥) can be interpreted as the probability of {𝑋 = 𝑎𝑖 𝜃}.
Exercise 4.2. There is an alternative way to write the model where we set 𝑋 = 𝛼Θ and (𝛼, Θ) gets assigned
the prior distribution
the second line denoting the categorical distribution that puts mass 𝑝𝑖 on 𝑖. Show using marginalisation that
∑
𝑁
𝜇Θ (𝐶) = 𝑤𝑖 (𝑥) d𝑥.
∫𝐶
𝑖=1
4.1.6 Prediction
The Bayesian paradigm also accommodates inference about future observables in a natural way. If 𝑌 denotes
a future observation, then in a dominated Bayesian statistical experiment the posterior predictive density
of 𝑌 is given by
𝑓𝑋,𝑌 ∣Θ (𝑥, 𝑦 ∣ 𝜃)𝑓Θ (𝜃)
𝑓𝑌 ∣𝑋 (𝑦 ∣ 𝑥) = 𝑓Θ,𝑌 ∣𝑋 (𝜃, 𝑦 ∣ 𝑥)𝜉( d𝜃) = 𝜉( d𝜃). (4.8)
∫ ∫ 𝑓𝑋 (𝑥)
The formula simplifies in case one assumes that 𝑋 and 𝑌 are independent, conditional on Θ. In that case
Alternatively, one could make 𝑌 part of Θ, obtain the posterior distribution and marginalise out the compo-
nents of Θ that do not involve 𝑌 .
A typical setting here would involve a sequence of conditionally independent random variables {𝑋𝑖 }𝑚+𝑛
𝑖=1
given Θ, where 𝑋 = (𝑋1 , … , 𝑋𝑛 ) and 𝑌 = (𝑋𝑛+1 , … , 𝑋𝑛+𝑚 ). Another important setting is that of state-
space models, which we will discuss later on. Note that the preceding display shows that in prediction we
average over the posterior. So rather than fixing one value of the parameter (as is usually done in prediction
for frequentist statistics) uncertainty on the parameter is properly taken into account.
Example 4.5 (Laplace’s Law of Succession). Laplace considered the problem of determining the probability
that the sun will rise tomorrow on the assumption that it has risen 𝑛 times in succession.
He made the following assumptions:
1. The probability of the sun rising on any day is constant and unknown.
2. This unknown probability is a random variable Θ which is uniformly distributed on [0, 1]. This reflects
the total ignorance of the probability of the sun rising (or not).
3. Successive sunrises are independent events (actually, independent conditional on the parameter Θ.
If 𝑋 = (𝑋1 , … , 𝑋𝑛 ), then
𝑓Θ∣𝑋 (𝜃 ∣ 𝑥) ∝ 𝜃 𝑆 (1 − 𝜃)𝑛−𝑆 𝟏[0,1] (𝜃),
63
CHAPTER 4. BAYESIAN STATISTICS
∑𝑛
where 𝑆 = 𝑖=1 𝑋𝑖 . Hence the posterior follows from Θ ∣ 𝑋 ∼ 𝐵𝑒 (𝑆 + 1, 𝑛 − 𝑆 + 1). Using equation (4.8)
we get
1
𝑛+1
𝑓𝑋𝑛+1 ∣𝑋 (1 ∣ 𝑥) = 𝜃(𝑛 + 1)𝜃 𝑛 d𝜃 = 𝔼 [Θ ∣ 𝑋] = ,
∫0 𝑛+2
where 𝑥 ∈ ℝ𝑛 is the vector with all elements equal to 1. Laplace’ intuition for this result was that we
essentially have 2 extra observations, as we assume that both “rise” and “not rise” can happen. This explains
the factor 𝑛 + 2 in the denominator.
A more general point of view is to consider 𝜃 and 𝑥 containing all unobserved and observed variables
respectively. Then the Bayesian point of view can be summarised as deriving the distribution of the unob-
served variables conditional on the observed variables. In prediction, that amounts to added future values
𝑦 to 𝜃. Then we would have a Bayesian experiment with unobserved variable 𝜃 ′ ∶= (𝜃, 𝑦) and observed
variable 𝑥. Then the predictive distribution is simply the 𝑦-marginal of the distribution of 𝜃 ′ ∣ 𝑥. From this
one sees that it is not always entirely clear what should be called likelihood or prior (is the distribution on 𝑦
like a prior or likelihood?). The main point is that the distinction between likelihood and prior is irrelevant:
the important distinction is between observed and non-observed variables. We further illustrate this in the
upcoming section.
Important note on Bayesian notation. Often, “Bayesian notation” is used. This amount to writing
𝑝(𝑦 ∣ 𝑥) instead of 𝑓𝑌 ∣𝑋 (𝑦 ∣ 𝑥). Be aware that this can be a bit tricky: as an example, the expression 𝑝(𝑦2 ∣ 𝑥)
is meant to denote 𝑓𝑌 2 ∣𝑋 (𝑦2 ∣ 𝑥) and NOT 𝑓𝑌 ∣𝑋 (𝑦2 ∣ 𝑥). An advantage of this notation is that it enhances
readability of formulas.
Here, we do not fully specify the densities 𝑓 and 𝑔. The third line though imposes the restriction that 𝜇, 𝛽 and
𝜎 are apriori independent. This is a Bayesian analogue of the regression model. In frequentist statistics, one
usually directly writes down the conditional distribution of 𝑌𝑖 ∣ 𝑋𝑖 , assuming the 𝑋𝑖 are fixed. Why? First
of all, it is convenient, as it may be difficult to find a distribution for the 𝑋𝑖 , especially when this is a high-
dimensional vector. Now let’s see what the Bayesian approach gives us. Note that the following variables are
involved (I switch to Bayesian notation, writing all variables in lower-case): 𝑦1 , … , 𝑦𝑛 , 𝑥1 , … , 𝑥𝑛 , 𝜇, 𝛽, 𝜎 2 .
The variables that are observed are 𝑦1 , … , 𝑦𝑛 and 𝑥1 , … , 𝑥𝑛 . Hence, the posterior distribution satisfies
∏
𝑛
𝑝(𝜇, 𝛽, 𝜎 2 ∣ 𝑦1 , … , 𝑦𝑛 , 𝑥1 , … , 𝑥𝑛 ) ∝ 𝑝(𝜇, 𝛽, 𝜎 2 ) 𝑝(𝑦𝑖 ∣ 𝑥𝑖 , 𝜇, 𝜎 2 )𝑝(𝑥𝑖 ∣ 𝛽)
𝑖=1
( )( )
∏𝑛
∏
𝑛
2 2
= 𝑝(𝛽) 𝑝(𝑥𝑖 ∣ 𝛽) 𝑝(𝜇)𝑝(𝜎 ) 𝑝(𝑦𝑖 ∣ 𝑥𝑖 , 𝜇, 𝜎 )
𝑖=1 𝑖=1
64
4.1. SETUP
Hence, because the model prescribes that 𝛽 and (𝜇, 𝜎 2 ) are statistically independent (which is an assumption
made by the statistician), the posterior distribution of (𝛽, 𝜎 2 ) can be obtained by only specifying the condi-
tional distribution of 𝑦𝑖 ∣ 𝑥𝑖 . This sheds light on the underlying assumption that explains why we can think
of 𝑥1 , … , 𝑥𝑛 as being fixed.
It gets more interesting when there are “missing data”, a somewhat strange name for referring to the
case where for some couple (𝑥𝑗 , 𝑦𝑗 ), either 𝑥𝑗 or 𝑦𝑗 is not observed. If 𝑦𝑗 is not observed, then the posterior
density is 𝑝(𝜇, 𝛽, 𝜎 2 , 𝑦𝑗 ∣ 𝑥1 , … 𝑥𝑛 , {𝑦𝑖 , 𝑖 ≠ 𝑗}), which is proportional to the expression in the preceding
display. So in this case, a posteriori, (𝑦𝑗 , 𝜇, 𝜎 2 ) and 𝛽 are independent. Now would you call 𝑦𝑗 a parameter,
observation, missing observation,...? The point is, it is not necessary to think about this. 1 We simply write
the hierarchical model as above (we come back to hierarchical modelling later in this chapter) and condition
on observed variables.
Exercise 4.3 What if some 𝑥𝑗 is not observed, where 𝑦𝑗 is observed. Do we still have posterior
independence?
𝑝(𝜃, 𝑥𝑛+1 ∣ 𝑥𝑛 )
𝑝(𝜃 ∣ 𝑥𝑛+1 ) = 𝑝(𝜃 ∣ 𝑥𝑛 , 𝑥𝑛+1 ) =
𝑝(𝑥𝑛+1 ∣ 𝑥𝑛 )
𝑝(𝑥𝑛+1 ∣ 𝜃, 𝑥𝑛 )𝑝(𝜃 ∣ 𝑥𝑛 ) 𝑝(𝑥𝑛+1 ∣ 𝜃)𝑝(𝜃 ∣ 𝑥𝑛 )
= =
𝑝(𝑥𝑛+1 ∣ 𝑥𝑛 ) 𝑝(𝑥𝑛+1 ∣ 𝑥𝑛 )
Hence
𝑝(𝜃 ∣ 𝑥𝑛+1 ) ∝ 𝑝(𝑥𝑛+1 ∣ 𝜃)𝑝(𝜃 ∣ 𝑥𝑛 )
which reveals that the posterior based on 𝑛 observations serves as a prior for the next observation. Intu-
itively this makes much sense and this also explains why Bayesian inference easily incorporates sequential
processing of data. Moreover, even if one observes 𝑥𝑛 , it may be computationally advantageous to compute
the posterior sequentially using Bayesian updating.
Definition 4.7. Assume 𝜃 is one-dimensional. The posterior 𝛾-quantile is defined as the 𝛾-quantile of the
posterior distribution.
1
In my opinion, thinking about variables and dividing these into observed and nonobserved variables is a very useful thing. I
was much influenced by the talk https://www.youtube.com/watch?v=yakg94HyWdE&t=6s by Richard McElreath who works
in anthropology and is the author of a much acclaimed book on Bayesian statistics.
65
CHAPTER 4. BAYESIAN STATISTICS
Both the posterior mean and posterior median are point estimators. The Bayesian equivalent of a confi-
dence set is a credible set, which is a set with pre described posterior probability.
Definition 4.8. The set 𝐴 is a level 𝛾-credible set for 𝜃 if Π𝑋 (𝐴) ≥ 𝛾.
Contrary to confidence sets, interpretation of credible sets is straightforward. Nevertheless, just like
confidence sets, credible sets are not unique. It is not obvious how to choose a credible set when for example
the posterior density is multimodal.
The previous three summary measures of the posterior distribution appear a bit at hoc. A formal way to
derive these uses a loss function (details are in chapter 6).
4.1.10 An example
Example 4.9. This example is taken from Berger [2006]. Suppose we are dealing with a medical problem
where within a population the probability that someone has a particular disease is given by 𝜃0 . Hence, if
𝐷 = {patient has the disease} then 𝜃0 = ℙ(𝐷). A diagnostic reading results in either a positive (𝑃 ) or
negative (𝑁) reading. Let
𝜃1 = ℙ(𝑃 ∣ 𝐷) 𝜃2 = ℙ(𝑃 ∣ 𝐷𝑐 ).
By Bayes’ theorem:
𝜃0 𝜃1
𝜓 = ℙ(𝐷 ∣ 𝑃 ) = = 𝑔(𝜃0 , 𝜃1 , 𝜃2 ) (say). (4.9)
𝜃0 𝜃1 + (1 − 𝜃0 )𝜃2
( )
The statistical problem is as follows: based on data 𝑋𝑖 ∼ 𝐵𝑖𝑛 𝑛𝑖 , 𝜃𝑖 , 𝑖 = 0, 1, 2 (arising from medical
studies), find a 100(1 − 𝛼)% confidence or credible set for 𝜃. At first sight this may seem like a very simple
problem, but it is not straightforward how to construct a confidence set from a classical perspective.
Within the Bayesian framework, all unknowns quantities get assigned a probability distribution. We choose
ind ( )
𝑋𝑖 ∣ Θ𝑖 = 𝜃𝑖 ∼ 𝐵𝑖𝑛 𝑛𝑖 , 𝜃𝑖
ind
Θ𝑖 ∼ 𝐵𝑒 (𝑎, 𝑏) .
( )
It is easy to verify that Θ𝑖 ∣ 𝑋𝑖 ∼ 𝐵𝑒 𝑋𝑖 + 𝑎, 𝑛𝑖 − 𝑋𝑖 + 𝑏 , 𝑖 = 0, 1, 2 (independently). To construct the
desired credible set, we use Monte Carlo simulation:
( )
1. For 𝑖 ∈ {0, 1, 2}, draw Θ𝑖 ∣ 𝑋𝑖 = 𝑥𝑖 ∼ 𝐵𝑒 𝑥𝑖 + 𝑎, 𝑛𝑖 − 𝑥𝑖 + 𝑏 to obtain realisations 𝜃0 , 𝜃1 , 𝜃2 .
2. Set 𝜓 = 𝑔(𝜃0 , 𝜃1 , 𝜃2 ).
3. Repeat steps (1) and (2) a large number of times (say 𝐵 times) to obtain from step (2) the numbers
𝜓 (1) , … , 𝜓 (𝐵) .
Finally, use the 𝛼∕2-th upper and lower quantiles of 𝜓 (1) , … , 𝜓 (𝐵) to form the desired credible set.
The use of Monte Carlo simulation in this example features more generally in Bayesian statistics. Only
in very specific situations the posterior density can be computed in closed form.
In Figure 4.1 we present histograms of posterior samples when the data are simulated with
(𝑛0 , 𝜃0 ) = (17, 000, 0.01) (𝑛1 , 𝜃1 ) = (10, 0.9) (𝑛2 , 𝜃2 ) = (100, 0.05)
leading to 𝑥0 = 163, 𝑥1 = 8 and 𝑥2 = 4. The prior parameters were taken 𝑎 = 𝑏 = 1. The value of 𝜓
corresponding to the 𝜃-values we used for generating the data equals 0.154, which is well within the topleft
66
4.1. SETUP
psi theta0
600 400
300
400
200
200
100
0 0
count
0.0 0.2 0.4 0.6 0.007 0.008 0.009 0.010 0.011 0.012
theta1 theta2
300 400
300
200
200
100
100
0 0
0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15
Figure 4.1: Numerical illustration to Example 4.9. Here, for generating the data we took (𝑛0 , 𝜃0 ) =
(17, 000, 0.01), (𝑛1 , 𝜃1 ) = (10, 0.9) and (100, 0.05) leading to data 𝑥0 = 163, 𝑥1 = 8 and 𝑥2 = 4. The
results are based on 10, 000 Monte-Carlo draws from the posterior. Note that the data-generating value of 𝜓
equals 0.154.
histogram. Note that the histogram for 𝜃0 is very peaked around 0.01, which should not come as a surprise
as 𝑛0 = 17, 000 which is fairly large.
Exercise 4.4 [YS exercise 3.2.] Suppose that 𝑌 , the number of heads in 𝑛 tosses of a coin, is
binomially distributed with index 𝑛 and with parameter 𝜃 and that the prior distribution on Θ is
𝐵𝑒 (𝛼, 𝛽).
2. Suppose that a-priori to any tossing the coin seems to be fair, so that we would take 𝛼 = 𝛽.
Suppose also that tossing yields 1 tail and 𝑛 − 1 heads. How large should 𝑛 be in order that
we would just give odds of 2 to 1 in favour of a head occurring at the next toss? Show that for
𝛼 = 𝛽 = 1 we obtain 𝑛 = 4.
67
CHAPTER 4. BAYESIAN STATISTICS
Exercise 4.5 Suppose 𝑋 = (𝑋1 , … , 𝑋𝑛 ) is a random sample from the 𝑈 𝑛𝑖𝑓 (0, 𝜃)-distribution.
1. The Pareto family of distributions, with parameters 𝑎 and 𝑏, prescribes density 𝑓Θ (𝜃) =
𝑎𝑏𝑎 ∕𝜃 1+𝑎 𝟏(𝑏,∞) (𝜃) (𝑎, 𝑏 > 0). Derive the posterior.
2. Suppose 𝑏 is small, verify that the maximum likelihood and posterior mode are about the same.
iid
Note that the statistical model for computing the MLE is different, namely 𝑋1 , … , 𝑋𝑛 ∼
𝑈 𝑛𝑖𝑓 (0, 𝜃).
4. Verify that if 𝑏 is small then asymptotically (𝑛 large) the predictive density is about uniform
on [0, 𝑥(𝑛) ] (which would be the classical approach, where an estimator (in the case the MLE),
is plugged in).
Note that the Julia script pareto_example.jl gives a Monte-Carlo scheme to approximate the
posterior density.
Exercise 4.6 [YS exercise 3.12.] Assume 𝜃 ∈ ℝ and that the posterior density 𝜃 → 𝑓Θ∣𝑋 (𝜃 ∣ 𝑥) is
unimodal. Show that if we choose 𝜃1 < 𝜃2 to minimise 𝜃2 − 𝜃1 subject to
𝜃2
𝑓Θ∣𝑋 (𝜃 ∣ 𝑥) d𝜃 = 1 − 𝛼
∫𝜃 1
for given 𝛼 ∈ (0, 1), then we have 𝑓Θ∣𝑋 (𝜃1 ∣ 𝑥) = 𝑓Θ∣𝑋 (𝜃2 ∣ 𝑥).
Hint: In order to solve this minimisation problem, apply the method of Lagrange multipliers.
4.2 An application
4.2.1 Bayesian updating for linear regression
In the following we use “Bayesian notation” throughout. Furthermore, we use the multivariate normal dis-
tribution, see Section 1.3 for its definition.
Suppose we have observations 𝑦1 , … , 𝑦𝑛 satisfying a linear regression model
ind ( )
𝑦𝑖 = 𝜃1 + 𝜃2 𝑡𝑖 + 𝜀𝑖 𝜀𝑖 ∼ 𝑁 0, 𝜎 2 .
The times 𝑡1 < 𝑡2 < … are the observation times. We assume for simplicity that 𝜎 2 is known. If we define
[ ]
[ ] 𝜃
𝐻𝑖 = 1 𝑡𝑖 𝜃= 1 ,
𝜃2
68
4.2. AN APPLICATION
[ ]′
Define 𝑦 = 𝑦1 … 𝑦𝑛 . The likelihood is given by
∏
𝑛 ( )
1
𝐿(𝜃; 𝑦) = 𝑝(𝑦 ∣ 𝜃) = (2𝜋𝜎 2 )−1∕2 exp − 2 (𝑦𝑖 − 𝐻𝑖 𝜃)2 .
𝑖=1
2𝜎
That is,
( )
1
𝑝(𝑦 ∣ 𝜃) = (2𝜋𝜎 2 )−𝑛∕2 exp − (𝑦 − 𝐻𝜃)′ (𝜎 2 𝐼𝑛 )−1 (𝑦 − 𝐻𝜃) ,
2
where
⎡𝐻1 ⎤ ⎡1 𝑡1 ⎤
𝐻 = ⎢ ⋮ ⎥ = ⎢⋮ ⋮ ⎥ .
⎢ ⎥ ⎢ ⎥
⎣ 𝐻𝑛 ⎦ ⎣ 1 𝑡𝑛 ⎦
Clearly, 𝑦 ∣ 𝜃 ∼ 𝑁𝑛 (𝐻𝜃, 𝜎 2 𝐼𝑛 ). We take 𝜃 ∼ 𝑁2 (𝑚0 , 𝑃0 ) a priori.
𝐶 −1 = 𝐻 ′ 𝜎 −2 𝐻 + 𝑃0−1
and ( )
𝜈 = 𝐶 (𝐻 ′ 𝜎 −2 𝑦 + 𝑃0−1 𝑚0 .
That is, both the prior and posterior distribution are normal. Put differently, the chosen prior is
conjugate for the given statistical model.
Hint: you may use the results in Example 4.4 here.
[ ]′
Bayesian updating refers to the following observation: if we let 𝑦1∶𝑘 = 𝑦1 ⋯ 𝑦𝑘 then
The equality on the second line follows from 𝑦1∶𝑘−1 and 𝑦𝑘 being independent, conditional on 𝜃. Therefore,
if we wish to find the posterior after 𝑘 observations, we can obtain it by considering only the 𝑘-th observation
coming in with prior distribution for 𝜃 equal to the posterior of 𝜃 based on the first 𝑘 − 1 observations.
( )
Exercise 4.8 Suppose that 𝜃 ∣ 𝑦1∶𝑘 ∼ 𝑁 𝑚𝑘 , 𝑃𝑘 . Show that
and ( )
𝑚𝑘 = 𝑃𝑘 𝐻𝑘′ 𝜎 −2 𝑦𝑘 + 𝑃𝑘−1
−1
𝑚𝑘−1 .
69
CHAPTER 4. BAYESIAN STATISTICS
Exercise 4.10 Verify the formulas in (4.10) for the prediction step of the Kalman filter. Hints:
1. Note that the distribution of 𝜃𝑘 ∣ 𝑦1∶𝑘−1 can be obtained as the marginal distribution of
(𝜃𝑘 , 𝜃𝑘−1 ) ∣ 𝑦1∶𝑘−1 .
2. Explain why
𝑝(𝜃𝑘 , 𝜃𝑘−1 ∣ 𝑦1∶𝑘−1 ) = 𝑝(𝜃𝑘 ∣ 𝜃𝑘−1 )𝑝(𝜃𝑘−1 ∣ 𝑦1∶𝑘−1 ).
3. Apply lemma 1.14 to deduce that the joint distribution of (𝜃𝑘 , 𝜃𝑘−1 ) ∣ 𝑦1∶𝑘−1 is multivariate
normal with the given parameters.
The Kalman filter is implemented in all main engineering packages such as Matlab, Python, R and Ju-
lia. Normality and linearity in its description lead to the closed-form expressions of the filter. In case of
nonlinearity there are many related algorithms such as the extended and unscented Kalman filter.
Remark 4.10. Many familiar time-series models such as autoregressive and moving average processes can
be written as a state-space model.
70
4.3. JUSTIFICATIONS FOR BAYESIAN INFERENCE
4.3.1 Exchangeability
Suppose that the random variables 𝑋1 , … , 𝑋𝑛 represent the results of successive tosses of a coin, with values
1 and 0 corresponding to the results ”Heads“ and ”Tails“ respectively. Analysing the meaning of the usual
frequentist model under which the {𝑋𝑖 }𝑛𝑖=1 are independent and identically distributed with 𝜃 = ℙ(𝑋𝑖 = 1)
fixed, the condition of independence would imply, for example, that
ℙ(𝑋𝑛 = 𝑥𝑛 ∣ 𝑋1 = 𝑥1 , … , 𝑋𝑛−1 = 𝑥𝑛−1 ) = ℙ(𝑋𝑛 = 𝑥𝑛 )
and, therefore, the results of the first 𝑛 − 1 tosses would not change my uncertainty about the result of
the 𝑛-th toss. The classical statistician would naturally react that this is true, but we do learn about the un-
known parameter 𝜃. Yet, the independence assumption seems to be unnatural (author’s opinion!) and indeed,
Bayesians often motivate their models by the weaker notion of exchangeability. Regarding the example just
given, Schervish [1995] (page 7) remarks that
It seems unfortunate that so much machinery as assumptions of mutual independence and the
existence of a mysterious fixed but unknown 𝜃 must be introduced to describe what seems, on
the surface, to be a relatively simple situation.
Definition 4.11. A finite set 𝑋1 , … , 𝑋𝑛 of random quantities is said to be exchangeable if every permu-
tation of (𝑋1 , … , 𝑋𝑛 ) has the same joint distribution as every other permutation. An infinite collection is
exchangeable if every finite subcollection is exchangeable.
The motivation for the definition of exchangeability is to express symmetry of beliefs about the random
quantities in the weakest possible way. The definition merely says that the labeling of the random quan-
tities is immaterial. If 𝑋1 , … , 𝑋𝑛 are IID (Independent and Identically Distributed), then 𝑋1 , … , 𝑋𝑛 are
exchangeable:
The converse is not true and hence assuming exchangeability is a weaker requirement than IID.
2
I believe in this sense Jaynes was way ahead in his thinking.
71
CHAPTER 4. BAYESIAN STATISTICS
72
4.4. CHOOSING THE PRIOR
1. Frequentist procedures are neither objective, for example the chosen significance level is often ar-
bitrary. Moreover, the choice of statistical model is usually not objective as well and much more
influential (and this influence remains even in the large sample limit). Often, for the same problem,
multiple tests exist, none of which can be classified as optimal. For a pre specified significance level,
different tests may give rise to conflicting conclusions. Choosing among the tests appears subjective.
2. The influence of the prior distribution usually vanishes as the sample size increases (see section 4.7).
3. In certain cases one can choose a prior that has least influence on the resulting posterior. Deriving
such a prior, relative to the statistical model, is key to what is called objective Bayesian statistics.
4. Under certain assumptions, any “admissible” statistical procedure is essentially Bayesian (we make
this precise in chapter 6).
5. “Optimal” frequentist procedures are Bayesian procedures with respect to a particular prior.
We go more into detail on the final two points of this list in chapter 6. It is fair to say that choice of prior
can be very influential in small samples giving the statistician the possibility to take a prior that biases the
posterior in a favourable direction. Think about a pharmaceutical company choosing the prior on efficacy of
a medicament itself! The point is that there do exist reasonable priors on which consensus can be reached.
However, it is very hard to define in general what “reasonable” means here. Hence, in Bayesian statistics,
at some stage the statistician has to trust his/her prior and investigate in which sense this choice affects the
posterior. If its influence is strong, a common approach is to make the prior more robust by spreading out
its mass. Note that in frequentist statistics there is no prior, but test/confidence-intervals are often based on
asymptotic derivations and then, rather on trusting a prior, the statistician has to trust that the sample size is
sufficiently large to justify asymptotic approximations.
Conjugate priors constitute a convenient class of priors.
Definition 4.16. If 𝑓Θ ∈ implies 𝑓Θ∣𝑋 ∈ for some class of densities then we call the class conjugate
for the statistical model.
Note that conjugacy is a property of the model together with the family of priors considered. Conjugate
priors are very convenient and often lead to a relatively simple formula for the posterior density. We give a
number of examples.
73
CHAPTER 4. BAYESIAN STATISTICS
Exercise 4.13 [YS exercise 5.3.] Find the general form of a conjugate prior density for 𝜃 in a
Bayesian analysis of the one-parameter exponential family density
Another way for obtaining a prior is based on subjective grounds of the experimenter or historical data.
In the former case one speaks of subjective Bayesian procedures. On the opposite, the objective Bayes
school tries to choose the prior that affects the resulting posterior in a minimal way. A first thing that comes
to mind for objective choice of the prior is to take the uniform prior over the parameter set (assuming it is
bounded for simplicity). This won’t work, as can be seen from the following example.
Example 4.20. Suppose 𝑋 ∣ Θ = 𝜃 ∼ 𝐵𝑒𝑟 (𝜃) and take the uniform prior over the parameter set: Θ ∼
𝑈 𝑛𝑖𝑓 (0, 1). The prior is supposed to express ignorance about 𝜃. However, if we are ignorant about 𝜃, we
are ignorant about 𝜃 2 as well (for example). If we define Ψ = Θ2 , then
√ 1 1
𝑓Ψ (𝜓) = 𝑓Θ ( 𝜓) √ = √ 𝟏[0,1] (𝜓).
2 𝜓 2 𝜓
This implies that small realisations of Ψ are more probable than large realisations and we are no longer
ignorant about Ψ.
We can set this up more formally, following an example presented in Kleijn [2020]. Suppose we have a
statistical model that is parametrised in two different ways. Say we have
𝜙1 ∶ (0, 1) → with 𝜙1 (𝜏) = 𝑁 (0, 𝜏)
and ( )
𝜙2 ∶ (0, 1) → with 𝜙2 (𝜎) = 𝑁 0, 𝜎 2 .
From a statistical modelling perspective, these two models are equivalent (in both cases we have a Normal
distribution, where the variance is assumed to be in (0, 1)). More generally, assume
𝜙1 ∶ Ω1 → and 𝜙2 ∶ Ω2 → .
For the following derivation, we assume each Ω𝑖 is a measurable space with 𝜎-algebra 𝑖 and is a mea-
surable space with 𝜎-algebra .3 Assume in addition that 𝜙1 , 𝜙−1
1
, 𝜙2 , 𝜙−1
2
are measurable. Assuming Ω1 to
be bounded, we can define
Π1 (𝐴) = 𝜇(𝐴)∕𝜇(Ω1 ) 𝐴 ∈ 1 ,
𝜇 denoting Lebesgue measure. Hence Π1 is the rescaled Lebesgue measure on Ω1 , representing the uniform
distribution on Ω1 . This measure induces a measure on by
Π′1 (𝐵) = (Π1 ◦𝜙−1
1
)(𝐵) 𝐵 ∈ .
But then, using 𝜙2 we can push-back to Ω2 and obtain the measure 4
( )
Π2 (𝐶) = (Π′1 ◦𝜙2 )(𝐶) = Π1 ◦(𝜙−1
1
◦𝜙2 ) (𝐶) 𝐶 ∈ 2 .
Hence, starting with a uniform prior on Ω1 , we obtain a prior on Ω2 that in general will not be uniform. Let’s
check this for our example:
( )
𝜏(𝜎) ∶= (𝜙−11
◦𝜙2 )(𝜎) = 𝜙−1
1
(𝑁 0, 𝜎 2 ) = 𝜎 2 .
3
Formally, we should state precisely the topology and and ; this is somewhat out of scope for this course.
4
If you are familiar with differential geometry, you may recognise a resemblance with moving along overlapping charts.
74
4.4. CHOOSING THE PRIOR
Hence, as Π2 (𝐶) = (Π1 ◦𝜏)(𝐶) we get that the induced density on Ω2 is given by
| d𝜏 |
𝜋2 (𝜎) = 𝜋1 (𝜏(𝜎)) || || = 1 ⋅ 2𝜎 = 2𝜎. (4.11)
| d𝜎 |
So, taking parametrisation 𝜙2 , the prior is non uniform and favours values close to 1 (if for example Ω1 =
(0, 1)).
A natural question that appears is whether it is possible to choose a prior such that the prior is invariant
under reparametrisation. So what is meant by this? It means that we have a way of constructing the prior,
such that the following two derivations lead to the same result:
In certain cases this is possible and its construction dates back to 1946 and is due to Jeffreys. Often, it leads
to “improper priors”, a concept that we will now discuss.
𝑓Θ (𝜃) d𝜃 = ∞
∫
Improper priors do not follow the rules of probability theory and should be interpreted as expressing
degree of belief. They can only be used as long as the posterior is proper. The idea is that since all Bayesian
inference depends only on the posterior, the prior actually being a density is not really important. As you
can guess, not all statisticians agree on using improper priors! Moreover, it is really not guaranteed that the
posterior is proper. As an example, suppose 𝑋 ∣ Θ = 𝜃 ∼ 𝐵𝑖𝑛 (𝑛, 𝜃) and we use the improper prior
1
𝑓Θ (𝜃) ∝
𝜃(1 − 𝜃)
(this prior is known as Haldane’s prior). It follows that
1
𝑓Θ∣𝑋 (𝜃 ∣ 𝑥) ∝ 𝜃 𝑥 (1 − 𝜃)𝑛−𝑥 = 𝜃 𝑥−1 (1 − 𝜃)𝑛−𝑥−1 .
𝜃(1 − 𝜃)
Now suppose 𝑥 = 0 is observed. Then due to the fact that 𝜃 → 1∕𝜃 is not integrable near zero, the map
𝜃 → 𝑓Θ∣𝑋 (𝜃 ∣ 0) fails to be integrable. Hence there is no well-defined posterior in this case.
The following example shows why improper priors can be a natural choice sometimes.
Example 4.22. Let 𝑔 be a density (with respect to Lebesgue measure) on ℝ and assume the location model
𝑓𝑋∣Θ (𝑥 ∣ 𝜃) = 𝑔(𝑥 − 𝜃)
and 𝜃 ∈ ℝ is the location parameter. If no prior information is available, it makes sense to assume the
likelihood of an interval [𝑎, 𝑏] is proportional to 𝑏 − 𝑎. Therefore, the prior is proportional to Lebesgue
measure on ℝ. We write 𝑓Θ (𝜃) ∝ 𝑘 (𝑘 ∈ ℝ), or 𝑓Θ (𝜃) ∝ 1.
75
CHAPTER 4. BAYESIAN STATISTICS
Exercise 4.14 Derive the posterior in case 𝑔 is the density of the standard normal distribution.
2. their ability to recover usual estimators like maximum likelihood within the Bayesian paradigm.
Often, improper priors arise as limits of a sequence of proper prior distributions. They are generally more
acceptable to non-Bayesians, as they can lead to estimators with frequentist validation, such as minimaxity
(related to improper least favourable priors, Cf. chapter 6).
1. historically, it was the first approach to deal with the problem of Example 4.20 by constructing a prior
that is “invariant under reparametrisation”;
and ( )
𝜙2 ∶ (0, 1) → with 𝜙2 (𝜎) = 𝑁 0, 𝜎 2 .
It is easily verified that indeed Jeffreys’ prior transforms correctly. First, it is elementary to verify that
1 2
𝐼1 (𝜏) = and 𝐼2 (𝜎) = .
2𝜏 2 𝜎2
Now, starting from the parametrisation in terms of 𝜏, Jeffreys’ prior is given by
1
𝜋1 (𝜏) = √ .
𝜏 2
76
4.4. CHOOSING THE PRIOR
This is exactly equal to the prior we get when we start of from 𝐼2 (𝜎), which would give
√
√ 2
𝜋2 (𝜎) = 𝐼2 (𝜎) = .
𝜎
In many complicated models, it is difficult to compute the Fisher information and obtain an expression
for the Jeffreys’ prior. Moreover, in case the Fisher-information does not exist a clear problem arises. So this
prior is definitely not “the solution” in general to the problem of selecting a prior in an objective way. Even
saying it is objective is a bit vague: but it does resolve the reparametrisation issue for prior selection in some
statistical models.
where the second equality follows from the substitution 𝜃 = 𝑔 −1 (𝜓). This implies that for measurable subsets
𝐴 ⊂ ℝ we have
∫ 𝑓𝑋∣Ψ (𝑥 ∣ 𝜓)𝟏𝑔(𝐴) (𝜓)𝑓Ψ (𝜓) d𝜓 ∫ 𝑓𝑋∣Ψ (𝑥 ∣ 𝑔(𝜃))𝟏𝑔(𝐴) (𝑔(𝜃))𝑓Θ (𝜃) d𝜃
= .
∫ 𝑓𝑋∣Ψ (𝑥 ∣ 𝜓)𝑓Ψ (𝜓) d𝜓 ∫ 𝑓𝑋∣Ψ (𝑥 ∣ 𝑔(𝜃))𝑓Θ (𝜃) d𝜃
Example 4.25 (Jeffreys’ prior for location families). Suppose 𝑌 has density 𝑓𝑌 . Let 𝑏 ∈ ℝ. If 𝑋 = 𝑌 + 𝜃,
then 𝑓𝑋 (𝑥 ∣ 𝜃) = 𝑓𝑌 (𝑥 − 𝜃). The family of densities {𝑓𝑋 (⋅ ∣ 𝜃), 𝜃 ∈ ℝ} is called a location-family of
densities. By example 2.13
𝑓𝑌′ (𝑢)2
𝐼(𝜃) = d𝑢.
∫ 𝑓𝑌 (𝑢)
it follows that Jeffreys’ prior is given by 𝑓Θ (𝜃) ∝ 1, the (improper) uniform prior on ℝ.
ind
Example 4.26. Suppose 𝑋1 , … , 𝑋𝑛 ∣ Θ = 𝜃 ∼ 𝑁 (𝜃, 1). Let 𝑋 = (𝑋1 , … , 𝑋𝑛 ). This is a location family
and using Jeffreys’ prior we get
( )
1∑
𝑛
𝑓Θ∣𝑋 (𝜃 ∣ 𝑋) ∝ exp − (𝑋 − 𝜃)2 .
2 𝑖=1 𝑖
77
CHAPTER 4. BAYESIAN STATISTICS
( )
If 𝑋 = (𝑋1 , … , 𝑋𝑛 ), then the posterior satisfies Θ ∣ 𝑋 ∼ 𝑁 𝑋̄ 𝑛 , 1∕𝑛 . The posterior mean equals E[Θ ∣
𝑋] = 𝑋̄ 𝑛 and a 95%-credible set is given by
√ √
[𝑋̄ 𝑛 − 1.96∕ 𝑛, 𝑋̄ 𝑛 + 1.96∕ 𝑛].
The classical 95%-confidence interval for 𝜃 is the same, but the interpretation of both intervals is completely
different!
Example 4.27 (Jeffreys’
( ) prior for scale families). Suppose 𝑌 has density 𝑓𝑌 . Let 𝜃 > 0 . If 𝑋 = 𝜃𝑌 , then
𝑓𝑋 (𝑥 ∣ 𝜃) = 𝜃1 𝑓𝑌 𝑥
𝜃
. Cf. example 1.17. By example 2.13
( )2
1
∞ 𝑢𝑓𝑌′ (𝑢)
𝐼(𝜃) = 2 1+ 𝑓𝑌 (𝑢) d𝑢.
𝜃 ∫0 𝑓𝑌 (𝑢)
78
4.5. HIERARCHICAL BAYESIAN MODELS
This example implies that Jeffreys’ prior does not satisfy the likelihood principle (Cf. chapter 3). For
this reason, Jeffreys’ priors are not universally accepted by Bayesians. A further complication is that it is
not always easy to calculate the Jeffreys’ prior for a given statistical model. As a final note, especially in
multivariate cases, Jeffreys’ prior can perform unsatisfactory. The interested reader can consult for example
Berger et al. [2015]. The suggested fix is to use reference priors. We do not go into details here.
iid ∑
Exercise 4.15 * [YS exercise 3.9.] Let 𝑋1 , … , 𝑋𝑛 ∣ 𝜇, 𝜎 ∼ 𝑁(𝜇, 𝜎 2 ). Let 𝑋̄ = 𝑛−1 𝑛𝑖=1 𝑋𝑖 and
∑
𝑆 2 = (𝑛 − 1)−1 𝑛𝑖=1 (𝑋𝑖 − 𝑋̄ 𝑛 )2 . Assume
Exercise 4.16 Suppose 𝑋 ∼ 𝐵𝑒𝑟(𝜃).√Use Jeffreys’ prior to compute the posterior probability that
𝜃 ∈ (0, 1∕2). Next assume 𝑋 ∼ 𝐵𝑒𝑟( 𝜂). Use Jeffreys’ prior to compute the posterior probability
that 𝜃 ∈ (0, 1∕4). Compare the results and explain.
√
Exercise 4.17 Show that if we parametrise the Poisson distribution instead of 𝜃 by 𝜃, the Jeffreys
prior is uniform. Construct a general transformation 𝜃 → 𝑔(𝜃) such that the Jeffreys prior is uniform
(possibly improper). You may assume that 𝜃 is one-dimensional.
Hint: Use Lemma 2.17 (on Fisher information under reparametrisation) to construct a general
transformation such that Jeffreys’ prior is uniform.
Exercise 4.18 Suppose we have data {𝑋𝑖 , 𝑖 = 1, … , 𝑛} and assume the following hierarchical
model.
ind ( )
𝑋𝑖 ∣ Θ𝑖 = 𝜃𝑖 ∼ 𝑁 𝜃𝑖 , 1
iid ( )
Θ1 , … , Θ𝑛 ∣ 𝑇 = 𝜏 ∼ 𝑁 0, 𝜏 2
1
𝑓𝑇 (𝜏) ∝
𝜏
( )
The prior on 𝜏 is improper and motivated as a Jeffreys’ prior (note that the 𝑁 0, 𝜏 2 belongs to a
scale family). Investigate whether the posterior for 𝜏 is a proper density.
Hierarchical modeling is a fundamental concept in Bayesian statistics. The basic idea is that
parameters are endowed with distributions which may themselves introduce new parameters, and
this construction recurses. A common motif in hierarchical modeling is that of the conditionally
79
CHAPTER 4. BAYESIAN STATISTICS
independent hierarchy, in which a set of parameters are coupled by making their distributions
depend on a shared underlying parameter. These distributions are often taken to be identical,
based on an assertion of exchangeability and an appeal to de Finetti’s theorem. Hierarchies help
to unify statistics, providing a Bayesian interpretation of frequentist concepts such as shrinkage
and random effects. Hierarchies also provide ways to specify non-standard distributional forms,
obtained as integrals over underlying parameters. They play a role in computational practice
in the guise of variable augmentation. These advantages are well appreciated in the world of
parametric modeling, and few Bayesian parametric modelers fail to make use of some aspect of
hierarchical modeling in their work.
Definition 4.30. A hierarchical Bayesian model is a Bayesian statistical model, (𝑝(𝑥 ∣ 𝜃), 𝑝(𝜃)), where the
prior distribution 𝑝(𝜃) is decomposed into
such that
𝑝(𝜃) = 𝑝(𝜃 ∣ 𝜃1 ) 𝑝(𝜃1 ∣ 𝜃2 ) ⋯ 𝑝(𝜃𝑛−1 ∣ 𝜃𝑛 )𝑝(𝜃𝑛 ) d𝜃1 ⋯ d𝜃𝑛 .
∫
The parameters 𝜃𝑖 are called the hyperparameters of level 𝑖 (1 ≤ 𝑖 ≤ 𝑛).
The popularity of such models is partly due to their flexibility in modelling complex dependencies in
the data, but also due to the existence of computational algorithms to draw from the posterior (most notably
Markov Chain Monte Carlo methods, see chapter 5).
Example 4.31. Suppose in a medical treatment there are 𝐼 treatments groups. Denote by 𝑋𝑖𝑗 the response
of subject 𝑗 in treatment group 𝑖. Within each group, we model the data as exchangeable:
ind ( )
𝑋𝑖𝑗 ∣ Θ𝑖 = 𝜃𝑖 ∼ 𝑁 𝜃𝑖 , 1
ind ( )
Θ1 , … , Θ𝐼 ∼ 𝑁 𝜈, 𝜏 2
for known values of 𝜈 and 𝜏 2 . The hyperparameters are 𝜈 and 𝜏 2 in this case. Additional layers in the hierar-
chical model can be introduced by introducing priors on 𝜈 and/or 𝜏 2 as well. If this is pursued, dependence
is created among all 𝑋𝑖𝑗 .
Example 4.32. Suppose that a survey is conducted in 𝐼 cities. Each person surveyed is asked a yes-no
question. Denote by 𝑋𝑖𝑗 the response of person 𝑗 in city 𝑖. Set 𝑋𝑖𝑗 = 1 if the answer is “yes” and 𝑋𝑖𝑗 = 0 if
it is “no”. If we model the data within a city as exchangeable then a possible model is given by
ind ( )
𝑋𝑖𝑗 ∣ Θ𝑖 = 𝜃𝑖 ∼ 𝐵𝑒𝑟 𝜃𝑖
ind
Θ1 , … , Θ𝐼 ∼ 𝐵𝑒 (𝛼, 𝛽)
Example 4.33. In the baseball data example introduced in section 3.4.1 in Young and Smith [2005], we
observe 𝑦𝑖 which is the number of home-runs out of 𝑛𝑖 times at bat (𝑖 = 1, … , 𝑛, 𝑛 = 17) from pre-season
data. Two models are proposed:
80
4.6. EMPIRICAL BAYES
1. A model where
( the response
) is transformed and next modelled by a Normal distribution. Define 𝑋𝑖 =
√ 𝑌𝑖
𝑛𝑖 arcsin 2 𝑛 − 1 , then the model is given by:
𝑖
ind ( )
𝑋𝑖 ∣ 𝑀𝑖 = 𝜇𝑖 ∼ 𝑁 𝜇𝑖 , 1
iid ( )
𝑀𝑖 ∣ Θ = 𝜃, 𝑇 = 𝜏 ∼ 𝑁 𝜃, 𝜏 2
∗ ∗ ∕𝜏 2
𝑓(Θ,𝑇 ) (𝜃, 𝜏) ∝ 𝜏 −1−2𝛼 𝑒−𝛽
ind ( )
To avoid excessive notational overhead one sometimes writes 𝑥𝑖 ∣ 𝜇𝑖 ∼ 𝑁 𝜇𝑖 , 1 instead of 𝑋𝑖 ∣ 𝑀𝑖 =
ind ( )
𝜇𝑖 ∼ 𝑁 𝜇𝑖 , 1 (for example). Then, all quantities are written down in lower-case.
𝑋 ∣ Θ = 𝜃 ∼ 𝑓𝑋∣Θ (⋅ ∣ 𝜃)
Θ ∼ 𝑓Θ (𝜃; 𝜂),
where 𝜂 is the hyperparameter. A truly Bayesian analysis requires specification of the value for 𝜂. If there is
insufficient prior information on 𝜂, the idea of empirical Bayes methods is to estimate 𝜂 from 𝑓𝑋 . This can
for example be done in a way similar to maximum likelihood way by defining
The “posterior” obtained by the empirical Bayes method is the “ordinary” posterior, with 𝜂̂ substituted for 𝜂.
Empirical Bayes methods are neither classical nor Bayesian. It is observed that estimators obtained in this
way often are “good” in terms of classical optimality criteria, such as minimaxity (details follow in section
6.3 of chapter 6).
iid
Example 4.34. Suppose 𝑋1 , … , 𝑋𝑛 ∣ Θ ∼ 𝑃 𝑜𝑖𝑠 (Θ) and Θ ∼ 𝐺𝑎 (𝑎, 𝑏) apriori. The hyperparameter is given
by (𝑎, 𝑏). The marginal density of the data is given by
∏
𝑛
𝑓𝑋1 ,…,𝑋𝑛 (𝑥1 , … , 𝑥𝑛 ; 𝑎, 𝑏) = 𝑓𝑋𝑖 (𝑥𝑖 ∣ 𝜃)𝑓Θ (𝜃) d𝜃
∫
𝑖=1
∞ 𝑆
−𝑛𝜃 𝜃 𝑏𝑎 𝑎−1 −𝑏𝜃
= 𝑒 𝜃 𝑒 d𝜃
∫0 𝐶 Γ(𝑎)
∞
𝑏𝑎 𝑏𝑎 Γ(𝑆 + 𝑎)
= 𝜃 𝑆+𝑎−1 𝑒−(𝑏+𝑛)𝜃 d𝜃 =
𝐶Γ(𝑎) ∫0 𝐶Γ(𝑎) (𝑏 + 𝑛)𝑆+𝑎
81
CHAPTER 4. BAYESIAN STATISTICS
∑ ∏
with 𝑆 = 𝑛𝑖=1 𝑋𝑖 and 𝐶 = 𝑛𝑖=1 (𝑋𝑖 !). The final equality follows upon noting that the integrand is pro-
portional to the density of the 𝐺𝑎 (𝑆 + 𝑎, 𝑏 + 𝑛)-density. The empirical Bayes estimator for (𝑎, 𝑏) is defined
by
(𝑎̂(𝐸𝐵) , 𝑏̂ (𝐸𝐵) ) = argmax (𝑎 log 𝑏 − log Γ(𝑎) + log Γ(𝑆 + 𝑎) − (𝑆 + 𝑎) log(𝑏 + 𝑛)) .
(𝑎,𝑏)∈(0,∞)2
It is not entirely clear whether such a maximiser exists and for sure it cannot be calculated in closed form. If
we fix 𝑏 = 1 for example (corresponding to a 𝐸𝑥𝑝 (𝑏)-prior), the problem gets easier.
Example 4.35. Suppose 𝑋 ∣ Θ = 𝜃 ∼ 𝑁 (𝜃, 1) and Θ ∼ 𝑁 (0, 𝐴) apriori. Note that this prior is conjugate
for the given statistical model (i.e. the posterior has a normal distribution). The “hyperparameter” for this
model is 𝐴. The posterior mean is given by 𝐴𝑋∕(𝐴 + 1). The marginal density of 𝑋 is computed as follows:
This shows that 𝑋 ∼ 𝑁 (0, 1 + 𝐴). The empirical Bayes method postulates that we find 𝐴 by maximising
𝑓𝑋 (𝑋; 𝐴) over 𝐴 ≥ 0. This gives
Example 4.36. Suppose we define a family of priors for a fixed 𝜀 ∈ [0, 1] by letting
Here 𝜋 is a fixed density and 𝑞 can be any density. The idea is that 𝜀 is close to zero and 𝜋 is the prior
density the statistician has in mind. The family of densities {𝑓Θ(𝜀,𝑞) } is called a contamination family. We
will choose 𝑞 in an empirical Bayes fashion. Note that
≤ (1 − 𝜀) ̂
𝑓𝑋∣Θ (𝑥 ∣ 𝜃)𝜋(𝜃) d𝜃 + 𝜀𝑓𝑋∣Θ (𝑥 ∣ Θ),
∫
where 𝜃̂ is the maximum likelihood estimate for 𝜃. It follows that the empirical Bayes prior is given by
𝑓Θ (⋅) = (1 − 𝜀)𝜋(⋅) + 𝜀𝛿Θ̂ (⋅). This is a mixture of 𝜋 and a Dirac-measure at Θ.
̂
82
4.6. EMPIRICAL BAYES
Exercise 4.19 Suppose 𝑋1 ,(… , 𝑋)𝑛 are conditionally independent given Θ = 𝜃 with the 𝑁 (𝜃, 1)
distribution. Assume Θ ∼ 𝑁 0, 𝜏 2 apriori. Here 𝜏 2 is a hyperparameter.
1. Show that ( ∑𝑛 )
𝜏2 1 𝑋𝑖 𝜏2
Θ ∣ 𝑋1 , … , 𝑋𝑛 ∼ 𝑁 , .
𝑛𝜏 2 + 1 𝑛𝜏 2 + 1
and conclude that the posterior mean is given by
𝜏2
𝑛𝑋̄ 𝑛 .
𝑛𝜏 2+1
Hint: first write down the joint density of (Θ, 𝑋1 , … , 𝑋𝑛 ) and then integrate out 𝜃, then first
show that this integral equals
( )
− 𝑛+1 1 − 12 ∑𝑛𝑖=1 𝑥2𝑖 1 𝜃2
(2𝜋) 2 𝑒 exp 𝑆𝜃 − d𝜃,
𝜏 ∫ 2 𝑢𝑛
∑𝑛
where 𝑢−1
𝑛 (= 𝑛 ) + 𝜏 −2 and 𝑆 = 𝑖=1 𝑥𝑖 . Then show that this integral equals
√
2𝜋𝑢𝑛 exp 12 𝑆 2 𝑢𝑛 .
3. Derive an estimator for the hyperparameter 𝜂 ∶= 𝜏 2 as in (4.12), i.e. as the maximiser of the
marginal likelihood. Verify that this estimator is given by
{ ( ∑𝑛 )2 }
1 𝑥𝑖 −𝑛
𝜏̂ 2 = max 0, .
𝑛2
̂ 𝐸𝐵 = 𝜏̂ 2 ̂
Θ 𝑛𝑋.
𝑛𝜏̂ 2 +1
For the setting of the previous exercise we can now compare the following three estimators for 𝜃:
• the posterior mean
𝜏2
𝑛𝑋̄ 𝑛
1 + 𝑛𝜏 2
• the derived empirical Bayes estimator
( )
𝜏̂ 2 (𝑛𝑋̄ 𝑛 )2 − 𝑛
𝑛𝑋̄ 𝑛 , with 2
𝜏̂ = max 0,
1 + 𝑛𝜏̂ 2 𝑛
83
CHAPTER 4. BAYESIAN STATISTICS
900
count
600
300
0
1.75 2.00 2.25 1.75 2.00 2.25 1.75 2.00 2.25
estimate
● ● ●
2.4
● ● ●
●
● ●
● ●
●
●
● ●
● ●
●
●
● ●
●
● ●
●
●
●
●
● ●
● ●
●
● ● ●
2.2
estimate
2.0
1.8
●
● ●
● ●
●
● ● ●
●
● ●
● ●
●
●
● ●
● ●
●
● ● ●
● ● ●
1.6
Bayes Emp. Bayes Mle
type
Figure 4.2: Comparison of posterior mean (Bayes) with 𝜏 2 = 10, empirical Bayes (Emp. Bayes) and max-
iid
imum likelihood estimator by a Monte Carlo study. Each Monte Carlo sample is sampled as 𝑋1 , … , 𝑋𝑛 ∼
𝑁 (𝜃, 1) with 𝜃 = 2.
We compare the performance of these estimators by a Monte Carlo study where we took 𝜏 2 = 10 in the pos-
iid
terior mean and 104 Monte Carlo samples. Each Monte Carlo sample is sampled as 𝑋1 , … , 𝑋𝑛 ∼ 𝑁 (𝜃, 1).
Figures 4.2 and 4.3 show the results for 𝜃 = 2 and 𝜃 = 0 respectively. In case 𝜃 = 2 all three estimators
perform roughly the same. In case 𝜃 = 0 the situation is rather different:
√ the empirical Bayes estimator is
exactly equal to zero whenever (𝑛𝑋𝑛 ) − 𝑛 < 0, i.e. when |𝑋𝑛 | < 1∕ 𝑛. This comes at the cost of slightly
̄ 2 ̄
worse behaviour than the posterior mean and maximum likelihood estimator at ±0.15 approximately.
84
4.6. EMPIRICAL BAYES
6000
4000
count
2000
0
−0.2 0.0 0.2 0.4 −0.2 0.0 0.2 0.4 −0.2 0.0 0.2 0.4
estimate
6000
900 900
4000
count
600 600
0 0 0
−0.2 0.0 0.2 0.4 −0.2 0.0 0.2 0.4 −0.2 0.0 0.2 0.4
estimate
● ● ●
0.4 ● ● ●
● ● ●
●
● ●
● ●
●
●
● ●
● ●
●
●
●
● ●
● ●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
0.2 ●
●
●
●
●
●
●
●
●
●
●
●
●
estimate
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0 ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−0.2 ●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
● ●
● ●
●
●
● ●
● ●
●
● ● ●
● ● ●
Figure 4.3: Comparison of posterior mean (Bayes) with 𝜏 2 = 10, empirical Bayes (Emp. Bayes) and max-
iid
imum likelihood estimator by a Monte Carlo study. Each Monte Carlo sample is sampled as 𝑋1 , … , 𝑋𝑛 ∼
𝑁 (𝜃, 1) with 𝜃 = 0. Note the difference in vertical scale of the middle and lower figures.
85
CHAPTER 4. BAYESIAN STATISTICS
(d) Combine parts (a) and (b) (or (c)) to find empirical Bayes estimators for Θ1 , … , Θ𝑝 .
86
4.7. BAYESIAN ASYMPTOTICS
Exercise 4.21 [YS exercise 3.13.] Let 𝑋 ∼ 𝐵𝑖𝑛 (𝑛, 𝜃) and consider a conjugate 𝐵𝑒 (𝛼, 𝛽) prior
distribution for 𝜃.
1. Show that if we reparametrise from (𝛼, 𝛽) to (𝜇, 𝑀), where 𝜇 = 𝛼∕(𝛼 + 𝛽) and 𝑀 = 𝛼 + 𝛽,
the marginal distribution of X if of beta-binomial form:
ℙ(𝑋 = 𝑥) =
( )
Γ(𝑀) 𝑛 Γ(𝑥 + 𝑀𝜇)Γ(𝑛 − 𝑥 + 𝑀(1 − 𝜇)
.
Γ(𝑀𝜇)Γ(𝑀(1 − 𝜇)) 𝑥 Γ(𝑛 + 𝑀)
2. Verify that the marginal expectation and variance of 𝑋∕𝑛 are respectively
𝜇(1 − 𝜇) ( 𝑛−1
)
𝔼[𝑋∕𝑛] = 𝜇 Var(𝑋∕𝑛) = 1+ .
𝑛 𝑀 +1
In calculating Var(𝑋∕𝑛) it is handy to use the law of total variance:
87
CHAPTER 4. BAYESIAN STATISTICS
the assumption 𝑋 ∼ P𝜃0 , the posterior Π𝐗𝑛 converges to 𝛿𝜃0 , as 𝑛 → ∞ 5 . That is, the posterior measure
concentrates at the “true” parameter 𝜃0 asymptotically. If this is the case, we say the posterior is consistent.
Theorem 2.33 shows that under certain conditions, maximum likelihood estimators are asymptotically
normal with asymptotic variance the inverse of the Fisher information. It turns out that, under mild condi-
tions, the posterior distribution is the same in the large sample limit.
4.7.1 Consistency
We start with an example.
ind
Example 4.37. Take 𝑋1 , … , 𝑋𝑛 ∣ Θ = 𝜃 ∼ 𝐺𝑎 (2, 𝜃). Then
∏
𝑛
𝑓𝐗𝑛 ∣Θ (𝐗𝑛 ∣ 𝜃) = 𝜃 2 𝑋𝑖 𝑒−𝜃𝑋𝑖 .
𝑖=1
88
4.7. BAYESIAN ASYMPTOTICS
|𝜃 − 𝜃0 | ≤ |𝜃 − 𝑚(𝑥)| + |𝑚(𝑥) − 𝜃0 |.
Hence
( ) ( )
Π𝑥 {𝜃 ∶ |𝜃 − 𝑚(𝑥)| < 𝜖∕2 and |𝑚(𝑥) − 𝜃0 | < 𝜖∕2} ≤ Π𝑥 {𝜃 ∶ |𝜃 − 𝜃0 | < 𝜖} .
where we used Chebyshev’s inequality at the second inequality. Now substitute 𝐗𝑛 for 𝑥 and consider the
limit 𝑛 → ∞ under P𝜃0 . Then 𝑣(𝐗𝑛 ) converges in probability to zero, by assumption. Taking the expectation
under 𝜃0 of the second term, we see that term tends to zero in 𝐿1 , as we assumed 𝑚(𝐗𝑛 ) − 𝜃0 to convergence
in probability to zero. As convergence in 𝐿1 is stronger than convergence in probability, the second term
converges in probability to zero. We conclude that 1 − Π𝑥 (𝐵(𝜃0 , 𝜀)) converges to zero under P𝜃0 .
E𝜃0 [Θ ∣ 𝑋1 , … , 𝑋𝑛 ] → 𝜃0 , Var 𝜃0 (Θ ∣ 𝑋1 , … , 𝑋𝑛 ) → 0.
The following exercise shows that if the parameter set is finite, posterior consistency can be proved under
mild conditions.
89
CHAPTER 4. BAYESIAN STATISTICS
∏
𝑛
𝜇𝑋 (𝐵) = 𝑓 (𝑥𝑖 ∣ 𝜃) d𝜇Θ (𝜃) d𝑥1 … , d𝑥𝑛 .
∫𝐵 ∫Ω
𝑖=1
𝜉 𝐿(𝜃𝓁 , 𝑥)
Π𝑥 ({𝜃𝓁 }) = ∑𝑝 𝓁
𝜉 𝐿(𝜃𝑘 , 𝑥)
𝑘=1 𝑘
4. Show that
𝜉𝓁
Π𝑥 ({𝜃𝓁 }) = ∑𝑝 −𝑛𝑧𝑘,𝓁 (𝑥)
𝜉 𝑒
𝑘=1 𝑘
with
1 𝐿(𝜃𝓁 , 𝑥)
𝑧𝑘,𝓁 (𝑥) = log .
𝑛 𝐿(𝜃𝑘 , 𝑥)
5. Now take a frequentist point of view where we assume the data are generated with 𝜃𝓁 . Assume
the model is identifiable. Consider the estimator 𝑇 = Π𝑋 ({𝜃𝓁 }) (note that the stochasticity
is in 𝑋). Show that 𝑇 converges in probability to 1 as 𝑛 → ∞. In other words, the posterior
concentrates on {𝜃𝓁 } asymptotically.
Hint: law of large numbers, Kullback-Leiber divergence
𝜕 |
̃ 𝑛 ∣ 𝐗𝑛 ) + (𝜃 − Θ
log 𝑓Θ∣𝐗𝑛 (𝜃 ∣ 𝐗𝑛 ) = log 𝑓Θ∣𝐗𝑛 (Θ ̃ 𝑛) log 𝑓Θ∣𝐗𝑛 (𝜃 ∣ 𝐗𝑛 )||
𝜕𝜃 |𝜃=Θ̃ 𝑛
1 ̃ 𝑛 )2 𝐼̃𝑛 + ⋯
− (𝜃 − Θ
2
≈ log 𝑓Θ∣𝐗𝑛 (Θ̃ 𝑛 ∣ 𝐗𝑛 ) − 1 (𝜃 − Θ
̃ 𝑛 )2 𝐼̃𝑛 ,
2
90
4.7. BAYESIAN ASYMPTOTICS
where
𝜕2 |
𝐼̃𝑛 = − 2 log 𝑓Θ∣𝐗𝑛 (𝜃 ∣ 𝐗𝑛 )|| .
𝜕𝜃 |𝜃=Θ̃ 𝑛
Hence ( )
1 ̃ 𝑛 )2 𝐼̃𝑛 .
𝑓Θ∣𝐗𝑛 (𝜃 ∣ 𝐗𝑛 ) ∝ exp − (𝜃 − Θ
2
( )
which means that the posterior distribution of Θ is approximately 𝑁 Θ ̃ 𝑛 , 𝐼̃−1 .
𝑛
Denote the 𝑘-th derivative of the loglikelihood with respect to 𝜃 by by 𝑙(𝑘) (𝜃 ∣ 𝐗𝑛 ).
Define the following assumptions:
(A2) log 𝑓𝑋1 ∣Θ (𝑥 ∣ 𝜃) is thrice differentiable with respect to 𝜃 in a neighbourhood (𝜃0 − 𝛿, 𝜃0 + 𝛿) of 𝜃0 The
expectations E𝜃0 𝑙(1) (𝜃0 ∣ 𝑋1 ) and E𝜃0 𝑙(2) (𝜃0 ∣ 𝑋1 ) are both finite and
(A3) Interchange of the order of integration with respect to 𝑃𝜃0 and differentiation at 𝜃0 is justified, so that
( )2
E𝜃0 𝑙(1) (𝜃0 ∣ 𝑋1 ) = 0 and E𝜃0 𝑙(2) (𝜃0 ∣ 𝑋1 ) = − E𝜃0 𝑙(1) (𝜃0 ∣ 𝑋1 ) .
( )2
Furthermore 𝐼(𝜃0 ; 𝑋1 ) = 𝐸𝜃0 𝑙(1) (𝜃0 ∣ 𝑋1 ) < ∞.
The condition that is hardest to check is (A4). A common assumption for deriving frequentist asymptotic
results involve the behaviour of 𝑙(𝜃 ∣ 𝐗𝑛 ) in the neighbourhood of 𝜃0 . Since Bayesian estimators involve
integration over the whole of Ω, it is also necessary to control 𝑙(𝜃 ∣ 𝐗𝑛 ) at a distance from 𝜃0 . Condition
(A4) turns this requirement in a formal mathematical condition. Note that
𝑙(𝜃 ∣ 𝐗𝑛 ) − 𝑙(𝜃0 ∣ 𝐗𝑛 ) 1 ∑
𝑛 𝑓𝑋𝑖 ∣Θ (𝑥𝑖 ∣ 𝜃)
= log .
𝑛 𝑛 𝑖=1 𝑓𝑋𝑖 ∣Θ (𝑥𝑖 ∣ 𝜃0 )
If the sequence Θ
̂ 𝑛 is strongly consistent for 𝜃0 . In that case, it can be shown that Θ
̂ 𝑛 satisfies the score
equation 𝑙 (𝜃 ∣ 𝐗𝑛 ) = 0.
(1)
Theorem 4.39 (Theorem 4.2 in Ghosh et al. [2006]). In addition to (A1)-(A4), assume
• Θ
̂ 𝑛 is a strongly consistent solution of the score equation.
91
CHAPTER 4. BAYESIAN STATISTICS
√
If Ψ𝑛 = ̂ 𝑛 ), then
𝑛(Θ − Θ
| |
lim |𝑓Ψ𝑛 ∣𝐗𝑛 (𝜓 ∣ 𝐗𝑛 ) − 𝜙(𝜓; 0, 𝐼(𝜃0 )−1 )| d𝜓 = 0
𝑛→∞ ∫ | |
𝜕2 |
𝐼̂𝑛 = − 2 log 𝑓𝐗𝑛 ∣Θ (𝐗𝑛 ∣ 𝜃)|| .
𝜕𝜃 |𝜃=Θ̂ 𝑛
• It shows that if we have a large number of observations the prior does not matter (the prior is “washed
away”/“overridden” by the data). Note that we assume the number of parameters in the statistical
model does not grow with 𝑛, as sometimes happens to be the case in hierarchical models.
• The normal approximation to the posterior justifies summarising the posterior by its mean and standard-
deviation.
• It offers computational simplicity: whereas computation of the posterior mean is hard, computation
of the posterior mode is usually much easier.
Proof of theorem 4.39*. 6 In the proof we write 𝑙𝑛 (𝜃) instead of 𝑙(𝜃 ∣ 𝐗𝑛 ). For the posterior we have
∏𝑛
𝑙𝑛 (𝜃)
𝑓Θ∣𝐗𝑛 (𝜃 ∣ 𝐗𝑛 ) ∝ ̂ 𝑛 ))𝑓Θ (𝜃).
𝑓Θ (𝜃) = exp(𝐿𝑛 (𝜃) − 𝐿𝑛 (Θ
̂
𝑖=1 𝑙𝑛 (Θ𝑛 )
√
Let Ψ𝑛 = ̂ 𝑛 ), then
𝑛(Θ − Θ
√ √
̂ 𝑛 + 𝜓∕ 𝑛) − 𝑙𝑛 (Θ
𝑓Ψ𝑛 ∣𝐗𝑛 (𝜓 ∣ 𝐗𝑛 ) ∝ exp(𝑙𝑛 (Θ ̂ 𝑛 ))𝑓Θ (Θ
̂ 𝑛 + 𝜓∕ 𝑛) =∶ ℎ𝑛 (𝜓).
√
(the Jacobian term equals 1∕ 𝑛 and is absorbed into the proportionality constant). If we define 𝐶𝑛 =
∫ ℎ𝑛 (𝜓)𝑑𝜓, then 𝑓Ψ𝑛 ∣𝐗𝑛 (𝜓 ∣ 𝐗𝑛 ) = 𝐶𝑛−1 ℎ𝑛 (𝜓). We wish to show that
| √ |
| −1 |
|𝐶 ℎ (𝜓) − 𝐼(𝜃0 ) 𝑒− 2 𝜓 2 𝐼(𝜃0 ) | d𝜓
1
|
∫ | 𝑛 𝑛 |
2𝜋 |
| |
tends to zero. This integral can be bounded by
| √ |
| 1 2 | | −1 1 2 𝐼(𝜃0 ) − 1 𝜓 2 𝐼(𝜃0 ) ||
𝐶𝑛−1 |ℎ𝑛 (𝜓) − 𝑓Θ (𝜃0 )𝑒 2
− 𝜓 𝐼(𝜃0 |
)
d𝜓 + |𝐶 𝑓Θ (𝜃0 )𝑒 2
− 𝜓 𝐼(𝜃 )
− 𝑒 | d𝜓
0
∫ | | ∫ || 𝑛
2
| | 2𝜋 |
| |
Suppose
| 1 |
|ℎ (𝜓) − 𝑓 (𝜃 )𝑒− 2 𝜓 2 𝐼(𝜃0 ) | d𝜓 → 0
𝐵𝑛 ∶=
∫ | | 𝑛 Θ 0 |
|
1 2 √
Then 𝐶𝑛 → ∫ 𝑓Θ (𝜃0 )𝑒− 2 𝜓 𝐼(𝜃0 ) 𝑑𝜓 = 𝑓Θ (𝜃0 ) 2𝜋𝐼(𝜃0 ) which implies both terms in the bound tend to zero.
6
The proof is not part of the exam.
92
4.7. BAYESIAN ASYMPTOTICS
Hence,
√it suffices to show 𝐵𝑛 → 0. We √ do this by separately considering the integrals over 𝐴1 = {𝜓 ∶
|𝜓| > 𝛿0 𝑛} and 𝐴2 = {𝜓 ∶ |𝜓| ≤ 𝛿0 𝑛}. Denote these integrals by 𝐼 and 𝐼𝐼 respectively. In the
following, the limits are understood to hold with P𝜃0 probability 1.
Bounding 𝐼: The integral over domain 𝐴1 is bounded by
1 2 𝐼(𝜃 )
ℎ𝑛 (𝜓) d𝜓 + 𝑓Θ (𝜃0 )𝑒− 2 𝜓 0
d𝜓.
∫𝐴1 ∫𝐴1
It is√easy to see that the second integral tends to zero. For the first integral, if 𝜓 ∈ 𝐴1 we have 𝑙𝑛 (Θ
̂𝑛 +
𝜓∕ 𝑛) − 𝑙𝑛 (Θ𝑛 ) < −𝜀𝑛 for 𝑛 sufficiently large and hence the integral tends to zero if 𝑛 → ∞.
̂
for 𝑛( sufficiently)large. The final inequality follows from 𝐼̂𝑛 → 𝐼(𝜃0 ). It is easy to see that the integral
exp − 14 𝜓 2 𝐼(𝜃0 ) over 𝐴2 tends to zero, as 𝑛 → ∞.
The following corollary is proved in Schervish [1995] (Theorem 7.101). It shows that posterior proba-
bilities converge in probability under 𝑃𝜃0 . for any (Borel) subset 𝐵 of ℝ
√
Corollary 4.40. Define Λ𝑛 = 𝐼̂𝑛 (Θ − Θ ̂ 𝑛 ). Under “regularity conditions”,
p
ℙ(Λ𝑛 ∈ 𝐵 ∣ 𝐗𝑛 ) ←→ Φ(𝐵) under 𝑃𝜃0 , as 𝑛 → ∞.
Here Φ(𝐵) denotes the probability that a standard normal random variable lies in 𝐵.
93
CHAPTER 4. BAYESIAN STATISTICS
94
Chapter 5
Bayesian computation
The posterior distribution is virtually always intractable. The field of Bayesian computation is centred on
computational techniques to approximate the posterior distribution or sample from it.
Except for exceptional simple cases the integral in the denominator, which is the normalising constant 𝑓𝑋 (𝑥),
is intractable. Its evaluation possibly poses a high-dimensional integration problem, in case the dimension of
hte parameter is large. Markov Chain Monte Carlo methods is a collection of techniques that can be used to
obtain (dependent) samples from the posterior distribution.The main algorithm is the Metropolis-Hastings
(MH) algorithm.
Definition 5.1. A Markov chain Monte Carlo (MCMC) method for sampling from a distribution 𝜋 is any
method producing an ergodic Markov chain whose stationary distribution is 𝜋.
For ease of exposition, first we consider the problem of sampling from the probability mass function 𝜋,
supported on a countable set , where
95
CHAPTER 5. BAYESIAN COMPUTATION
to the point 𝑦. Put differently, for each 𝑥, 𝑦 → 𝑞(𝑥, 𝑦) is a probability vector, it plays the role of a conditional
probability. As such, 𝑞 is referred to as the proposal density. The output of this algorithm is a Markov chain
{𝑋𝑛 } that has 𝜋 as invariant distribution. Under weak additional assumptions
1 ∑
𝑁
a.s.
𝑔(𝑋𝑛 ) ←→ E𝜋 𝑔(𝑋), 𝑁 → ∞,
𝑁 𝑛=1
for 𝜋-integrable functions 𝑔 (a precise statement is given by Theorem 5.8). From this description it is apparent
that there is huge freedom in choosing 𝑞.
Definition 5.2. The Metropolis-Hastings (MH) algorithm is the algorithm by which a Markov chain is
constructed which evolves 𝑥𝑛 = 𝑥 to 𝑥𝑛+1 by the following steps
2. Compute ( )
𝜋(𝑦) 𝑞(𝑦, 𝑥)
𝛼(𝑥, 𝑦) = min 1, .
𝜋(𝑥) 𝑞(𝑥, 𝑦)
3. Set {
𝑦 with probability 𝛼(𝑥, 𝑦)
𝑥𝑛+1 = .
𝑥 with probability 1 − 𝛼(𝑥, 𝑦)
For computing 𝛼(𝑥, 𝑦) it suffices to know 𝜋 up to a proportionality constant. Within Bayesian statistics,
this is a very attractive property of the algorithm, as it avoids computing 𝑓𝑋 (𝑥).
To understand the algorithm, note that the transition probabilities of the Markov chain defined by the
MH-algorithm are given by
{
𝑞(𝑥, 𝑦)𝛼(𝑥, 𝑦) if 𝑥 ≠ 𝑦
𝑝(𝑥, 𝑦) = ∑ .
𝑞(𝑥, 𝑥) + 𝑧≠𝑥 𝑞(𝑥, 𝑧)(1 − 𝛼(𝑥, 𝑧)) if 𝑥 = 𝑦
Hence, the MH-acceptance rule adjusts the transition probabilities from 𝑞 to 𝑝. Then for 𝑦 ≠ 𝑥
( )
𝜋(𝑦) 𝑞(𝑦, 𝑥)
𝜋(𝑥)𝑝(𝑥, 𝑦) = 𝜋(𝑥)𝑞(𝑥, 𝑦) min 1,
𝜋(𝑥) 𝑞(𝑥, 𝑦)
= min (𝜋(𝑥)𝑞(𝑥, 𝑦), 𝜋(𝑦)𝑞(𝑦, 𝑥)) .
This relation is generally referred to as detailed balance. Summing over 𝑥 on both sides gives
∑
𝜋(𝑦) = 𝜋(𝑥)𝑝(𝑥, 𝑦).
𝑥
This reveals that 𝜋 is invariant for the chain: if we draw 𝑥 according to 𝜋 and let the chain evolve, the
distribution at all future times will be exactly 𝜋.
96
5.1. THE METROPOLIS-HASTINGS ALGORITHM
Definition 5.3. The Markov chain with transition kernel 𝑃 is invariant with respect to 𝜋 if for each 𝑥 ∈
as measures on (, ). The Markov chain is said to satisfy detailed balance with respect to 𝜋 if
as measures on ( × , ⊗ ). The resulting Markov chain is then said to be reversible with respect to 𝜋.
By integrating the detailed balance relation with respect to 𝑥 it follows that if a Markov chain is reversible
with respect to 𝜋 then it is invariant with respect to 𝜋.
The MH-algorithm accepts-rejects proposals from a Markov kernel 𝑄 to produce a Markov chain with
kernel 𝑃 which is reversible with respect to 𝜋. Thus, the chain evolves 𝑥𝑛 = 𝑥 to 𝑥𝑛+1 by the following steps.
Proof. Suppose 𝑋𝑛 = 𝑥. To evolve the chain to time 𝑛 + 1 we independently draw 𝑈 ∼ 𝑈 𝑛𝑖𝑓 (0, 1) and
𝑌𝑛+1 ∼ 𝑄(𝑥, ⋅). The algorithm prescribes that 𝑋𝑛+1 either equals 𝑋𝑛 or 𝑌𝑛+1 , depending on the event {𝑈 <
𝛼(𝑋𝑛 , 𝑌𝑛+1 ). Hence
𝑃 (𝑥, 𝐵) = ℙ(𝑋𝑛+1 ∈ 𝐵 ∣ 𝑋𝑛 = 𝑥)
= ℙ(𝑌𝑛+1 ∈ 𝐵, 𝑈 < 𝛼(𝑋𝑛 , 𝑌𝑛+1 ) ∣ 𝑋𝑛 = 𝑥) + ℙ(𝑋𝑛 ∈ 𝐵, 𝑈 ≥ 𝛼(𝑋𝑛 , 𝑌𝑛+1 ) ∣ 𝑋𝑛 = 𝑥)
97
CHAPTER 5. BAYESIAN COMPUTATION
Given the kernel 𝑄, a key question is how 𝛼 ∶ × → [0, 1] should be chosen to ensure 𝑃 (𝑥, d𝑦)
satisfies detailed balance with respect to 𝜋. The following theorem is due to Tierney [1998] (Theorem 2).
Then the kernel 𝑃 defined in (5.1) satisfies detailed balance with respect to 𝜋 if and only if
In particular, the choice 𝛼𝑀𝐻 (𝑥, 𝑦) = min(1, 𝑟(𝑦, 𝑥)) will imply detailed balance.
𝛼𝑀𝐻 (𝑥, 𝑦)𝑟(𝑥, 𝑦) = min(𝑟(𝑥, 𝑦), 𝑟(𝑦, 𝑥)𝑟(𝑥, 𝑦)) = min(𝑟(𝑥, 𝑦), 1) = 𝛼𝑀𝐻 (𝑦, 𝑥).
This is a general formulation of the MH-algorithm which even applies to infinite dimensional settings.
The result gets less abstract and easier to comprehend in case there is a common dominating measure. Sup-
pose there is a measure 𝜇 such that
Definition 5.6. A Markov chain is 𝜋-irreducible if, for any initial state, it has positive probability of entering
any set to which 𝜋 assigns positive probability.
Proposition 5.7. Assume that the proposal kernel 𝑄 has density 𝑞. If 𝑞(𝑥, 𝑦) > 0 for all 𝑥, 𝑦 ∈ supp(𝜋), then
the induced MH-chain is 𝜋-irreducibility.
The assumption is natural, as the induced MH-chain can only reach points that are proposed according
to 𝑄.
The following theorem gives sufficient conditions for convergence of MH-Markov chains.
98
5.2. EXAMPLES OF PROPOSAL KERNELS
Theorem 5.8 (Robert and Casella [2004], Theorem 7.4). ] Suppose the Metropolis-Hastings chain (𝑋𝑖 , 𝑖 ≥ 1)
is 𝜋-irreducible.
1. If ∫ |ℎ(𝑥)|𝜋(𝑥) d𝑥 < ∞, then
1∑
𝑛
lim ℎ(𝑋𝑖 ) = ℎ(𝑥)𝜋(𝑥) d𝑥 𝜋 − a.s.
𝑛→∞ 𝑛 ∫
𝑖=1
2. Independent proposals. Here we take 𝑄(𝑥, d𝑦) = 𝑞(𝑦) ̄ d𝑦 with 𝑞̄ a probability density. Hence the
proposal is independent of the current state. In this case
( )
𝜋(𝑦) 𝑞(𝑥)
̄
𝛼(𝑥, 𝑦) = min 1, .
𝜋(𝑥) 𝑞(𝑦)
̄
The acceptance probability is maximised for 𝑞̄ = 𝜋 which is of course intractable. This does show that
ideally 𝑞̄ resembles 𝜋.
2
The total variation distance between two probability measures 𝑃 and 𝑄 equals ‖𝑃 − 𝑄‖𝑇 𝑉 = sup𝐴 |P(𝐴) − 𝑄(𝐴)| (where the
supremum is over all measurable sets). In case 𝑃 and 𝑄 admit densities 𝑝 and 𝑞 with respect to a common dominating measure 𝜈,
then ‖𝑃 − 𝑄‖𝑇 𝑉 = 12 ∫ |𝑝(𝑥) − 𝑞(𝑥)|𝜈( d𝑥).
3
Optimal refers to the Markov chain for which 𝔼‖𝑋𝑛 − 𝑋𝑛−1 }2 is maximal, where the expectation is over all proposed moves
(including rejected ones), when the Markov chain has reached its stationary regimen. There are various possible notions for defining
optimality, we refer to Chaper 4 in Brooks et al. [2011] for additional information. Citing from that source: “Best is to find reasonably
large proposed moves which are reasonably likely to be accepted.”
99
CHAPTER 5. BAYESIAN COMPUTATION
0.8
0.6
0.4
0.2
0.0
0 1000 2000 3000 4000 5000
iterate
0
0.00 0.25 0.50 0.75 1.00
Figure 5.1: Output of the MH algorithm with independent 𝑈 𝑛𝑖𝑓 (0, 1) proposals. Top: trace plot. Bottom:
histogram.
3. Langevin adjusted proposals. Here we choose a tuning parameter ℎ > 0 and set
1 √
𝑦 = 𝑥 + ℎ∇ log 𝜋(𝑥) + ℎ𝑍,
2
where 𝑍 ∼ 𝑁 (0, 1). Hence
The rationality behind this choice is that 𝜋 is invariant for the Langevin diffusion
1 √
d𝑋𝑡 = 𝐴∇ log 𝜋(𝑋𝑡 ) d𝑡 + 𝐴 d𝑊𝑡 .
2
Here 𝑊 is a Wiener process. The proposal follows upon Euler discretisation of this stochastic differ-
ential equation. The MH-acceptance probability corrects for the discretisation error made.
The derivation of the Langevin adjusted proposal is an example of a general strategy for finding MH-
algorithms: construct a stochastic process that has 𝜋 as invariant distribution. Ideally this process can be
simulated without error. Otherwise, the MH-acceptance rule will correct for any (discretisation) error in
simulating the process to ensure the resulting Markov chain has 𝜋 as its invariant distribution.
Example 5.9. Suppose we wish to simulate from the 𝐵𝑒 (𝑎, 𝑏)-distribution. Of course, there exist direct
ways for simulating independent realisations of the beta distribution. We use the MH-algorithm to generate
dependent draws from the Beta-distribution. First, we use an independence-sampler, where the proposals are
independent draws from the 𝑈 𝑛𝑖𝑓 (0, 1)-distribution with 𝑎 = 2.7 and 𝑏 = 6.3. The results are in figure 5.1.
Next, we also use random-walk proposals, where given the current state 𝑥 we propose
𝑥′ ∶= 𝑥 + 𝑈 𝑛𝑖𝑓 (−𝜂, 𝜂) ,
100
5.2. EXAMPLES OF PROPOSAL KERNELS
0.8
0.6
0.4
0.2
0.0
0 1000 2000 3000 4000 5000
iterate
0
0.00 0.25 0.50 0.75 1.00
Figure 5.2: Output of the MH algorithm with random walk proposals with 𝜂 = 10. Top: trace plot. Bottom:
histogram. Average acceptance probability equals 0.023.
0.8
0.6
0.4
0.2
0.0
0 1000 2000 3000 4000 5000
iterate
0
0.00 0.25 0.50 0.75 1.00
Figure 5.3: Output of the MH algorithm with random walk proposals with 𝜂 = 1. Top: trace plot. Bottom:
histogram. Average acceptance probability equals 0.224.
101
CHAPTER 5. BAYESIAN COMPUTATION
0.8
0.6
0.4
0.2
0.0
0 1000 2000 3000 4000 5000
iterate
0
0.00 0.25 0.50 0.75 1.00
Figure 5.4: Output of the MH algorithm with random walk proposals with 𝜂 = 0.1. Top: trace plot. Bottom:
histogram. Average acceptance probability equals 0.844.
∑
𝑝
𝑄(𝑥, d𝑦) = 𝑤𝑖 𝑄𝑖 (𝑥, d𝑦).
𝑖=1
As an application of cycling kernels, we consider One-at-a-time MH. Suppose we wish to generate samples
from 𝜋(𝑥) and that we split 𝑥 into two parts: 𝑥 = (𝑥1 , 𝑥2 ). Assume we have
• a proposal density 𝑄1 (𝑥1 , d𝑦1 ∣ 𝑥2 ) for the 1st component (𝑥2 fixed);
• a proposal density 𝑄2 (𝑥2 , d𝑦2 ∣ 𝑥1 ) for the 2nd component (𝑥1 fixed).
For ease of exposition, we assume 𝑄1 (𝑥1 , d𝑦1 ∣ 𝑥2 ) and 𝜋( d𝑥1 , 𝑥2 ) are dominated by a common dominating
measure and the corresponding densities are denoted by 𝑞1 (𝑥1 , 𝑦1 ∣ 𝑥2 ) and 𝜋(𝑥1 , 𝑥2 ) (and similarly for 𝑄2 ).
Suppose we have (𝑥1 , 𝑥2 ), then we evolve the chain by the following steps.
102
5.4. APPLYING MCMC METHODS TO THE BASEBALL DATA
1. • draw 𝑦1 ∼ 𝑞1 (𝑥1 , 𝑦1 ∣ 𝑥2 )
• accept with probability
( )
𝜋(𝑦1 , 𝑥2 ) 𝑞1 (𝑦1 , 𝑥1 ∣ 𝑥2 )
𝛼1 = min 1, ,
𝜋(𝑥1 , 𝑥2 ) 𝑞1 (𝑥1 , 𝑦1 ∣ 𝑥2 )
else set 𝑦1 = 𝑥1 .
2. • draw 𝑦2 ∼ 𝑞2 (𝑥2 , 𝑦2 ∣ 𝑦1 )
• accept with probability
( )
𝜋(𝑦1 , 𝑦2 ) 𝑞2 (𝑦2 , 𝑥2 ∣ 𝑦1 )
𝛼2 = min 1, ,
𝜋(𝑦1 , 𝑥2 ) 𝑞2 (𝑥2 , 𝑦2 ∣ 𝑦1 )
else set 𝑦2 = 𝑥2 .
A special case is obtained by taking
𝑞1 (𝑥1 , 𝑦1 ∣ 𝑥2 ) = 𝜋(𝑦1 ∣ 𝑥2 )
𝑞2 (𝑥2 , 𝑦2 ∣ 𝑥1 ) = 𝜋(𝑦2 ∣ 𝑥1 )
Then ( ) ( )
𝜋(𝑦1 , 𝑥2 ) 𝜋(𝑥1 ∣ 𝑥2 ) 𝜋(𝑥1 ∣ 𝑥2 ) 𝜋(𝑦1 , 𝑥2 )
𝛼1 = min 1, = min 1, = 1.
𝜋(𝑥1 , 𝑥2 ) 𝜋(𝑦1 ∣ 𝑥2 ) 𝜋(𝑥1 , 𝑥2 ) 𝜋(𝑦1 ∣ 𝑥2 )
and similarly for 𝛼2 . Hence in this case the acceptance probabilities equal 1. This algorithm is known as the
Gibbs sampler. It prescribes to iteratively sample from the “full conditionals”.
Exercise 5.1. Suppose we wish to simulate from
Note that this is a mixture density. Of course, there is a simple direct way to sampling from this density, but
suppose we don’t know this and wish to apply the MH-algorithm. Consider
• random walk MH, where 𝑍 ∼ 𝑈 𝑛𝑖𝑓 (−1, 1) and 𝜎 = 1;
1. Implement both methods and experiment how they perform on simulated data.
2. Repeat for the bivariate normal distribution, where the marginals are standard Normal and the cor-
relation between the components is 0.9. For random walk MH, you can for example updates both
components iteratively like in the Gibbs sampler.
103
CHAPTER 5. BAYESIAN COMPUTATION
3. Update 𝜃 ∣ 𝜇, 𝜏 2 , 𝑋: sample ( )
𝜃 ∼ 𝑁 𝜇̄ 𝑛 , 𝜏 2 ∕𝑛 .
4. Update 𝜏 2 ∣ 𝜃, 𝑋, 𝜇: sample
( )
∑
𝑛
1
−2 ∗ ∗
𝜏 ∼ 𝐺𝑎 𝑛∕2 + 𝛼 , 𝛽 + (𝜇𝑖 − 𝜃)2 2
.
𝑖=1
Therefore, conditional on (𝜇, 𝜃, 𝑌 ), 𝑣 is Gamma distributed with shape parameter 𝛼 ∗ +𝑛∕2 and rate parameter
∑
𝛽 ∗ + 𝑖 (𝜇𝑖 − 𝜃)2 .
104
5.4. APPLYING MCMC METHODS TO THE BASEBALL DATA
In model 2, steps 1, 3 and 4 are the same as for model 1. For step 2 (updating 𝜇), note that
( )
2 𝑒𝑌𝑖 𝜇𝑖 (𝜇𝑖 − 𝜃)2
𝑝(𝜇𝑖 ∣ 𝑌𝑖 , 𝜏 , 𝜃) ∝ exp − .
(1 + 𝑒𝜇𝑖 )𝑛𝑖 2𝜏 2
At the third equality we use Fubini’s theorem; the final equality holds true provided that 𝑦n𝑒𝑤 and 𝑦 are
independent conditionally on 𝜃. From this derivation its follows that the predictive mean in case of 𝑛n𝑒𝑤
trials equals 𝑛n𝑒𝑤 E[𝜃 ∣ 𝑦] (take 𝑔 the identity map).
In the lower panel of Figure 5.5 we compare the estimated number of home-runs based on pre-season
data with the actual number of homeruns. Fro this, wee computed the sum of the squared deviations. For
the mean of the predictive distribution and MLE these numbers are 2008 and 9051 respectively. This means
that overall the Bayesian approach shows a large improvement.
Hence, overall, the estimates based on the posterior mean appear to be better than those based on max-
imum likelihood when considering all players together. Note in particular the difference in estimates for
the players Sosa and Vaughn. Improving the estimators by learning from other players is often referred to as
borrowing strength from others.
Here 𝑖 = 1, … , 𝑛. Derive the Gibbs-sampler for drawing from the posterior of (𝜃1 , … , 𝜃𝑛 , 𝛽).
105
CHAPTER 5. BAYESIAN COMPUTATION
●
0.15
probability
● method
● mle
●
0.10
●
bayes
●
●
●
●
● ●
0.05 ● ●
●
● ●
●
Bonds
..omas
..ome
Bagwell
Buhner
Burks
Castilla
Galaragga
Gonzalez
Griffey
McGwire
Palmeiro
Piazza
Sosa
T. Martinez
Vaughn
Walker
Season: predicted and observed number of homeruns
●
90
nr of homeruns
type
● mle
●
60 bayes
● ●
observed
●
● ● ●
30
● ● ●
● ● ●
● ●
Bonds
..omas
..ome
Bagwell
Buhner
Burks
Castilla
Galaragga
Gonzalez
Griffey
McGwire
Palmeiro
Piazza
Sosa
T. Martinez
Vaughn
Walker
players
Figure 5.5: Comparison of maximum likelihood estimators and posterior mean estimators for 𝑝𝑖 (based
on model 2). Top: posterior mean and maximum likelihood estimator based on pre-season data. Bottom:
estimated number of homeruns during the season based on both the mean of the predictive distribution and
the maximum likelihood estimator. In blue the observed number of homeruns during the season have been
added.
106
5.5. APPLYING GIBBS SAMPLING FOR MISSING DATA PROBLEMS*
1. Find the form of the joint posterior distribution of (𝜇, 𝜏) ∣ 𝑋, where 𝑋 = (𝑋1 , … , 𝑋𝑛 ). Note
that this is not of standard form. Show that the conditional (posterior) distributions (also
known as full conditionals are of simple forms:
( ∑𝑛 )
𝜏 𝑖=1 𝑋𝑖 + 𝜅𝜉 1
𝜇 ∣ 𝜏, 𝑋 ∼ 𝑁 , ,
𝜏𝑛 + 𝜅 𝜏𝑛 + 𝜅
( )
∑
𝑛
𝜏 ∣ 𝜇, 𝑋 ∼ 𝐺𝑎 𝛼 + 𝑛∕2, 𝛽 + (1∕2) (𝑋𝑖 − 𝜇)2 .
𝑖=1
2. How can the derived distributions be used to devise an MCMC algorithm to draw from the
posterior of (𝜇, 𝜏)?
107
CHAPTER 5. BAYESIAN COMPUTATION
Note that specifying the marginal distribution of 𝑥, 𝑝(𝑥), is not necessary due to the choice of model in (5.2).
Logistic regression corresponds to the particular choice
𝑛 ( )𝑦𝑖 ( )1−𝑦𝑖
∏𝑛
∏ 1 1
𝑝(𝑦 ∣ 𝑥, 𝜃) = 𝑝(𝑦𝑖 ∣ 𝑥𝑖 , 𝜃) = 1 − .
−𝜃𝑥𝑇𝑖 𝜃 −𝜃𝑥𝑇𝑖 𝜃
𝑖=1 𝑖=1 1 + 𝑒 1 + 𝑒
This turns the inference problem into an optimisation problem. As the posterior is intractable, so is 𝐾𝐿(𝑞, 𝜋),
but we have the decomposition
𝑞(Θ)
𝐾𝐿(𝑞, 𝜋) = E𝑞 log = E𝑞 log 𝑞(Θ) − E𝑞 log 𝑓Θ,𝑋 (Θ, 𝑥) + log 𝑓𝑋 (𝑥).
𝜋(Θ)
Now if we define the Evidence Lower BOund, by (the dependence on 𝑥 is suppressed in the notation)
𝐸𝐿𝐵𝑂(𝑞) = E𝑞 log 𝑓Θ,𝑋 (Θ, 𝑥) − E𝑞 log 𝑞(Θ),
then
𝑞 ⋆ = argmax 𝐸𝐿𝐵𝑂(𝑞).
𝑞∈
The “trick” is now to take the class sufficiently large to get a good approximation, while still being able to
compute 𝐸𝐿𝐵𝑂(𝑞) for 𝑞 ∈ .
We can decompose the ELBO as follows
𝐸𝐿𝐵𝑂(𝑞) = E𝑞 log 𝑓Θ (Θ) + E𝑞 log 𝑓𝑋∣Θ (𝑥 ∣ Θ) − E𝑞 log 𝑞(Θ)
= E𝑞 𝓁(Θ ∣ 𝑥) − 𝐾𝐿(𝑞, 𝑓Θ ),
where
𝓁(Θ ∣ 𝑥) = log 𝑓𝑋∣Θ (𝑥 ∣ Θ)
is the loglikelihood. So the VI-approximation balances two terms
• the first term encourages densities that place their mass on 𝜃s that explain the observed data;
• the second term encourages densities close to the prior.
So the variational objective mirrors the usual balance between likelihood and prior.
Probably the most popular class of approximating densities is the mean field variational family, where it
is assumed that if 𝜃 ∈ ℝ𝑚 then
∏𝑚
𝑞(𝜃) = 𝑞𝑗 (𝜃𝑗 ).
𝑗=1
This simply means that the joint density factorises into its marginals. Even for the mean field family, deriving
an expression for 𝐸𝐿𝐵𝑂(𝑞) usually requires lengthy tricky calculations. For examples we refer to chapter
21 in Murphy [2012].
108
5.6. VARIATIONAL INFERENCE*
Figure 5.6: The blue curves are the contours of the true density 𝜋. The red curves are approximations. Left:
using 𝐾𝐿(𝜋, 𝑞) leads to a 𝑞 that tries to cover both modes. Middle and right: using 𝐾𝐿(𝑞, 𝜋) forces 𝑞 to
choose one of the modes. This is what is done using Variational Inference.
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
−0.2 −0.2
−0.4 −0.4
−0.6 −0.6
−0.8 −0.8
−1 −1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Figure 5.7: The blue curves are the contours of the true density 𝜋. The red curves are approximations. Left:
using 𝐾𝐿(𝜋, 𝑞) leads to a 𝑞 that tries to cover the full support of 𝜋. Middle and right: using 𝐾𝐿(𝑞, 𝜋) forces
𝑞 to choose one of the modes. This is what is done using Variational Inference.
Exercise 5.2. Show that log 𝑓𝑋 (𝑥) ≥ 𝐸𝐿𝐵𝑂(𝑞) Hence, log 𝑓𝑋 (𝑥), sometimes called the log-evidence in
machine learning, is lower bounded by 𝐸𝐿𝐵𝑂(𝑞). This explains the nomenclature.
Note however that for intractable 𝜋 (what we are dealing with), the expectation in 𝐾𝐿(𝜋, 𝑞) is over 𝜋 which
is henceforth intractable. It is easy to see that 𝐾𝐿(𝑞, 𝜋) is infinite if 𝜋(𝜃) = 0 while 𝑞(𝜃) > 0 (in a neighbour-
hood). We say that 𝐾𝐿(𝑞, 𝜋) is zero-forcing for 𝑞 and hence 𝑞 ⋆ will typically underestimate the support of
𝜋. Some terminology: 𝑞 ⋆ and 𝑞 ◦ are known as the information projection and reverse information projection
of 𝜋 on .
To clearly see the difference between the two types of projection, consider Figures 5.6 and 5.7 which are
taken from chapter 21 in Murphy [2012]. Here, the blue curves represent contour lines of the density of 𝜋,
whereas the red-curves represent contour lines of the best approximation which in this case is taken to be a
distribution with elliptic contours (i.e. a multivariate normal distribution).
109
CHAPTER 5. BAYESIAN COMPUTATION
2. Let
𝜃 (𝑖) = argmax 𝑄(𝜃, 𝜃 (𝑖−1) ), (5.5)
𝜃
where
3. Check for convergence of either the loglikelihood or parameter values. If the convergence criterion is
not satisfied, increase 𝑖 by 1 and return to step (2).
Commonly, (𝑥, 𝑧) are referred to as the “full data”. The second step requires to compute the expected full
loglikelihood, under the distribution of 𝑍 ∣ 𝑋 with the current iterate of the parameter.
There are various ways to motivate the EM-algorithm. Here, we follow the exposition in Särkkä [2013]
(Section 12.2.3). Let 𝑞 be a probability density. By Jensen’s inequality
110
5.8. EXPECTATION MAXIMISATION (EM) ALGORITHM*
𝑞𝜃(𝑖+1)
(𝑖)
(𝑧) = 𝑓𝑍∣𝑋 (𝑧 ∣ 𝑥; 𝜃 (𝑖) ),
where we add the subscript to 𝑞 to highlight its dependence on 𝜃 (𝑖) (and dropped dependence on 𝑥 as the data
are fixed). Next, note that subsequently the second step satisfies
argmax 𝐹 (𝑞 (𝑖+1) , 𝜃) = argmax 𝑞𝜃(𝑖+1) (𝑧) log 𝑓𝑋,𝑍 (𝑥, 𝑧; 𝜃) d𝑧 = argmax 𝑄(𝜃, 𝜃 (𝑖) )
𝜃 𝜃 ∫ (𝑖)
𝜃
For a full discussion and conditions for the algorithm to converge to the MLE we refer to Section 9.4 in
Bishop [2006] and Section 7.2 in Groeneboom and Jongbloed [2014].
Exercise 5.3. Verify that the maximum aposterior (MAP) estimator can also be approximated using the
EM-algorithm, by changing (5.5) to 𝜃 (𝑖) = argmax𝜃 𝑄(𝜃, 𝜃 (𝑖−1) ) + log 𝑓Θ (𝜃).
111
CHAPTER 5. BAYESIAN COMPUTATION
112
Chapter 6
In this chapter we show that many statistical methods, such as point estimation and hypothesis testing, are
part of a general framework provided by statistical decision theory. We discuss optimality of statistical
decision rules and connect these to both classical and Bayesian statistics.
6.1 Introduction
Statistical decision theory is about making decisions under uncertainty. It provides a unifying framework
to think about what makes a “good” estimator, test, or more generally a statistical procedure. within this
framework one often speaks of “decision rules”. Tests, confidence sets and estimators all turn out to be
examples of decision rules. The conceptual framework is due to A. Wald (1939). You might wonder, “What’s
new?”. Whereas “conventional statistics” is only directed towards the use of sampling information (data)
in making inferences about 𝜃, decision theory combines the sampling information with knowledge of the
consequences of our decisions. Using a loss function and the data 𝑋, a decision rule maps the data 𝑋 to
an action 𝑎. Depending on the objective of the decision, actions can be “estimate the parameter 𝜃 using
estimator Θ”,
̂ “reject the hypothesis that 𝜃 ∈ Ω0 ⊂ Ω”, or “change the treatment of all patients within one
month”.
Definition 6.1. Suppose 𝑋 takes values in ℝ𝑘 and 𝑋 ∼ 𝑃𝜃 , where 𝜃 ∈ Ω. Denote by 𝔸 the set of allowable
actions. Let 𝔸 be a 𝜎-field on 𝔸. The measurable space (𝔸, 𝔸 ) is called the action space. A decision
rule is a measurable function 𝑑 ∶ (ℝ𝑘 , (ℝ𝑘 )) → (𝔸, 𝔸 ). If a decision rule 𝑑 is chosen, we take action
𝑑(𝑋) ∈ 𝔸 when 𝑋 is observed.
The construction or selection of decision rules cannot be done without any criterion about the preference
of decision rules.
Definition 6.2. The loss function 𝐿 is a measurable function 𝐿 ∶ Ω × 𝔸 → [0, ∞). If 𝑋 = 𝑥 is observed
then the loss for using decision rule 𝑑 is given by 𝐿(𝜃, 𝑑(𝑥)) when the parameter value equals 𝜃.
Example 6.4. In estimation problems 𝔸 = Ω. A decision rule is usually called an estimator in this case.
Common loss functions include:
113
CHAPTER 6. STATISTICAL DECISION THEORY
• 𝐿2 -loss: 𝐿(𝜃, 𝑎) = ‖𝜃 − 𝑎‖2 , also known as the Mean Squared Error. If Ω ⊂ ℝ then
( )2
𝑅(𝜃, 𝑑) = E𝜃 (𝑑(𝑋) − 𝜃)2 = Var 𝜃 𝑑(𝑋) + E𝜃 [𝜃 − 𝑑(𝑋)] .
which is the bias-variance decomposition.
• 𝐿1 -loss: 𝐿(𝜃, 𝑎) = ‖𝜃 − 𝑎‖.
• Large deviation loss: choose 𝑐 > 0 and define
𝐿𝑐 (𝜃, 𝑎) = 𝟏{‖𝜃 − 𝑎‖ > 𝑐}.
Example 6.5. The following example is taken from Berger [1985] (chapter 1). Suppose a drug company has
to decide whether or not to market a new pain reliever and suppose that there are two factors affecting the
decision
1. 𝜂: proportion of people for which the drug will be effective;
2. 𝜃: proportion of the market that the drug will capture.
For ease of exposition we only take 𝜃 into account. Sample information for 𝜃 can be obtained by interviewing
people. Suppose the sample size equals 𝑛 and let 𝑋 denote the number of people that will buy the drug. If
people decide independently on buying, then we may assume 𝑋 ∼ 𝐵𝑖𝑛 (𝑛, 𝜃). The maximum likelihood
estimator for 𝜃 is given by 𝑋∕𝑛.
Suppose now that overestimation is considered twice as costly as underestimation. Should we still use
the MLE as estimator? Within a decision-theoretic perspective, we can deal with this problem by specifying
an appropriate loss-function. In this case we can take
{
𝜃−𝑎 if 𝜃 > 𝑎
𝐿(𝜃, 𝑎) = ,
2(𝑎 − 𝜃) if 𝜃 ≤ 𝑎
where 𝑎 ∈ [0, 1]. Further optimality criteria for choosing a decision rule (estimator in this case) are required
for deciding which rule to use.
Example 6.6. In the hypothesis testing problem
𝐻0 ∶ 𝜃 ∈ Ω0 𝐻1 ∶ 𝜃 ∈ Ω1 ,
a decision rule is called a test. The action space can be taken 𝔸 = {𝑎0 , 𝑎1 } with 𝑎0 = {accept 𝐻0 } and
𝑎1 = {accept 𝐻1 }. We can make two errors: erroneously accepting 𝐻0 and erroneously accepting 𝐻1 . In
case we find these errors equally important, the following loss function is appropriate:
{ {
0 if 𝜃 ∈ Ω0 1 if 𝜃 ∈ Ω0
𝐿(𝜃, 𝑎0 ) = 𝐿(𝜃, 𝑎1 ) =
1 if 𝜃 ∈ Ω1 0 if 𝜃 ∈ Ω1
This loss function is called zero-one loss. For a given decision rule 𝑑 we have
𝑅(𝜃, 𝑑) = 𝐿(𝜃, 𝑎0 )P𝜃 (𝑑(𝑋) = 𝑎0 ) + 𝐿(𝜃, 𝑎1 )P𝜃 (𝑑(𝑋) = 𝑎1 )
as 𝑑 can only take 2 values. Hence
{
P𝜃 (𝑑(𝑋) = 𝑎1 ) if 𝜃 ∈ Ω0
𝑅(𝜃, 𝑑) = .
P𝜃 (𝑑(𝑋) = 𝑎0 ) if 𝜃 ∈ Ω1
These probabilities correspond to type I and type II errors in classical hypothesis testing. If we identify 𝑎0
with 0 and 𝑎1 with 1, then we might also have chosen 𝔸 = {0, 1}. In that case the decision rule is of the
form 𝟏𝐶 (𝑋), the set 𝐶 being the critical region of the test.
114
6.2. COMPARING DECISION RULES
Within the framework of decision theory, statistical inference has an interpretation as a random game
with two players: “nature” and the “statistician”. In this game we fix
• an action space 𝔸;
• a loss function 𝐿.
2. the statistician observes 𝑋 ∼ P𝜃0 and plays 𝑎 ∈ 𝔸 in response, where 𝑎 = 𝑑(𝑋) is determined by the
statistician’s decision rule;
• A loss function 𝐿(𝜃, 𝑎) defined on Ω × 𝔸: the cost of taking action 𝑎 ∈ 𝔸 when the parameter is 𝜃.
The goal is to find an “optimal” decision rule (in a sense to be made precise shortly), which is a mapping
𝑑 ∶ → 𝔸.
Example 6.7. Suppose 𝑋 ∼ 𝑁 (𝜃, 1), 𝜃 ∈ ℝ and we wish to estimate 𝜃. We consider the rules 𝑑(𝑋) = 𝑋
and 𝑑 ′ (𝑋) = 2. Obviously, for most values of 𝜃, 𝑑 ′ is a silly choice, but in case 𝜃 = 2 it is perfect. Under
𝐿2 -loss
𝑅(𝜃, 𝑑) = 1 𝑅(𝜃, 𝑑 ′ ) = (2 − 𝜃)2
from which we see that 𝑑 ′ has strictly smaller risk than 𝑑 if 𝜃 ∈ (1, 3).
It is better if strict inequality holds for at least one 𝜃. In that case we say that 𝑑 ′ is dominated by 𝑑 and call
𝑑 ′ inadmissible. A decision rule that is not inadmissible is called admissible.
115
CHAPTER 6. STATISTICAL DECISION THEORY
Strictly speaking, one should speak of admissibility with respect to a given loss function. Admissibility
rules out certain decision rules (that henceforth should not be used), but is a very weak condition. In example
6.7 for example 𝑑 ′ is admissible though few statisticians would consider this to be a good choice. Maximum
likelihood estimators need not be admissible, as the following example shows.
ind
Example 6.9. Suppose 𝑋1 , … , 𝑋𝑛 ∣ Θ = 𝜃 ∼ 𝐸𝑥𝑝 (𝜃). The maximum likelihood estimator is given by
∑
𝑑(𝑋) = 1∕𝑋̄ 𝑛 . For 𝐿2 -loss it is inadmissible. To see why, we first note that 𝑛𝑖=1 𝑋𝑖 ∼ 𝐺𝑎 (𝑛, 𝜃) which
implies
∞
1 𝜃 𝑛 𝑛−1 −𝜃𝑧 𝑛
E𝜃 𝑑(𝑋) = 𝑛 𝑧 𝑒 d𝑧 = 𝜃.
∫0 𝑧 Γ(𝑛) 𝑛−1
𝑛−1
Therefore, the decision rule 𝑑 ′ (𝑋) = 𝑛
𝑑(𝑋) is unbiased for estimating 𝜃. Since
( )2
𝑛−1
𝑅(𝜃, 𝑑 ′ ) = Var 𝑑 ′ (𝑋) = Var 𝑑(𝑋) < Var 𝑑(𝑋) ≤ 𝑅(𝜃, 𝑑).
𝑛
𝑑 ′ improves upon 𝑑.
Another famous example of an inadmissible estimator is given in the next subsection.
[ 𝑝 ]
∑ ∑
𝑝
𝔼 ̂ 𝑖 − 𝜃𝑖 )2 =
(Θ 𝔼(𝑋𝑖 − 𝜃𝑖 )2 = 𝑝.
𝑖=1 𝑖=1
‖ ‖2 2
̂ (𝐽 𝑆) ‖2 = ‖𝜃 − 𝑋 + 𝛼 𝑋 ‖ = ‖𝜃 − 𝑋‖2 + 2𝛼 𝑋 𝑇 (𝜃 − 𝑋) + 𝛼 .
‖𝜃 − Θ ‖ ‖𝑋‖ ‖‖ ‖𝑋‖ ‖𝑋‖2
‖ 2 2
We take expectations on both sides of this equation. Define for each 𝑖 the function ℎ𝑖 ∶ ℝ𝑝 → ℝ by
𝑥
ℎ𝑖 (𝑥) = ∑𝑝 𝑖 .
𝑗=1
𝑥2𝑗
116
6.2. COMPARING DECISION RULES
This implies [ ]
( ) 1
̂
E ‖𝜃 − Θ (𝐽 𝑆) 2
‖ = 𝑝 − 2𝛼(𝑝 − 2) − 𝛼 E 2
.
⏟⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏟ ‖𝑋‖2
The term in braces is strictly positive when 0 < 𝛼 < 2(𝑝 − 2).
Proof. First assume 𝜇 = and 𝜎 = 1. Without loss of generality we can assume ℎ(0) = 0. By Fubini’s
theorem:
∞ ∞[ 𝑥 ]
−𝑥2 ∕2 2
𝑥ℎ(𝑥)𝑒 d𝑥 = ℎ (𝑦) d𝑦 𝑥𝑒−𝑥 ∕2 d𝑥
′
∫0 ∫0 ∫0
∞ [ ∞ ] ∞
′ −𝑥2 ∕2 2
= ℎ (𝑦) 𝑥𝑒 𝑑𝑥 d𝑦 = ℎ′ (𝑦)𝑒−𝑦 ∕2 d𝑦.
∫0 ∫𝑦 ∫0
Here we used the result for 𝜇 = 0 and 𝜎 = 1 at the second equality sign.
The JS-estimator has an empirical Bayes interpretation. For that, consider the model
ind ( )
𝑋𝑖 ∣ Θ𝑖 = 𝜃𝑖 ∼ 𝑁 𝜃𝑖 , 1
ind
Θ1 , … , Θ𝑝 ∼ 𝑁 (0, 𝐴)
Proof. It is a well known that 𝑌 ∶= ‖𝑋‖2 ∕(𝐴 + 1) ∼ 𝜒𝑝2 . To prove the result, rewrite
𝑝−2 1 𝑝−2
=
‖𝑋‖ 2 𝐴 +1 𝑌
117
CHAPTER 6. STATISTICAL DECISION THEORY
The way is which 𝐴 is estimated is not prescribed by the empirical Bayes method. So we can alternatively
use the method of maximum likelihood to estimate 𝐴. It is not too hard to establish that 𝑋1 , … , 𝑋𝑛 are
independent with the 𝑁 (0, 1 + 𝐴) distribution. An estimator for 𝐴 can henceforth be obtained as maximiser
of ( )
−𝑝∕2 ‖𝑋‖2
𝐴 → (2𝜋(1 + 𝐴)) exp − .
2(1 + 𝐴)
over 𝐴 ∈ [0, ∞). This problem has solution
{
‖𝑋‖2 ∕𝑝 − 1 if ‖𝑋‖2 > 𝑝
𝐴̂ = .
0 else
Here (𝑥)+ = max(𝑥, 0). This estimator is known as the truncated James-Stein estimator.
Although the JS-estimator strictly dominates the MLE it is inadmissible itself. The use of seemingly
unrelated data in forming estimators is sometimes referred to as borrowing strength from others. Bayesian
hierarchical modelling incorporates the borrowing strength idea automatically by introducing dependence
via additional layers in the prior specification. If the sole purpose is estimating one specific 𝜃𝑖 , then 𝑋𝑖 is
admissible (and minimax, a concept that we will define in the next section).
Alternatively, 𝑑 is minimax if
sup 𝑅(𝜃, 𝑑) = inf′ sup 𝑅(𝜃, 𝑑 ′ ).
𝜃∈Ω 𝑑 𝜃∈Ω
This criterion chooses the decision rule which behaves best in the worst-case scenario.
The second notion of optimality is called Bayes optimality. Here we assign weights to different values of 𝜃
by means of a prior density 𝑓Θ . The Bayes risk is defined as a weighted average of the risk with respect to
this prior.
Definition 6.14. Bayes risk principle: in comparing decision rules, choose the rule 𝑑 with the smallest
value of
𝑟(𝜇Θ , 𝑑) = 𝑅(𝜃, 𝑑) d𝜇Θ (𝜃),
∫
where 𝜇Θ is a probability measure on Ω. The number 𝑟(𝜇Θ , 𝑑) is known as the Bayes risk of 𝑑 with respect
to 𝜇Θ . Any decision rule that minimises the Bayes risk is called a Bayes decision rule.
118
6.3. MINIMAX AND BAYES DECISION RULES
Both sup𝜃∈Ω 𝑅(𝜃, 𝑑) and 𝑟(𝜇Θ , 𝑑) are nonnegative real numbers that depend on the chosen decision rule,
but not on 𝑥 (as the risk function is defined by taking an expectation over 𝑋 under P𝜃 ).
Definition 6.15. The posterior risk of 𝑑 using prior measure 𝜇Θ is defined as
Remark 6.17. In classical statistics, an estimator is considered good if it is close to the true value on average
or in the long-run (pre-trial): the loss is averaged by taking an expectation over 𝑋. In Bayesian statistics
the loss is averaged by taking an expectation over Θ ∣ 𝑋. This is an expectation after seeing the data and is
sometimes called post-trial.
From here, we can perhaps most clearly see how fundamentally different classical statistics and Bayesian
statistics are (as opposed to how Bayesian methods are incorrectly introduced by saying that it is like adding
a prior to the framework of classical statistics). Classical statistics bases inference on
Exercise 6.1 [YS exercise 2.1.] Let 𝑋 ∼ 𝑈 𝑛𝑖𝑓 (0, 𝜃), where 𝜃 > 0 is unknown. Let the action
space be [0, ∞) and the loss function 𝐿(𝜃, 𝑑) = (𝜃 − 𝑑)2 , where 𝑑 is the action chosen. Consider
the decision rules
𝑑𝜇 (𝑥) = 𝜇𝑥, 𝜇 ≥ 0.
For what value of 𝜇 is 𝑑𝜇 unbiased? Show that 𝜇 = 3∕2 is a necessary condition for 𝑑𝜇 to be
admissible.
119
CHAPTER 6. STATISTICAL DECISION THEORY
Exercise 6.2 The risks for five decision rules 𝛿1 , … , 𝛿5 depend on the value of a positive-valued
parameter 𝜃. The risks are given in the table below
𝛿1 𝛿2 𝛿3 𝛿4 𝛿5
0≤𝜃<1 10 10 7 6 8
1≤𝜃<2 8 11 8 5 10
2≤𝜃 15 11 12 14 14
4. Suppose 𝜃 has a uniform distribution on [0, 5]. Which is the Bayes rule and what is the Bayes
risk for that rule?
Exercise 6.3 [YS exercise 3.5.] At a critical stage in the development of a new aeroplane, a de-
cision must be taken to continue or to abandon the project. The financial viability of the project
can be measured by a parameter 𝜃 ∈ (0, 1), the project being profitable if 𝜃 > 1∕2. Data 𝑥 provide
information about 𝜃. We assume 𝑥 to be a realisation of 𝑋 ∼ P𝜃 .
• If 𝜃 < 1∕2, the cost to the taxpayer of continuing the project is 1∕2 − 𝜃 (in units of billion
dollars), whereas if 𝜃 > 1∕2 it is zero (since the project will be privatised if profitable).
• If 𝜃 > 1∕2 the cost of abandoning the project is 𝜃 − 1∕2 (due to contractual arrangements for
purchasing the aeroplane from the French), whereas if 𝜃 < 1∕2 it is zero.
2. Derive the Bayes decision rule in terms of the posterior mean of 𝜃 by choosing the decision
rule that minimises the posterior risk.
3. The Minister of Aviation has prior density 6𝜃(1−𝜃) for 𝜃. The Prime Minister has prior density
4𝜃 3 . The prototype aeroplane is subjected to trials, each independently having probability 𝜃
of success, and the data 𝑥 consist of the total number of trials required for the first successful
result to be obtained. For what values of 𝑥 will there be serious ministerial disagreement?
120
6.3. MINIMAX AND BAYES DECISION RULES
Exercise 6.4 Suppose we have a single observation, 𝑋, which comes from a distribution with
density function 𝑓𝜃 , with 𝜃 ∈ {0, 1} and we want to test
against
𝐻1 ∶ 𝑓 (𝑥) = 𝑓1 (𝑥) = 2𝑥𝟏[0,1] (𝑥)
1. Using Neyman-Pearson, show that the best critical region for the likelihood ratio test of 𝐻0
versus 𝐻1 is given by 𝑋 ≥ 𝐵 for some constant 𝐵.
2. Consider now choosing 𝐵 using decision theory. Suppose the losses incurred by a type II
error is four times the loss of a type I error. Consider decision rules 𝑑𝐵 which choose 𝐻1 if
𝑋 ≥ 𝐵.
(a) Write down the loss function when considering the action space is = {𝑎0 , 𝑎1 } with
𝑎0 = {accept 𝐻0 } and 𝑎1 = {accept 𝐻1 }.
(b) Calculate the risks 𝑅(0, 𝑑𝐵 ) and 𝑅(1, 𝑑𝐵 ) as functions of 𝐵. Use this to find the value
of 𝐵 which gives the minimax rule.
(c) Calculate the Bayes risk, when the prior probabilities are 1∕4 and 3∕4 for 𝐻0 and 𝐻1
respectively, and find the value of 𝐵 which gives the Bayes rule.
Having introduced admissibility, minimax rules and Bayes rules it is interesting to see how these concepts
relate. This is a broad field from which we can only discuss a few main results. Typical questions that are
part of statistical decision theory include:
3. Are all admissible rules Bayes for some prior? (complete class theorem)
4. Are Bayes rules minimax? (extended Bayes rules which are equaliser rules are minimax)
5. Are minimax rules Bayes for some prior 𝜋? (requires existence of a least favourable prior).
In brackets we have given hints or partial answers to these questions (some concepts appearing in these
answers will be discussed as we proceed along this chapter). One relation is easy: the Bayes risk is always
smaller than (or equal) the maximum risk:
In section 6.8 we will investigate the role of sufficient statistics to the list of posed questions.
Related to question 3 we introduce the following definition.
Definition 6.18.
1. A class of decision rules is complete if for any decision rule 𝑑 ∉ there exists a decision rule
𝑑 ′ ∈ that dominates 𝑑.
121
CHAPTER 6. STATISTICAL DECISION THEORY
2. A class of decision rules is essentially complete if for any decision rule 𝑑 ∉ there exists a decision
rule 𝑑 ′ ∈ such that 𝑅(𝜃, 𝑑 ′ ) ≤ 𝑅(𝜃, 𝑑) for all 𝜃.
Theorem 6.19. Assume Ω is finite. If the prior measure 𝜇Θ satisfies 𝜇Θ ({𝜃}) > 0 for all 𝜃 ∈ Ω, then a Bayes
rule with respect to 𝜇Θ is admissible.
Theorem 6.21. Suppose Ω is a subset of the real line. Assume that the risk functions 𝑅(𝜃, 𝑑) are continuous
in 𝜃 for all decision rules 𝑑. Suppose that for any 𝜀 > 0 and any 𝜃 the interval (𝜃 − 𝜀, 𝜃 + 𝜀) has positive
probability under the prior 𝜇Θ . Then a Bayes rule with respect to 𝜇Θ is admissible.
Proof. Suppose that 𝑑̄ is a Bayes rule with respect to 𝜇Θ , but not admissible. Then there exists another
decision rule 𝑑 ′ such that {
𝑅(𝜃, 𝑑 ′ ) ≤ 𝑅(𝜃, 𝑑),
̄ ∀𝜃 ∈ Ω,
′ ̄
𝑅(𝜃0 , 𝑑 ) < 𝑅(𝜃0 , 𝑑), for some 𝜃0 ∈ Ω.
From the continuity of 𝑅(𝜃, 𝑑), it follows that there exists 𝜖 > 0 such that 𝑅(𝜃, 𝑑 ′ ) < 𝑅(𝜃, 𝑑),
̄ for all 𝜃 ∈
𝐼 ∶= (𝜃0 − 𝜖, 𝜃0 + 𝜖). Hence
As a result,
𝑟(𝜇Θ , 𝑑 ′ ) = 𝑅(𝜃, 𝑑 ′ )𝜇Θ ( d𝜃) < ̄ Θ ( d𝜃) = 𝑟(𝜇Θ , 𝑑),
𝑅(𝜃, 𝑑)𝜇 ̄
∫Ω ∫Ω
which contradicts that 𝑑̄ is a Bayes rule.
In case the loss function is strictly convex, admissibility of Bayes rules follows from the following theo-
rem.
Theorem 6.22. Assume the action space 𝔸 is a convex subset of ℝ𝑚 and that all P𝜃 are absolutely continuous
with respect to each other. If 𝐿(𝜃, ⋅) is strictly convex for all 𝜃, then for any probability measure Π on Ω, the
Bayes rule 𝑑Π is admissible.
122
6.5. BAYES RULES IN VARIOUS SETTINGS
Proof. Suppose 𝑑Π is not admissible. Then there exists 𝑑0 such that 𝑅(𝜃, 𝑑0 ) ≤ 𝑅(𝜃, 𝑑Π ) with strict inequality
for some 𝜃. Define a new decision rule
𝑑Π (𝑥) + 𝑑0 (𝑥)
𝑑1 (𝑥) = ,
2
then 𝑑1 ∈ 𝔸 by convexity. For all 𝜃 we have
and the Bayes rule minimises this expression with respect to 𝑑. Taking partial derivatives with respect
to each element 𝑑𝑖 of the vector 𝑑 and equating to zero, we easily derive that the Bayes rule is given by
𝑑 = ∫ 𝜃Π𝑥 ( d𝜃) which is the posterior mean.
If Ω ⊂ ℝ and we take {
𝑐1 (𝜃 − 𝑎) if 𝑎 ≤ 𝜃
𝐿(𝜃, 𝑎) = . (6.1)
𝑐2 (𝑎 − 𝜃) if 𝑎 > 𝜃
then it follows that if 𝑎 minimises the posterior risk then it satisfies the equation
𝑐1
Π𝑥 ((−∞, 𝑎)) = (6.2)
𝑐1 + 𝑐2
Hence, the Bayes rule for this loss is the 𝑐1 ∕(𝑐1 +𝑐2 )-quantile of the posterior. In particular, taking 𝑐1 = 𝑐2 = 1
the resulting point estimator is the posterior median.
123
CHAPTER 6. STATISTICAL DECISION THEORY
Exercise 6.6 Prove (6.2). Hint: consider the posterior risk and differentiate the latter with respect
to 𝑎.
Finally, consider large deviation loss, which is defined by 𝐿𝑐 (𝜃, 𝑎) = 𝟏{‖𝜃 − 𝑎‖ > 𝑐} for a fixed 𝑐 > 0.
The Bayes rule is the value of 𝑎 that has the largest posterior mass in a ball with radius 𝑐. If the prior is
approximately flat, then this 𝑎 will be close to the value maximising the likelihood. For this loss function
minimising the posterior risk is equivalent to maximising
𝑑 → 𝑓Θ∣𝑋 (𝜃 ∣ 𝑥) d𝜃.
∫𝜃∶ ‖𝜃−𝑑‖≤𝑐
Upon letting 𝑐 ↓ 0 the Bayes rule equals that value of 𝜃 for which 𝑓Θ∣𝑋 (𝜃 ∣ 𝑥) is maximal (if such a value
exists). When it exists, we call argmax𝜃∈Θ 𝑓Θ∣𝑋 (𝜃 ∣ 𝑥) the posterior mode.
Example 6.23. This is a continuation of example 6.5. Suppose we decide to take a 𝑈 𝑛𝑖𝑓 (0, 1)-prior on Θ.
Then Θ ∣ 𝑋 = 𝑥 ∼ 𝐵𝑒 (𝑥 + 1, 𝑛 − 𝑥 + 1). The loss-function that takes into account that overestimation is
twice as costly as underestimation is of the form 6.1 with 𝑐1 = 1 and 𝑐2 = 2. The Bayes rule is henceforth
the 1∕3-th quantile of the posterior.
𝜃 ∈ Ω0 𝜃 ∈ Ω1
𝑎0 0 𝐿0
𝑎1 𝐿1 0
124
6.5. BAYES RULES IN VARIOUS SETTINGS
The Bayes action is 𝑎0 if 𝐿0 Π𝑥 (Ω1 ) < 𝐿1 Π𝑥 (Ω0 ). That is, we take action 𝑎0 if
𝐿0 Π (Ω )
< 𝑥 0 . (6.3)
𝐿1 Π𝑥 (Ω1 )
For hypothesis testing, assume the prior 𝜇Θ is of the form
Here 𝜋0 + 𝜋1 = 1 and 𝜋0 is the prior probability that hypothesis 𝐻0 is true. Furthermore, 𝜇Θ(0) and 𝜇Θ(1) are are
prior (probability) measures supported on Ω0 and Ω1 respectively, implying that 𝜇Θ is a probability measure
dP
on Ω. If we assume P𝜃 ≪ 𝜈 and denote 𝐿(𝜃; 𝑥) = d𝜈𝜃 (𝑥), then
(0)
Π𝑥 (Ω0 ) ∫Ω0 𝐿(𝜃; 𝑥)𝜇Θ ( d𝜃) 𝜋0
= = × 𝐵0,1 (𝑥) (6.4)
Π𝑥 (Ω1 ) ∫ 𝐿(𝜃; 𝑥)𝜇 (1) ( d𝜃) 𝜋1
Ω 1 Θ
Jeffreys advocated the use of the Bayes factor as a direct and intuitive measure of evidence to be used in
alternative to, say, 𝑝-values, for evidence against a hypothesis. A good discussion on Bayes factors can be
found in section 4.4 of Young and Smith [2005]. Note that the formulation of the prior in terms of measures
rather than densities easily incorporates point-null hypothesis testing: if Ω0 = {𝜃0 } then 𝜇Θ(0) = 𝛿𝜃0 (Dirac
mass at 𝜃0 ). Equation (6.4) then implies
( )−1
𝜋1
Π𝑥 ({𝜃0 }) = 1+
𝜋0 𝐵0,1 (𝑥)
Exercise 6.7 Verify the preceding calculation. Check that if 𝜋0 = 𝜋1 = 1∕2 and 𝐵0,1 (𝑥) = 1, then
Π𝑥 ({𝜃0 }) = 1∕2.
If both Ω0 = {𝜃0 } and Ω1 = {𝜃1 } (point null versus a single alternative testing), we have 𝜇Θ(0) = 𝛿𝜃0 and
𝜇Θ(1) = 𝛿𝜃1 . In this case the Bayes factor is the likelihood ratio and we have
Π𝑥 (Ω0 ) 𝜋0 𝐿(𝜃0 ; 𝑥)
= × .
Π𝑥 (Ω1 ) 𝜋1 𝐿(𝜃1 ; 𝑥)
125
CHAPTER 6. STATISTICAL DECISION THEORY
Compared to the Neyman-Pearson lemma, the data enter in exactly the same way (via the likelihood ratio),
but the decision to accept/reject a hypothesis is made on completely different criteria.
The difference between classical hypothesis testing and Bayesian hypothesis testing are clearly illustrated
by the following example, an instance of what is known as Lindley’s paradox.
iid
Example 6.24. Assume 𝑋1 , … , 𝑋𝑛 ∼ 𝐸𝑥𝑝 (𝜃). Suppose we wish √ to test 𝐻0 ∶ 𝜃 = 1 versus 𝐻0 ∶ 𝜃 ≠ 1.
∑𝑛
Let 𝑆𝑛 = 𝑖=1 𝑋𝑖 . Under the null hypothesis 𝑇 ∶= (𝑆𝑛 − 𝑛)∕ 𝑛 has asymptotically a standard Normal
distribution. Hence the test with testfunction 𝜙(𝑋1 , … , 𝑋𝑛 ) = 𝟏{|𝑇 | ≥ 𝜉𝛼∕2 } has significance level 𝛼 in the
large sample limit (Of course, for this testing problem there is no need to use asymptotics as 𝑆𝑛 ∼ 𝐺𝑎 (𝑛, 𝜃),
but it turns out to be convenient for the remainder of this example). Note that if we observe
√
𝑠𝑛 = 𝑛 + 𝜉𝛼∕2 𝑛 (6.5)
the test will have significance level 𝛼 and the decision will be to reject, for any value of 𝑛.
We now also view the problem from the Bayesian point of view using a prior measure on 𝜃. Let 𝜋0 and
𝜋1 be nonnegative and add to one. Define
𝑏𝑎 𝑎−1 −𝑏𝜃
𝜇Θ (𝐴) = 𝜋0 𝛿{1} (𝐴) + 𝜋1 𝜃 𝑒 𝟏[0,∞) (𝜃) d𝜃,
∫𝐴 Γ(𝑎)
for 𝐴 a Borel set in ℝ and where 𝛿{1} (𝐴) = 𝟏𝐴 (1) (writing the measure, instead of density 𝑓Θ is somewhat
more convenient due to the occurrence of the Dirac mass). The posterior odds are given by
∞
𝜋1 ∫0 𝜃 𝑛 𝑒−𝜃𝑠 Γ(𝑎)−1 𝑏𝑎 𝜃 𝑎−1 𝑒−𝑏𝜃 d𝜃 𝜋
−𝑠
=∶ 1 𝐵𝑛 ,
𝜋0 𝑒 𝜋0
with
Γ(𝑛 + 𝑎) 𝑏𝑎
𝐵𝑛 = 𝑒 𝑠 .
Γ(𝑎) (𝑏 + 𝑠)𝑛+𝑎
To
√ simplify, we take 𝑎 = 𝑏 = 1 so that 𝐵𝑛 = 𝑒𝑠 (𝑛!)(1 + 𝑠)−𝑛−1 . Applying Sterling’s formula, 𝑛! ∼
√ 𝑛 −𝑛
2𝜋 𝑛𝑛 𝑒 we get
√ ( )
2𝜋 1 + 𝑠 −(𝑛+1)
𝐵𝑛 ∼ 𝑒𝑠−𝑛 .
𝑛 𝑛
Next, we take 𝑠 = 𝑠𝑛 , with 𝑠𝑛 as defined in (6.5). This gives
√ ( )−(𝑛+1)
2𝜋 𝜉𝛼∕2 √𝑛 1 𝜉𝛼∕2
𝐵𝑛 ∼ 𝑒 1+ + √ .
𝑛 𝑛 𝑛
√
This behaves asymptotically as 2𝜋∕𝑛 and therefore tends to zero. So in case the observations satisfy (6.5)
the frequentist test rejects for any value of 𝑛 whereas the posterior probability of the alternative hypothesis
tends to zero. Therefore, the conclusions from both methods are rather different.
( )
Exercise 6.8 Suppose 𝑋1 , … , 𝑋𝑛 ∼ 𝑁 𝜃, 𝜎 2 and assume 𝜎 2 is known. We consider testing
𝐻0 ∶ 𝜃 ≤ 𝜃0 versus 𝐻1 ∶ 𝜃 > 𝜃0 .
3. Compare the computed posterior probability with the 𝑝-value and comment on this.
126
6.5. BAYES RULES IN VARIOUS SETTINGS
Now consider the binary classification problem where = Ω = {0, 1}. Then
{ }
𝑓𝑋∣Θ (𝑥 ∣ 1) 𝑓Θ (0)
Θ𝑀𝐴𝑃 = 𝟏 ≥ ,
𝑓𝑋∣Θ (𝑥 ∣ 0) 𝑓Θ (1)
in which we recognise both the likelihood ratio (Bayes factor) and the prior ratio.
As an example, we consider the binary detection problem from the communications literature. Here, an emit-
ter outputs either “0” or “1” with apriori probabilities
( ) 𝑓Θ (0) and 𝑓Θ (1) respectively. Each digit is transmitted
through a noisy channel that adds to it a 𝑁 0, 𝜎 2 random variable. This is the simplest, but most common,
( )
model for channel noise in digital communications. As this model stipulates that 𝑋 ∣ Θ = 𝜃 ∼ 𝑁 𝜃, 𝜎 2
we get ( )
Θ𝑀𝐴𝑃 = 𝟏{2𝑋 − 1 ≥ 2𝜎 2 log 𝑓Θ (0)∕𝑓Θ (1) }.
Exercise 6.10
2. (*)Now suppose 𝜎 2 is unknown. Set 𝑢 = 1∕𝜎 2 and consider 𝑢 to be a realisation of the random
variable 𝑈 , which gets assigned the 𝐺𝑎 (𝛼, 𝛽)-distribution. Find the MAP.
∞
Hint: 𝑓Θ∣𝑋 (𝜃 ∣ 𝑥) = ∫0 𝑓Θ,𝑈 ∣𝑋 (𝜃, 𝑢 ∣ 𝑥) d𝑢.
3. (*) Derive the Bayes rule when the loss for incorrectly deciding “1” equals 1, but the loss for
incorrectly deciding “0” equals 𝑐 > 0.
We extend the binary classification problem to the case of a known finite number of classes. Linear
discriminant analysis is concerned with the following statistical model: suppose 𝑋 = (𝑋1 , … , 𝑋𝑛 ) satisfies
𝑋 = 𝜃 𝕀 + 𝑍, with 𝑍 ∼ 𝑁 (0, 𝐶) .
Here 𝕀 denotes the vector with 𝑛 times a 1 and 𝑍 is assumed to be multivariate Normally distributed, with co-
variance matrix 𝐶. If 𝐶 is diagonal, then the 𝑋𝑖 are independent, else not. We assume 𝜃 ∈ Ω = {𝜃1 , … , 𝜃𝑀 }.
The MAP estimator is defined by
( )
1
Θ𝑀𝐴𝑃 = argmax − ‖𝑋 − 𝜃𝑖 ‖2𝐶 + log 𝑓Θ (𝜃𝑖 ) ,
1≤𝑖≤𝑀 2
127
CHAPTER 6. STATISTICAL DECISION THEORY
Definition 6.25. The predictive loss function1 𝐿 is a measurable function from ×𝔸 to [0, ∞). If (𝑋, 𝑌 ) =
(𝑥, 𝑦) is observed then the loss for using decision rule 𝑑 is given by 𝐿(𝑦, 𝑑(𝑥)).
Just like the definition of posterior risk is obtained by averaging the loss over the posterior of Θ (Cf.
definition 6.15), we now define the predictive risk by averaging the loss over the predictive distribution of 𝑌 .
Definition 6.26. The predictive risk with respect to the prior 𝜇Θ is defined by
Example 6.27. This example is extracted from Exercise 4.1 in YS. Suppose the random variable 𝑋 has one
of two possible densities:
This is clearly a finite decision problem, the parameter set Ω being equal to {1, 2}.
We consider the family of decision rules
{
1 if 𝑥 ≥ 𝜇
𝑑𝜇 (𝑥) = ,
2 if 𝑥 < 𝜇
indexed by 𝜇 ∈ [0, ∞]. To completely specify the decision problem, define the loss function to be 𝐿(𝜃, 𝑑) =
|𝜃 − 𝑑|. The risk function is then given by
𝑅(𝜃, 𝑑𝜇 ) = E𝜃 |𝜃 − 𝑑𝜇 (𝑋)|
= E𝜃 |𝜃 − 𝑑𝜇 (𝑋)|𝟏{𝑋≥𝜇} + E𝜃 |𝜃 − 𝑑𝜇 (𝑋)|𝟏{𝑋<𝜇}
= |𝜃 − 1|P𝜃 (𝑋 ≥ 𝜇) + |𝜃 − 2|P𝜃 (𝑋 < 𝜇)
= |𝜃 − 1|𝑒−𝜃𝜇 + |𝜃 − 2|(1 − 𝑒−𝜃𝜇 ).
1
I am not aware of common terminology.
128
6.6. FINITE DECISION PROBLEMS
By sketching this function, considered as a function of 𝜇, it becomes clear that the minimax estimator satisfies
1 = 𝑒−𝜇 = 𝑒−2𝜇 .
Letting 𝜉 = 𝑒−𝜇 √
this becomes a quadratic equation in 𝜉 with only one positive solution, which is equal to
𝜉m𝑖𝑛𝑚𝑎𝑥 = (−1 + 5)∕2. Hence
√
−1 + 5
𝜇minmax = − log ≈ 0.48.
2
For a given prior on Ω, the Bayes rule can be determined. If {1} gets prior probability 𝜋1 then the Bayes risk
equals
𝜋1 𝑅(1, 𝑑𝜇 ) + (1 − 𝜋1 )𝑅(2, 𝑑𝜇 ) = 𝜋1 (1 − 𝜉) + (1 − 𝜋1 )𝜉 2 .
Minimising this function over all 𝜉 > 0 is straightforward and gives and 𝜇B𝑎𝑦𝑒𝑠 = − log 𝜉B𝑎𝑦𝑒𝑠 .
Exercise 6.11 For what prior mass function for 𝜃 does the minimax rule coincide with the Bayes
rule?
Example 6.28. Suppose Ω = {𝜃0 , 𝜃1 } and the action set is given by 𝔸 = {𝑎0 , 𝑎1 }. Assume the loss function
𝜃⧵𝑎 𝑎0 𝑎1
𝜃0 0 2
𝜃1 1 0
Suppose the statistician gets to see 𝑋 ∼ 𝐵𝑒𝑟 (𝜃), with 𝜃 ∈ Ω. Since both the sample space = {0, 1} and
the action space 𝔸 are finite, we are able to write down all (non-randomised) decision rules. These are
𝑑1 (𝑋) = 𝑎0 𝑑2 (𝑋) = 𝑎1
{ {
𝑎0 if 𝑋 = 0 𝑎1 if 𝑋 = 0
𝑑3 (𝑋) = 𝑑4 (𝑋) = .
𝑎1 if 𝑋 = 1 𝑎0 if 𝑋 = 1
For computing the risk function of these decision rules we use that
and
For computing the minimax decision rule in this example, it turns out that we need to consider randomised
decision rules.
Randomised decision rules are decision rules that include an “external” randomisation (think for example
of randomised tests). Here we give the definition of a randomised decision rule when a finite number of
(nonrandom) decision rules is given to us (the general definition is somewhat involved).
129
CHAPTER 6. STATISTICAL DECISION THEORY
∑
Definition 6.29. Let 𝑝𝑖 ∈ [0, 1], 𝑖 = 1, … , 𝐼 be such that 𝐼𝑖=1 𝑝𝑖 = 1. Suppose 𝑑1 , … , 𝑑𝐼 are decision
rules. A randomised decision rule 𝑑 ∗ is obtained by choosing rule 𝑑𝑖 with probability 𝑝𝑖 . The loss of the
randomised decision rule 𝑑 ∗ is defined by
∑
𝐼
∗
𝐿(𝜃, 𝑑 ) = 𝑝𝑖 𝐿(𝜃, 𝑑𝑖 ).
𝑖=1
On pages 12 and 13 of Young and Smith [2005] it is explained how minimax and Bayes decision rules
can be found using the geometry of the risk set in case 𝑘 = 2. Precise mathematical statements that justify
these derivations are given in section 3.2.4 of Schervish [1995].
Definition 6.30. Suppose Θ = {𝜃1 , 𝜃2 , … , 𝜃𝑘 }. The risk set of the decision rule 𝑑 is defined by
Proof. Suppose 𝑧, 𝑤 ∈ and 𝛼 ∈ [0, 1]. Suppose 𝑧 corresponds to decision rule 𝑑𝑧 and 𝑤 to decision rule
𝑑𝑤 . For 𝑖 = 1, 2
𝛼𝑧𝑖 + (1 − 𝛼)𝑤𝑖 = 𝛼𝑅(𝜃𝑖 , 𝑑𝑧 ) + (1 − 𝛼)𝑅(𝜃𝑖 , 𝑑𝑤 ) = 𝑅(𝜃𝑖 , 𝑑 ∗ ), .
with 𝑑 ∗ the randomised decision rule
{
∗ 𝑑𝑧 with probability 𝛼
𝑑 = .
𝑑𝑤 with probability 1 − 𝛼
Example 6.32 (Continuation of example 6.28). The risk set is the convex hull of
max{𝑅(𝜃0 , 𝑑 ∗ ), 𝑅(𝜃1 , 𝑑 ∗ )}
𝜃0 = 1∕3 𝜃1 = 3∕4.
130
6.6. FINITE DECISION PROBLEMS
A sketch of the risk set then reveals that 𝑑4 is inadmissible (any probability assigned to this rule can
better be assigned to 𝑑3 to lower the maximal risk). Just as in example 6.27 the minimax rule is the rule for
which 𝑅(𝜃0 , 𝑑 ∗ ) = 𝑅(𝜃1 , 𝑑 ∗ ). From the form of the risk set it then follows that the minimax rule is of the
form [ ] [ ]
0 2∕3
𝑝1 + 𝑝3
1 1∕4
with 𝑝1 + 𝑝3 = 1. Both coordinates are equal for 𝑝1 = 5∕17, resulting in the minimax risk being equal to
8∕17. Note that this number is indeed smaller than the maximum risk of the best non-randomised estimator
(which is 𝑑3 with max-risk equal to 2∕3.
In the previous example the minimax estimator is a randomised estimator. The need of an external
randomisation device (completely independent of the data) is unappealing to many statisticians, but a con-
sequence of taking the maximum risk as optimality criterion.
Example 6.33 (Continuation of example 6.28). We now turn to Bayes rules. Let 𝜋 be the prior probability
of {𝜃0 }. We wish to find the rule 𝑑𝑖 (𝑖 ∈ {1, 2, 3, 4}) for which
𝑐 = 𝜂𝑅(𝜃0 , 𝑑) + (1 − 𝜂)𝑅(𝜃1 , 𝑑)
is smallest. Rewriting this equation gives
𝑐 𝜂
𝑅(𝜃1 , 𝑑) = − 𝑅(𝜃0 , 𝑑).
1−𝜂 1−𝜂
As example, suppose 𝜂 = 1∕2. Then the Bayes rule is the rule where the line with slope −1 touches the risk
set. This is at 𝑑3 .
Exercise 6.12 Suppose 𝜂 = 3∕4 in the preceding example, what is the Bayes rule? Is it unique.
Repeat for 𝜂 = 9∕17.
The previous exercise shows that Bayes rules need not be unique, and may be randomised. However, for
any randomised rule there is a non-randomised rule with the same Bayes risk. Hence, within the Bayesian
setup randomised rules are not needed, which can hardly be surprising when thinking about the likelihood
principle.
1
Exercise 6.13 In example 6.28 assume 𝜃0 = 10
, 𝜃1 = 12 .
2. Find the minimax estimator and the prior for which it is a Bayes rule.
131
CHAPTER 6. STATISTICAL DECISION THEORY
Theorem 6.34 (Minimax theorem, Schervish [1995] page 172). Suppose that the loss function is bounded
below and Ω is finite. Then
The prior 𝜇0 such that (6.6) is called least favourable. If is closed from below, then there is a minimax
rule that is a Bayes rule with respect to 𝜇0 .
132
6.6. FINITE DECISION PROBLEMS
Exercise 6.14 [YS exercise 2.5.] Bacteria are distributed at random in a fluid, with mean density
𝜃 per unit volume, for some 𝜃 ∈ 𝐻 ⊆ [0, ∞). This means that
We remove a sample of volume v from the fluid and test it for the presence or absence of bacteria.
On the basis of this information we have to decide whether there are any bacteria in the fluid at all.
An incorrect decision will result in a loss of 1, a correct decision in no loss.
(a) Suppose 𝐻 = [0, ∞). The problem can be cast into the decision framework as follows by
defining the action space = {𝑎0 , 𝑎1 } with
Describe all the non-randomised decision rules for this problem and calculate their risk.
Which of these rules are admissible?
where 𝑅(𝜃, 𝑑) is the expected loss in applying 𝑑 under 𝑃𝜃 . Show that the minimax rule is
a randomised rule, where the randomisation consists of tossing a coin that lands heads with
probability (1 + 𝑒𝑣 )−1 .
(c) Now suppose again that 𝐻 = [0, ∞), just as under (a). Determine the Bayes decision rules
and Bayes risk for the prior
1 2
𝜇Θ ( d𝜃) = 𝛿0 ( d𝜃) + 𝑒−𝜃 d𝜃.
3 3
Hint: The Bayes risk of the rule 𝑑 is given by
∞
1 2
𝑅(0, 𝑑) + 𝑅(𝜃, 𝑑)𝑒−𝜃 𝑑𝜃.
3 3 ∫0
(d) If it costs additionally 𝑣∕24 to test a sample of volume 𝑣 (so 𝑣∕24 is added to the loss-function),
what is the optimal volume to test? What if the cost is 1∕6 per unit volume?
133
CHAPTER 6. STATISTICAL DECISION THEORY
Exercise 6.15 [YS exercise 2.4] An unmanned rocket is being launched in order to place in orbit
an important new communications satellite. At the time of launching, a certain crucial electronic
component is either functioning or not functioning. In the control centre there is a warning light
that is not completely reliable. If the crucial component is not functioning, the warning light goes
on with probability 2∕3; if the component is functioning, it goes on with probability 1∕4. At the
time of launching, an observer notes whether the warning light is on or off. It must then be decided
immediately whether or not to launch the rocket.
There is no loss associated with launching the rocket with the component functioning, or aborting
the launch when the component is not functioning. However, if the rocket is launched when the
component is not functioning, the satellite will fail to reach the desired orbit. The Space Shuttle
mission required to rescue the satellite and place it in the correct orbit will cost 10 billion dollars.
Delays caused by the decision not to launch when the component is functioning result, through lost
revenue, in a loss of 5 billion dollars.
Suppose that the prior probability that the component is not functioning is 𝜓 = 2∕5. If the warning
light does not go on, what is the decision according to the Bayes rule? For what values of the prior
probability 𝜓 is the Bayes decision to launch the rocket, even if the warning light comes on?
Hints: The actions can be defined as
{
𝑎0 do not launch rocket
.
𝑎1 launch rocket
The observation is {
1 warning light turns on
𝑋= .
0 warning light does not turn on
First write down the loss function and all non randomised decision rules. Next, compute the risk
function.
Theorem 6.35 (Neyman-Pearson fundamental lemma, Schervish [1995] theorem 3.87). Let Ω = 𝔸 =
{0, 1} and assume the loss function
𝜃⧵𝑎 0 1
0 0 𝑘0
1 𝑘1 0
with 𝑘0 > 0 and 𝑘1 > 0. Let 𝑓𝑖 (𝑥) = 𝑑𝑃𝑖 ∕𝑑𝜈(𝑥) for 𝑖 = 0, 1, where 𝜈 = 𝑃0 + 𝑃1 . Let denote the class of
all rules with test function of the following forms:
134
6.7. MINIMAX-BAYES CONNECTIONS
• For 𝑘 = 0, {
1 if 𝑓1 (𝑥) > 0
𝜙0 (𝑥) =
0 if 𝑓1 (𝑥) = 0
• For 𝑘 = ∞, {
1 if 𝑓0 (𝑥) = 0
𝜙∞ (𝑥) =
0 if 𝑓0 (𝑥) > 0
Then is a minimal complete class.
Obviously, in the setting of this theorem there is no need to look for other test than the likelihood ratio
test. As Parmigiani and Inoue [2009] put it (page 168):
With this result we have come in a complete circle: the Neyman-Pearson theory was the seed
that started Wald’s statistical decision theory: minimal completeness is the ultimate rationality
endorsement for a statistical approach within that theory – all and only the rules generated by
the approach are worth considering. The Neyman-Pearson tests are a minimal complete class.
Also, for each of these tests we can find a prior for which that test is the formal Bayes rule. What
is left to argue about?
Theorem 6.36 (Complete class theorem, Schervish [1995] theorem 3.95). Suppose |Ω| = 𝑘, the loss function
is bounded below, and the risk set is closed from below. Then the set of all Bayes rules is a complete class, and
the set of admissible Bayes rules is a minimal complete class. These are also the rules whose risk functions
are on the boundary of the risk set.
Definition 6.37. The decision rule 𝑑0 is called extended Bayes if for each 𝜀 > 0 there exists a (proper) prior
𝜇𝜀 such that
𝑟(𝜇𝜀 , 𝑑0 ) ≤ 𝜀 + inf 𝑟(𝜇𝜀 , 𝑑).
𝑑
A Bayes rule is always extended Bayes. The concept of extended Bayes rule involves a weaker require-
ment than Bayes rule because:
• the prior may depend on 𝜀;
• the Bayes risk need only be attained by 𝑑0 up to an amount 𝜀.
Example 6.38. If 𝑋 ∼ 𝑁 (𝜃, 1), 𝐿(𝜃, 𝑎) = (𝜃 − 𝑎)2 , then 𝑑0 = 𝑋 is not a Bayes rule by Proposition 6.52
ahead (note that 𝑑0 is unbiased for estimating 𝜃 and that this proposition says that Bayes rules are necessarily
biased). However, it is an extended Bayes rule, as we now show.
135
CHAPTER 6. STATISTICAL DECISION THEORY
( )
As 𝑅(𝜃, 𝑑0 ) = E𝜃 ((𝜃 − 𝑋)2 ) = 1, we have 𝑟(𝜇, 𝑑0 ) = 1 for any prior 𝜇. Take the 𝑁 0, 𝜎 2 -prior on 𝜃,
and denote it by 𝜇𝜎 . It follows that 𝑟(𝜇𝜎 , 𝑑0 ) = 1.
As we assume the quadratic loss, Bayes rule is given by the posterior mean, which is given by
𝜎2
𝑑𝜎 (𝑋) = 𝑋.
1 + 𝜎2
Cf. Example 4.4. This rule shrinks the observation 𝑋 towards the prior mean (which is assumed zero here).
The risk function of 𝑑𝜎 is given by
[ ]
𝑅(𝜃, 𝑑𝜎 ) = E𝜃 (𝜃 − 𝜎 2 (1 + 𝜎 2 )−1 𝑋)2
2𝜎 2 2 𝜎4 1 𝜎4
= 𝜃2 − 2
𝜃 + 2 2
(1 + 𝜃 2 ) = 𝜃 2 2 2
+ .
1+𝜎 (1 + 𝜎 ) (1 + 𝜎 ) (1 + 𝜎 2 )2
This implies
𝜎2
𝑟(𝜇𝜎 , 𝑑𝜎 ) = E𝜃∼𝜇𝜎 𝑅(𝜃, 𝑑𝜎 ) = .
1 + 𝜎2
For any 𝜎 > 0 we have ( )
𝜎2
𝑟(𝜇𝜎 , 𝑑0 ) = 1 = 1− + 𝑟(𝜇𝜎 , 𝑑𝜎 ).
1 + 𝜎2
As 𝑑𝜎 minimises 𝑟(𝜇𝜎 , 𝑑) over all decision rules (it is the Bayes rule!), it follows that
( )
𝜎2
𝑟(𝜇𝜎 , 𝑑0 ) = 1 ≤ 1 − inf 𝑟(𝜇𝜎 , 𝑑).
1 + 𝜎2 𝑑
The claim now follows upon taking 𝜎 = 𝜎𝜀 such that 𝜀 = 1∕(1 + 𝜎𝜀2 ).
The idea of the proof is to take a class of prior distributions for which the posterior is tractable. The
following example is similar in spirit.
Example 6.39. Suppose 𝑋 ∼ 𝑃 𝑜𝑖𝑠 (𝜃) and the loss function is given by 𝐿(𝜃, 𝑎) = (𝜃 − 𝑎)2 ∕𝜃. We prove that
𝑑0 (𝑋) = 𝑋 is extended bayes. To see this, for any prior 𝜇
[ ]
(𝜃 − 𝑋)2
𝑟(𝜇, 𝑑0 ) = E𝜃 𝜇( d𝜃) = 1.
∫ 𝜃
Take a sequence of priors 𝜇𝜆 such that under 𝜇𝜆 , Θ ∼ 𝐸𝑥𝑝 (𝜆). Then
𝑓Θ∣𝑋 (𝜃 ∣ 𝑥) ∝ 𝜃 𝑥 𝑒−(1+𝜆)𝜃
and hence Θ ∣ 𝑋 ∼ 𝐺𝑎 (𝑋 + 1, 1 + 𝜆). To find the Bayes rule for the given loss function, we need to choose
𝑑 to minimise
𝑑 → (𝑑 − 𝜃)2 𝜃 𝑥−1 𝑒−(1+𝜆)𝜃 d𝜃.
∫
But this minimiser is exactly the mean of the 𝐺𝑎 (𝑋, 1 + 𝜆)-distribution and thus the Bayes rule is given by
𝑑𝜆 = 𝑋∕(1 + 𝜆). To compute the Bayes risk of 𝑑𝜆 , first note that
[( )2 ]
1 𝑋 1 + 𝜆2 𝜃
𝑅(𝜃, 𝑑𝜆 ) = E −𝜃 =
𝜃 1+𝜆 (1 + 𝜆)2
We leave it as an exercise to show that then
136
6.7. MINIMAX-BAYES CONNECTIONS
Both of the decision rules 𝑑0 in the previous examples are examples of rules with constant risk.
The following theorem shows that an extended Bayes rule with constant risk is minimax, generalising
the two examples just given.
Theorem 6.41. Suppose 𝑑0 is extended Bayes and 𝑅(𝜃, 𝑑0 ) is constant for all 𝜃. Then 𝑑0 is minimax.
Proof. Suppose 𝑅(𝜃, 𝑑0 ) = 𝐶. Suppose 𝑑0 is not minimax. Then there exists a rule 𝑑 ′ for which sup𝜃 𝑅(𝜃, 𝑑 ′ ) <
𝐶. So let sup𝜃 𝑅(𝜃, 𝑑 ′ ) = 𝐶 − 𝜀 for some 𝜀 > 0. As 𝑑0 is extended Bayes, we can find a prior 𝜇𝜀 such that
Example 6.42. Suppose that 𝑋 ∼ 𝑁 (𝜃, 1) and the loss function is given by 𝐿(𝜃, 𝑎) = (𝜃 −𝑎)2 . Then 𝑑0 = 𝑋
is extended Bayes. As 𝑅(𝜃, 𝑑0 ) = 1, it follows that 𝑋 is minimax.
2. lim𝑛→∞ 𝑟(𝜇𝑛 , 𝑑𝑛 ) = 𝐶;
Then 𝑑0 is minimax.
Proof. Suppose 𝑑0 is not minimax. Then there exists a 𝑑 ′ for which sup𝜃 𝑅(𝜃, 𝑑 ′ ) < 𝐶. Hence there exists
a 𝜀 > 0 such that 𝑅(𝜃, 𝑑 ′ ) < 𝐶 − 𝜀 for all 𝜃. As 𝑟(𝜇𝑛 , 𝑑𝑛 ) → 𝐶 we can find an 𝑛 for which
So we have
𝑟(𝜇𝑛 , 𝑑 ′ ) ≤ sup 𝑅(𝜃, 𝑑 ′ ) < 𝐶 − 𝜀 < 𝐶 − 𝜀∕2 < 𝑟(𝜇𝑛 , 𝑑𝑛 ).
𝜃
But then 𝑑𝑛 cannot be Bayes with respect to 𝜇𝑛 which is a contradiction.
137
CHAPTER 6. STATISTICAL DECISION THEORY
Let 𝜇𝑛 be the prior such that Θ ∼ 𝑁𝑚 (0, 𝑛𝐼). The Bayes rule is
𝑑𝑛 (𝑋) = 𝑛𝑋∕(𝑛 + 1).
The Bayes risk satisfies
𝑟(𝜋𝑛 , 𝑑𝑛 ) = 𝑚𝑛∕(𝑛 + 1)
which converges to 𝑚 as 𝑛 → ∞. As 𝑅(𝜃, 𝑑0 ) = 𝑚 is constant, it follows that 𝑋 is minimax.
ind
Example 6.45. Suppose 𝑋1 , … , 𝑋𝑛 ∼ 𝑁 (𝜃, 1). For estimation of 𝜃 under 𝐿2 -loss, 𝑑(𝑋) = 𝑋̄ 𝑛 is minimax.
To see this, first note that 𝑋̄ 𝑛 is an equaliser rule
[ ]
E (𝑋̄ 𝑛 − 𝜃)2 = 1∕𝑛.
Now take 𝜇𝑘 ∼ 𝑁 (0, 𝑘). Let 𝑑𝑘 denote the Bayes rule corresponding to 𝜇𝑘 . Some computations show that
𝑘∕𝑛
𝑟(𝜇𝑘 , 𝑑𝑘 ) = ,
𝑘 + 1∕𝑛
which tends to 1∕𝑛 as 𝑘 → ∞.
Definition 6.46. A prior 𝜇0 for which 𝑟(𝜇, 𝑑𝜇 ) is maximised is called a least favourable prior:
𝑟(𝜇0 , 𝑑𝜇0 ) = sup 𝑟(𝜇, 𝑑𝜇 ).
𝜇
Then
1. 𝑑𝜇 is minimax.
2. If 𝑑𝜇 is unique Bayes with respect to 𝜇, then 𝑑𝜇 is unique minimax.
3. 𝜇 is least favourable.
Proof. Let 𝑑 be another rule. Then
sup 𝑅(𝜃, 𝑑𝜇 ) = 𝑟(𝜇, 𝑑𝜇 ) ≤ 𝑟(𝜇, 𝑑) ≤ sup 𝑅(𝜃, 𝑑).
𝜃 𝜃
Hence 𝑑𝜇 is minimax. If 𝑑𝜇 is unique Bayes, then the first inequality is strict, and then 𝑑𝜇 is unique minimax.
Let 𝜇∗ be another prior distribution. Then
𝑟(𝜇∗ , 𝑑𝜇∗ ) ≤ 𝑟(𝜇∗ , 𝑑𝜇 ) ≤ sup 𝑅(𝜃, 𝑑𝜇 ) = 𝑟(𝜇, 𝑑𝜇 ).
𝜃
138
6.7. MINIMAX-BAYES CONNECTIONS
Example 6.48. Suppose 𝑋 ∣ Θ = 𝜃 ∼ 𝐵𝑖𝑛 (𝑛, 𝜃) and squared error loss. Take Θ ∼ 𝐵𝑒 (𝑎, 𝑏), then under
squared error loss the Bayes rule is given by
𝑎+𝑋
𝑑(𝑋) = E[Θ ∣ 𝑋] = .
𝑎+𝑏+𝑛
√
If 𝑎 = 𝑏 = 𝑛∕2, then
[ ]
𝑅(𝜃, 𝑑) = E (𝑑(𝑋) − 𝜃)2 = constant.
(√ √ )
It follows that 𝑑(𝑋) is the unique minimax rule and that the 𝐵𝑒 𝑛∕2, 𝑛∕2 -prior is least-favourable.
Exercise 6.16 [YS exercise 2.7.] In the context of a finite decision problem, decide whether each
of the followingstatements is true, providing a proof or counterexample as appropriate.
1. The Bayes risk of a minimax rule is never greater than the minimax risk.
Exercise 6.17 [YS exercise 3.4.] Suppose 𝑋 ∣ Θ = 𝜃 ∼ 𝐵𝑖𝑛 (𝑛, 𝜃) and Θ ∼ 𝑈 𝑛𝑖𝑓 (0, 1). Consider
loss function
(𝜃 − 𝑑)2
𝐿(𝜃, 𝑑) = .
𝜃(1 − 𝜃)
Derive the Bayes rule. Is it minimax?
Hint: In order to show that the Bayes rule is minimax, show that the risk of the Bayes rule is constant
and apply Theorem 6.41.
Exercise 6.18 * Let Θ = [0, 1), 𝔸 = [0, 1] and 𝐿(𝜃, 𝑎) = (𝜃 − 𝑎)2 ∕(1 − 𝜃). Suppose 𝑋 is a random
variable with probability mass function
P𝜃 (𝑋 = 𝑥) = (1 − 𝜃)𝜃 𝑥 , 𝑥 = 0, 1, 2, …
1. Write the risk function 𝑅(𝜃, 𝑑) for a decision rule 𝑑 as a power series in 𝜃.
2. Show that the only nonrandomised equaliser rule is 𝑑(0) = 1∕2, 𝑑(1) = 𝑑(2) = ⋯ = 1.
Hint: Proving that the given rule is an equaliser rule should not be too hard. Proving that it
is the only equaliser rule is somewhat harder. A first step consists of showing that for 𝓁 ≥ 2,
an equaliser rule satisfies 𝑑(𝓁 − 1) ≥ 𝑑(𝓁).
139
CHAPTER 6. STATISTICAL DECISION THEORY
The Rao-Blackwell theorem asserts that for convex loss functions we only need to consider decision rules
that depend on sufficient statistics.
Theorem 6.49 (Rao-Blackwell theorem). Suppose the action space 𝔸 of a statistical decision problem is a
convex subset of ℝ𝑚 and that for all 𝜃 ∈ Θ, 𝐿(𝜃, 𝑎) is a convex function of 𝑎. Suppose also that 𝑇 is sufficient
for 𝜃 and 𝑑0 is a nonrandomised decision rule such that E𝜃 [‖𝑑0 (𝑋)‖] < ∞. Define
𝑑1 (𝑇 ) = E𝜃 [𝑑0 (𝑋) ∣ 𝑇 ],
Now take the expectation on both sides with respect to the distribution of T under 𝑋 ∼ P𝜃
Note that, by sufficiency, E𝜃 [𝑑0 (𝑋) ∣ 𝑇 ] does not depend on 𝜃. The result gives a direct recipe for
improving estimators, as illustrated by the following examples.
Example 6.50. Suppose 𝑋1 , … , 𝑋𝑛 are independent 𝑁(𝜃, 1) random variables. Suppose we wish to estimate
𝜂 = ℙ𝜃 (𝑋1 ≤ 𝑐) = Φ(𝑐 − 𝜃) under quadratic loss
∑
A naive estimator that is unbiased for 𝜂 is given by 𝑆 = 𝑛−1 𝑛𝑖=1 𝟏{𝑋𝑖 ≤ 𝑐}. However, this estimator does
∑𝑛
not depend on the sufficient statistic 𝑇 = 𝑖=1 𝑋𝑖 . As the loss function is convex in 𝑎 we obtain the improved
estimator by calculating
( )
𝑐 − 𝑇 ∕𝑛
𝔼[𝑆 ∣ 𝑇 ] = ℙ𝜃 (𝑋1 ≤ 𝑐 ∣ 𝑇 ) = Φ √ .
(𝑛 − 1)∕𝑛
The second equality in fact does not depend on 𝜃 (by sufficiency). The final equality follows from general
results on the multivariate normal distribution (you can skip the details of this computation if you wish).
140
6.9. BAYES RULES AND UNBIASEDNESS
Exercise 6.19 [YS exercise 6.3.] Independent factory-produced items are packed in boxes each
containing 𝑘 items. The probability that an item is in working order is 𝜃 with 0 < 𝜃 < 1. A sample
of 𝑛 boxes are chosen for testing, and 𝑋𝑖 , the number of working items in the 𝑖-th box, is noted.
Thus 𝑋1 , … , 𝑋𝑛 are a sample from a binomial distribution, 𝐵𝑖𝑛 (𝑘, 𝜃), with index 𝑘 and parameter
𝜃. It is required to estimate the probability, 𝜃 𝑘 , that all items in a box are in working order. Find the
minimum variance unbiased estimator, justifying your answer. Proceed along the following steps:
∑
1. Show that 𝑇 = 𝑛𝑖=1 𝑋𝑖 is sufficient for 𝜃.
1 ∑𝑛
2. Show that 𝑆 = 𝑛 𝑖=1 1{𝑋𝑖 =𝑘} is an unbiased estimator of 𝜃 𝑘 .
Example 6.51. Suppose interest lies in approximating the integral 𝐼 = ∫ ℎ(𝑥)P(𝑑𝑥) where P is a probability
measure and ℎ a measurable function on ℝ𝑘 such that ∫ |ℎ(𝑥)|P(𝑑𝑥) < ∞. Especially when 𝑘 is large,
Monte-Carlo simulation is a common way to approximate 𝐼. Hence, suppose 𝑋1 , … , 𝑋𝑛 are draws from P,
then a straightforward estimator is defined by
1∑
𝑛
𝐼̂ = ℎ(𝑋𝑖 ).
𝑛 𝑖=1
1∑
𝑛
𝐼̃ = 𝑔(𝑌𝑖 ).
𝑛 𝑖=1
This implies that the Mean Squared Error of 𝐼̂ can never be smaller than that of 𝐼̃ and one should prefer 𝐼̃
for estimating 𝐼 (provided that the computational effort of evaluating ℎ(𝑋) and 𝑔(𝑌 ) is comparable). The
Rao-Blackwell theorem asserts that ideally we decompose 𝑋 such that we condition on a sufficient statistic.
Consult page 50 of Young and Smith [2005] to see a nice example of this procedure, which is referred to as
Rao-Blackwellisation.
141
CHAPTER 6. STATISTICAL DECISION THEORY
Proposition 6.52. Suppose 𝑔 ∶ Ω → ℝ is measurable. Let 𝑑 be a Bayes rule for estimating 𝑔(𝜃) under 𝐿2
loss using the prior 𝜇. If 𝑟(𝜇, 𝑑) ≠ 0, then 𝑑 is biased for 𝑔(𝜃).
By conditioning on 𝑋
[ ]
𝔼[𝑔(Θ)𝑑(𝑋)] = 𝔼𝔼[𝑔(Θ)𝑑(𝑋) ∣ 𝑋] = 𝔼[𝑑(𝑋)𝔼[𝑔(Θ) ∣ 𝑋]] = 𝔼 𝑑(𝑋)2 ,
142
Bibliography
G. A. Barnard. Statistical inference. J. Roy. Statist. Soc. Ser. B., 11:115–139; discussion, 139–149, 1949.
ISSN 0035-9246.
Debabrata Basu. Statistical information and likelihood [with discussion]. Sankhyā: The Indian Journal of
Statistics, Series A, pages 1–71, 1975.
James Berger. The case for objective Bayesian analysis. Bayesian Anal., 1(3):385–402, 2006. ISSN 1936-
0975.
James O. Berger. Statistical decision theory and Bayesian analysis. Springer Series in Statistics. Springer-
Verlag, New York, second edition, 1985. ISBN 0-387-96098-8. doi: 10.1007/978-1-4757-4286-2. URL
http://dx.doi.org/10.1007/978-1-4757-4286-2.
James O. Berger, Jose M. Bernardo, and Dongchu Sun. Overall objective priors. Bayesian Anal., 10(1):189–
221, 2015. ISSN 1936-0975. doi: 10.1214/14-BA915. URL http://dx.doi.org/10.1214/14-BA915.
Jose-M. Bernardo and Adrian F. M. Smith. Bayesian theory. Wiley Series in Probability and Mathematical
Statistics: Probability and Mathematical Statistics. John Wiley & Sons, Ltd., Chichester, 1994. ISBN
0-471-92416-4. doi: 10.1002/9780470316870. URL http://dx.doi.org/10.1002/9780470316870.
Allan Birnbaum. On the foundations of statistical inference. J. Amer. Statist. Assoc., 57:269–326, 1962.
ISSN 0162-1459.
Christopher M Bishop. Pattern recognition and machine learning, volume 4. Springer, 2006.
William M. Briggs. Breaking the Law of Averages. Real-Life Probability and Statistics in Plain English.
2008. URL http://wmbriggs.com/public/briggs_breaking_law_averages.pdf.
Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of markov chain monte carlo.
CRC press, 2011.
Aubrey Clayton. Bernoulli’s fallacy. In Bernoulli’s Fallacy. Columbia University Press, 2021.
Jacob Cohen. Tthe earth is round (p < .05). American Psychologist, 49:997–1003, 1994.
National Research Council. Frontiers in Massive Data Analysis. Committee on the Analysis of Massive Data,
Committee on Applied and Theoretical Statistics, Board on Mathematical Sciences and Their Applications
& Division on Engineering and Physical Sciences. The National Academies Press, 2013.
M. Dashti and A. M. Stuart. The Bayesian Approach To Inverse Problems. ArXiv e-prints, February 2013.
B. Efron. Why isn’t everyone a Bayesian? Amer. Statist., 40(1):1–11, 1986. ISSN 0003-1305. doi: 10.2307/
2683105. URL http://dx.doi.org/10.2307/2683105. With discussion and a reply by the author.
143
BIBLIOGRAPHY
Bradley Efron. Bayesians, frequentists, and scientists. J. Amer. Statist. Assoc., 100(469):1–5, 2005.
ISSN 0162-1459. doi: 10.1198/016214505000000033. URL http://dx.doi.org/10.1198/
016214505000000033.
Bradley Efron and David V. Hinkley. Assessing the accuracy of the maximum likelihood estimator: observed
versus expected Fisher information. Biometrika, 65(3):457–487, 1978. ISSN 0006-3444. doi: 10.1093/
biomet/65.3.457. URL http://dx.doi.org/10.1093/biomet/65.3.457. With comments by Ole
Barndorff-Nielsen, A. T. James, G. K. Robinson and D. A. Sprott and a reply by the authors.
Bradley Efron and Carl Norris. Data analysis using stein’s estimator and its generlizations. Journal of the
American Statistical Association, 70(350):311–319, 1975.
Ronald A. Fisher. Mathematical probability in the natural sciences. Technometrics, 1:21–29, 1959. ISSN
0040-1706.
Greg Gandenberger. A new proof of the likelihood principle. British J. Philos. Sci., 66(3):475–503, 2015.
ISSN 0007-0882. doi: 10.1093/bjps/axt039. URL http://dx.doi.org/10.1093/bjps/axt039.
Jayanta K. Ghosh, Mohan Delampady, and Tapas Samanta. An introduction to Bayesian analysis. Springer
Texts in Statistics. Springer, New York, 2006. ISBN 978-0387-40084-6; 0-387-40084-2. Theory and
methods.
Piet Groeneboom and Geurt Jongbloed. Nonparametric estimation under shape constraints, volume 38 of
Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, New York,
2014. ISBN 978-0-521-86401-5. doi: 10.1017/CBO9781139020893. URL http://dx.doi.org/10.
1017/CBO9781139020893. Estimators, algorithms and asymptotics.
E. T. Jaynes. Probability theory. Cambridge University Press, Cambridge, 2003. ISBN 0-521-59271-2. doi:
10.1017/CBO9780511790423. URL http://dx.doi.org/10.1017/CBO9780511790423. The logic of
science, Edited and with a foreword by G. Larry Bretthorst.
Robert W. Keener. Theoretical statistics. Springer Texts in Statistics. Springer, New York, 2010.
ISBN 978-0-387-93838-7. doi: 10.1007/978-0-387-93839-4. URL http://dx.doi.org/10.1007/
978-0-387-93839-4. Topics for a core course.
B.J.K. Kleijn. The frequentist theory of Bayesian statistics. Springer Verlag, New York, 2020.
Lucien LeCam. On some asymptotic properties of maximum likelihood estimates and related Bayes’ esti-
mates. Univ. California Publ. Statist., 1:277–329, 1953.
D. V. Lindley and L. D. Phillips. Inference for a Bernoulli process (a Bayesian view). Amer. Statist., 30(3):
112–119, 1976. ISSN 0003-1305.
Dennis V. Lindley. The 1988 wald memorial lectures: The present position in bayesian statistics. Statistical
Science, 5(1):44–65, 1990. ISSN 08834237. URL http://www.jstor.org/stable/2245880.
R. McElreath. Statistical Rethinking: a Bayesian Course with Examples in R and Stan. Chapman and
Hall–CRC, 2015.
Ronald Meester. Waarom p-waardes niet gebruikt mogen worden als statistisch bewijs. Nieuw archief voor
de wiskunde, 2019.
144
BIBLIOGRAPHY
Giovanni Parmigiani and Lurdes Y. T. Inoue. Decision theory. Wiley Series in Probability and Statistics.
John Wiley & Sons, Ltd., Chichester, 2009. ISBN 978-0-471-49657-1. doi: 10.1002/9780470746684.
URL http://dx.doi.org/10.1002/9780470746684. Principles and approaches, With contributions
by Hedibert F. Lopes.
Christian P. Robert. The Bayesian choice. Springer Texts in Statistics. Springer, New York, second edition,
2007. ISBN 978-0-387-71598-8. From decision-theoretic foundations to computational implementation.
Christian P. Robert and George Casella. Monte Carlo statistical methods. Springer Texts in Statis-
tics. Springer-Verlag, New York, second edition, 2004. ISBN 0-387-21239-6. doi: 10.1007/
978-1-4757-4145-2. URL http://dx.doi.org/10.1007/978-1-4757-4145-2.
Simo Särkkä. Bayesian filtering and smoothing. Number 3. Cambridge university press, 2013.
Leonard J. Savage. The foundations of statistics reconsidered. In Proceedings of the Fourth Berkeley Sym-
posium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics,
pages 575–586, Berkeley, Calif., 1961. University of California Press. URL http://projecteuclid.
org/euclid.bsmsp/1200512183.
Mark J. Schervish. Theory of statistics. Springer Series in Statistics. Springer-Verlag, New York,
1995. ISBN 0-387-94546-6. doi: 10.1007/978-1-4612-4250-5. URL http://dx.doi.org/10.1007/
978-1-4612-4250-5.
Mark J. Schervish. 𝑃 values: what they are and what they are not. Amer. Statist., 50(3):203–206, 1996.
ISSN 0003-1305. doi: 10.2307/2684655. URL http://dx.doi.org/10.2307/2684655.
Jun Shao. Mathematical statistics. Springer Texts in Statistics. Springer-Verlag, New York, second edition,
2003. ISBN 0-387-95382-5. doi: 10.1007/b97553. URL http://dx.doi.org/10.1007/b97553.
Charles Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In
Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1954–1955,
vol. I, pages 197–206. University of California Press, Berkeley and Los Angeles, 1956.
Y. W. Teh and M. I. Jordan. Hierarchical Bayesian nonparametric models with applications. In N. Hjort,
C. Holmes, P. Muller, and S. Walker, editors, Bayesian Nonparametrics: Principles and Practice. Cam-
bridge University Press, 2010.
Luke Tierney. A note on Metropolis-Hastings kernels for general state spaces. Ann. Appl. Probab., 8(1):1–9,
1998. ISSN 1050-5164. doi: 10.1214/aoap/1027961031. URL http://dx.doi.org/10.1214/aoap/
1027961031.
A. W. van der Vaart. Asymptotic statistics, volume 3 of Cambridge Series in Statistical and Probabilistic
Mathematics. Cambridge University Press, Cambridge, 1998. ISBN 0-521-49603-9; 0-521-78450-6. doi:
10.1017/CBO9780511802256. URL http://dx.doi.org/10.1017/CBO9780511802256.
Robert L. Winkler. An introduction to Bayesian inference and decision / Robert L. Winkler. Holt, Rinehart
and Winston New York, 1972. ISBN 0030813271.
145
BIBLIOGRAPHY
Shelemyahu Zacks. Examples and problems in mathematical statistics. John Wiley & Sons, Inc., Hoboken,
NJ, 2014. ISBN 978-1-118-60550-9.
146