STM Summary Notes
STM Summary Notes
Dr Roberto Trotta
January 23, 2010
This is a summary of the notes I used for the 9 lectures course Statistics of Measurement
for the 2nd year physics degree at Imperial College London. While I hope the material will be
useful for revision, you should not rely solely on it, but complement it with your own notes from
the lectures (in particular, covering examples and derivations, which are mostly omitted from
this summary) and other textbook sources.
Every eort has been made to correct any typos, but invariably some will remain. I would be
grateful if you could point them out to me by e-mailing your corrections to: [email protected].
A list of errata corrige will be posted and maintained on Blackboad.
Contents
1 Probabilities 2
2 Random variables, parent distributions and samples 3
3 Discrete probability distributions 3
4 Properties of discrete distributions: expectation value and variance 6
5 Properties of continous distributions 8
6 The Gaussian (or Normal) distribution 9
7 The Central Limit Theorem 11
8 The likelihood function 11
9 The Maximum Likelihood Principle 13
10 Condence intervals 15
11 Propagation of errors 16
12 Bayesian statistics 17
1
2
1 Probabilities
Let A, B, C, . . . denote propositions (e.g., that a coin toss gives tails). Let describe
the sample space of the experiment, i.e., is a list of all the possible outcomes of the
experiment (for the coin tossing example, = T, H, where T denotes tails and H
denotes head.
When the experiment is performed, the oucome O (e.g., coin lands T) is selected with
probability
P(O) =
n(O)
n
(1)
where n(O) is the number of possibilities in favourable to O and n is the total number
of possibilities in .
Frequentist denition of probability: The number of times an event occurs divided by the
total number of events in the limit of an innite series of equiprobable trials.
The joint probability of A and B is the probability of A and B happening together, and is
denoted by P(A, B).
The conditional probability of A given B is the probability of A happening given that B
has happened, and is denoted by P(A[B).
The sum rule:
P(A) +P(A) = 1, (2)
where A denotes the proposition not A.
The product rule:
P(A, B) = P(A[B)P(B). (3)
By inverting the order of A and B we obtain that
P(B, A) = P(B[A)P(A) (4)
and because P(A, B) = P(B, A), we obtain Bayes theorem by equating Eqs. (3) and (4):
P(A[B) =
P(B[A)P(A)
P(B)
. (5)
The marginalisation rule (follows from the two rules above):
P(A) = P(A, B
1
) +P(A, B
2
) + =
i
P(A, B
i
) =
i
P(A[B
i
)P(B
i
), (6)
where the sum is over all possible outcomes for proposition B.
Two propositions (or events) are said to be independent if and only if
P(A, B) = P(A)P(B). (7)
Statistics of Measurement c _Roberto Trotta 2010
3
2 Random variables, parent distributions and samples
Let us assume we have repeatedly measured a certain property (e.g., the temperature of
an object, or a sequence of H/T in coin tossing). We have thus collected a series of real
numbers, x
1
, x
2
, . . . , x
N
, which we call samples (always denoted by symbols with a hat).
A random variable (RV) is a function mapping the sample space of possible outcomes of a
random process to the real numbers (samples), with probability given by the corresponding
probability of outcome, dened in Eq. (1).
A RV can be discrete (only a countable number of outcomes is possible, such as in coin
tossing) or continous (an uncountable number of outcomes is possible, such as in a temper-
ature measurement). It is mathematically subtle to carry out the passage from a discrete
to a continuous RV, although as physicists we wont bother too much with mathematical
rigour.
A discrete RV has an associated probability mass function (pmf), which gives the proba-
bility of each outcome. So for example P(X = x
i
) = P
i
gives the probability of the RV
X assuming the value x
i
. In the following we shall use the shorthand notation P(x
i
) to
mean P(X = x
i
). If X is the RV associated with tossing a fair die once, then P
i
= 1/6 for
i = 1, . . . , 6, where x
i
is the outcome the face with i pips comes up.
A continuous RV has an associated probability density function (pdf), denoted by p(X).
The quantity p(x)dx gives the probabilty that the RV X assumes the value between x and
x +dx.
The choice of parent distribution (i.e., which pmf or pdf to associate with a given random
process) is dictated by the nature of the random process one is investigating.
For a discrete pmf, the cumulative probability distribution function (cdf) is given by
C(x
i
) =
i
j=1
P(x
j
). (8)
The cdf gives the probabilty that the RV X takes on a value less than x
i
, i.e. C(x
i
) =
P(X < x
i
).
For a continous pdf, the cdf is given by
P(x) =
_
x
p(y)dy, (9)
with the same interpretation as above, i.e. it is the probability that the RV X takes a
value smaller than x.
3 Discrete probability distributions
The uniform distribution: for n equiprobable outcomes between 1 and n, the uniform
discrete distribution is given by
P(r) =
_
1/n for 1 r n
0 otherwise
(10)
Statistics of Measurement c _Roberto Trotta 2010
4
Figure 1: Left panel: uniform discrete distribution for n = 6. Right panel: the corresponding
cdf.
It is plotted in Fig. 1 alongside with its cdf for the case of the tossing of a fair die (n = 6).
The binomial distribution: the binomial describes the probability of obtaining r suc-
cesses in a sequence of n trials, each of which has probability p of success. Here, success
can be dened as one specic outcome in a binary process (e.g., H/T, blue/red, 1/0, etc).
The binomial distribution B(n, p) is given by:
P(r[n, p) B(n, p) =
_
n
r
_
p
r
(1 p)
nr
, (11)
where the choose symbol is dened as
_
n
r
_
n!
(n r)!r!
(12)
for 0 r n (remember, 0! = 1). Some examples of the binomial for dierent choices of
n, p are plotted in Fig. 2.
The derivation of the binomial distribution proceeds from considering the probability of
obtaining r successes in n trials (p
r
), while at the same time obtaining n r failures
((1 p)
nr
). The combinatorial factor in front is derived from considerations of the
number of permutations that leads to the same total number of successes.
The Poisson distribution: the Poisson distribution describes the probability of the
number of events in a process where events occur with a xed average rate and indepen-
dently of each other. E.g.: number of galaxies in the sky, number of murders in London,
number of planes landing at Heathrow, number of photons arriving at a photomultiplier,
etc.
Assume that it the probabilty of an event occuring per unit time (with = constant.
This is the denition of Poisson process!). The probability of r events happening in a time
Statistics of Measurement c _Roberto Trotta 2010
5
Figure 2: Some examples of the binomial distribution, Eq. (11), for dierent choices of n, p, and
its corresponding cdf.
Statistics of Measurement c _Roberto Trotta 2010
6
t is given by the Poisson distribution:
P(r[, t) Poisson() =
(t)
r
r!
e
t
. (13)
Notice that this is a discrete pmf in the number of events r, and not a continuous pdf in t
(which is a xed parameter). The probability of getting r events in a unit time interval is
obtained from the above equation by setting t = 1.
The Poisson distribution of Eq. (13) is plotted in Fig. 3 as a function of r for a few choices
of (notice that in the gure t = 1 has been assumed, in the appropriate units). The
derivation of the Poisson distribution follows from considering the probability of 1 event
taking place in a small time interval t, then taking the limit t dt 0. It can also
be shown that the Poisson distribution arises from the Binomial in the limit pN for
N .
4 Properties of discrete distributions: expectation value and
variance
The discrete distributions above depend on parameters (such as p for the Binomial, for
Poisson), which control the shape of the distribution. If we know the value of the param-
eters, we can deduce what we will observe when we obtain samples from the distributions
via the measurement process. This is the subject of probability theory, which concerns
itself with the theoretical properties of the distributions. The inverse problem of making
inductions about the parameters from the observed samples is the subject of statistical
inference, addressed later.
Two important properties of distributions are the expectation value (which controls the
location of the distribution) and the variance or dispersion (which controls how much the
distribution is spread out). Expectation value and variance are functions of a RV.
The expectation value E(X) (often called mean, or expected value
1
) of the discrete
RV X is dened as
E(X) = X)
i
x
i
P
i
. (14)
For example, for the tossing of a fair die (which follows the uniform discrete distribution,
Eq. (10)), the expectation value is given by E(X) =
i
i
1
6
= 21/6.
The variance or dispersion Var(X) of the discrete RV X is dened as
Var(X) E(X E(X))
2
= E(X
2
) E(X)
2
. (15)
The square root of the variance is often called standard deviation and is usually denoted
by the symbol , so that Var(X) =
2
. For the above example of die tossing, the variance
is given by
Var(X) =
i
(x
i
X))
2
P
i
=
i
x
2
i
P
i
i
x
i
P
i
_
2
=
i
i
2
1
6
_
21
6
_
2
=
105
36
. (16)
1
We prefer not to use the term mean to avoid confusion with the sample mean.
Statistics of Measurement c _Roberto Trotta 2010
7
Figure 3: Some examples of the Poisson distribution, Eq. (13), for dierent choices of , and its
corresponding cdf.
Statistics of Measurement c _Roberto Trotta 2010
8
For the Binomial distribution of Eq. (11), the expectation value and variance are given by:
E(X) = np, Var(X) = np(1 p). (17)
For the Poisson distribution of Eq. (13), the expectation value and variance are given by:
E(X) = , Var(X) = . (18)
5 Properties of continous distributions
As we did above for the discrete distribution, we now dene the following properties for
continous distributions.
The expectation value E(X) of the continous RV X is dened as
E(X) = X)
_
xp(x)dx. (19)
The variance or dispersion Var(X) of the continous RV X is dened as
Var(X) E(X E(X))
2
= E(X
2
) E(X)
2
=
_
x
2
p(x)dx
__
xp(x)dx
_
2
. (20)
The exponential distribution: the exponential distribution describes the time one has
to wait between two consecutive events in a Poisson process, e.g. the waiting time between
two radioactive particles decays, or the time between cars passing by a certain point on a
road, or (swapping time for length) the distance between galaxies in the sky.
To derive the exponential distribution, one can consider the arrival time of Poisson dis-
tributed counts, for example the arrival time of customers in a queue, then derive the
probability density that the rst person arrives at time t by considering the probability
(which is Poisson distributed) that nobody arrives in the interval [0, t] and then that one
person arrives during the interval [t, t + t]. Taking the limit t 0 it follows that the
probability density for observing precisely 1 event at time t is given by
P(1 event at time t[) = e
t
, (21)
where is the mean number of events per unit time. This is the exponential distribution.
If we have already waited for a time s for the rst event to occur (and no event has
occurred), then the probability that we have to wait for another time t before the rst
event happens satises
P(T > t +s[T > s) = P(T > t). (22)
This means that having waited for time s without the event occuring, the time we can
expect to have to wait has the same distribution as the time we have to wait from the
beginning. The exponential distribution has no memory of the fact that a time s has
already elapsed.
For the exponential distribution of Eq. (21), the expectation value and variance are given
by
E(t) = 1/, Var(t) = 1/
2
. (23)
Statistics of Measurement c _Roberto Trotta 2010
9
Figure 4: Two examples of the Gaussian distribution, Eq. (24), for dierent choices of , , and
its corresponding cdf. It is clear that the expectation value controls the location of the pdf,
while controls its width.
6 The Gaussian (or Normal) distribution
The Gaussian pdf (often called the Normal distribution) is perhaps the most important
distribution. It is used as default in many situations involving continous RV (the reason
becomes clear once we have studied the Central Limit Theorem, section 7). A heuristic
derivation of how the Gaussian arises follows from the example of darts throwing (given
in the lecture).
The Gaussian pdf is a continous distribution with mean and standard deviation is
given by
p(x[, ) =
1
2
exp
_
1
2
(x )
2
2
_
, (24)
and it is plotted in Fig. 4 for two dierent choices of , . The Gaussian is the famous
bell-shaped curve.
For the Gaussian distribution of Eq. (24), the expectation value and variance are given by:
E(X) = , Var(X) =
2
. (25)
It can be shown that the Gaussian arises from the Binomial in the limit n and
from the Poisson distribution in the limit . As shown in Fig. 5, the Gaussian
approximation to either the Binomial or the Poisson distribution is very good even for
fairly moderate values of n and .
The probability content of the Gaussian for a given symmetric interval around the mean
Statistics of Measurement c _Roberto Trotta 2010
10
Figure 5: Gaussian approximation to the Binomial (left panel) and the Poisson distribution
(right panel). The solid curve gives in each case the Gaussian approximation to each pmf.
of width on each side is given by
P( < x < +) =
_
+
2
exp
_
1
2
(x )
2
2
_
dx (26)
=
2
_
1/
2
0
exp
_
y
2
_
dy (27)
= erf(1/
2), (28)
where the error function erf is dened as
erf(x) =
2
_
x
0
exp
_
y
2
_
dy, (29)
and can be found by numerical integration (also often tabulated and available as a built-in
function in most mathematical software). Also recall the useful integral:
_
exp
_
1
2
(x )
2
2
_
dx =
2. (30)
Eq. (26) allows to nd the probability content of the Gaussian pdf for any symmetric
interval around the mean. Some commonly used values are given in Table 1.
In particular, the usual notation, e.g. for a measurement of a temperature of the form
T = (100 1) K, means that 1 K is the 1 errorbar. This means that 68.4% of the
probability is contained with 1 K of the mean.
The discovery threshold in particle physics is traditionally set at 5. This means that
one needs to have a probability in excess of 1 5.7 10
7
before being able to claim the
discovery of a new eect.
Statistics of Measurement c _Roberto Trotta 2010
11
P( <
x
N
i=1
X
i
N
i=1
N
i=1
2
i
(31)
is distributed as a Gaussian with expectation value 0 and unit variance.
Proof: not required. Very simple using characteristic functions.
8 The likelihood function
The problem of inference can be stated as follows: given a collection of samples, x
1
, x
2
, . . . , x
N
,
and a generating random process, what can be said about the properties of the underlying
probability distribution?
Statistics of Measurement c _Roberto Trotta 2010
12
Figure 6: The likelihood function for the probability of heads () for the coin tossing example.
Schematically, we have that:
pdf - e.g., Gaussian with a given (, ) Probability of observation
Underlying (, ) Observed events
(32)
The connection between the two domains is given by the likelihood function.
Given a pdf or a pmf p(X[), where X represents a random variable and a collection
of parameters describing the shape of the pdf (e.g., for a Gaussian = , ) and the
observed data x = x
1
, x
2
, . . . , x
N
, the likelihood function L (or likelihood for short) is
dened as
L() = p(X = x[) (33)
i.e., the probability, as a function of the parameters , of observing the data that have
been obtained. Notice that the likelihood is not a pdf in .
Example: in tossing a coin, let be the probability of obtaining heads in one throw.
Suppose we make n = 5 ips and obtain the sequence x = H, T, T, T, T. The likelihood
is obtained by taking the Binomial, Eq. (11), and replacing for r the number of heads
obtained (r = 1) in n = 5 trials. Thus
L() =
_
5
1
_
1
(1 )
4
= 5(1 )
4
, (34)
which is plotted as a function of in Fig. 6.
This example leads to the formulation of the Maximum Likelihood Principle (see below):
if we are trying to determine the value of given what we have observed (the sequence
Statistics of Measurement c _Roberto Trotta 2010
13
of H/T), we should choose the value that maximises the likelihood. Notice that this is
not necessarily the same as maximising the probability of . Doing so requires the use of
Bayes theorem, see section 12.
A common problem is how to estimate the mean and the standard deviation of a Gaussian.
Given a list of samples x = x
1
, x
2
, . . . , x
N
, the estimator for the (unknown) mean of
the underlying Gaussian they are drawn from is given by
x =
1
N
N
i=1
x
i
, (35)
i.e., the sample mean. The law of large numbers implies that
lim
N
= . (36)
This means that for large samples, the estimated sample mean converges to the true mean
of the distribution. An estimator with this property is said to be unbiased.
The estimator for the standard deviation of the Gaussian is given by
2
=
1
N 1
N
i=1
(x
i
x)
2
. (37)
Notice the 1/(N 1) factor in front, that ensures that the estimator is unbiased. Indeed
lim
N
2
=
2
, (38)
i.e., the above estimator converges to the true value for a large number of samples.
9 The Maximum Likelihood Principle
The Maximum Likelihood Principle (MLP): given the likelihood function L() and seeking
to determine the parameter , we should choose the value of in such a way that the value
of the likelihood is maximised. The Maximum Likelihood Estimator (MLE) for is thus
ML
max
L() (39)
Properties of the MLE: it is asymptotically unbiased (i.e.,
ML
for N ) and it is
asymptotically the minimum variance estimator, i.e. the one with the smallest errors.
To nd the MLE, we maximise the likelihood by requiring its rst derivative to be zero
and the second derivative to be negative:
L()
ML
= 0, and
2
L()
ML
< 0. (40)
In practice, it is often more convenient to maximise the logarithm of the likelihood (the
log-likelihood) instead. Since log is a monotonic function, maximising the likelihood is
the same as maximising the log-likelihood. So one often uses
ln L()
ML
= 0, and
2
ln L()
ML
< 0. (41)
Statistics of Measurement c _Roberto Trotta 2010
14
Application of the MLP to a Gaussian likelihood: for N independent samples from a
Gaussian distribution, the joint likelihood function is given by
L() = p( x[) =
N
i=1
1
2
exp
_
1
2
( x
i
)
2
2
_
, (42)
where = , are the mean and standard deviation of the distribution. Note: often
the Gaussian above is written as
L() = L
0
exp
_
2
/2
_
(43)
where the so-called chi-squared is dened as
2
=
i
( x
i
)
2
2
. (44)
The MLE for the mean is obtained by solving
ln L
= 0
ML
=
1
N
N
i=1
x
i
, (45)
i.e., the MLE for the mean is just the sample mean that we already encountered above.
The MLE for works out to be
ln L
= 0
2
ML
=
1
N
N
i=1
( x
i
)
2
, (46)
which however is biased, because we have that E(
2
ML
) = (1
1
N
)
2
,=
2
(for nite N).
In order to obtain an unbiased estimator we replace the factor 1/N by 1/(N 1). Also,
because the true is usually unknown, we replace it in Eq. (47) by the MLE estimator,
ML
. Thus an unbiased estimator for the variance is
2
=
1
N 1
N
i=1
( x
i
ML
)
2
. (47)
MLE recipe:
1. Write down the likelihood. This depends on the kind of random process you are
considering.
2. Find the best t value of the parameter by maximising the likelihood L as a function
of . This is your MLE,
ML
.
3. Evaluate the uncertainty on
ML
(see next section).
Statistics of Measurement c _Roberto Trotta 2010
15
10 Condence intervals
Consider a general likelihood function, L() and let us expand ln L around its maximum:
ln L() = ln L(
ML
) +
ln L()
ML
(
ML
) +
1
2
2
lnL()
ML
(
ML
)
2
+. . . (48)
The second term on the RHS vanishes (by denition of the Maximum Likelihood value),
hence we can approximate the likelihood as
L() L(
ML
) exp
_
1
2
(
ML
)
2
_
, (49)
with
1
2
=
2
ln L()
ML
. (50)
So a general likelihood function can be approximated as a Gaussian around its peak, as
shown by Eq. (49).
Application: going back to the example given by the likelihood of Eq. (42), we can use
the above result to estimate the width of the likelihood function around the peak. This
expresses the uncertainty in our estimation of the mean, Eq. (45). Applying Eq. (50) to
the likelihood of Eq. (42) we obtain
=
2
/N. (51)
(this result can also be derived directly by manipulating the likelihood function). This
means that the standard deviation (i.e., the uncertainty) on our ML estimate for is
proportional to 1/
< <
ML
+
< <
ML
+ 2
] is a 95.4%
condence interval (a 2 interval).
One has to be careful with the interpretation of condence intervals as this is often mis-
understood! Interpretation: if we were to repeat an experiment many times, and each
time report the observed 100% condence interval, we would be correct 100% of the
time. This means that (ideally) a 100% condence intervals contains the true value of
the parameter 100% of the time.
In a frequentist sense, it does not make sense to talk about the probability of . This
is because every time the experiment is performed we get a dierent realization (dierent
samples), hence a dierent numerical value for the condence interval. Each time, either
the true value of is inside the reported condence interval (in which case, the probability
of being inside is 1) or the true value is outside (in which case its probability of being
inside is 0). Condence intervals do not give the probability of the parameter! In order to
do that, you need Bayes theorem.
Statistics of Measurement c _Roberto Trotta 2010
16
11 Propagation of errors
Suppose we have measured a quantity x obtaining a measurement x
x
. How do we
propagate the measurement onto another variable y = y(x)?
Taylor expanding y(x) around x we obtain:
y(x) y( x) + (x x)
y
x
x= x
+. . . (52)
Truncating the expansion at linear order, the expectation value of y is given by:
E(y) = E(y( x)) +
y
x
x= x
E(x x) = y( x) (53)
because E(x x) = 0.
The variance of y is given by:
V (y) = E([y(x) E(y(x))]
2
) = E([y(x) y( x)]
2
) =
_
y
x
x= x
_
2
2
x
. (54)
So the variance on y is related to the variance on x by
2
y
=
_
y
x
x= x
_
2
2
x
. (55)
Generalization to functions of several variables: if y = y(x
1
, . . . , x
N
) then
2
y
=
N
i=1
_
y
x
i
x= x
_
2
2
x
i
. (56)
Special cases:
1. Linear relationship: y = ax. Then
y
= a
x
.
2. Product or ratio: e.g. y(x
1
, x
2
) = x
1
x
2
or y(x
1
, x
2
) = x
1
/x
2
. Then
2
y
y
2
=
2
x
1
x
2
1
+
2
x
2
x
2
2
. (57)
Systematic vs random errors: errors are often divided in this two categories. Any mea-
surement is subject to statistical uctuations, whice means that if we repeat the same
measurement we will obtain every time a slightly dierent outcome. This is a statistical
(or random) error. Random errors manifest themeselves as noise in the measurement,
which leads to variability in the data each time a measurement is made.
On the other hand, systematic errors do not lead to variability in the measurement, but
are the cause for data to be systematically o all the time (e.g., measuring a current in
A while the apparatus really gives mA would lead to a factor of 1000 systematic error all
the time). Systematic errors are usually more dicult to track down. They might arise by
experimental mistakes, or because of unmodelled (or unrecognized) eects in the system
you are measuring.
Statistics of Measurement c _Roberto Trotta 2010
17
12 Bayesian statistics
Bayes theorem, Eq. (5), encapsulates the notion of probability as degree of belief. The
Bayesian outlook on probability is more general than the frequentist one, as the former
can deal with unrepeatable situations that the latter cannot address.
We replace in Bayes theorem, Eq. (5), A (the parameters) and B d (the observed
data, or samples), obtaining
P([d) =
P(d[)P()
P(d)
. (58)
On the LHS, P([d) is the posterior probability for (or posterior for short), and it
represents our degree of belief about the value of after we have seen the data d.
On the RHS, P(d[) = L() is the likelihood we already encountered. It is the probability of
the data given a certain value of the parameters. The quantity P() is the prior probability
distribution (or prior for short). It representes our degree of belief in the value of
before we see the data. This is an essential ingredient of Bayesian statistics. In the
denominator, P(d) is a normalizing constant (often called the evidence), than ensures
that the posterior is normalized to unity:
P(d) =
_
dP(d[)P(). (59)
The evidence is important for Bayesian model selection (not covered in this course).
Interpretation: Bayes theorem relates the posterior probability for (i.e., what we know
about the parameter after seeing the data) to the likelihood. It can be thought of as
a general rule to update our knowledge about a quantity (here, ) from the prior to
the posterior. A result known as Cox theorem shows that Bayes theorem is the unique
generalization of boolean algebra in the presence of uncertainty.
Remember that in general P([d) ,= P(d[) (see ex. of pregnant woman), i.e. the posterior
and the likelihood are two dierent quantities with dierent meaning!
Bayesian inference works by updating our state of knowledge about a parameter (or hy-
pothesis) as new data ow in. The posterior from a previous cycle of observations becomes
the prior for the next. The price we have to pay is that we have to start somewhere by
specifying an initial prior, which is not determined by the theory, but it needs to be given
by the user. The prior should represent fairly the state of knowledge of the user about the
quantity of interest. Eventually, the posterior will converge to a unique (objective) result
even if dierent scientists start from dierent priors (provided their priors are non-zero in
regions of parameter space where the likelihood is large). See Fig. 7 for an illustration.
There is a vast literature about how to select a prior in an appropriate way. Some aspects
are fairly obvious: if your parameter describes a quantity that has e.g. to be strictly
positive (such as the number of photons in a detector, or an amplitude), then the prior
will be 0 for values < 0.
Statistics of Measurement c _Roberto Trotta 2010
18
(a) (b) (c) (d)
Figure 7: Converging views in Bayesian inference. Two scientists having dierent prior believes
p() about the value of a quantity (panel (a), the two curves representing two dierent priors)
observe one datum with likelihood L() (panel (b)), after which their posteriors p([d) (panel
(c), obtained via Bayes Theorem, Eq. (5)) represent their updated states of knowledge on the
parameter. This posterior then becomes the prior for the next observation. After observing 100
data points, the two posteriors have become essentially indistinguishable (d).
A standard (but by no means trivial) choice is to take a uniform prior (also called at
prior) on , dened as:
P() =
_
1
(max
min
)
for
min
max
0 otherwise
(60)
With this choice of prior in Bayes theorem, Eq. (58), the posterior becomes functionally
identical to the likelihood up to a proportionality constant:
P([d) P(d[) = L(). (61)
In this case, all of our previous results about the likelihood carry over (but with a dierent
interpretation). In particular, the probability content of an interval around the mean for
the posterior should be interpreted as a statement about our degree of belief in the value
of (dierently from condence intervals for the likelihood).
Under a change of variable, = (), the prior transforms according to:
P() = P()
d
d
. (62)
In particular, a at prior on is no longer at in if the variable transformation is
non-linear.
Statistics of Measurement c _Roberto Trotta 2010