Stat520 Ch.4
Stat520 Ch.4
3 Proof by Topic
A
• Accuracy and complexity for Bayesian linear regression, 497
• Accuracy and complexity for Bayesian linear regression with known covariance, 521
• Accuracy and complexity for the univariate Gaussian, 358
• Accuracy and complexity for the univariate Gaussian with known variance, 370
• Addition law of probability, 13
• Addition of the differential entropy upon multiplication with a constant, 92
• Addition of the differential entropy upon multiplication with invertible matrix, 94
• Additivity of the Kullback-Leibler divergence for independent distributions, 109
• Additivity of the variance for independent random variables, 59
• Akaike information criterion for multiple linear regression, 489
• Application of Cochran’s theorem to two-way analysis of variance, 397
B
• Bayes’ rule, 133
• Bayes’ theorem, 132
• Bayesian information criterion for multiple linear regression, 490
• Best linear unbiased estimator for the inverse general linear model, 537
• Binomial test, 550
• Brier scoring rule is strictly proper scoring rule, 140
C
• Characteristic function of a function of a random variable, 35
• Chi-squared distribution is a special case of gamma distribution, 252
• Combined posterior distribution for Bayesian linear regression when analyzing conditionally inde-
pendent data sets, 509
• Combined posterior distributions in terms of individual posterior distributions obtained from con-
ditionally independent data, 127
• Concavity of the Shannon entropy, 86
• Conditional distributions of the multivariate normal distribution, 299
• Conditional distributions of the normal-gamma distribution, 318
• Conjugate prior distribution for Bayesian linear regression, 492
• Conjugate prior distribution for Bayesian linear regression with known covariance, 515
• Conjugate prior distribution for binomial observations, 554
• Conjugate prior distribution for multinomial observations, 564
• Conjugate prior distribution for multivariate Bayesian linear regression, 543
• Conjugate prior distribution for Poisson-distributed data, 573
• Conjugate prior distribution for the Poisson distribution with exposure values, 580
• Conjugate prior distribution for the univariate Gaussian, 351
• Conjugate prior distribution for the univariate Gaussian with known variance, 364
• Construction of confidence intervals using Wilks’ theorem, 114
• Construction of unbiased estimator for variance, 602
• Construction of unbiased estimator for variance in multiple linear regression, 603
• Continuous uniform distribution maximizes differential entropy for fixed range, 184
• Convexity of the cross-entropy, 88
• Convexity of the Kullback-Leibler divergence, 108
190 CHAPTER II. PROBABILITY DISTRIBUTIONS
Xi − µ
Ui = (5)
σ
which follows a standard normal distribution (→ II/3.2.4)
Ui ∼ N (0, 1) . (6)
Then, the sum of squared random variables Ui can be rewritten as
X
n n
X 2
Xi − µ
Ui2 =
i=1 i=1
σ
Xn 2
(Xi − X̄) + (X̄ − µ)
=
i=1
σ
(7)
X
n
(Xi − X̄)2 Xn
(X̄ − µ)2 Xn
(Xi − X̄)(X̄ − µ)
= 2
+ 2
+2
i=1
σ i=1
σ i=1
σ2
Xn 2 X n 2
(X̄ − µ) X
n
Xi − X̄ X̄ − µ
= + +2 (Xi − X̄) .
i=1
σ i=1
σ σ2 i=1
X
n X
n
(Xi − X̄) = Xi − nX̄
i=1 i=1
Xn
1X
n
= Xi − n · Xi
i=1
n i=1 (8)
Xn X
n
= Xi − Xi
i=1 i=1
=0,
X
n n
X 2 Xn 2
Xi − X̄ X̄ − µ
Ui2 = + . (9)
i=1 i=1
σ i=1
σ
Cochran’s theorem states that, if a sum of squared standard normal (→ II/3.2.3) random variables
(→ I/1.2.2) can be written as a sum of squared forms
X
n X
m X
n X
n
(j)
Ui2 = Qj where Qj = Uk Bkl Ul
i=1 j=1 k=1 l=1
X
m
(10)
with B (j) = In
j=1
Proof: We are interested in the probability that Y equals a number m. According to the law of
marginal probability (→ I/1.3.3) or the law of total probability (→ I/1.4.7), this probability can be
expressed as:
X
∞
Pr(Y = m) = Pr(Y = m|X = k) · Pr(X = k) . (4)
k=0
Since, by definitions (2) and (1), Pr(X = k) = 0 when k > n and Pr(Y = m|X = k) = 0 when
k < m, we have:
X
n
Pr(Y = m) = Pr(Y = m|X = k) · Pr(X = k) . (5)
k=m
Now we can take the probability mass function of the binomial distribution (→ II/1.3.2) and plug it
in for the terms in the sum of (5) to get:
n
X
k n k
Pr(Y = m) = q (1 − q)
m
· p (1 − p)n−k .
k−m
(6)
k=m
m k
k n−m
Applying the binomial coefficient identity nk m = mn
k−m
and rearranging the terms, we have:
n
X
n n−m
Pr(Y = m) = pk q m (1 − p)n−k (1 − q)k−m . (7)
k=m
m k−m
Now we partition pk = pm · pk−m and pull all terms dependent on k out of the sum:
Xn
n n − m k−m
Pr(Y = m) = m m
p q p (1 − p)n−k (1 − q)k−m
m k=m
k−m
X n − m
n (8)
n
= (pq) m
(p(1 − q))k−m (1 − p)n−k .
m k=m
k − m
X
n−m
n−m
(p − pq)i (1 − p)n−m−i = ((p − pq) + (1 − p))n−m . (11)
i=0
i
1. UNIVARIATE NORMAL DATA 369
λ0 µ0 + τ nȳ
µn =
λ0 + τ n (4)
λn = λ0 + τ n
with the sample mean (→ I/1.10.2) ȳ and the inverse variance or precision (→ I/1.11.12) τ = 1/σ 2 .
Proof: According to the law of marginal probability (→ I/1.3.3), the model evidence (→ I/5.1.11)
for this model is:
Z
p(y|m) = p(y|µ) p(µ) dµ . (5)
According to the law of conditional probability (→ I/1.3.4), the integrand is equivalent to the joint
likelihood (→ I/5.1.5):
Z
p(y|m) = p(y, µ) dµ . (6)
Y
n
2
p(y|µ, σ ) = N (yi ; µ, σ 2 )
i=1
" 2 #
Yn
1 1 yi − µ
= √ · exp − (7)
i=1
2πσ 2 σ
r !n " #
1 X
n
1
= · exp − 2 (yi − µ) 2
2πσ 2 2σ i=1
Y
n
2
p(y|µ, σ ) = N (yi ; µ, σ 2 )
i=1
" 2 #
Y
n
1 1 yi − µ
= √ · exp − (3)
2πσ 2 σ
i=1
" #
1 X
n
1
= √ · exp − 2 (yi − µ)2
( 2πσ )2 n 2σ i=1
Y
n
p(y|µ, τ ) = N (yi ; µ, τ −1 )
i=1
Yn r h τ i
τ
= · exp − (yi − µ) 2
(4)
2π 2
i=1
r n " #
τX
n
τ
= · exp − (yi − µ)2
2π 2 i=1
s " #
1 τ Xn
p(y|µ, τ ) = · τ n/2 · exp − y 2 − 2µyi + µ2
(2π)n 2 i=1 i
s " !#
1 τ Xn Xn
= · τ n/2 · exp − y 2 − 2µ yi + nµ2
(2π)n 2 i=1 i i=1
s (6)
1 h τ T i
= · τ n/2
· exp − y y − 2µnȳ + nµ 2
(2π)n 2
s
1 τn 1 T
= ·τ n/2
· exp − y y − 2µȳ + µ 2
(2π)n 2 n
P P
where ȳ = n1 ni=1 yi is the mean of data points and y T y = ni=1 yi2 is the sum of squared data points.
Completing the square over µ, finally gives
s
1 τn 1 T
p(y|µ, τ ) = ·τ n/2
· exp − (µ − ȳ) − ȳ + y y
2 2
(7)
(2π)n 2 n