0% found this document useful (0 votes)
12 views5 pages

Stat520 Ch.4

The document provides a comprehensive overview of various statistical concepts and theorems, including Bayesian linear regression, probability laws, and distributions. It covers topics such as the addition law of probability, Bayes' theorem, and Cochran's theorem, along with proofs and applications related to univariate Gaussian data. The content is organized by topic, making it a useful reference for statistical analysis and modeling.

Uploaded by

gohocel840
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views5 pages

Stat520 Ch.4

The document provides a comprehensive overview of various statistical concepts and theorems, including Bayesian linear regression, probability laws, and distributions. It covers topics such as the addition law of probability, Bayes' theorem, and Cochran's theorem, along with proofs and applications related to univariate Gaussian data. The content is organized by topic, making it a useful reference for statistical analysis and modeling.

Uploaded by

gohocel840
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

3.

PROOF BY TOPIC 655

3 Proof by Topic
A
• Accuracy and complexity for Bayesian linear regression, 497
• Accuracy and complexity for Bayesian linear regression with known covariance, 521
• Accuracy and complexity for the univariate Gaussian, 358
• Accuracy and complexity for the univariate Gaussian with known variance, 370
• Addition law of probability, 13
• Addition of the differential entropy upon multiplication with a constant, 92
• Addition of the differential entropy upon multiplication with invertible matrix, 94
• Additivity of the Kullback-Leibler divergence for independent distributions, 109
• Additivity of the variance for independent random variables, 59
• Akaike information criterion for multiple linear regression, 489
• Application of Cochran’s theorem to two-way analysis of variance, 397

B
• Bayes’ rule, 133
• Bayes’ theorem, 132
• Bayesian information criterion for multiple linear regression, 490
• Best linear unbiased estimator for the inverse general linear model, 537
• Binomial test, 550
• Brier scoring rule is strictly proper scoring rule, 140

C
• Characteristic function of a function of a random variable, 35
• Chi-squared distribution is a special case of gamma distribution, 252
• Combined posterior distribution for Bayesian linear regression when analyzing conditionally inde-
pendent data sets, 509
• Combined posterior distributions in terms of individual posterior distributions obtained from con-
ditionally independent data, 127
• Concavity of the Shannon entropy, 86
• Conditional distributions of the multivariate normal distribution, 299
• Conditional distributions of the normal-gamma distribution, 318
• Conjugate prior distribution for Bayesian linear regression, 492
• Conjugate prior distribution for Bayesian linear regression with known covariance, 515
• Conjugate prior distribution for binomial observations, 554
• Conjugate prior distribution for multinomial observations, 564
• Conjugate prior distribution for multivariate Bayesian linear regression, 543
• Conjugate prior distribution for Poisson-distributed data, 573
• Conjugate prior distribution for the Poisson distribution with exposure values, 580
• Conjugate prior distribution for the univariate Gaussian, 351
• Conjugate prior distribution for the univariate Gaussian with known variance, 364
• Construction of confidence intervals using Wilks’ theorem, 114
• Construction of unbiased estimator for variance, 602
• Construction of unbiased estimator for variance in multiple linear regression, 603
• Continuous uniform distribution maximizes differential entropy for fixed range, 184
• Convexity of the cross-entropy, 88
• Convexity of the Kullback-Leibler divergence, 108
190 CHAPTER II. PROBABILITY DISTRIBUTIONS

Xi − µ
Ui = (5)
σ
which follows a standard normal distribution (→ II/3.2.4)

Ui ∼ N (0, 1) . (6)
Then, the sum of squared random variables Ui can be rewritten as

X
n n 
X 2
Xi − µ
Ui2 =
i=1 i=1
σ
Xn  2
(Xi − X̄) + (X̄ − µ)
=
i=1
σ
(7)
X
n
(Xi − X̄)2 Xn
(X̄ − µ)2 Xn
(Xi − X̄)(X̄ − µ)
= 2
+ 2
+2
i=1
σ i=1
σ i=1
σ2
Xn  2 X n  2
(X̄ − µ) X
n
Xi − X̄ X̄ − µ
= + +2 (Xi − X̄) .
i=1
σ i=1
σ σ2 i=1

Because the following sum is zero

X
n X
n
(Xi − X̄) = Xi − nX̄
i=1 i=1
Xn
1X
n
= Xi − n · Xi
i=1
n i=1 (8)
Xn X
n
= Xi − Xi
i=1 i=1
=0,

the third term disappears, i.e.

X
n n 
X 2 Xn  2
Xi − X̄ X̄ − µ
Ui2 = + . (9)
i=1 i=1
σ i=1
σ
Cochran’s theorem states that, if a sum of squared standard normal (→ II/3.2.3) random variables
(→ I/1.2.2) can be written as a sum of squared forms

X
n X
m X
n X
n
(j)
Ui2 = Qj where Qj = Uk Bkl Ul
i=1 j=1 k=1 l=1
X
m
(10)
with B (j) = In
j=1

and rj = rank(B (j) ) ,


1. UNIVARIATE DISCRETE DISTRIBUTIONS 161

Y ∼ Bin(n, pq) . (3)

Proof: We are interested in the probability that Y equals a number m. According to the law of
marginal probability (→ I/1.3.3) or the law of total probability (→ I/1.4.7), this probability can be
expressed as:
X

Pr(Y = m) = Pr(Y = m|X = k) · Pr(X = k) . (4)
k=0

Since, by definitions (2) and (1), Pr(X = k) = 0 when k > n and Pr(Y = m|X = k) = 0 when
k < m, we have:
X
n
Pr(Y = m) = Pr(Y = m|X = k) · Pr(X = k) . (5)
k=m

Now we can take the probability mass function of the binomial distribution (→ II/1.3.2) and plug it
in for the terms in the sum of (5) to get:
n  
X  
k n k
Pr(Y = m) = q (1 − q)
m
· p (1 − p)n−k .
k−m
(6)
k=m
m k
 k  n−m
Applying the binomial coefficient identity nk m = mn
k−m
and rearranging the terms, we have:
n  
X 
n n−m
Pr(Y = m) = pk q m (1 − p)n−k (1 − q)k−m . (7)
k=m
m k−m

Now we partition pk = pm · pk−m and pull all terms dependent on k out of the sum:

  Xn  
n n − m k−m
Pr(Y = m) = m m
p q p (1 − p)n−k (1 − q)k−m
m k=m
k−m
  X n − m
n  (8)
n
= (pq) m
(p(1 − q))k−m (1 − p)n−k .
m k=m
k − m

Then we subsititute i = k − m, such that k = i + m:


  X n − m
n−m
n
Pr(Y = m) = (pq)m
(p − pq)i (1 − p)n−m−i . (9)
m i=0
i
According to the binomial theorem
n  
X
n n
(x + y) = xn−k y k , (10)
k=0
k
the sum in equation (9) is equal to

X
n−m
n−m

(p − pq)i (1 − p)n−m−i = ((p − pq) + (1 − p))n−m . (11)
i=0
i
1. UNIVARIATE NORMAL DATA 369

1.2.8 Log model evidence


Theorem: Let

m : y = {y1 , . . . , yn } , yi ∼ N (µ, σ 2 ), i = 1, . . . , n (1)


be a univariate Gaussian data set (→ III/1.2.1) with unknown mean µ and known variance σ 2 .
Moreover, assume a normal distribution (→ III/1.2.6) over the model parameter µ:

p(µ) = N (µ; µ0 , λ−1


0 ) . (2)
Then, the log model evidence (→ IV/??) for this model is
τ  1  
n λ0 1 
log p(y|m) = log + log − τ y T y + λ0 µ20 − λn µ2n . (3)
2 2π 2 λn 2
where the posterior hyperparameters (→ I/5.1.7) are given by

λ0 µ0 + τ nȳ
µn =
λ0 + τ n (4)
λn = λ0 + τ n

with the sample mean (→ I/1.10.2) ȳ and the inverse variance or precision (→ I/1.11.12) τ = 1/σ 2 .

Proof: According to the law of marginal probability (→ I/1.3.3), the model evidence (→ I/5.1.11)
for this model is:
Z
p(y|m) = p(y|µ) p(µ) dµ . (5)

According to the law of conditional probability (→ I/1.3.4), the integrand is equivalent to the joint
likelihood (→ I/5.1.5):
Z
p(y|m) = p(y, µ) dµ . (6)

Equation (1) implies the following likelihood function (→ I/5.1.2)

Y
n
2
p(y|µ, σ ) = N (yi ; µ, σ 2 )
i=1
"  2 #
Yn
1 1 yi − µ
= √ · exp − (7)
i=1
2πσ 2 σ
r !n " #
1 X
n
1
= · exp − 2 (yi − µ) 2
2πσ 2 2σ i=1

which, for mathematical convenience, can also be parametrized as


352 CHAPTER III. STATISTICAL MODELS

Y
n
2
p(y|µ, σ ) = N (yi ; µ, σ 2 )
i=1
"  2 #
Y
n
1 1 yi − µ
= √ · exp − (3)
2πσ 2 σ
i=1
" #
1 X
n
1
= √ · exp − 2 (yi − µ)2
( 2πσ )2 n 2σ i=1

which, for mathematical convenience, can also be parametrized as

Y
n
p(y|µ, τ ) = N (yi ; µ, τ −1 )
i=1
Yn r h τ i
τ
= · exp − (yi − µ) 2
(4)
2π 2
i=1
r n " #
τX
n
τ
= · exp − (yi − µ)2
2π 2 i=1

using the inverse variance or precision τ = 1/σ 2 .

Separating constant and variable terms, we have:


s " #
1 τ Xn
p(y|µ, τ ) = · τ n/2 · exp − (yi − µ) .
2
(5)
(2π)n 2 i=1
Expanding the product in the exponent, we have

s " #
1 τ Xn

p(y|µ, τ ) = · τ n/2 · exp − y 2 − 2µyi + µ2
(2π)n 2 i=1 i
s " !#
1 τ Xn Xn
= · τ n/2 · exp − y 2 − 2µ yi + nµ2
(2π)n 2 i=1 i i=1
s (6)
1 h τ T i
= · τ n/2
· exp − y y − 2µnȳ + nµ 2
(2π)n 2
s   
1 τn 1 T
= ·τ n/2
· exp − y y − 2µȳ + µ 2
(2π)n 2 n
P P
where ȳ = n1 ni=1 yi is the mean of data points and y T y = ni=1 yi2 is the sum of squared data points.
Completing the square over µ, finally gives
s   
1 τn 1 T
p(y|µ, τ ) = ·τ n/2
· exp − (µ − ȳ) − ȳ + y y
2 2
(7)
(2π)n 2 n

You might also like