Lecture3_statistics
Lecture3_statistics
1
Menu
1. Conjugate Priors
2
Conjugate Priors
• so posterior given data with 𝑛& ones and 𝑛' zeroes has the form …
1. Conjugate Priors
9
Laplace Approximation
Bayes Marginal Likelihood
For all full 𝑘- dim exponential families, in mean-value,canonical or standard
parameterization, we have, with 𝜃J6 ∈ Θ MLE based on 𝑥 ( , ⋅ determinant:
𝑘 𝑛 1
− log 𝑝7 𝑥( = − log 𝑝98) 𝑥( + log − log 𝑤 𝜃( + log |𝐼 𝜃J( ) + 𝑜 1
J
2 2𝜋 2
uniformly for all 𝑥&, 𝑥0, … with, for all large 𝑛, 𝜃J( in a compact subset of
parameter space, where we assume prior density 𝑤 continuous and > 0
𝑘 𝑛 1
− log 𝑝7 𝑥( = − log 𝑝98) 𝑥( + log − log 𝑤 𝜃( + log |𝐼 𝜃J( ) + 𝑜 1
J
2 2𝜋 2
uniformly for all 𝑥&, 𝑥0, … with 𝜃J( in compact subset of Θ for large 𝑛
• Recall the “Bayes Factor” method for hypothesis testing: evidence for
𝐻&, against 𝐻', given by 𝑝7! (𝑋 ( )/𝑝7" (𝑋 ( )
• More generally, Bayes Factor method for model selection tells us, to
pick, from a finite or countable set models 𝐻> = 𝑃> : 𝛾 ∈ Θ> , 𝛾 ∈ Γ the
one maximizing Bayes marginal likelihood
arg max 𝑝7- (𝑋 ( ) = arg min − log 𝑝7- (𝑋 ( )
>∈? >∈?
• Laplace-approximating each 𝑃7- and ignoring terms that do not depend
on 𝑛 , we get the celebrated BIC: take
(
𝑘>
arg min − log 𝑝98-,) 𝑋 + log 𝑛 12
>∈? 2
BIC (Bayesian Information Criterion)
• BIC: take
𝑘>
arg min − log 𝑃98-,) 𝑋( + log 𝑛
>∈? 2
Recall: Bayes factor with simple null defines test martingale which we can
rewrite as product of conditional e-variables:
@/ (4) ) @ (4& )
/|1&2!
= ∏(CD& 𝑆C with 𝑆C = and 𝑤 𝜃 𝑋 C%& ∝ ∫ 𝑝9 𝑋 C%& 𝑤 𝜃 𝑑𝜃
@" (4) ) @" (4& )
…and can think of each component as the factor by which our wealth
changes in the i-th round, in a sequential betting game in which we would
not expect to gain money if the null were true
14
best worst-case sequential betting strategy
Recall: Bayes factor with simple null defines test martingale which we can
rewrite as product of conditional e-variables; in i.i.d. case:
(() @3 (4) ) @3 (4& )
𝑆9 ≔ = ∏(CD& 𝑆C,9 with 𝑆C,9 =
@" (4) ) @" (4& )
@/ (4) ) @ (4& )
(() /|1&2!
𝑆7 ≔ = ∏(CD& 𝑆C,7 with 𝑆C,7 =
@" (4) ) @" (4& )
𝑆C : factor by which our wealth changes in the 𝑖-th round, in sequential betting
game in which we would not expect to gain money if null were true
With hindsight, the best strategy, that would have made us richest, among all
(() (()
𝑆9 and even all 𝑆7( , is 𝑆98 . Unfortunately we can’t play this strategy! 15
(why?)
Jeffreys’ prior:
best worst-case sequential betting strategy
Recall: Bayes factor with simple null defines test martingale which we can
rewrite as product of conditional e-variables; in i.i.d. case:
@/ (4) ) @ (4& )
(() /|1&2!
𝑆7 ≔ = ∏(CD& 𝑆C,7 with 𝑆C,7 =
@" (4) ) @" (4& )
With hindsight, the best strategy, that would have made us richest, among
(() (()
all 𝑆9 and even all 𝑆7( , is 𝑆98 .
)
𝑘 𝑛 1
− log 𝑝7 𝑥( = − log 𝑝98) 𝑥( + log − log 𝑤 𝜃J( + log |𝐼 𝜃J( ) + 𝑜 1
2 2𝜋 2
uniformly for all 𝑥&, 𝑥0, … with, for all large 𝑛, 𝜃J( in a compact subset of
parameter space, i.e. for any compact 𝐶 ⊂ int Θ ,
𝑝98) 𝑥 (𝑘 𝑛 𝑤(𝜃J( )
sup log ( → log − log
8) ∈=
: ) :9 𝑝7 (𝑥 ) 2 2𝜋
𝐼 𝜃J(
17
Jeffreys prior gives the
optimal relative betting strategy
𝑝98) 𝑥 (𝑘 𝑛 𝑤(𝜃J( )
sup log ( → log − log
8) ∈=
: ) :9 𝑝7 (𝑥 ) 2 2𝜋
𝐼 𝜃J(
so if we plug in Jeffreys prior 𝑊E on 𝐶 for 𝑤, we find:
𝑝98) (𝑥 ( ) 𝑘 𝑛
log ( → log + log m |𝐼 𝜃 |𝑑𝜃
𝑝74 (𝑥 ) 2 2𝜋 =
20
Wilks’ Theorem/Phenomenon
22
Composite 𝑯𝟏
• If you think 𝐻' is wrong, but you do not know which alternative is true,
then… you can try to learn 𝑝&
• Use a 𝑝&̅ that better and better mimics the true, or just “best” fixed 𝑝&
& &
Example, 𝐻': 𝑋C ∼ Ber 0
, H&: 𝑋K ∼ Ber 𝜃 , 𝜃 ≠ 0 : set:
(! +&
𝑝&̅ 𝑋(+& = 1 𝑥( ≔ (+0
, where 𝑛& is nr of 1s in 𝑥 (
…we use notation for conditional probabilities, but we should really think of
𝑝&̅ as a sequential betting strategy with the “conditional probabilities”
indicating how to bet/invest in the next round, given the past data
23
Composite 𝑯𝟏 and The Oracle
If you had access to an oracle, that would tell you “If 𝐻& is true, then the
data come from this-and-this 𝑃9 with 𝜃 ∈ Θ&” then you would like to use
(()
GRO-e-process 𝑆9 = 𝑝9 (𝑋 ( )/𝑝'(𝑋 ( )
Sensible goal: use an e-process 𝑆 that is not much worse than the oracle.
(() 24
𝑆7 will have this property!
Laplace+Wilks
1 ( 1
𝐄J3 [− log 𝑝9 𝑋 ( + 0 log 0L − log 𝑤 𝜃 + log 𝐼 𝜃 − 0] + 𝑜 1
so
( @3 (4) ) 1 ( 1
𝐄J3 log 𝑆7 𝑋( = 𝐄J3 [log @ (4) )] − 0
log 0L + log 𝑤 𝜃 − log 𝐼 𝜃 +0 +𝑜 1
"
1
…within 0
log 𝑛 of optimal:
@3 (4) )
pretty good since 𝐄J3 [log @ (4) )] is linear in 𝑛 for iid data
"
25
Laplace+Wilks, details
1
Prequential plug-in methods typically also loose a 0
log 𝑛 + 𝑂 1 term
(but they often perform a bit worse in practice)
27
Anytime-Valid Confidence Intervals
revisited
… always wider than Bayes credible posterior interval based on same prior
AV CI’s vs. Bayesian Credible Sets
33
Priors optimal for anticipated 𝒏∗
! !
()* " ()* "
𝜃& = 𝜃 − 2 𝐼 𝜃 ⋅ ∗ =𝜃− 2 ∗
. .
39