0% found this document useful (0 votes)
5 views

Lecture3_statistics

The document discusses the foundations of statistics and machine learning, focusing on conjugate priors, Laplace approximation, and Bayesian information criterion (BIC). It explains the concept of conjugate priors in relation to various statistical models and details the use of Laplace approximation for Bayesian marginal likelihood and model selection. Additionally, it explores the implications of Jeffreys' prior in sequential betting strategies and its optimality in relative betting performance.

Uploaded by

Günay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture3_statistics

The document discusses the foundations of statistics and machine learning, focusing on conjugate priors, Laplace approximation, and Bayesian information criterion (BIC). It explains the concept of conjugate priors in relation to various statistical models and details the use of Laplace approximation for Bayesian marginal likelihood and model selection. Additionally, it explores the implications of Jeffreys' prior in sequential betting strategies and its optimality in relative betting performance.

Uploaded by

Günay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Foundations of Statistics and Machine Learning:

testing and uncertainty quantification with e-values


(and their link to likelihood, betting)

1
Menu

1. Conjugate Priors

2. Laplace Approximation Bayes marginal likelihood; BIC

3. Using the Laplace Approximation to quantify growth (“e-power”) for


simple null and composite alternative

2
Conjugate Priors

• Informally, a family of priors { 𝜋! : 𝜌 ∈ 𝑅 } for an exponential family ℳ is


conjugate if the prior and the posterior have the same functional
form. This greatly facilitates computations
• Rather than formalizing this we will give a few examples.
• The family of beta distributions is conjugate to the Bernoulli model
• The normal location family is conjugate to the normal location family
(‘self-conjugacy’)
• The inverse Gamma family is conjugate to the normal scale family
Beta Bernoulli Conjugacy

Beta-distribution with parameter (𝛼, 𝛽) is distribution on [0,1] with density

𝜋",$ 𝜃 ∝ 𝜃 "%& 1 − 𝜃 $%&

• so posterior given data with 𝑛& ones and 𝑛' zeroes has the form …

𝜋",$ 𝜃|𝑥 ( ∝ 𝜃 (! 1 − 𝜃 (" 𝜃 "%& 1−𝜃 $%&

….so it is a beta-distribution with parameters (𝛼 + 𝑛&, 𝛽 + 𝑛')


Uniform prior: 𝜋&,& . Jeffreys prior: 𝜋!,! , Haldane: 𝜋','
##
Beta Bernoulli Conjugacy

Beta-distribution with parameter (𝛼, 𝛽) is distribution on [0,1] with density

𝜋",$ 𝜃 ∝ 𝜃 "%& 1 − 𝜃 $%&

• Uniform prior is 𝜋&,& . Jeffreys prior is 𝜋!,!


##

• Question for you: why the -1 in the definition?


Beta Bernoulli Conjugacy

Beta-distribution with parameter (𝛼, 𝛽) is distribution on [0,1] with density

𝜋",$ 𝜃 ∝ 𝜃 "%& 1 − 𝜃 $%&

• Uniform prior is 𝜋&,& . Jeffreys prior is 𝜋!,!


##

• Question for you: why the -1 in the definition?


𝒏𝟏 +𝜶
• Answer: nice form of predictive distribution 𝒑 𝑿𝒏+𝟏 = 𝟏 𝑿𝒏 ) =
𝒏+𝜶+𝜷

Jeffreys: alternative for Laplace rule of succession!


Conjugacy of Normal

The normal location family is self-conjugate:


Let ℳ = 𝑝/ : 𝜇 ∈ ℝ be the family of normal densities with mean 𝜇 and
fixed variance 𝜎 0 and let
𝜇 − 𝜇' 0
𝑤 𝜇 ∝ exp −
2 𝜌10
2#
be density of a normal with mean 𝜇' and variance 𝜌10 = 1
. Then Bayes
posterior is normal:
4& %/ # 1 /" %/ # /%5/ #
% ∑&'!..) + % #
𝑤 𝜇 𝑋( ∝𝑒 02# 02# ∝ 𝑒 0!𝒏+,

…again 𝑘 counts additional ‘’virtual’ data points


Proper, Jeffreys, Conjugate

• Jeffreys prior sometimes, but not always improper


• proper for Bernoulli, improper for Gaussian, exponential, geometric,
Poisson, …
• Jeffreys prior usually, but not always conjugate
• conjugate for Bernoulli, Gaussian location, Gaussian scale
(not conjugate for some multivariate models or “truncated” 1-dim
models such as exponential distributions restricted to [0,1])

• Improper priors sometimes, but not always conjugate


• Haldane prior conjugate+improper for Bernoulli
• Jeffrey prior conjugate+improper for Gaussian location/scale
Menu

1. Conjugate Priors

2. Laplace Approximation Bayes marginal likelihood; BIC

3. Using the Laplace Approximation to quantify growth (“e-power”) for


simple null and composite alternative

9
Laplace Approximation
Bayes Marginal Likelihood
For all full 𝑘- dim exponential families, in mean-value,canonical or standard
parameterization, we have, with 𝜃J6 ∈ Θ MLE based on 𝑥 ( , ⋅ determinant:

𝑘 𝑛 1
− log 𝑝7 𝑥( = − log 𝑝98) 𝑥( + log − log 𝑤 𝜃( + log |𝐼 𝜃J( ) + 𝑜 1
J
2 2𝜋 2

uniformly for all 𝑥&, 𝑥0, … with, for all large 𝑛, 𝜃J( in a compact subset of
parameter space, where we assume prior density 𝑤 continuous and > 0

Note: clearly − log 𝑝7 𝑥 ( + log 𝑝98) 𝑥 ( must be positive. This result


quantifies by how much. DERIVATION ON BLACK BOARD 10
Laplace Approximation
Bayes Marginal Likelihood
For all full exponential families, in mean-value, canonical or standard
parameterization, we have, with 𝜃J6 ∈ Θ MLE based on 𝑥 ( , ⋅ determinant:

𝑘 𝑛 1
− log 𝑝7 𝑥( = − log 𝑝98) 𝑥( + log − log 𝑤 𝜃( + log |𝐼 𝜃J( ) + 𝑜 1
J
2 2𝜋 2
uniformly for all 𝑥&, 𝑥0, … with 𝜃J( in compact subset of Θ for large 𝑛

By this we mean, 𝐿𝐻𝑆 denoting left-hand side, 𝑅𝐻𝑆 right-hand side:


for any compact 𝐶 ⊂ int Θ ,
sup 𝐿𝐻𝑆( − 𝑅𝐻𝑆( → 0
8) ∈=
: ) :9
11
BIC (Bayesian Information Criterion)

• Recall the “Bayes Factor” method for hypothesis testing: evidence for
𝐻&, against 𝐻', given by 𝑝7! (𝑋 ( )/𝑝7" (𝑋 ( )

• More generally, Bayes Factor method for model selection tells us, to
pick, from a finite or countable set models 𝐻> = 𝑃> : 𝛾 ∈ Θ> , 𝛾 ∈ Γ the
one maximizing Bayes marginal likelihood
arg max 𝑝7- (𝑋 ( ) = arg min − log 𝑝7- (𝑋 ( )
>∈? >∈?
• Laplace-approximating each 𝑃7- and ignoring terms that do not depend
on 𝑛 , we get the celebrated BIC: take
(
𝑘>
arg min − log 𝑝98-,) 𝑋 + log 𝑛 12
>∈? 2
BIC (Bayesian Information Criterion)

• BIC: take
𝑘>
arg min − log 𝑃98-,) 𝑋( + log 𝑛
>∈? 2

This incorporates some very crude form of Ockham’s Razor...

has been used 10000s of times in the literature. I recommend against it


though, even with a Bayesian hat on…ignores very important 𝑂 1 terms
(BLACKBOARD: SICK 2-SAMPLE BERNOULLI MODEL)
Note that in general BIC does not have an e-process interpretation except
when comparing two models and one has 0 parameters (simple null) 13
best worst-case sequential betting strategy

Recall: Bayes factor with simple null defines test martingale which we can
rewrite as product of conditional e-variables:
@/ (4) ) @ (4& )
/|1&2!
= ∏(CD& 𝑆C with 𝑆C = and 𝑤 𝜃 𝑋 C%& ∝ ∫ 𝑝9 𝑋 C%& 𝑤 𝜃 𝑑𝜃
@" (4) ) @" (4& )

…and can think of each component as the factor by which our wealth
changes in the i-th round, in a sequential betting game in which we would
not expect to gain money if the null were true

14
best worst-case sequential betting strategy

Recall: Bayes factor with simple null defines test martingale which we can
rewrite as product of conditional e-variables; in i.i.d. case:
(() @3 (4) ) @3 (4& )
𝑆9 ≔ = ∏(CD& 𝑆C,9 with 𝑆C,9 =
@" (4) ) @" (4& )

@/ (4) ) @ (4& )
(() /|1&2!
𝑆7 ≔ = ∏(CD& 𝑆C,7 with 𝑆C,7 =
@" (4) ) @" (4& )
𝑆C : factor by which our wealth changes in the 𝑖-th round, in sequential betting
game in which we would not expect to gain money if null were true

With hindsight, the best strategy, that would have made us richest, among all
(() (()
𝑆9 and even all 𝑆7( , is 𝑆98 . Unfortunately we can’t play this strategy! 15
(why?)
Jeffreys’ prior:
best worst-case sequential betting strategy
Recall: Bayes factor with simple null defines test martingale which we can
rewrite as product of conditional e-variables; in i.i.d. case:
@/ (4) ) @ (4& )
(() /|1&2!
𝑆7 ≔ = ∏(CD& 𝑆C,7 with 𝑆C,7 =
@" (4) ) @" (4& )

With hindsight, the best strategy, that would have made us richest, among
(() (()
all 𝑆9 and even all 𝑆7( , is 𝑆98 .
)

Sensible goal: become as rich as possible compared to optimal-with-


hindsight 𝜃J( , making the ratio as small as possible in worst-case
Jeffreys’ prior approximately achieves this
16
Jeffreys prior and sequential betting

𝑘 𝑛 1
− log 𝑝7 𝑥( = − log 𝑝98) 𝑥( + log − log 𝑤 𝜃J( + log |𝐼 𝜃J( ) + 𝑜 1
2 2𝜋 2
uniformly for all 𝑥&, 𝑥0, … with, for all large 𝑛, 𝜃J( in a compact subset of
parameter space, i.e. for any compact 𝐶 ⊂ int Θ ,

𝑝98) 𝑥 (𝑘 𝑛 𝑤(𝜃J( )
sup log ( → log − log
8) ∈=
: ) :9 𝑝7 (𝑥 ) 2 2𝜋
𝐼 𝜃J(

17
Jeffreys prior gives the
optimal relative betting strategy
𝑝98) 𝑥 (𝑘 𝑛 𝑤(𝜃J( )
sup log ( → log − log
8) ∈=
: ) :9 𝑝7 (𝑥 ) 2 2𝜋
𝐼 𝜃J(
so if we plug in Jeffreys prior 𝑊E on 𝐶 for 𝑤, we find:
𝑝98) (𝑥 ( ) 𝑘 𝑛
log ( → log + log m |𝐼 𝜃 |𝑑𝜃
𝑝74 (𝑥 ) 2 2𝜋 =

for any other prior 𝑊′ on 𝐶, there exists 𝜃′ with 𝑤 F 𝜃 F < 𝑤E (𝜃 F ) (WHY?) so


if 𝜃 F = 𝜃J( for some 𝑥 ( , for large 𝑛, the red term is larger, so the worst-case
performance relative to the hindsight-best 𝜃J6 will be worse
18
Jeffreys prior gives the
optimal relative betting strategy
so if we plug in Jeffreys prior 𝑊E on 𝐶 for 𝑤, we find:
𝑝75 (𝑥 ( ) 𝑘 𝑛
log ( → − log − log m |𝐼 𝜃 |𝑑𝜃
𝑝98) (𝑥 ) 2 2𝜋 =

Applicability of this result is limited though:


for most exponential families, Jeffreys’ prior is improper. This means that if
we take a sequence of compact sets 𝐶&, 𝐶0, … such that “𝐶C → Θ” , we get
that ∫= |𝐼 𝜃 |𝑑𝜃 → ∞ and then we’re in trouble: we want to take “large” 𝐶
&
but then we get bad worst-case behaviour
It is meaningful for Bernoulli (Jeffreys’ prior proper)
19
Laplace Approximation and
Expected Log Optimality
• We now show how to proceed from individual data sequence-analysis
that we just did to expected logarithmic growth analysis (our optimality
criterion for e-variables)

20
Wilks’ Theorem/Phenomenon

• Wilks’ (1939!): suppose 𝑋&, 𝑋0, … i.i.d. 𝑃9 . Then


𝑝98) 𝑋 (
𝑈( ≔ log
𝑝9 (𝑋 ( )
&
Converges in probability and expectation times a 𝜒 0-distributed random
to 0
variable with 𝑘 degrees of freedom. Consequence:
𝑘
𝐄𝑷𝜽 𝑈( = + 𝑜(1)
2
…asymptotically, surprisingly no dependence on 𝑛 !
• holds for all full exp fams, and in fact much more generally;
• holds exactly, nonasymptotically, for normal location family
21
Simple 𝑯𝟏 and 𝑯𝟎 , log-optimal betting

Recall: “best” e-variable pick 𝑠 ∗ (𝑋) maximizing

𝐄4∼J! [log 𝑠(𝑋)] over all e-variables 𝑠 𝑋 for 𝐻'


@! 4
maximum is achieved for 𝑠∗ 𝑋 =
@" (4)

22
Composite 𝑯𝟏

• If you think 𝐻' is wrong, but you do not know which alternative is true,
then… you can try to learn 𝑝&
• Use a 𝑝&̅ that better and better mimics the true, or just “best” fixed 𝑝&
& &
Example, 𝐻': 𝑋C ∼ Ber 0
, H&: 𝑋K ∼ Ber 𝜃 , 𝜃 ≠ 0 : set:
(! +&
𝑝&̅ 𝑋(+& = 1 𝑥( ≔ (+0
, where 𝑛& is nr of 1s in 𝑥 (

…we use notation for conditional probabilities, but we should really think of
𝑝&̅ as a sequential betting strategy with the “conditional probabilities”
indicating how to bet/invest in the next round, given the past data
23
Composite 𝑯𝟏 and The Oracle

Two general strategies for learning 𝑃& ∈ 𝐻& ∶


• ”prequential plug-in” (or simply “plug-in”) vs.
• “method-of-mixtures” (or, in present simple context, simply “Bayesian”)
@/ (4) ) @/|1&2! (4& )
= ∏(CD& 𝑆C with 𝑆C =
@" (4) ) @" (4& )

If you had access to an oracle, that would tell you “If 𝐻& is true, then the
data come from this-and-this 𝑃9 with 𝜃 ∈ Θ&” then you would like to use
(()
GRO-e-process 𝑆9 = 𝑝9 (𝑋 ( )/𝑝'(𝑋 ( )
Sensible goal: use an e-process 𝑆 that is not much worse than the oracle.
(() 24
𝑆7 will have this property!
Laplace+Wilks

For all 𝜃 ∈ int Θ: 𝐄J3 [− log 𝑝7 𝑋 ( ] =


1 (
𝐄J3 [− log 𝑝98) 𝑋 + 0 log 0L − log 𝑤 𝜃J( + log 𝐼 𝜃J( ] + 𝑜 1 =
(

1 ( 1
𝐄J3 [− log 𝑝9 𝑋 ( + 0 log 0L − log 𝑤 𝜃 + log 𝐼 𝜃 − 0] + 𝑜 1
so
( @3 (4) ) 1 ( 1
𝐄J3 log 𝑆7 𝑋( = 𝐄J3 [log @ (4) )] − 0
log 0L + log 𝑤 𝜃 − log 𝐼 𝜃 +0 +𝑜 1
"
1
…within 0
log 𝑛 of optimal:
@3 (4) )
pretty good since 𝐄J3 [log @ (4) )] is linear in 𝑛 for iid data
"

25
Laplace+Wilks, details

• In first step we acted as if data “outside of compact set 𝐶” played no role.


For exponential families this is indeed the case
• intuitively plausible by law of large numbers, but by no means easy to
prove (Clarke and Barron, 1990, 1994)
• Second step crucially used continuity and positivity of prior density
𝑤 and of Fisher information - the latter holds for exponential families
• Again, may try to optimize by using Jeffreys prior but that only works in
the rare cases (like Bernoulli) where it is proper
• Nevertheless, this suggests using an improper Jeffreys prior, just to see
what happens…(will see this in the coming weeks)
26
#
The Ubiquitous log 𝑛
$

Bayes/Method of Mixtures: By having a 𝑘-dimensional composite


exponential family alternative, rather than a simple alternative, using a
1
standard (continuous+strictly positive) prior, you loose a 0 log 𝑛 + 𝑂 1
term in expected logarithmic growth

1
Prequential plug-in methods typically also loose a 0
log 𝑛 + 𝑂 1 term
(but they often perform a bit worse in practice)

27
Anytime-Valid Confidence Intervals
revisited

• e-processes can be used to construct AVCIs


• Given model {𝑃9 : 𝜃 ∈ Θ} , let 𝑆9 be e-process for 𝐻' = 𝑃9 , all 𝜃 ∈ Θ :
Anytime-Valid Confidence Intervals

• e-processes can be used to construct AVCIs


• Given model {𝑃9 : 𝜃 ∈ Θ} , let 𝑆9 be e-process for 𝐻' = 𝑃9 , all 𝜃 ∈ Θ :

The 𝜃 you have not


been able to reject
AV CIs for normal location family
& &
• 𝑝9 𝑥( = ) exp −0∑ 𝑥C − 𝜃 0
0L

• Equip with normal prior 𝜃 F ∼ 𝑊 = 𝑁 0, 𝜌0

• Bayes factor relative to 𝐻' = 𝑃9 given by


∫ @37 : ) N 97 O97 @/ (: ) )
𝑆9 = @3 (: ) )
= @3 (: ) )
@3 : )
CI(,&%P = 𝜃: @ :) >𝛼
/

… always wider than Bayes credible posterior interval based on same prior
AV CI’s vs. Bayesian Credible Sets

• Standard CI = Bayesian 95% credible interval (noninformative prior)

• AV confidence interval based on BF with prior with variance → ∞ ,


approximately:
Anytime-Valid Confidence Interval

Red is the standard confidence interval,


green is the anytime-valid confidence interval that I just gave
Ubiquitous (or not) log 𝒏

• By Laplace approximation, we now see that for any 1-dimensional


QRS (
exponential familly, we can get AV Cis of width 𝑂 (

• Precise constants at small 𝑛 vary from family to family

• By taking “strange” priors though we can do better “in some regimes”

33
Priors optimal for anticipated 𝒏∗

• If we anticipate that we will observe some sample size 𝑛∗ , and we are


interested in 1 − 𝛼 AVCIs for a given 𝛼 = 𝛼 ∗ , we can associate each 𝜃
with 𝑆9 so that 1 − 𝛼 AVCI is as narrow as possible at time 𝑛∗ .

• If actual 𝑛 is then very different from 𝑛∗ , or 𝛼 very different from 𝛼 ∗ , our


AVCI will get wide and not very useful, but it will still be valid.

• The prior 𝑊←9 which minimizes width at time 𝑛∗ is ‘degenerate’, putting


#
& & QRS 8
mass 0 on 𝜃U < 𝜃 and 0 on 𝜃V > 𝜃 s.t. 𝐷 𝜃U | 𝜃 = 𝐷 𝜃V | 𝜃 = (∗
(we will derive this intriguing fact next week!)
Priors optimal for anticipated 𝒏∗
$
• The prior 𝑊←" which minimizes width at time 𝑛∗ is ‘degenerate’, putting mass on 𝜃&
%
!
$ ()* (")
and on 𝜃' > 𝜃& such that 𝐷 𝜃& | 𝜃 = 𝐷 𝜃' | 𝜃 =
% .∗
(with 𝐷 𝜃 / | 𝜃 = KL divergence between 𝑃"/ and 𝑃" as measured on a single outcome)
$
• In normal location family, 𝐷 𝜃 / | 𝜃 = 𝐼(𝜃) 𝜃 / − 𝜃 % so
%

! !
()* " ()* "
𝜃& = 𝜃 − 2 𝐼 𝜃 ⋅ ∗ =𝜃− 2 ∗
. .

…with second equality holding if 𝜎 % = 1.


For other exponential families, by 2nd order Taylor approximation of KL divergence, Θ& will
be of same order for large 𝑛∗
standard CI: 𝑋• ± 1.96/ 𝑛
W+QRS (
AV CI, “non-informative” prior: 𝑋• ± (

AV CI, prior optimized for specific 𝑛∗ :


around 𝑛 = 𝑛∗ , 𝑋• ± 2.72/ 𝑛∗
If 𝒏 ≠ 𝒏∗

• With this 𝑛∗ -optimized prior, in the normal location family with 𝜎 0 = 1 we


get
# ! !
QRS8 %# (∗
Θ(,&%" ≈ {𝜃: 𝜃 − 𝜃J 𝑥( ≥ 0(
⋅ (𝑐 +𝑐
# )} with 𝑐 = (
AVCI widths
So we have:
• with prior dependent on 𝜃 and optimized for 𝑛∗ (making the
prior degenerate) AVCI’s are only a small constant wider than
ordinary CI’s at 𝑛 = 𝑛∗ , but become polynomially, that is
(∗ (
O max ,
( (∗
wider otherwise

• with prior independent of 𝜃 and continuous prior they are


logarithmically, i.e. order 𝑂 log 𝑛 wider
• It turns out: with a clever prior that does depend on 𝜃 we can
make them wider by only a factor 𝐂 ⋅ 𝐥𝐨𝐠 𝐥𝐨𝐠 𝒏 , for all 𝑛 for
some constant C : take prior 𝑊←9 with lim 𝑤 𝜃 F =∞
7 ←9
9 →9
Upcoming Weeks

• Derive the results for “strange” priors

• Finding growth-optimal e-variables for composite null

• Homework: something to program!

39

You might also like