0% found this document useful (0 votes)
2 views

FSMLecture3 - Copy (1)

The document discusses the foundations of statistics and machine learning, focusing on probability theory, random variables, and betting strategies, particularly the Kelly betting strategy. It emphasizes the relationship between e-values, likelihood ratios, and betting processes, highlighting how these concepts can be applied to statistical testing and uncertainty quantification. The lecture also covers conditioning, joint and marginal probabilities, and strategies for optimizing betting in the context of hypothesis testing.

Uploaded by

Günay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

FSMLecture3 - Copy (1)

The document discusses the foundations of statistics and machine learning, focusing on probability theory, random variables, and betting strategies, particularly the Kelly betting strategy. It emphasizes the relationship between e-values, likelihood ratios, and betting processes, highlighting how these concepts can be applied to statistical testing and uncertainty quantification. The lecture also covers conditioning, joint and marginal probabilities, and strategies for optimizing betting in the context of hypothesis testing.

Uploaded by

Günay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Foundations of Statistics and Machine Learning:

testing and uncertainty quantification with e-values


(and their link to likelihood, betting)

1
Today, Lecture 3

1. Organizational matters:
• today stop 10 minutes early
• next week in lecture room C2
2. Basic Probability Theory
3. Kelly Betting
4. Revisiting OS/Counterfactuals in terms of Kelly Betting
5. (Basic Probability Theory Continued; Basic Bayesian Statistics)

2
Probability

• Ω : sample space
• either for just one or for fixed nr., say 𝑛 outcomes
• 𝑃: probability distribution on Ω , identified by its mass function 𝑝
(in case Ω countable) or density function 𝑓 (in case Ω=ℝ! )

Example:
• Ω = {1,2, … , 6} ; 𝑃 𝐴 = ∑"∈$ 𝑝(𝑎)
• 𝑃 2,4,6 = 𝑃(even) makes sense, 𝑝 2,4,6 does not
Random Variables

• Random variable: function from Ω to ℝ (or subset)


• If 𝑋 is random variable we write 𝑋 instead of X(𝜔)
• Example: Ω = 1,2, … , 6 % ; 𝑛 throws of a die
• Outcome is 𝜔 ∈ Ω , an 𝑛-dim. vector
𝑋& 𝜔 ≔ 𝑖-th component of 𝜔
We also write 𝒳& for domain of RV 𝑋&

𝑷(𝑿𝒊 = 𝟓) is shorthand for 𝑷( 𝝎 ∈ 𝜴: 𝑿𝒊 𝝎 = 𝟓 )


Random Variables

• Random variable: function from Ω to ℝ (or subset)


• If 𝑋 is random variable we write 𝑋 instead of X(𝜔)
• Example: Ω = 1,2, … , 6 % ; 𝑛 throws of a die
• Outcome is 𝜔 ∈ Ω , an 𝑛-dim. vector
• For any distribution 𝑃 on Ω, any random variable 𝑋 has itself a
probability mass/density function: 𝑝) 𝑥 ≔ 𝑃(𝑋 = 𝑥). This is called
the marginal distribution of 𝑋
• e.g. 𝑝)! 5 = 𝑝)" 5 = 1/6
• If RV 𝑌 is clear from context we write 𝑝 rather than 𝑝*
Conditioning

With events:
+ $∩-
• 𝑃 𝐴𝐵 ≔ if 𝑃 𝐵 > 0.
+(-)
e.g.
!
0
• 𝑃 4 even = P 4 2,4,6 = #
$ = 1
#
Conditioning

With events:
+ $∩-
• 𝑃 𝐴𝐵 ≔ if 𝑃 𝐵 > 0
+(-)
With random variables:
• For all 𝑥 ∈ 𝒳, 𝑦 ∈ 𝒴 with 𝑃 𝑌 = 𝑦 > 0:
+ )23,*25
• 𝑃 𝑋=𝑥 𝑌=𝑦 ≔
+(*25)
6 ),*
• Abbreviated to 𝑝 𝑋 𝑌 ≔ 6(*)
, written with capitals to denote that the
statement really holds for all ‘instantiations’ of 𝑋 and 𝑌.
Joint, Marginal, Independence

• Joint probability: 𝑝 𝑋, 𝑌 = 𝑝 𝑋 𝑝(𝑌|𝑋)


• Random variables 𝑋 and 𝑌 are independent (under 𝑃) if
𝑝 𝑌 𝑋 = 𝑝(𝑌) or eqv. 𝑝 𝑋, 𝑌 = 𝑝 𝑋 𝑝(𝑌)

• Marginal probability satisfies


• 𝑝 𝑋 = ∑5∈𝒴 𝑝 𝑋, 𝑦 = ∑𝒚∈𝓨 𝒑 𝑿 𝒚 𝒑 𝒚
• …first formula explains the phrase ‘marginal’ probability
“Chain” or “Product” or “Telescoping” Rule

• Let 𝑋 % ≡ 𝑋0, … , 𝑋% be defined on a probability space Ω. For every


distribution 𝑃 on Ω , we have

• …proof:
“Chain” or “Product” or “Telescoping” Rule

• Let 𝑋 % ≡ 𝑋0, … , 𝑋% be defined on a probability space Ω. For every


distribution 𝑃 on Ω , we have

• …proof:
“Chain” or “Product” or “Telescoping” Rule

• Let 𝑋 % ≡ 𝑋0, … , 𝑋% be defined on a probability space Ω. For every


distribution 𝑃 on Ω , we have


Extremely
…proof:
Important!
Today, Lecture 3

1. Organizational matters:
• today stop 10 minutes early
• next week in lecture room C2
2. Basic Probability Theory
3. Kelly Betting
4. Revisiting OS/Counterfactuals in terms of Kelly Betting
5. (Basic Probability Theory Continued; Basic Bayesian Statistics)

12
"̅! # "
Interpreting in terms of betting
"# (# " )
E-Processes and Betting!
Kelly (1956)

• Let 𝒳 = {1, … , 𝐾}.


At each time 𝑡 = 1,2, … there are 𝐾 tickets available. Ticket 𝑘 pays off
1/𝑝:(𝑘) if outcome is 𝑘, and 0 otherwise.
You may buy multiple and fractional nrs of tickets.
• You start by investing 1$ in ticket 1.
• At each time t you put fraction 𝑝0̅ 𝑘 of your money on outcome 𝑘. Then
your total capital 𝑀(;) gets multiplied by 𝑀; : = 𝑝0̅ 𝑋< /𝑝:(𝑋< )
• After 1 outcome you either stop with end-capital 𝑀0 or you continue, so
you put fraction 𝑝0̅ 𝑘 of 𝑀0 on outcome 𝑋= = 𝑘 (“reinvest everything”).
• After second round you stop with end capital 𝑀 = = 𝑀0 ⋅ 𝑀= or you
continue, and so on..
E-Processes and Betting
Kelly (1956)

• When you finally stop at time 𝜏, end-capital is 𝑀 > = ∏>&20 𝑀&


• This is the actual betting game as it is played in a casino (without the 0)
• Think of 𝑝:(𝑘) as actual probability of 𝑘. Pay-off 1/𝑝:(𝑘) for investing
1$ into outcome 𝑘 is fair to both player and casino, since
0&'(
𝐄+% 6% )
= 1 (expected pay-off is equal to investment)

0 = 0 0
• Example: 𝑝: red = = , 𝑝0 red = 1 ;𝑝: 𝑀 = 1, 𝑝0̅ 𝑀 = ?
E-Processes and Betting

• When you finally stop at time 𝜏, end-capital is 𝑀 > = ∏>&20 𝑀&


• This is the actual betting game as it is played in a casino (without the 0)
• Think of 𝑝:(𝑘) as actual probability of 𝑘. Pay-off 1/𝑝:(𝑘) for investing
1$ into outcome 𝑘 is fair to both player and casino, since
0&'(
𝐄+% 6% )
= 1 (expected pay-off is equal to investment)

• Think of 𝑝0̅ 𝑘 as a betting/investment strategy rather than a distribution


from which you sample data. For each “distribution” (strategy) 𝑃a0, we have

6̅! ()) 0&'(


𝐄+% = ∑A 𝑝0̅ (𝑘) 𝐄+% =1
6% ) 6% )
E-Processes and Betting

• When you finally stop at time 𝜏, end-capital is 𝑀 > = ∏>&20 𝑀&


• This is the actual betting game as it is played in a casino (without the 0)
• For each “distribution” (strategy) 𝑃a0, we have that

𝑝0̅ (𝑋< )
𝐄+% =1
𝑝: 𝑋<

• “In a real casino, it does not matter what strategy you use at time 𝑡, you
do not expect to gain any money”
E-Processes and Betting

• When you finally stop at time 𝜏, end-capital is 𝑀 > = ∏>&20 𝑀&


• This is the actual betting game as it is played in a casino (without the 0)
• For each “distribution” (strategy) 𝑃a0, we have that

𝑝0̅ (𝑋 (>) )
𝐄+% >
=1
𝑝: 𝑋

• “In a real casino, it does not matter what strategy you use and what rule
you use for deciding when to stop, you do not expect to gain any money”
E-Processes and Betting

New interpretation of likelihood ratio:


• 𝑝0̅ is betting strategy, 𝑝: determines pay-offs, likelihood ratio/e-
process at time 𝑡 gives capital accumulated so far if you started
with 1$.

• Obtaining evidence against the null means: getting rich in a


game in which you would not expect to gain money if the null were
true

• 1-to-1 correspondence between e-processes of the form


b(𝑿𝝉)/𝒑𝟎 (𝑿𝝉) and sequential Betting in a casino!
𝒑
How to Choose the Strategy

• If the null is true, you do not expect to gain any money, under any
stopping time, no matter what strategy 𝑝0̅ you use
• So which one should you use?
• Intuitively, this should be a strategy that lets you get rich as fast as possible if the
alternative is true.

• If you think alternative is a specific 𝑝0 , then using 𝑝0̅ = 𝑝0 is a good idea


• Why exactly? Explained later in the course, but note for now: in a 1-round game,
the optimal strategy, maximizing expected gain, is not 𝑝! but instead 𝑝!̅ with
𝑝!̅ 𝑋 = 𝑘 ∗ = 1: put all your money on 𝑘 ∗ maximizing 𝑝! (𝑋 = 𝑘) over all 𝑘
• …but in a many-round game, this strategy is very silly!

20
Composite 𝑯𝟏

• If the null is true, you do not expect to gain any money, under any
stopping time, no matter what strategy 𝑝0̅ you use

• If you think alternative is a specific 𝑝0 , then using 𝑝0̅ = 𝑝0 is a good idea

• If you think 𝐻: is wrong, but you do not know which alternative is true,
then… you can try to learn 𝑝0

• Use a 𝑝0̅ that better and better mimics the true, or just “best” fixed 𝑝0

21
Composite 𝑯𝟏

• If you think 𝐻: is wrong, but you do not know which alternative is true,
then… you can try to learn 𝑝0
• Use a 𝑝0̅ that better and better mimics the true, or just “best” fixed 𝑝0
0
Example, 𝐻:: 𝑋& ∼ Ber =
, set:
%! D0
𝑝0̅ 𝑋%D0 = 1 𝑥% ≔ %D=
, where 𝑛0 is nr of 1s in 𝑥 %

…we use notation for conditional probabilities, but we should really think of
𝑝0̅ as a sequential betting strategy with the “conditional probabilities”
indicating how to bet/invest in the next round, given the past data
22
Composite 𝑯𝟏

• If you think 𝐻: is wrong, but you do not know which alternative is true,
then… you can try to learn 𝑝0
• Use a 𝑝0̅ that better and better mimics the true, or just “best” fixed 𝑝0
0
Example, 𝐻:: 𝑋& ∼ Ber =
, set:
%! D0
𝑝0̅ 𝑋%D0 = 1 𝑥% ≔ %D=
, where 𝑛0 is nr of 1s in 𝑥 %
…still, formally, using telescoping-in-reverse, we find that 𝑝0̅ also uniquely
defines a marginal probability distribution for 𝑋 % , for each 𝑛 , and our
accumulated capital at time 𝑛 is again given by the likelihood ratio.
6̅! )) 6̅! ()* ∣)*+! )
= ∏&20..%
6% ()) ) 6% ()* ∣𝑿𝒊+𝟏 )
23
Extensions

• For the betting analogy to hold, the bets must be not-strictly favourable
to you if the null hypothesis were true. So the capital process in a real
casino (which includes a 0 outcome, so the actual probability 𝑃: red =
𝑃: black = 18/37) is also an e-process.

• By considering a more general betting game, we can also handle testing


with continuous-valued sample spaces and with composite nulls. We will
see how to do this later on in the course
• If the null is composite then the e-process (=capital process) does
not always look like a LR any more

24
p-value problem 1 goes away

Data arrives sequentially as 𝑋0, 𝑋=, … Protocol 1:


• At each time 𝑛 = 1,2,3, … you do a test as if 𝑛 were fixed in advance and
you reject if p < 0.05
• Then with probability 1 (not 0.05!) you will eventually reject even if
𝑯𝟎 is true
Protocol 2: At each time 𝑛 = 1,2,3, … until arbitrary 𝑛HIJ you do LR test
% 6! )) 0
based on 𝑋 and you reject if LR 6 )) ≥ :.:K = 20
%

• If null is true, probability you will reject remains bounded by 0.05


This is simply the fact that if you start with 𝐵$ in a real casino, then no
matter what strategy or rule for stopping you use, the probability that you
0 25
go home with 20𝐵 is bounded by =:
p-value problem 2 (counterfactuals) goes away

• Suppose I plan to test a new medication on exactly 100 patients.


I do this and obtain a (just) significant result
(p=0.03 based on fixed n=100).
But just to make sure I ask a statistician whether I did everything right.

26
• Suppose I plan to test a new medication on exactly 100 patients.
I do this and obtain a (just) significant result
(p=0.03 based on fixed n=100).
But just to make sure I ask a statistician whether I did everything right.
• The statistician asks: what would you have done if your result had been
‘almost-but-not-quite’ significant?

27
• Suppose I plan to test a new medication on exactly 100 patients.
I do this and obtain a (just) significant result
(p=0.03 based on fixed n=100).
But just to make sure I ask a statistician whether I did everything right.
• The statistician asks: what would you have done if your result had been
‘almost-but-not-quite’ significant?
• I say “Well I never thought about that. Well, perhaps, but I’m not sure, I
would have asked my boss for money to test another 50 patients”.

28
• Suppose I plan to test a new medication on exactly 100 patients.
I do this and obtain a (just) significant result
(p=0.03 based on fixed n=100).
But just to make sure I ask a statistician whether I did everything right.
• The statistician asks: what would you have done if your result had been
‘almost-but-not-quite’ significant?
• I say “Well I never thought about that. Well, perhaps, but I’m not sure, I
would have asked my boss for money to test another 50 patients”.
• Now the statistician says: that means your result is invalid!

29
• Suppose I plan to test a new medication on exactly 100 patients.
I do this and obtain a (just) significant result
(𝑆 = 21 based on fixed n=100).
But just to make sure I ask a statistician whether I did everything right.
• The statistician asks: what would you have done if your result had been
‘almost-but-not-quite’ significant?
• I say “Well I never thought about that. Well, perhaps, but I’m not sure, I
would have asked my boss for money to test another 50 patients”.
• Now the statistician says: this is completely fine, since the validity of your
conclusion does not depend on the actual stopping rule you have used

30
• Suppose I plan to test a new medication on exactly 100 patients.
I do this and obtain a (just) significant result
(𝑆 = 21 based on fixed n=100).
But just to make sure I ask a statistician whether I did everything right.
• The statistician asks: what would you have done if your result had been
‘almost-but-not-quite’ significant?
• I say “Well I never thought about that. Well, perhaps, but I’m not sure, I
would have asked my boss for money to test another 50 patients”.
• Now the statistician says: this is completely fine, since the validity of your
conclusion does not depend on the actual stopping rule you have used
OR:
This is completely fine, since evidence is measured in terms of money
you gained and that amount does not depend what you would have
done in situations that never occurred 31
Today, Lecture 3

1. Organizational matters:
• today stop 10 minutes early
• next week in lecture room C2
2. Basic Probability Theory
3. Kelly Betting
4. Revisiting OS/Counterfactual Results in terms of Kelly Betting
5. (Basic Probability Theory Continued; Basic Bayesian Statistics)

32
Bayes’ Theorem

• We know
𝑃 𝐷∩𝐻
𝑃 𝐻𝐷 =
𝑃(𝐻)
and 𝑃 𝐷 ∩ 𝐻 = 𝑃 𝐻 ⋅ 𝑃(𝐷|𝐻)
…so combining both conditional probability statements we get
𝑃 𝐷 𝐻 ⋅𝑃 𝐻
𝑃 𝐻𝐷 =
𝑃(𝐷)
Bayes’ Theorem

• We know
𝑃 𝐷∩𝐻 likelihood
𝑃 𝐻𝐷 =
𝑃(𝐻)
and 𝑃 𝐷 ∩ 𝐻 = 𝑃 𝐻 ⋅ 𝑃(𝐷|𝐻)
…so combining both conditional probability statements we get
𝑃 𝐷 𝐻 ⋅𝑃 𝐻
𝑃 𝐻𝐷 =
𝑃(𝐷)

prior probability
posterior probability of 𝑯
of 𝑯
Two Fundamentally Different Uses of Bayes’
Theorem
1. A Priori Probabilities can be meaningfully estimated
(medical testing, for example!)
…just an application of a mathematical theorem

2. A Priori Probabilities are mere guess (and conceivably do not even exist)
– we will now see example of this
“Bayesian learning/Bayesian statistics”

(…in reality it’s often ‘somewhere in the middle’)


Medical Example

• Every medical test has a certain sensitivity and specificity. The sensitivity
is the probability of a positive result, given that you have the disease.
The specificity is the probability of a negative result, given that you do
not have the disease.
• If you know the probability that an average person has the disease (i.e.
the frequency in the population) you can take this as your ‘prior’ and then
calculate the ‘posterior’ that you have the disease given a positive test
result via Bayes theorem ⇒ homework
Bayesian Inference, Toy Example
You get kidnapped, sedated and wake up in a foreign
country. You only know that:
• You are either in Sweden or France
• Two thirds of all Swedes are blond
• One third of all French are blond
Bayesian Inference, Toy Example
You get kidnapped, sedated and wake up in a foreign
country. You only know that:
• You are either in Sweden or France
• Two thirds of all Swedes are blond
• One third of all French are blond
• You see three blond and one non-blond persons

Where are you?

or
• Statistical model
• prior probability


• Statistical model
• prior probability

• likelihood


• Statistical model
• prior probability

• likelihood


• Statistical model
• prior probability

• likelihood

• Bayes theorem gives posterior probability


• Statistical model
• prior probability

• likelihood

• Bayes theorem gives posterior probability

• and posterior odds


• Before you see anybody:


• Before you see anybody:
• Before you see anybody:

• 1st person you see is blond:


• Before you see anybody:

• 1st person you see is blond:

Posterior odds in favour of Sweden are 2:1


Posterior probability of being in Sweden is 2/3
• Before you see anybody:

• 1st person you see is blond:

• 2nd is not-blond:
• Before you see anybody:

• 1st person you see is blond:

• 2nd is not-blond:

• 3rd and 4th are blond:


How does Bayesian inference behave?

• Question: what happens to the posterior odds if you are actually in


Sweden and you see more and more people?

• What happens if you are in France?

• And what happens if you are really in Germany, where 50% of the
people are blond?
Models

• Let Ω = 𝒳 % be a sample space and suppose we observe data


𝑥0, … , 𝑥% ∈ 𝒳 %
• We call a set of distributions ℳ = {𝑃L : 𝜃 ∈ Θ} on Ω a statistical model (or
often hypothesis) for the data
• Simple example: 𝒳 = 0,1 , Θ = 0,1 , ℳ is the Bernoulli model, defined
by
Models

• Let Ω = 𝒳 % be a sample space and suppose we observe data


𝑥0, … , 𝑥% ∈ 𝒳 %
• We call a set of distributions ℳ = {𝑃L : 𝜃 ∈ Θ} on Ω a statistical model (or
often hypothesis) for the data
• Simple example: 𝒳 = 0,1 , Θ = 0,1 , ℳ is the Bernoulli model, defined
by

Note: for all distributions on Ω,

Bernoulli is the restriction to those distrs with


Models

• Let Ω = 𝒳 % be a sample space and suppose we observe data


𝑥0, … , 𝑥% ∈ 𝒳 %
• We call a set of distributions ℳ = {𝑃L : 𝜃 ∈ Θ} on Ω a statistical model (or
often hypothesis) for the data
• Simple example: 𝒳 = 0,1 , Θ = 0,1 , ℳ is the Bernoulli model, defined
by

• NOTE: not
Models

• Let Ω = 𝒳 % be a sample space and suppose we observe data


𝑥0, … , 𝑥% ∈ 𝒳 %
• We call a set of distributions ℳ = {𝑃L : 𝜃 ∈ Θ} on Ω a statistical model (or
often hypothesis) for the data
• Simple example: 𝒳 = 0,1 , Θ = 0,1 , ℳ is the Bernoulli model, defined
by
Maximum Likelihood

• Let Ω = 𝒳 % be a sample space and suppose we observe data


𝑥0, … , 𝑥% ∈ 𝒳 %
• We call a set of distributions ℳ = {𝑃L : 𝜃 ∈ Θ} on Ω a statistical model (or
often hypothesis) for the data
• Simple example: 𝒳 = 0,1 , Θ = 0,1 , ℳis the Bernoulli model, defined
by

• The method of maximum likelihood (Fisher, 1922) tells us to pick, as a


‘best guess’ of the true 𝜃 , the value 𝜃y maximizing the probability of the
actually observed data.
The Likelihood Function

𝒏
𝒑𝜽 𝑿 as function of 𝜽
The Bayesian Posterior

• From the Bayesian perspective, you do not necessarily want to make a


‘single’ estimate of 𝜃
• Rather, you want to report the full posterior – this encapsulates
everything you have learned from the data
• Example – Bernoulli model with prior 𝑃 on Θ = [0,1]
0 = 0
• We have already seen the example with 𝑃 𝜃 = =𝑃 𝜃= = ;
1 1 =
posterior was 𝑃(𝜃|𝐷), a probability distr on 2 parameter values
The Bayesian Posterior

• From the Bayesian perspective, you do not necessarily want to make a


‘single’ estimate of 𝜃
• Rather, you want to report the full posterior – this encapsulates
everything you have learned from data
• Example – Bernoulli model with prior on Θ = [0,1]
• If we want to take a prior on the full Bernoulli model, we should take
one with a continuous probability density 𝑝 𝜃
• Everything works as before: posterior is
The Bayesian Posterior

• Posterior is

• If we take uniform prior 𝑝 𝜃 ≡ 1, this is proportional to likelihood


function!
• For more general priors, uniform prior not always well-defined (and even
for Bernoulli, perhaps not desirable!)
• Why not desirable?
The Bayesian Posterior

• Posterior is

• If we take uniform prior 𝑝 𝜃 ≡ 1, this is proportional to likelihood


function!
• For more general priors, uniform prior not always well-defined (and even
for Bernoulli, not desirable!)
• Not invariant to reparametrization
• …we could just as well have defined 𝑝L 𝑋& = 1 = 𝜃 =
The Bayesian Posterior

• Posterior is

• If we take uniform prior 𝑝 𝜃 ≡ 1, this is proportional to likelihood


function!
• For more general priors, uniform prior not always well-defined (and even
for Bernoulli, not desirable!)
• For general parametric models and continuous priors, posterior looks
more and more like a normal distribution as 𝑛 increases, centered
around 𝜃,y with variance of order 1/√𝑛
The Bayesian Posterior

• Posterior is

• If we take uniform prior, this is proportional to likelihood function!


• For more general priors, uniform prior not always well-defined (and even
for Bernoulli, not desirable!)
• For general parametric models and continuous priors, posterior looks
more and more like a normal distribution as 𝑛 increases, centered
around 𝜃,y with variance of order 1/√𝑛
A Note On Notation

• We will henceforth use 𝑤(𝜃) and 𝑤 𝜃 𝐷 = 𝑤(𝜃|𝑋 % )


for prior and posterior (𝑤 stands for “weight”) and write
𝑝L (𝑋 % ) instead of 𝑝 𝑋 % 𝜃) and 𝑝M 𝑋 % for 𝑝 𝑋 % , the
marginal probability of the data.

• So Bayes theorem becomes

…and

You might also like