0% found this document useful (0 votes)
0 views

slides

The document provides an overview of Bayesian Statistics, detailing its principles, tools, and methodologies. It discusses the differences between Bayesian and Frequentist statistics, emphasizing the flexibility and better uncertainty treatment of Bayesian approaches. Additionally, it introduces various tools for Bayesian analysis, including Stan, Turing, and PyMC, along with code examples and recommended references.

Uploaded by

Marcus Braga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

slides

The document provides an overview of Bayesian Statistics, detailing its principles, tools, and methodologies. It discusses the differences between Bayesian and Frequentist statistics, emphasizing the flexibility and better uncertainty treatment of Bayesian approaches. Additionally, it introduces various tools for Bayesian analysis, including Stan, Turing, and PyMC, along with code examples and recommended references.

Uploaded by

Marcus Braga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 381

Bayesian Statistics

Jose Storopolii

ⁱUniversidade Nove de Julho, Pumas-AI


Bayesian Statistics

License

The text and images from these slides have a


Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

All links are in blue. Feel free to click on them.

Bayesian Statistics, Jose Storopoli 3


Bayesian Statistics

Outline
1. Tools
2. Bayesian Statistics
3. Statistical Distributions
4. Priors
5. Bayesian Workflow
6. Linear Regression
7. Logistic Regression
8. Ordinal Regression
9. Poisson Regression
10. Robust Regression
11. Sparse Regression
12. Hierarchical Models
13. Markov Chain Monte Carlo (MCMC) and Model Metrics
14. Model Comparison
15. Backup Slides

Bayesian Statistics, Jose Storopoli 4


Tools
Bayesian Statistics
Tools

Recommended References
• Ge, Xu and Ghahramani (2018) - Turing paper
• Carpenter et al. (2017) - Stan paper
• Salvatier, Wiecki and Fonnesbeck (2016) - PyMC paper
• Bayesian Statistics with Julia and Turing - Why Julia?

Bayesian Statistics, Jose Storopoli 6


A man and his tools make a man and his trade
— Vita Sackville-West

We shape our tools and then the tools shape us


— Winston Churchill
Bayesian Statistics
Tools

Tools
• Stan (BSD-3 License)
• Turing (MIT License)
• PyMC (Apache License)
• JAGS (GPL License)
• BUGS (GPL License)

Bayesian Statistics, Jose Storopoli 9


Bayesian Statistics
Tools

Stanii
• High-performance platform for statistical modeling and statistical computation
• Financial support from NUMFocus:
‣ AWS Amazon
‣ Bloomberg
‣ Microsoft
‣ IBM
‣ RStudio
‣ Facebook
‣ NVIDIA
‣ Netflix
• Open-source language, similar to C++
• Markov Chain Monte Carlo (MCMC) parallel sampler

ii
Carpenter et al. (2017)

Bayesian Statistics, Jose Storopoli 10


Bayesian Statistics
Tools

Stan Code Example

data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
alpha ~ normal(0, 20);
beta ~ normal(0, 2);
sigma ~ cauchy(0, 2.5);
y ~ normal(alpha + beta * x, sigma);
}

Bayesian Statistics, Jose Storopoli 11


Bayesian Statistics
Tools

Turingiii
• Ecosystem of Julia packages for Bayesian Inference using probabilistic
programming
• Julia is a fast dynamic-typed language that just-in-time (JIT) compiles into
native code using LLVM: “runs like C but reads like Python” ; meaning that
is blazing fast, easy to prototype and read/write code
• Julia has Financial support from NUMFocus
• Composability with other Julia packages
• Several other options of Markov Chain Monte Carlo (MCMC) samplers

iii
Ge, Xu and Ghahramani (2018)

Bayesian Statistics, Jose Storopoli 12


Bayesian Statistics
Tools

Turing Ecosystem
We have several Julia packages under Turing’s GitHub organization TuringLang, but I will focus on 6 of those:
• Turing: main package that we use to interface with all the Turing ecosystem of packages and the backbone of
everything
• MCMCChains: interface to summarizing MCMC simulations and has several utility functions for diagnostics
and visualizations
• DynamicPPL: specifies a domain-specific language for Turing, entirely written in Julia, and it is modular
• AdvancedHMC: modular and efficient implementation of advanced Hamiltonian Monte Carlo (HMC)
algorithms
• DistributionsAD: defines the necessary functions to enable automatic differentiation (AD) of the log PDF
functions from Distributions
• Bijectors: implements a set of functions for transforming constrained random variables (e.g. simplexes,
intervals) to Euclidean space

Bayesian Statistics, Jose Storopoli 13


Bayesian Statistics
Tools

Turingiv Code Example

@model function linreg(x, y)


α ~ Normal(0, 20)
β ~ Normal(0, 2)
σ ~ truncated(Cauchy(0, 2.5); lower=0)

y .~ Normal(α .+ β * x, σ)
end

iv
I believe in Julia’s potential and wrote a whole set of Bayesian Statistics tutorials using Julia and Turing (Storopoli, 2021)

Bayesian Statistics, Jose Storopoli 14


Bayesian Statistics
Tools

PyMCv
• Python package for Bayesian statistics with a Markov Chain Monte
Carlo sampler
• Financial support from NUMFocus
• Backend was based on Theano
• Theano died, but PyMC developers create a fork named Aesara
• We have no idea what will be the backend in the future. PyMC
developers are still experimenting with other backends:
TensorFlow Probability, NumPyro, BlackJAX, and so on …

v
Salvatier, Wiecki and Fonnesbeck (2016)

Bayesian Statistics, Jose Storopoli 15


Bayesian Statistics
Tools

PyMC Code Example


with pm.Model() as model:
alpha = pm.Normal("Intercept", mu=0, sigma=20)
beta = pm.Normal("beta", mu=0, sigma=2)
sigma = pm.HalfCauchy("sigma", beta=2.5)

likelihood = pm.Normal("y",
mu=alpha + beta * x1,
sigma=sigma, observed=y)

Bayesian Statistics, Jose Storopoli 16


Bayesian Statistics
Tools

Which Tool Should You Use?

Turing Stan

Bayesian Statistics, Jose Storopoli 17


Bayesian Statistics
Tools

Why Turing
• Julia all the way down…
• Can interface/compose any Julia package
• Decoupling of modeling DSL, inference algorithms and data
• Not only HMC-NUTS, but a whole plethora of MCMC algorithms, e.g. Metropolis-
Hastings, Gibbs, SMC, IS etc.
• Easy to create/prototype/modify inference algorithms
• Transparent MCMC workflow, e.g. iterative sampling API allows step-wise execution
and debugging of the inference algorithm
• Very easy to do stuff in the GPU, e.g. NVIDIA’s CUDA.jl, AMD’s AMDGPU.jl, Intel’s
oneAPI.jl, and Apple’s Metal.jl
• Very easy to do distributed model inference and prediction.

Bayesian Statistics, Jose Storopoli 18


Bayesian Statistics
Tools

Why Not Turing


• Not as fast, but pretty close behind, as Stan.
• Not enough learning materials, example models, tutorials. Also documentation is
somewhat lacking in certain areas, e.g. Bijectors.jl.
• Not as many citations as Stan, although not very far behind in GitHub stars.
• Not well-known in the academic community.

Bayesian Statistics, Jose Storopoli 19


Bayesian Statistics
Tools

Why Stan
• API for R, Python and Julia.
• Faster than Turing.jl in 95% of models.
• Well-known in the academic community.
• High citation count.
• More tutorials, example models, and learning materials available.

Bayesian Statistics, Jose Storopoli 20


Bayesian Statistics
Tools

Why Not Stan


• If you want to try something new, you’ll have to do in C++.
• Constrained only to HMC-NUTS as MCMC algorithm.
• Cannot decouple model DSL from data (and also from inference algorithm).
• Does not compose well with other packages. For anything you want to do, it has to
“exist” in the Stan world, e.g. bayesplot.
• A not so easy and intuitive ODE interface.
• GPU interface depends on OpenCL. Also not easy to interoperate.

Bayesian Statistics, Jose Storopoli 21


Bayesian Statistics
Bayesian Statistics
Bayesian Statistics

Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013) - Chapter 1: Probability and inference
• McElreath (2020) - Chapter 1: The Golem of Prague
• Gelman, Hill and Vehtari (2020) - Chapter 3: Some basic methods in mathematics and probability
• Khan and Rue (2021)
• Probability:
‣ A great textbook - Bertsekas and Tsitsiklis (2008)
‣ Also a great textbook (skip the frequentist part) - Dekking et al. (2010)
‣ Bayesian point-of-view and also a philosophical approach - Jaynes (2003)
‣ Bayesian point-of-view with a simple and playful approach - Kurt (2019)
‣ Philosophical approach not so focused on mathematical rigor - Diaconis and Skyrms (2019)

Bayesian Statistics, Jose Storopoli 23


Inside every nonBayesian there is a Bayesian
struggling to get out
— Denis Lindley
Bayesian Statistics
Bayesian Statistics

What is Bayesian Statistics?

Bayesian statistics is a data analysis approach based on Bayes’ theorem where available
knowledge about the parameters of a statistical model is updated with the information
of observed data. (Andrew Gelman, John B. Carlin, Stern, et al., 2013).

Previous knowledge is expressed as a prior distribution and combined with the


observed data in the form of a likelihood function to generate a posterior distribution.

The posterior can also be used to make predictions about future events.

Bayesian Statistics, Jose Storopoli 26


Bayesian Statistics
Bayesian Statistics

What changes from Frequentist Statistics?


• Flexibility - probabilistic building blocks to construct a modelvi:
‣ Probabilistic conjectures about parameters:
– Prior
– Likelihood
• Better uncertainty treatment:
‣ Coherence
‣ Propagation
‣ We don’t use “if we sampled infinite times from a population that we do not observe…”
• No 𝑝-values:
‣ All statistical intuitions makes sense
‣ 95% certainty that 𝜃’s parameter value is between 𝑥 and 𝑦
‣ Almost impossible to perform 𝑝-hacking

vi
like LEGO

Bayesian Statistics, Jose Storopoli 27


Bayesian Statistics
Bayesian Statistics

A little bit more formal


• Bayesian Statistics uses probabilistic statements:
‣ one or more parameters 𝜃
‣ unobserved data 𝑦̃
• These statements are conditioned on the observed values of 𝑦:
‣ 𝑃 (𝜃 | 𝑦)
‣ 𝑃 (̃
𝑦 | 𝑦)
• We also, implicitly, condition on the observed data from any covariate 𝑥

Bayesian Statistics, Jose Storopoli 28


Bayesian Statistics
Bayesian Statistics

Definition of Bayesian Statistics

The use of Bayes theorem as the procedure to estimate parameters of interest 𝜃 or


unobserved data 𝑦̃. (Andrew Gelman, John B. Carlin, Stern, et al., 2013)

Bayesian Statistics, Jose Storopoli 29


Bayesian Statistics
Bayesian Statistics

PROBABILITY DOES NOT EXIST!vii

• Yes, probability does not exist …


• Or even better, probability as a physical quantity, objective chance,
does NOT exist
• if we disregard objetive chance nothing is lost
• The math of inductive rationality remains exactly the same

vii
Finetti (1974)

Bayesian Statistics, Jose Storopoli 30


Bayesian Statistics
Bayesian Statistics

PROBABILITY DOES NOT EXIST!viii


• Consider flipping a biased coin H
T
• The trials are considered independent and, as a
result, have an important property: the order
0.5 0.5
does not matter
• The frequency is considered a sufficient statistic
H T
• Saying that order does not matter or saying that
the only thing that matters is frequency are two
0.5 0.5 0.5 0.5
ways of saying the same thing
• We say that the probability is invariant under
H T H T
permutations

viii
Finetti (1974)

Bayesian Statistics, Jose Storopoli 31


Bayesian Statistics
Bayesian Statistics

Probability Interpretations
• Objective - frequency in the long run for an event:
days that rained
‣ 𝑃 (rain) = total days

‣ 𝑃 (me being elected president) = 0 (never occurred)

• Subjective - degrees of belief in an event:


‣ 𝑃 (rain) = degree of belief that will rain
‣ 𝑃 (me being elected president) = 10−10 (highly unlikely)

Bayesian Statistics, Jose Storopoli 32


Bayesian Statistics
Bayesian Statistics

What is Probability?
We define 𝐴 is an event and 𝑃 (𝐴) the probability of event 𝐴.
𝑃 (𝐴) has to be between 0 and 1, where higher values defines higher probability of 𝐴
happening.
𝑃 (𝐴) ∈ ℝ
𝑃 (𝐴) ∈ [0, 1]
0 ≤ 𝑃 (𝐴) ≤ 1

Bayesian Statistics, Jose Storopoli 33


Bayesian Statistics
Bayesian Statistics

Probability Axiomsix
• Non-negativity: For every 𝐴: 𝑃 (𝐴) ≥ 0

• Additivity: For every two mutually exclusive 𝐴 and 𝐵: 𝑃 (𝐴) = 1 −


𝑃 (𝐵) and 𝑃 (𝐵) = 1 − 𝑃 (𝐴)

• Normalization: The probability of all possible events 𝐴1 , 𝐴2 , …


must sum up to 1: ∑𝑛∈ℕ 𝐴𝑛 = 1

ix
Kolmogorov (1933)

Bayesian Statistics, Jose Storopoli 34


Bayesian Statistics
Bayesian Statistics

Sample Space

• Discrete: Θ = {1, 2, …}

• Continuous: Θ ∈ (−∞, ∞)

Bayesian Statistics, Jose Storopoli 35


Bayesian Statistics
Bayesian Statistics

Discrete Sample Space


8 planets in our solar system:
• Mercury: ☿
• Venus: ♀
• Earth: ♁
• Mars: ♂
• Jupiter: ♃
• Saturn: ♄
• Uranus: ♅
• Neptune: ♆

Bayesian Statistics, Jose Storopoli 36


Bayesian Statistics
Bayesian Statistics

Discrete Sample Space


placeholder 𝜃 ∈ 𝐸1
The planet has a magnetic field ☿ ♀ ♁ ♂ ♃ ♄ ♅ ♆
𝜃 ∈ 𝐸2
The planet has moon(s) ☿ ♀ ♁ ♂ ♃ ♄ ♅ ♆
𝜃 ∈ 𝐸1 ∩ 𝐸2
The planet has a magnetic field and moon(s) ☿ ♀ ♁ ♂ ♃ ♄ ♅ ♆
𝜃 ∈ 𝐸1 ∪ 𝐸2
The planet has a magnetic field or moon(s) ☿ ♀ ♁ ♂ ♃ ♄ ♅ ♆
𝜃 ∈ ¬𝐸1
The planet does not have a magnetic field ☿ ♀ ♁ ♂ ♃ ♄ ♅ ♆

Bayesian Statistics, Jose Storopoli 37


Bayesian Statistics
Bayesian Statistics

Continuous Sample Space


placeholder 𝜃 ∈ 𝐸1
The distance is less than five centimeters
𝜃 ∈ 𝐸2
The distance is between three and seven centimeters
𝜃 ∈ 𝐸1 ∩ 𝐸2
The distance is less than five centimeters
and between three and seven centimeters
𝜃 ∈ 𝐸1 ∪ 𝐸2
The distance is less than five centimeters
or between three and seven centimeters
𝜃 ∈ ¬𝐸1
The distance is not less than five centimeters

Bayesian Statistics, Jose Storopoli 38


Bayesian Statistics
Bayesian Statistics

Discrete versus Continuous Parameters


Everything that has been exposed here was under the assumption that the parameters are
discrete.
This was done with the intent to provide an intuition what is probability.
Not always we work with discrete parameters.
Parameters can be continuous, such as: age, height, weight etc. But don’t despair! All probability
rules and axioms are valid also for continuous parameters.
The only thing we have to do is to change all s ∑ for integrals ∫. For example, the third axiom of
Normalization for continuous random variables becomes:

∫ 𝑝(𝑥) d𝑥 = 1
𝑥∈𝑋

Bayesian Statistics, Jose Storopoli 39


Bayesian Statistics
Bayesian Statistics

Conditional Probability
Probability of an event occurring in case another has occurred or not.
The notation we use is 𝑃 (𝐴 | 𝐵), that read as “the probability of observing 𝐴 given that
we already observed 𝐵”.
number of elements in A and B
𝑃 (𝐴 | 𝐵) =
number of elements in B
𝑃 (𝐴 ∩ 𝐵)
𝑃 (𝐴 | 𝐵) =
𝑃 (𝐵)

assuming that 𝑃 (𝐵) > 0}.

Bayesian Statistics, Jose Storopoli 40


Bayesian Statistics
Bayesian Statistics

Example of Conditional Probability – Poker Texas Hold’em


• Sample Space: 52 cards in a deck, 13 types of cards and 4 types of suits.
4 1
• 𝑃 (𝐴): Probability of being dealt an Ace ( 52 = 13
)
4 1
• 𝑃 (𝐾): Probability of being dealt a King ( 52 = 13
)
4
• 𝑃 (𝐴 | 𝐾): Probability of being dealt an Ace, given that you have already a King ( 51 ≈
0.078)
4
• 𝑃 (𝐾 | 𝐴): Probability of being dealt a King, given that you have already an Ace ( 51 ≈
0.078)

Bayesian Statistics, Jose Storopoli 41


Bayesian Statistics
Bayesian Statistics

Caution! Not always 𝑃 (𝐴 | 𝐵) = 𝑃 (𝐵 | 𝐴)


In the previous example we have the symmetry 𝑃 (𝐴 | 𝐾) = 𝑃 (𝐾 | 𝐴), but not always
this is truex
The Pope is catholic:

• 1
𝑃 (pope): Probability of some random person being the Pope, something really small, 1 in 8 billion ( 8⋅10 9)

• 𝑃 (catholic): Probability of some random person being catholic, 1.34 billion in 8 billion ( 1.34
8
≈ 0.17)

• 999
𝑃 (catholic | pope): Probability of the Pope being catholic ( 1000 = 0.999)

• 1
𝑃 (pope | catholic): Probability of a catholic person being the Pope ( 1.34⋅10 9 ⋅ 0.999 ≈ 7.46 ⋅ 10
−10
)

• Hence: 𝑃 (catholic | pope) ≠ 𝑃 (pope | catholic)

x
More specific, if the basal rates 𝑃 (𝐴) and 𝑃 (𝐵) aren’t equal, the symmetry is broken 𝑃 (𝐴 | 𝐵) ≠ 𝑃 (𝐵 | 𝐴)

Bayesian Statistics, Jose Storopoli 42


Bayesian Statistics
Bayesian Statistics

Joint Probability
Probability of two or more events occurring.
The notation we use is 𝑃 (𝐴, 𝐵), that read as “the probability of observing 𝐴 and also
observing 𝐵”.
𝑃 (𝐴, 𝐵) = number of elements in A or B
𝑃 (𝐴, 𝐵) = 𝑃 (𝐴 ∪ 𝐵)
𝑃 (𝐴, 𝐵) = 𝑃 (𝐵, 𝐴)

Bayesian Statistics, Jose Storopoli 43


Bayesian Statistics
Bayesian Statistics

Example of Joint Probability – Revisiting Poker Texas Hold’em


• Sample Space: 52 cards in a deck, 13 types of cards and 4 types of suits.
4 1
• 𝑃 (𝐴): Probability of being dealt an Ace ( 52 = 13 )
4 1
• 𝑃 (𝐾): Probability of being dealt a King ( 52 = 13 )
4
• 𝑃 (𝐴 | 𝐾): Probability of being dealt an Ace, given that you have already a King ( 51 ≈ 0.078)
4
• 𝑃 (𝐾 | 𝐴): Probability of being dealt a King, given that you have already an Ace ( 51 ≈ 0.078)
• 𝑃 (𝐴, 𝐾): Probability of being dealt an Ace and being dealt a King
𝑃 (𝐴, 𝐾) = 𝑃 (𝐾, 𝐴)
𝑃 (𝐴) ⋅ 𝑃 (𝐾 | 𝐴) = 𝑃 (𝐾) ⋅ 𝑃 (𝐴 | 𝐾)
1 4 1 4
⋅ = ⋅
13 51 13 51
≈ 0.006

Bayesian Statistics, Jose Storopoli 44


Bayesian Statistics
Bayesian Statistics

Visualization of Joint Probability versus Conditional Probability

Figure 1: 𝑃 (𝑋, 𝑌 ) versus 𝑃 (𝑋 | 𝑌 = −0.75)

Bayesian Statistics, Jose Storopoli 45


Bayesian Statistics
Bayesian Statistics

Product Rule of Probabilityxi

We can decompose a joint probability 𝑃 (𝐴, 𝐵) into the product of two probabilities:

𝑃 (𝐴, 𝐵) = 𝑃 (𝐵, 𝐴)
𝑃 (𝐴) ⋅ 𝑃 (𝐵 | 𝐴) = 𝑃 (𝐵) ⋅ 𝑃 (𝐴 | 𝐵)

xi
also called the Product Rule of Probability.

Bayesian Statistics, Jose Storopoli 46


Bayesian Statistics
Bayesian Statistics

Who was Thomas Bayes?


• Thomas Bayes (1701 - 1761) was a statistician, philosopher and Presbyterian
minister who is known for formulating a specific case of the theorem that bears
his name: Bayes’ theorem.
• Bayes never published what would become his most famous accomplishment;
his notes were edited and published posthumously by his friend Richard Price.
• The theorem official name is Bayes-Price-Laplace, because Bayes was the first
to discover, Price got his notes, transcribed into mathematical notation, and
read to the Royal Society of London, and Laplace independently rediscovered
the theorem without having previous contact in the end of the XVIII century in
France while using probability for statistical inference with census data in the
Napoleonic era.

Bayesian Statistics, Jose Storopoli 47


Bayesian Statistics
Bayesian Statistics

Bayes Theorem
Tells us how to “invert” conditional probability:

𝑃 (𝐴) ⋅ 𝑃 (𝐵 | 𝐴)
𝑃 (𝐴 | 𝐵) =
𝑃 (𝐵)

Bayesian Statistics, Jose Storopoli 48


Bayesian Statistics
Bayesian Statistics

Bayes’ Theorem Proof


Remember the following probability identity:
𝑃 (𝐴, 𝐵) = 𝑃 (𝐵, 𝐴)
𝑃 (𝐴) ⋅ 𝑃 (𝐵 | 𝐴) = 𝑃 (𝐵) ⋅ 𝑃 (𝐴 | 𝐵)

OK, now divide everything by 𝑃 (𝐵):


𝑃 (𝐴) ⋅ 𝑃 (𝐵 | 𝐴) 𝑃 (𝐵) ⋅ 𝑃 (𝐴 | 𝐵)
=
𝑃 (𝐵) 𝑃 (𝐵)
𝑃 (𝐴) ⋅ 𝑃 (𝐵 | 𝐴)
= 𝑃 (𝐴 | 𝐵)
𝑃 (𝐵)
𝑃 (𝐴) ⋅ 𝑃 (𝐵 | 𝐴)
𝑃 (𝐴 | 𝐵) =
𝑃 (𝐵)

Bayesian Statistics, Jose Storopoli 49


Bayesian Statistics
Bayesian Statistics

A Probability Textbook Classicxii


How accurate is a breast cancer test?
• 1% of women have breast cancer (Prevalence)
• 80% of mammograms detect breast cancer (True Positive)
• 9.6% of mammograms detect breast cancer when there is no incidence (False Positive)
𝑃 (+ | 𝐶) ⋅ 𝑃 (𝐶)
𝑃 (𝐶 | +) =
𝑃 (+)
𝑃 (+ | 𝐶) ⋅ 𝑃 (𝐶)
𝑃 (𝐶 | +) =
𝑃 (+ | 𝐶) ⋅ 𝑃 (𝐶) + 𝑃 (+ | ¬𝐶) ⋅ 𝑃 (¬𝐶)
0.8 ⋅ 0.01
𝑃 (𝐶 | +) =
0.8 ⋅ 0.01 + 0.096 ⋅ 0.99
𝑃 (𝐶 | +) ≈ 0.0776

xii
Adapted from: Yudkowski - An Intuitive Explanation of Bayes’ Theorem

Bayesian Statistics, Jose Storopoli 50


Bayesian Statistics
Bayesian Statistics

Why Bayes’ Theorem is Important?


We can invert the conditional probability:
𝑃 (data | hypothesis) ⋅ 𝑃 (hypothesis)
𝑃 (hypothesis | data) =
𝑃 (data)

But isn’t this the 𝑝-value?

NO!

Bayesian Statistics, Jose Storopoli 51


Bayesian Statistics
Bayesian Statistics

What are 𝑝-values?

𝑝-value is the probability of obtaining results at least as extreme as the observed, given
that the null hypothesis 𝐻0 is true:

𝑃 (𝐷 | 𝐻0 )

Bayesian Statistics, Jose Storopoli 52


Bayesian Statistics
Bayesian Statistics

What 𝑝-value is not!

Bayesian Statistics, Jose Storopoli 53


Bayesian Statistics
Bayesian Statistics

What 𝑝-value is not!


• 𝑝-value is not the probability of the null hypothesis
‣ No!
‣ Infamous confusion between 𝑃 (𝐷 | 𝐻0 ) and 𝑃 (𝐻0 | 𝐷).
‣ To get 𝑃 (𝐻0 | 𝐷) you need Bayesian statistics.
• 𝑝-value is not the probability of data being generated at random
‣ No again!
‣ We haven’t stated nothing about randomness.
• 𝑝-value measures the effect size of a statistical test
‣ Also no… 𝑝-value does not say anything about effect sizes.
‣ Just about if the observed data diverge of the expected under the null hypothesis.
‣ Besides, 𝑝-values can be hacked in several ways (Head et al., 2015).

Bayesian Statistics, Jose Storopoli 54


Bayesian Statistics
Bayesian Statistics

The relationship between 𝑝-value and 𝐻0


To find out about any 𝑝-value, find out what 𝐻0 is behind it. It’s definition will never
change, since it is always 𝑃 (𝐷 | 𝐻0 ):
• 𝑡-test: 𝑃 (𝐷 | the difference between the groups is zero)
• ANOVA: 𝑃 (𝐷 | there is no difference between groups)
• Regression: 𝑃 (𝐷 | coefficient has a null value)
• Shapiro-Wilk: 𝑃 (𝐷 | population is distributed as a Normal distribution)

Bayesian Statistics, Jose Storopoli 55


Bayesian Statistics
Bayesian Statistics

What are Confidence Intervals?

A confidence interval of X% for a parameter is an interval (𝑎, 𝑏)


generated by a repeated sampling procedure has probability X%
of containing the true value of the parameter, for all possible
values of the parameter.
— Neyman (1937) the “father” of confidence intervals

Bayesian Statistics, Jose Storopoli 56


Bayesian Statistics
Bayesian Statistics

What are Confidence Intervals?


Say you performed a statistical analysis to compare the efficacy of a public policy between two
groups and you obtain a difference between the mean of these groups. You can express this
difference as a confidence interval. Often we choose 95% confidence.
In other words, 95% is not the probability of obtaining data such that the estimate of the true
parameter is contained in the interval that we obtained, it is the probability of obtaining data
such that, if we compute another confidence interval in the same way, it contains the true
parameter.
The interval that we got in this particular instance is irrelevant and might as well be thrown away.
Doesn’t say anything about you target population, but about you sample in an insane process of
infinite sampling …

Bayesian Statistics, Jose Storopoli 57


Bayesian Statistics
Bayesian Statistics

Confidence Intervals versus Posterior Intervals


1.5
1.25
1
0.75
0.5
0.25
0
0
𝜃

Bayesian Statistics, Jose Storopoli 58


Bayesian Statistics
Bayesian Statistics

But why I never see stats without 𝑝-values?


We cannot understand 𝑝-values if we do no not comprehend its origins and
historical trajectory. The first mention of 𝑝-values was made by the statistician
Ronald Fischer in 1925:

𝑝-value is a measure of evidence against the null hypothesis


— Fisher (1925)

• To quantify the strength of the evidence against the null hypothesis, Fisher
defended “𝑝 < 0.05 as the standard level to conclude that there is evidence
against the tested hypothesis”
• “We should not be off-track if we draw a conventional line at 0.05”

Bayesian Statistics, Jose Storopoli 60


Bayesian Statistics
Bayesian Statistics

𝑝 = 0.06
• Since 𝑝-value is a probability, it is also a continuous measure.

• There is no reason for us to differentiate 𝑝 = 0.049 against 𝑝 = 0.051.

• Robert Rosenthal, a psychologist said “surely, God loves the .06 nearly as much as the
.05” (Rosnow and Rosenthal, 1989).

Bayesian Statistics, Jose Storopoli 61


Bayesian Statistics
Bayesian Statistics

But why I never heard about Bayesian statistics?xiii

… it will be sufficient … to reaffirm my personal conviction … that


the theory of inverse probability is founded upon an error, and
must be wholly rejected.
— Fisher (1925)

xiii
inverse probability was how Bayes’ theorem was called in the beginning of the 20th century.

Bayesian Statistics, Jose Storopoli 62


Bayesian Statistics
Bayesian Statistics

Inside every nonBayesian, there is a Bayesian struggling to get outxiv

• In his final year of life, Fisher published a paper (Fisher, 1962)


examining the possibilities of Bayesian methods, but with the prior
probabilities being determined experimentally.

• Some authors speculate (Jaynes, 2003) that if Fisher were alive


today, he would probably be a Bayesian.

xiv
quote from Dennis Lindley.

Bayesian Statistics, Jose Storopoli 63


Bayesian Statistics
Bayesian Statistics

Bayes’ Theorem as an Inference Engine


Now that you know what is probability and Bayes’ theorem, I will propose the following:
Likelihood Prior
⏞ ⏞
𝑃 (𝑦 | 𝜃) ⋅ 𝑃 (𝜃)
𝑃
⏟ (𝜃 | 𝑦) =
⏟𝑃 (𝑦)
Posterior
Normalizing Constant

• 𝜃 – parameter(s) of interest
• 𝑦 – observed data
• Priori: prior probability of the parameter(s) value(s)
• Likelihood: probability of the observed data given the parameter(s) value(s)
• Posterior: posterior probability of the parameter(s) value(s) after we observed data 𝑦
• Normalizing Constantxv: 𝑃 (𝑦) does not make any intuitive sense. This probability is transformed and can be interpreted as something
that only exists so that the result 𝑃 (𝑦 | 𝜃)𝑃 (𝜃) be constrained between 0 and 1 – a valid probability.

xv
sometimes also called evidence.

Bayesian Statistics, Jose Storopoli 64


Bayesian Statistics
Bayesian Statistics

Bayes’ Theorem as an Inference Engine


Bayesian statistics allows us to quantify directly the uncertainty related to the value of
one or more parameters of our model given the observed data.

This is the main feature of Bayesian statistics, since we are estimating directly 𝑃 (𝜃 | 𝑦)
using Bayes’ theorem.

The resulting estimate is totally intuitive: simply quantifies the uncertainty that we have
about the value of one or more parameters given the data, model assumptions
(likelihood) and the prior probability of these parameter’s values.

Bayesian Statistics, Jose Storopoli 65


Bayesian Statistics
Bayesian Statistics

Bayesian vs Frequentist Stats

Bayesian Statistics Frequentist Statistics


Data Fixed – Non-random Uncertain – Random
Parameters Uncertain – Random Fixed – Non-random
Uncertainty regarding the
Uncertainty regarding the
Inference sampling process from an infinite
parameter value
population
Objective (but with several model
Probability Subjective
assumptions)
Uncertainty Posterior Interval – 𝑃 (𝜃 | 𝑦) Confidence Interval – 𝑃 (𝑦 | 𝜃)

Bayesian Statistics, Jose Storopoli 66


Bayesian Statistics
Bayesian Statistics

Advantages of Bayesian Statistics


• Natural approach to express uncertainty
• Ability to incorporate previous information
• Higher model flexibility
• Full posterior distribution of the parameters
• Natural propagation of uncertainty

Main disadvantage: Slow model fitting procedure

Bayesian Statistics, Jose Storopoli 67


Bayesian Statistics
Bayesian Statistics

The beginning of the end of Frequentist Statistics


• Know that you are in a very special moment in history of great changes in statistics
• I believe that frequentist statistics, specially the way we qualify evidence and hypotheses with 𝑝
-values will transform in a “significant”xvi way.
• 8 years ago, the American Statistical Association (ASA) published a declaration about 𝑝-values
(Wasserstein and Lazar, 2016). It states exactly what we exposed here: The main concepts of the
null hypothesis significant testing and, in particular 𝑝-values, cannot provide what researchers
demand of them. Despite what says several textbooks, learning materials and published content,
𝑝-values below 0.05 doesn’t “prove” anything. Not, on the other way around, 𝑝-values higher
than 0.05 refute anything.
• ASA statement has more than 4.700 citations with relevant impact.

xvi
pun intended …

Bayesian Statistics, Jose Storopoli 68


Bayesian Statistics
Bayesian Statistics

The beginning of the end of Frequentist Statistics


• An international symposium was promoted in 2017 which originated an open-access special edition of
The American Statistician dedicated to practical ways to abandon 𝑝 < 0.05 (Wasserstein, Schirm and
Lazar, 2019).
• Soon there were more attempts and claims. In September 2017, Nature Human Behaviour published an
editorial proposing that the 𝑝-value’s significance level be decreased from 0.05 to 0.005 (Benjamin et
al., 2018). Several authors, including highly important and influential statisticians argued that this
simple step would help to tackle the replication crisis problem in science, that many believe be the
main consequence of the abusive use of 𝑝-values (Ioannidis, 2019).
• Furthermore, many went a step ahead and suggested that science banish once for all 𝑝-values (Lakens
et al., 2018; “It’s Time to Talk about Ditching Statistical Significance,” 2019). Many suggest (including
myself) that the main tool of statistical inference be Bayesian statistics (Goodman, 2016; Amrhein,
Greenland and McShane, 2019; Schoot et al., 2021).

Bayesian Statistics, Jose Storopoli 69


Statistical Distributions
Bayesian Statistics
Statistical Distributions

Recommended References
• Grimmett and Stirzaker (2020):
‣ Chapter 3: Discrete random variables
‣ Chapter 4: Continuous random variables
• Dekking et al. (2010):
‣ Chapter 4: Discrete random variables
‣ Chapter 5: Continuous random variables
• Betancourt (2019)

Bayesian Statistics, Jose Storopoli 72


Bayesian Statistics
Statistical Distributions

Probability Distributions
Bayesian statistics uses probability distributions as the inference engine of the
parameter and uncertainty estimates.

Imagine that probability distributions are small “Lego” pieces. We can construct
anything we want with these little pieces. We can make a castle, a house, a city; literally
anything.
The same is valid for Bayesian statistical models. We can construct models from the
simplest ones to the most complex using probability distributions and their
relationships.

Bayesian Statistics, Jose Storopoli 74


Bayesian Statistics
Statistical Distributions

Probability Distribution Function


A probability distribution function is a mathematical function that outputs the
probabilities for different results of an experiment. It is a mathematical description of a
random phenomena in terms of its sample space and the event probabilities (subsets of
the sample space).
𝑃 (𝑋) : 𝑋 → ℝ ∈ [0, 1]

For discrete random variables, we define as “mass”, and for continuous random
variables, we define as “density”.

Bayesian Statistics, Jose Storopoli 75


Bayesian Statistics
Statistical Distributions

Mathematical Notation
We use the notation
𝑋 ∼ Dist(𝜃1 , 𝜃2 , …)

where:
• 𝑋: random variable
• Dist: distribution name
• 𝜃1 , 𝜃2 , …: parameters that define how the distribution behaves
Every probability distribution can be “parameterized” by specifying parameters that
allow to control certain distribution aspects for a specific goal.

Bayesian Statistics, Jose Storopoli 76


Bayesian Statistics
Statistical Distributions

Probability Distribution Function

0.4

0.3
PDF

0.2

0.1

−4 −3 −2 −1 0 1 2 3 4
𝑋

Bayesian Statistics, Jose Storopoli 77


Bayesian Statistics
Statistical Distributions

Cumulative Distribution Function


The cumulative distribution function (CDF) of a random variable 𝑋 evaluated at 𝑥 is the
probability that 𝑋 will take values less or qual than 𝑥:
CDF = 𝑃 (𝑋 ≤ 𝑥)

Bayesian Statistics, Jose Storopoli 78


Bayesian Statistics
Statistical Distributions

Cumulative Distribution Function


1

0.75
PDF

0.5

0.25

0
−4 −3 −2 −1 0 1 2 3 4
𝑋

Bayesian Statistics, Jose Storopoli 79


Bayesian Statistics
Statistical Distributions

Discrete Distributions
Discrete probability distributions are distributions which the results are a discrete
number: −𝑁 , …, −2, 1, 0, 1, 2, …, 𝑁 and 𝑁 ∈ ℤ.

In discrete probability distributions we call the probability of a distribution taking


certain values as “mass”. The probability mass function (PMF) is the function that
specifies the probability of a random variable 𝑋 taking value 𝑥:

PMF(𝑥) = 𝑃 (𝑋 = 𝑥)

Bayesian Statistics, Jose Storopoli 80


Bayesian Statistics
Statistical Distributions

Discrete Uniform
The discrete uniform is a symmetric probability distribution in which a finite number of
values are equally likely of being observable. Each one of the 𝑛 values have probability
1
𝑛
.

The uniform discrete distribution has two parameters and its notation is Uniform(𝑎, 𝑏):
• 𝑎 – lower bound
• 𝑏 – upper bound

Example: dice.

Bayesian Statistics, Jose Storopoli 81


Bayesian Statistics
Statistical Distributions

Discrete Uniform

1
Uniform(𝑎, 𝑏) = 𝑓(𝑥, 𝑎, 𝑏) = for 𝑎 ≤ 𝑥 ≤ 𝑏 and 𝑥 ∈ {𝑎, 𝑎 + 1, …, 𝑏 − 1, 𝑏}
𝑏−𝑎+1

Bayesian Statistics, Jose Storopoli 82


Bayesian Statistics
Statistical Distributions

Discrete Uniform

0.2

PMF
0.1

1 2 3 4 5 6
𝑎 = 1, 𝑏 = 6

Bayesian Statistics, Jose Storopoli 83


Bayesian Statistics
Statistical Distributions

Bernoulli
Bernoulli distribution describes a binary event of the success of an experiment. We
represent 0 as failure and 1 as success, hence the result of a Bernoulli distribution is a
binary variable 𝑌 ∈ {0, 1}.
Bernoulli distribution is often used to model binary discrete results where there is only
two possible results.
Bernoulli distribution has only a single parameter and its notation is Bernoulli(𝑝):
• 𝑝 – probability of success

Example: If the patient survived or died or if the client purchased or not.

Bayesian Statistics, Jose Storopoli 84


Bayesian Statistics
Statistical Distributions

Bernoulli

Bernoulli(𝑝) = 𝑓(𝑥, 𝑝) = 𝑝𝑥 (1 − 𝑝)1−𝑥 for 𝑥 ∈ {0, 1}

Bayesian Statistics, Jose Storopoli 85


Bayesian Statistics
Statistical Distributions

Bernoulli

0.6

PMF 0.4

0.2

0 1
1
𝑝= 3

Bayesian Statistics, Jose Storopoli 86


Bayesian Statistics
Statistical Distributions

Binomial
The binomial distribution describes an event in which the number of successes in a
sequence 𝑛 independent experiments, each one making a yes–no question with
probability of success 𝑝. Notice that Bernoulli distribution is a special case of the
binomial distribution where 𝑛 = 1.

The binomial distribution has two parameters and its notation is Binomial(𝑛, 𝑝) :
• 𝑛 – number of experiments
• 𝑝 – probability of success

Example: number of heads in five coin throws.

Bayesian Statistics, Jose Storopoli 87


Bayesian Statistics
Statistical Distributions

Binomial

𝑛 𝑥
Binomial(𝑛, 𝑝) = 𝑓(𝑥, 𝑛, 𝑝) = ( )𝑝 (1 − 𝑝)𝑛−𝑥 for 𝑥 ∈ {0, 1, …, 𝑛}
𝑥

Bayesian Statistics, Jose Storopoli 88


Bayesian Statistics
Statistical Distributions

Binomial

0.4 0.4

0.3 0.3
PMF

PMF
0.2 0.2

0.1 0.1

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
1 1
𝑛 = 10, 𝑝 = 5
𝑛 = 10, 𝑝 = 2

Bayesian Statistics, Jose Storopoli 89


Bayesian Statistics
Statistical Distributions

Poisson
Poisson distribution describes the probability of a certain number of events occurring in
a fixed time interval if these events occur with a constant mean rate which is known and
independent since the time of last occurrence. Poisson distribution can also be used for
number of events in other type of intervals, such as distance, area or volume.

Poisson distribution has one parameter and its notation is Poisson(𝜆):


• 𝜆 – rate

Example: number of e-mails that you receive daily or the number of the potholes you’ll
find in your commute.

Bayesian Statistics, Jose Storopoli 90


Bayesian Statistics
Statistical Distributions

Poisson

𝜆𝑥 𝑒−𝜆
Poisson(𝜆) = 𝑓(𝑥, 𝜆) = for 𝜆 > 0
𝑥!

Bayesian Statistics, Jose Storopoli 91


Bayesian Statistics
Statistical Distributions

Poisson

0.4 0.4

0.3 0.3
PMF

PMF
0.2 0.2

0.1 0.1

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
𝜆=2 𝜆=3

Bayesian Statistics, Jose Storopoli 92


Bayesian Statistics
Statistical Distributions

Negative Binomialxvii
The binomial distribution describes an event in which the number of successes in a sequence 𝑛 independent
experiments, each one making a yes–no question with probability of success 𝑝 until 𝑘 successes.
Notice that it becomes the Poisson distribution in the limit as 𝑘 → ∞. This makes it a robust option to replace
a Poisson distribution to model phenomena with overdispersion (presence of greater variability in data than
would be expected).
The negative binomial has two parameters and its notation is Negative Binomial(𝑘, 𝑝):
• 𝑘 – number of successes
• 𝑝 – probability of success
Example: annual occurrence of tropical cyclones.

any phenomena that can be modeles as a Poisson distribution can be modeled also as negative binomial distribution (Andrew Gelman, John B. Carlin, Stern, et al., 2013),
xvii

(Gelman, Hill and Vehtari, 2020).

Bayesian Statistics, Jose Storopoli 93


Bayesian Statistics
Statistical Distributions

Negative Binomial

𝑥+𝑘−1 𝑥
Negative Binomial(𝑘, 𝑝) = 𝑓(𝑥, 𝑘, 𝑝) = ( )𝑝 (1 − 𝑝)𝑘
𝑘−1

for 𝑥 ∈ {0, 1, …, 𝑛}

Bayesian Statistics, Jose Storopoli 94


Bayesian Statistics
Statistical Distributions

Negative Binomial

0.4 0.4

0.3 0.3
PMF

PMF
0.2 0.2

0.1 0.1

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
1 1
𝑘 = 1, 𝑝 = 2
𝑘 = 5, 𝑝 = 2

Bayesian Statistics, Jose Storopoli 95


Bayesian Statistics
Statistical Distributions

Continuous Distributions
Continuous probability distributions are distributions which the results are values in a continuous real number
line: (−∞, +∞) ∈ ℝ.
In continuous probability distributions we call the probability of a distribution taking values as “density”.
Since we are referring to real numbers we cannot obtain the probability of a random variable 𝑋 taking exactly
the value 𝑥.
This will always be 0, since we cannot specify the exact value of 𝑥. 𝑥 lies in the real number line, hence, we
need to specify the probability of 𝑋 taking values in an interval [𝑎, 𝑏].
The probability density function (PDF) is defined as:
𝑏
PDF(𝑥) = 𝑃 (𝑎 ≤ 𝑋 ≤ 𝑏) = ∫ 𝑓(𝑥) d𝑥
𝑎

Bayesian Statistics, Jose Storopoli 96


Bayesian Statistics
Statistical Distributions

Continuous Uniform
The continuous uniform distribution is a symmetric probability distribution in which an
infinite number of value intervals are equally likely of being observable. Each one of the
infinite 𝑛 intervals have probability 𝑛1 .

The continuous uniform distribution has two parameters and its notation is
Uniform(𝑎, 𝑏):
• 𝑎 – lower bound
• 𝑏 – upper bound

Bayesian Statistics, Jose Storopoli 97


Bayesian Statistics
Statistical Distributions

Continuous Uniform

1
Uniform(𝑎, 𝑏) = 𝑓(𝑥, 𝑎, 𝑏) = for 𝑎 ≤ 𝑥 ≤ 𝑏 and 𝑥 ∈ [𝑎, 𝑏]
𝑏−𝑎

Bayesian Statistics, Jose Storopoli 98


Bayesian Statistics
Statistical Distributions

Continuous Uniform
0.4

0.3
PDF

0.2

0.1

0
0 1 2 3 4 5 6
𝑎 = 0, 𝑏 = 6

Bayesian Statistics, Jose Storopoli 99


Bayesian Statistics
Statistical Distributions

Normal
This distribution is generally used in social and natural sciences to represent continuous
variables in which its underlying distribution are unknown.

This assumption is due to the central limit theorem (CLT) that, under precise conditions,
the mean of many samples (observations) of a random variable with finite mean and
variance is itself a random variable which the underlying distribution converges to a
normal distribution as the number of samples increases (as 𝑛 → ∞).

Hence, physical quantities that we assume that are the sum of many independent
processes (with measurement error) often have underlying distributions that are similar
to normal distributions.

Bayesian Statistics, Jose Storopoli 100


Bayesian Statistics
Statistical Distributions

Normal
The normal distribution has two parameters and its notation is Normal(𝜇, 𝜎) or 𝑁 (𝜇, 𝜎):
• 𝜇 – mean of the distribution, and also median and mode
• 𝜎 – standard deviationxviii, a dispersion measure of how observations occur in relation
from the mean

Example: height, weight etc.

xviii
sometimes is also parameterized as variance 𝜎2 .

Bayesian Statistics, Jose Storopoli 101


Bayesian Statistics
Statistical Distributions

Normalxix

1 𝑥−𝜇 2
− 12 ( 𝜎 )
Normal(𝜇, 𝜎) = 𝑓(𝑥, 𝜇, 𝜎) = √ 𝑒 for 𝜎 > 0
𝜎 2𝜋

xix
see how the normal distribution was derived from the binomial distribution in the backup slides.

Bayesian Statistics, Jose Storopoli 102


Bayesian Statistics
Statistical Distributions

Normal
0.6 0.6
0.5 0.5
0.4 0.4
PDF

PDF
0.3 0.3
0.2 0.2
0.1 0.1
0 0
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
2
𝜇 = 0, 𝜎 = 1 𝜇 = 1, 𝜎 = 3

Bayesian Statistics, Jose Storopoli 103


Bayesian Statistics
Statistical Distributions

Log-Normal
The log-normal distribution is a continuous probability distribution of a random variable
which its natural logarithm is distributed as a normal distribution. Thus, if the natural
logarithm a random variable 𝑋, ln(𝑋), is distributed as a normal distribution, then 𝑌 =
ln(𝑋) is normally distributed and 𝑋 is log-normally distributed.
A log-normal random variable only takes positive real values. It is a convenient and
useful model for measurements in exact and engineering sciences, as well as in
biomedical, economical and other sciences. For example, energy, concentrations, length,
financial returns and other measurements.
A log-normal process is the statistical realization of a multiplicative product of many
independent positive random variables.

Bayesian Statistics, Jose Storopoli 104


Bayesian Statistics
Statistical Distributions

Log-Normal
The log-normal distribution has two parameters and its notation is Log-Normal(𝜇, 𝜎2 ):

• 𝜇 – mean of the distribution’s natural logarithm


• 𝜎 – square root of the variance of the distribution’s natural logarithm

Bayesian Statistics, Jose Storopoli 105


Bayesian Statistics
Statistical Distributions

Log-Normal

1 (− ln(𝑥)−𝜇)2
Log-Normal(𝜇, 𝜎) = 𝑓(𝑥, 𝜇, 𝜎) = √ 𝑒 2𝜎2 for 𝜎 > 0
𝑥𝜎 2𝜋

Bayesian Statistics, Jose Storopoli 106


Bayesian Statistics
Statistical Distributions

Log-Normal
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
PDF

PDF
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 1 2 3 4 5 0 1 2 3 4 5
𝜇 = 0, 𝜎 = 1 𝜇 = 1, 𝜎 = 23

Bayesian Statistics, Jose Storopoli 107


Bayesian Statistics
Statistical Distributions

Exponential
The exponential distribution is the probability distribution of the time between events
that occurs in a continuous manner, are independent, and have constant mean rate of
occurrence.

The exponential distribution has one parameter and its notation is Exponential(𝜆):
• 𝜆 – rate

Example: How long until the next earthquake or how long until the next bus arrives.

Bayesian Statistics, Jose Storopoli 108


Bayesian Statistics
Statistical Distributions

Exponential

Exponential(𝜆) = 𝑓(𝑥, 𝜆) = 𝜆𝑒−𝜆𝑥 for 𝜆 > 0

Bayesian Statistics, Jose Storopoli 109


Bayesian Statistics
Statistical Distributions

Exponential

0.75 0.75
PDF

PDF
0.5 0.5

0.25 0.25

0 0
0 1 2 3 4 5 0 1 2 3 4 5
1
𝜆=1 𝜆= 2

Bayesian Statistics, Jose Storopoli 110


Bayesian Statistics
Statistical Distributions

Gamma
The gamma distribution is a long-tailed distribution with support only for positive real
numbers.

The gamma distribution has two parameters and its notation is Gamma(𝛼, 𝜃):
• 𝛼 – shape parameter
• 𝜃 – rate parameter

Example: Any waiting time can be modelled with a gamma distribution.

Bayesian Statistics, Jose Storopoli 111


Bayesian Statistics
Statistical Distributions

Gamma

𝑥
𝑥𝛼−1 𝑒− 𝜃
Gamma(𝛼, 𝜃) = 𝑓(𝑥, 𝛼, 𝜃) = for 𝑥, 𝛼, 𝜃 > 0
Γ(𝛼)𝜃𝛼

Bayesian Statistics, Jose Storopoli 112


Bayesian Statistics
Statistical Distributions

Gamma

0.8 0.8

0.6 0.6
PDF

PDF
0.4 0.4

0.2 0.2

0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
1
𝛼 = 1, 𝜃 = 1 𝛼 = 2, 𝜃 = 2

Bayesian Statistics, Jose Storopoli 113


Bayesian Statistics
Statistical Distributions

Student’s 𝑡
Student’s 𝑡 distribution arises by estimating the mean of a normally-distributed population in
situations where the sample size is small and the standard deviation is knownxx.

If we take a sample of 𝑛 observations from a normal distribution, then Student’s 𝑡 distribution


with 𝜈 = 𝑛 − 1 degrees of freedom can be defined as the distribution of the location of the sample
mean in relation to the true mean, divided by the sample’s standard deviation, after multiplying

by the scaling term 𝑛.

Student’s 𝑡 distribution is symmetric and in a bell-shape, like the normal distribution, but with
long tails, which means that has more chance to produce values far away from its mean.

xx
this is where the ubiquitous Student’s 𝑡 test.

Bayesian Statistics, Jose Storopoli 114


Bayesian Statistics
Statistical Distributions

Student’s 𝑡
Student’s 𝑡 distribution has one parameter and its notation is Student(𝜈):
• 𝜈 – degrees of freedom, controls how much it resembles a normal distribution

Example: a dataset full of outliers.

Bayesian Statistics, Jose Storopoli 115


Bayesian Statistics
Statistical Distributions

Student’s 𝑡

− 𝜈+1
Γ( 𝜈+1
2
) 2
𝑥
2

Student(𝜈) = 𝑓(𝑥, 𝜈) = √ (1 + ) for 𝜈 ≥ 1


𝜈𝜋Γ( 𝜈2 ) 𝜈

Bayesian Statistics, Jose Storopoli 116


Bayesian Statistics
Statistical Distributions

Student’s 𝑡

0.4 0.4

0.3 0.3
PDF

PDF
0.2 0.2

0.1 0.1

0 0
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
𝜈=1 𝜈=3

Bayesian Statistics, Jose Storopoli 117


Bayesian Statistics
Statistical Distributions

Cauchy
The Cauchy distribution is bell-shaped distribution and a special case for Student’s 𝑡
with 𝜈 = 1.

But, differently than Student’s 𝑡, the Cauchy distribution has two parameters and its
notation is Cauchy(𝜇, 𝜎):
• 𝜇 – location parameter
• 𝜎 – scale parameter

Example: a dataset full of outliers.

Bayesian Statistics, Jose Storopoli 118


Bayesian Statistics
Statistical Distributions

Cauchy

1
Cauchy(𝜇, 𝜎) = 2
for 𝜎 ≥ 0
𝜋𝜎(1 + ( 𝑥−𝜇
𝜎
) )

Bayesian Statistics, Jose Storopoli 119


Bayesian Statistics
Statistical Distributions

Cauchy
0.6 0.6
0.5 0.5
0.4 0.4
PDF

PDF
0.3 0.3
0.2 0.2
0.1 0.1
0 0
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
1
𝜇 = 0, 𝜎 = 1 𝜇 = 0, 𝜎 = 2

Bayesian Statistics, Jose Storopoli 120


Bayesian Statistics
Statistical Distributions

Beta
The beta distribution is a natural choice to model anything that is restricted to values
between 0 e 1. Hence, it is a good candidate to model probabilities and proportions.

The beta distribution has two parameters and its notations is Beta (𝛼, 𝛽):
• 𝛼 (or sometimes 𝑎) – shape parameter, controls how much the shape is shifted towards
1
• 𝛽 (or sometimes 𝑏) – shape parameter, controls how much the shape is shifted towards
0

Example: A basketball player that has already scored 5 free throws and missed 3 in a
total of 8 attempts – Beta(3, 5)

Bayesian Statistics, Jose Storopoli 121


Bayesian Statistics
Statistical Distributions

Beta

𝑥𝛼−1 (1 − 𝑥)𝛽−1
Beta(𝛼, 𝛽) = 𝑓(𝑥, 𝛼, 𝛽) Γ(𝛼)Γ(𝛽)
for 𝛼, 𝛽 > 0 and 𝑥 ∈ [0, 1]
Γ(𝛼+𝛽)

Bayesian Statistics, Jose Storopoli 122


Bayesian Statistics
Statistical Distributions

Beta
0.3 0.3

0.2 0.2
PDF

PDF
0.1 0.1

0 0
0 1 0 1
𝛼 = 1, 𝛽 = 1 𝛼 = 3, 𝛽 = 2

Bayesian Statistics, Jose Storopoli 123


Priors
Bayesian Statistics
Priors

Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013):
‣ Chapter 2: Single-parameter models
‣ Chapter 3: Introduction to multiparameter models
• McElreath (2020) - Chapter 4: Geocentric Models
• Gelman, Hill and Vehtari (2020):
‣ Chapter 9, Section 9.3: Prior information and Bayesian synthesis
‣ Chapter 9, Section 9.5: Uniform, weakly informative, and informative priors in
regression
• Schoot et al. (2021)

Bayesian Statistics, Jose Storopoli 125


Bayesian Statistics
Priors

Prior Probability
Bayesian statistics is characterized by the use of prior information as the prior
probability 𝑃 (𝜃), often just prior:
Likelihood Prior
⏞ ⏞
𝑃 (𝑦 | 𝜃) ⋅ 𝑃 (𝜃)
𝑃
⏟ (𝜃 | 𝑦) =
⏟𝑃 (𝑦)
Posterior
Normalizing Constant

Bayesian Statistics, Jose Storopoli 127


Bayesian Statistics
Priors

The Subjectivity of the Prior


• Many critics to Bayesian statistics are due the subjectivity in eliciting priors probability
on certain hypothesis or model parameter’s values.
• Subjectivity is something unwanted in the ideal picture of the scientist and the
scientific method.
• Anything that involves human action will never be free from subjectivity. We have
subjectivity in everything and science is no exception.
• The creative and deductive process of theory and hypotheses formulations is not
objective.
• Frequentist statistics, which bans the use of prior probabilities is also subjective, since
there is A LOT of subjectivity in choosing which model and likelihood function (Jaynes,
2003; Schoot et al., 2021).

Bayesian Statistics, Jose Storopoli 128


Bayesian Statistics
Priors

How to Incorporate Subjectivity


• Bayesian statistics embraces subjectivity while frequentist statistics bans it.
• For Bayesian statistics, subjectivity guides our inferences and leads to more robust
and reliable models that can assist in decision making.
• Whereas, for frequentist statistics, subjectivity is a taboo and all inferences should be
objective, even if it resorts to hiding and omitting model assumptions.
• Bayesian statistics also has assumptions and subjectivity, but these are declared and
formalized

Bayesian Statistics, Jose Storopoli 129


Bayesian Statistics
Priors

Types of Priors
In general, we can have 3 types of priors in a Bayesian approach (Andrew Gelman, John
B. Carlin, Stern, et al., 2013; McElreath, 2020; Schoot et al., 2021):
• uniform (flat): not recommended.
• weakly informative: small amounts of real-world information along with common
sense and low specific domain knowledge added.
• informative: introduction of medium to high domain knowledge.

Bayesian Statistics, Jose Storopoli 130


Bayesian Statistics
Priors

Uniform Prior (Flat)


Starts from the premise that “everything is possible”. There is no limits in the degree of
beliefs that the distribution of certain values must be or any sort of restrictions.
Flat and super-vague priors are not usually recommended and some thought should
included to have at least weakly informative priors.
Formally, an uniform prior is an uniform distribution over all the possible support of the
possible values:
• model parameters: {𝜃 ∈ ℝ : −∞ < 𝜃 < ∞}
• model error or residuals: {𝜎 ∈ ℝ+ : 0 < 𝜃 < ∞}

Bayesian Statistics, Jose Storopoli 131


Bayesian Statistics
Priors

Weakly Uninformative Prior


Here we start to have “educated” guess about our parameter values. Hence, we don’t
start from the premise that “anything is possible”.

I recommend always to transform the priors of the problem at hand into something
centered in 0 with standard deviation of 1xxi:
• 𝜃 ∼ Normal(0, 1) (Andrew Gelman’s preferred choicexxii )
• 𝜃 ∼ Student(𝜈 = 3, 0, 1) (Aki Vehtari’s preferred choicexxii)

xxi
this is called standardization, transforming all variables into 𝜇 = 0 and 𝜎 = 1.
xxii
see more about prior choices in the Stan’s GitHub wiki.

Bayesian Statistics, Jose Storopoli 132


Bayesian Statistics
Priors

An Example of a Robust Prior


A nice example comes from a Ben Goodrich’s lecturexxiii (Columbia professor and member of Stan’s
research group).
He discuss about one of the biggest effect sizes observed in social sciences. In the exit pools for
the 2008 USA presidential election (Obama vs McCain), there was, in general, around 40% of
support for Obama. If you changed the respondent race from non-black to black, this was
associated with an increase of 60% in the probability of the respondent to vote on Obama
In logodds scales, 2.5x increase (from 40% to almost 100%) would be equivalent, on a Bernoulli/
logistic/binomial model, to a coefficient value of ≈ 0.92xxiv. This effect size would be easily derived
from a Normal(0, 1) prior.

xxiii
https://youtu.be/p6cyRBWahRA, in case you want to see the full video, the section about priors related to the argument begins at minute 40
xxiv
log(odds ratio) = log(2.5) = 0.9163.

Bayesian Statistics, Jose Storopoli 133


Bayesian Statistics
Priors

Informative Prior
In some contexts, it is interesting to use an informative prior. Good candidates are when
data is scarce or expensive and prior knowledge about the phenomena is available.

Some examples:
• Normal(5, 20)
• Log-Normal(0, 5)
• Beta(100, 9803)xxv

xxv
this is used in COVID-19 models from the CoDatMo Stan research group.

Bayesian Statistics, Jose Storopoli 134


Bayesian Workflow
Bayesian Statistics
Bayesian Workflow

Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013) - Chapter 6: Model checking
• McElreath (2020) - Chapter 4: Geocentric Models
• Gelman, Hill and Vehtari (2020):
‣ Chapter 6: Background on regression modeling
‣ Chapter 11: Assumptions, diagnostics, and model evaluation
• Gelman et al. (2020) - “Workflow Paper”

Bayesian Statistics, Jose Storopoli 136


Bayesian Statistics
Bayesian Workflow

All Models Are Wrong

All models are wrong but some are useful


— George Box (Box, 1976)

Bayesian Statistics, Jose Storopoli 138


Bayesian Statistics
Bayesian Workflow

Bayesian Workflowxxvi

Prior Posterior
Predictive Predictive
Check Check
Prior Model Posterior
Elicitation Specification Inference

xxvi
based on Gelman et al. (2020)

Bayesian Statistics, Jose Storopoli 139


Bayesian Statistics
Bayesian Workflow

Bayesian Workflowxxvii
• Understand the domain and problem.
• Formulate the model mathematically.
• Implement model, test, and debug.
• Perform prior predictive checks.
• Fit the model.
• Assess convergence diagnostics.
• Perform posterior predictive checks.
• Improve the model iteratively: from baseline to complex and computationally efficient
models.

xxvii
adapted from Elizaveta Semenova.

Bayesian Statistics, Jose Storopoli 140


Bayesian Statistics
Bayesian Workflow

Actual Bayesian Workflow

Figure 2: Bayesian workflow by Gelman et al. (2020).

Bayesian Statistics, Jose Storopoli 141


Bayesian Statistics
Bayesian Workflow

Not a “new idea”

Figure 3: Box’s Loop from Box (1976) but taken from Blei (2014).

Bayesian Statistics, Jose Storopoli 142


Bayesian Statistics
Bayesian Workflow

Prior Predictive Check


Before we feed data into our model, we need to check all of our priors.

In a very simple way, it consists in simulate parameter values based on prior distribution
without conditioning on any data or employing any likelihood function.

Independent of the level of information specified in the priors, it is always important to


perform a prior sensitivity analysis in order to have a deep understanding of the prior
influence onto the posterior.

Bayesian Statistics, Jose Storopoli 143


Bayesian Statistics
Bayesian Workflow

Posterior Predictive Check


We need to make sure that the posterior distribution of 𝒚, namely 𝒚,
̃ can capture all the nuances
of the real distribution density/mass of 𝒚.
This procedure is called posterior predictive check, and it is generally carried on by a visual
inspectionxxviii of the real density/mass of 𝒚 against generated samples of 𝒚 by the Bayesian
model.
The purpose is to compare the histogram of the dependent variable 𝒚 against the histograms of
simulated dependent variables 𝒚rep by the model after parameter inference.
The idea is that the real and simulated histograms blend together and we do not observer any
divergences.

xxviii
we also perform mathematical/exact inspections, see the section on Model Comparison.

Bayesian Statistics, Jose Storopoli 144


Bayesian Statistics
Bayesian Workflow

Examples of Posterior Predictive Checks

Figure 4: Real versus Simulated Densities Figure 5: Real versus Simulated Empirical CDFs

Bayesian Statistics, Jose Storopoli 145


Linear Regression
Bayesian Statistics
Linear Regression

Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013):
‣ Chapter 14: Introduction to regression models
‣ Chapter 16: Generalized linear models
• McElreath (2020) - Chapter 4: Geocentric Models
• Gelman, Hill and Vehtari (2020):
‣ Chapter 7: Linear regression with a single predictor
‣ Chapter 8: Fitting regression models
‣ Chapter 10: Linear regression with multiple predictors

Bayesian Statistics, Jose Storopoli 147


Bayesian Statistics
Linear Regression

What is Linear Regression?

1 2 3 4

Bayesian Statistics, Jose Storopoli 149


Bayesian Statistics
Linear Regression

What is Linear Regression?


The ideia here is to model a dependent variable as a linear combination of independent
variables.
𝒚 = 𝛼 + 𝑿𝜷 + 𝜀

where:
• 𝒚 – dependent variable
• 𝛼 – intercept (also called as constant)
• 𝜷 – coefficient vector
• 𝑿 – data matrix
• 𝜀 – model error

Bayesian Statistics, Jose Storopoli 150


Bayesian Statistics
Linear Regression

Linear Regression Assumptions

• model error 𝜀 is independent of 𝑿 and 𝒚.


• Dependent variable 𝒚 is continuous, unbounded, and, more importantly, “metric”-
scaled, i.e. equidistant.
‣ e.g. the increase from 1 to 2 is the same from 3 to 4. Generally violated when 𝒚 is
interval-scaled.
• Observations are I.I.Dxxix.

xxix
independent and identically distributed.

Bayesian Statistics, Jose Storopoli 152


Bayesian Statistics
Linear Regression

Linear Regression Specification


To estimate the intercept 𝛼 and coefficients 𝜷 we use a Gaussian/normal likelihood
function. Mathematically speaking, Bayesian linear regression is:

𝒚 ∼ Normal(𝛼 + 𝑿𝜷, 𝜎)
𝛼 ∼ Normal(𝜇𝛼 , 𝜎𝛼 )
𝜷 ∼ Normal(𝜇𝜷 , 𝜎𝜷 )
𝜎 ∼ Exponential(𝜆𝜎 )

Bayesian Statistics, Jose Storopoli 153


Bayesian Statistics
Linear Regression

Linear Regression Specification


What we are missing is the prior probabilities for the model’s parameters:

• Prior Distribution for 𝛼 – Knowledge that we have about the model’s intercept.
• Prior Distribution for 𝜷 – Knowledge that we have about the model’s independent
variable coefficients.
• Prior Distribution for 𝜎 – Knowledge that we have about the model’s error.

Bayesian Statistics, Jose Storopoli 154


Bayesian Statistics
Linear Regression

Good Candidates for Prior Distributions


First, center (𝜇 = 0) and standardize (𝜎 = 1) the independent variables.

• 𝛼 – either a normal or student-𝑡 (𝜈 = 3), with mean as 𝜇𝒚 and standard deviation as


2.5 ⋅ 𝜎𝒚 (also you can use the median and median absolute deviation).
• 𝜷 – either a normal or student-𝑡 (𝜈 = 3), with mean 0 and standard deviation 2.5.
• 𝜎 – anything that is long-tailed (mass towards lower values) and restrained to positive
values only. Exponential is a good candidate.

Bayesian Statistics, Jose Storopoli 155


Bayesian Statistics
Linear Regression

Posterior Computation
Our aim to is to find the posterior distribution of the model’s parameters of interest (𝛼
and 𝜷) by computing the full posterior distribution of:

𝑃 (𝜽 | 𝒚) = 𝑃 (𝛼, 𝜷, 𝜎 | 𝒚)

Bayesian Statistics, Jose Storopoli 156


Logistic Regression
Bayesian Statistics
Logistic Regression

Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013) - Chapter 16: Generalized linear
models
• McElreath (2020)
‣ Chapter 10: Big Entropy and the Generalized Linear Model
‣ Chapter 11, Section 11.1: Binomial regression
• Gelman, Hill and Vehtari (2020):
‣ Chapter 13: Logistic regression
‣ Chapter 14: Working with logistic regression
‣ Chapter 15, Section 15.3: Logistic-binomial model
‣ Chapter 15, Section 15.4: Probit regression

Bayesian Statistics, Jose Storopoli 158


Bayesian Statistics
Logistic Regression

Welcome to the Magical World of the Linear Generalized Models


Leaving the realm of the linear models, we start to adventure to the generalized linear
models – GLM.

The first one is logistic regression (also called Bernoulli regression or binomial
regression).

Bayesian Statistics, Jose Storopoli 160


Bayesian Statistics
Logistic Regression

Binary Dataxxx

We use logistic regression when our dependent variable is binary. It only takes two
distinct values, usually coded as 0 and 1.

xxx
also known as dichotomous, dummy, indicator variable, etc.

Bayesian Statistics, Jose Storopoli 161


Bayesian Statistics
Logistic Regression

What is Logistic Regression


Logistic regression behaves exactly as a linear model: it makes a prediction by simply
computing a weighted sum of the independent variables 𝑿 using the estimated
coefficients 𝜷, along with a constant term 𝛼.

However, instead of outputting a continuous value 𝒚, it returns the logistic function of


this value:
1
logistic(𝑥) =
1 + 𝑒−𝑥

Bayesian Statistics, Jose Storopoli 162


Bayesian Statistics
Logistic Regression

Logistic Function
1

logistic(𝑥) 0.75

0.5

0.25

0
−10 −5 0 5 10
𝑥

Bayesian Statistics, Jose Storopoli 163


Bayesian Statistics
Logistic Regression

Probit Function
We can also opt to choose to use the probit function (usually represented by the Greek
letter Φ) which is the CDF of a normal distribution:

𝑥
1 −𝑡2
Φ(𝑥) = √ ∫ 𝑒 2 d𝑡
2𝜋 −∞

Bayesian Statistics, Jose Storopoli 164


Bayesian Statistics
Logistic Regression

Probit Function
1

Φ(𝑥) 0.75

0.5

0.25

0
−10 −5 0 5 10
𝑥

Bayesian Statistics, Jose Storopoli 165


Bayesian Statistics
Logistic Regression

Logistic Function versus Probit Function


1

Φ(𝑥) 0.75

0.5

0.25

0
−10 −5 0 5 10
𝑥

Bayesian Statistics, Jose Storopoli 166


Bayesian Statistics
Logistic Regression

Comparison with Linear Regression


Linear regression follows the following mathematical expression:
linear = 𝛼 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + … + 𝛽𝑘 𝑥𝑘

• 𝛼 – intercept.
• 𝜷 = 𝛽1 , 𝛽2 , …, 𝛽𝑘 – independent variables’ 𝑥1 , 𝑥2 , …, 𝑥𝑘 coefficients.
• 𝑘 – number of independent variables.
If you implement a small mathematical transformation, you’ll have logistic regression:
• 𝑝̂ = logistic(linear) = 1+𝑒−1 linear – probability of an observation taking value 1.
• 𝑦̂ = {0 if 𝑝̂ <0.5 – 𝒚’s predicted binary value.
1 if 𝑝
̂ ≥0.5

Bayesian Statistics, Jose Storopoli 167


Bayesian Statistics
Logistic Regression

Logistic Regression Specification


We can model logistic regression using two approaches:

• Bernoulli likelihood – binary dependent variable 𝒚 which results from a Bernoulli trial
with some probability 𝑝.
• binomial likelihood – discrete and positive dependent variable 𝒚 which results from 𝑘
successes in 𝑛 independent Bernoulli trials.

Bayesian Statistics, Jose Storopoli 168


Bayesian Statistics
Logistic Regression

Bernoulli Likelihood
𝒚 ∼ Bernoulli(𝑝)
𝑝 = logistic/probit(𝛼 + 𝑿𝜷)
𝛼 ∼ Normal(𝜇𝛼 , 𝜎𝛼 )
𝜷 ∼ Normal(𝜇𝜷 , 𝜎𝜷 )

where:
• 𝒚 - dependent binary variable.
• 𝑝 - probability of 𝒚 taking value of 1 – success in an independent Bernoulli trial.
• logistic/probit – logistic or probit function.
• 𝛼 – intercept (also called constant).
• 𝜷 – coefficient vector.
• 𝑿 – data matrix.

Bayesian Statistics, Jose Storopoli 169


Bayesian Statistics
Logistic Regression

Binomial Likelihood
𝒚 ∼ Binomial(𝑛, 𝑝)
𝑝 = logistic/probit(𝛼 + 𝑿𝜷)
𝛼 ∼ Normal(𝜇𝛼 , 𝜎𝛼 )
𝜷 ∼ Normal(𝜇𝜷 , 𝜎𝜷 )

where:
• 𝒚 - dependent binary variable.
• 𝑛 - number of independent Bernoulli trials.
• 𝑝 - probability of 𝒚 taking value of 1 – success in an independent Bernoulli trial.
• logistic/probit – logistic or probit function.
• 𝛼 – intercept (also called constant).
• 𝜷 – coefficient vector.
• 𝑿 – data matrix.

Bayesian Statistics, Jose Storopoli 170


Bayesian Statistics
Logistic Regression

Posterior Computation
Our aim to is to find the posterior distribution of the model’s parameters of interest (𝛼
and 𝜷) by computing the full posterior distribution of:

𝑃 (𝜽 | 𝒚) = 𝑃 (𝛼, 𝜷 | 𝒚)

Bayesian Statistics, Jose Storopoli 171


Bayesian Statistics
Logistic Regression

How to Interpret Coefficients


If we revisit logistic transformation mathematical expression, we see that, in order to
interpret coefficients 𝜷, we need to perform a transformation.

Specifically, we need to undo the logistic transformation. We are looking for its inverse
function.

Bayesian Statistics, Jose Storopoli 172


Bayesian Statistics
Logistic Regression

Probability versus Odds


But before that, we need to discern between probability and oddsxxxi.
• Probability: a real number between 0 and 1 that represents the certainty that an event will occur,
either by long-term frequencies (frequentist approach) or degrees of belief (Bayesian approach).
• Odds: a positive real number (ℝ+ ) that also measures the certainty of an event happening. However
this measure is not expressed as a probability (between 0 and 1), but as the ratio between the number
of results that generate our desired event and the number of results that do not generate our desired
event:
𝑝
odds =
1−𝑝

where 𝑝 is the probability.

xxxi
mathematically speaking.

Bayesian Statistics, Jose Storopoli 173


Bayesian Statistics
Logistic Regression

Probability versus Odds


𝑝
odds =
1−𝑝

where 𝑝 is the probability.

1
• Odds with a value of 1 is a neutral odds, similar to a fair coin: 𝑝 = 2
• Odds below 1 decrease the probability of seeing a certain event.
• Odds over 1 increase the probability of seeing a certain event.

Bayesian Statistics, Jose Storopoli 174


Bayesian Statistics
Logistic Regression

Logodds
If you revisit the logistic function, you’ll se that the intercept 𝛼 and coefficients 𝜷 are
literally the log of the odds (logodds):
𝑝 = logistic(𝛼 + 𝑿𝜷)
𝑝 = logistic(𝛼) + logistic(𝑿𝜷)
1
𝑝=
1 + 𝑒−𝜷
𝜷 = log(odds)

Bayesian Statistics, Jose Storopoli 175


Bayesian Statistics
Logistic Regression

Logodds
Hence, the coefficients of a logistic regression are expressed in logodds, in which 0 is the
neutral element, and any number above or below it increases or decreases, respectively,
the changes of obtaining a “success” in 𝒚. To have a more intuitive interpretation
(similar to the betting houses), we need to convert the logodds into chances by undoing
the log function. We need to perform an exponentiation of 𝛼 and 𝜷 values:
odds(𝛼) = 𝑒𝛼
odds(𝜷) = 𝑒𝜷

Bayesian Statistics, Jose Storopoli 176


Ordinal Regression
Bayesian Statistics
Ordinal Regression

Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013) - Chapter 16: Generalized linear
models, Section 16.2: Models for multivariate and multinomial responses
• McElreath (2020) - Chapter 12, Section 12.3: Ordered categorical outcomes
• Gelman, Hill and Vehtari (2020) - Chapter 15, Section 15.5: Ordered and unordered
categorical regression
• Bürkner and Vuorre (2019)
• Semenova (2019)

Bayesian Statistics, Jose Storopoli 178


Bayesian Statistics
Ordinal Regression

What is Ordinal Regression?

Ordinal regression is a regression model for discrete data and, more specific, when the
values of the dependent variables have a “natural ordering”.

For example, opinion polls with its plausible ordered values from agree-disagree, or a
patient perception of pain score.

Bayesian Statistics, Jose Storopoli 180


Bayesian Statistics
Ordinal Regression

Why not just use Linear Regression?


The main reason to not simply use linear regression with ordinal discrete outcomes is
that the categories of the dependent variable could not be equidistant.

This is an assumption in linear regression (and in almost all models that use “metric”
dependent variables): the distance between, for example, 2 and 3 is not the same
distance between 1 and 2.

This assumption can be violated in an ordinal regression.

Bayesian Statistics, Jose Storopoli 181


Bayesian Statistics
Ordinal Regression

How to deal with an Ordinal Dependent Variable?

Surprise! Plot twist!

Another non-linear transformation.

Bayesian Statistics, Jose Storopoli 182


Bayesian Statistics
Ordinal Regression

Cumulative Distribution Function – CDF


In the case of ordinal regression, first we need to transform the dependent variable into
a cumulative scale

For this, we use the cumulative distribution function (CDF):


𝑦
𝑃 (𝑌 ≤ 𝑦) = ∑ 𝑃 (𝑌 = 𝑖)
𝑖=𝑦min

CDF is a monotonically increasing function that represents the probability of a random


variable 𝑌 taking values less than a certain value 𝑦

Bayesian Statistics, Jose Storopoli 183


Bayesian Statistics
Ordinal Regression

Log-cumulative-odds
Still, this is not enough. We need to apply the logit function onto the CDF:
𝑥
logit(𝑥) = logistic−1 (𝑥) = ln( )
1−𝑥
where ln is the natural log function.
The logit function is the inverse of the logistic function: it takes as input any value between 0 and
1 (e.g. a probability) and outputs an unconstrained real number which we call logoddsxxxii.
As the transformation is performed onto the CDF, we call the result as the CDF logodds or log-
cumulative-odds.

xxxii
we already seen it in logistic regression.

Bayesian Statistics, Jose Storopoli 184


Bayesian Statistics
Ordinal Regression

𝐾 − 1 Intercepts
What do we do with this log-cumulative-odds?
It allows us to construct different intercepts for all possible values of the ordinal
dependent variable. We create an unique intercept for 𝑘 ∈ 𝐾.

Actually is 𝑘 ∈ 𝐾 − 1. Notice that the maximum value of the CDF of 𝑌 will always be 1.
Which translates to a log-cumulative-odds of ∞, since 𝑝 = 1:
𝑝 1
ln( ) = ln( ) = ln(0) = ∞
1−𝑝 1−1

Hence, we need only 𝐾 − 1 intercepts for all 𝐾 possible values that 𝑌 can take.

Bayesian Statistics, Jose Storopoli 185


Bayesian Statistics
Ordinal Regression

Violation of the Equidistant Assumption

Since each intercept implies a different CDF value for each 𝑘 ∈ 𝐾, we can safely violate
the equidistant assumption which is not valid in almost all ordinal variables.

Bayesian Statistics, Jose Storopoli 186


Bayesian Statistics
Ordinal Regression

Cut Points
Each intercept implies in a log-cumulative-odds for each 𝑘 ∈ 𝐾; We need also to undo
the cumulative nature of the 𝐾 − 1 intercepts. Firstly, we convert the log-cumulative-
odds back to a valid probability with the logistic function:
1
logit−1 (𝑥) = logistic(𝑥) = ( −𝑥
)
1+𝑒

Then, finally, we remove the cumulative nature of the CDF by subtracting every one of
the 𝑘 cut points by the 𝑘 − 1 cut point:
𝑃 (𝑌 = 𝑘) = 𝑃 (𝑌 ≤ 𝑘) − 𝑃 (𝑌 ≤ 𝑘 − 1)

Bayesian Statistics, Jose Storopoli 187


Bayesian Statistics
Ordinal Regression

Example - Probability Mass Function of an Ordinal Variable

0.4

0.3
PDF

0.2

0.1

1 2 3 4 5 6
values

Bayesian Statistics, Jose Storopoli 188


Bayesian Statistics
Ordinal Regression

Example - CDF versus log-cumulative-odds

1.1

log-cumulative-odds
1.0 2
0.9
0.8 1
0.7
CDF

0.6 0
0.5
0.4 -1
0.3
0.2 -2
0.1
1 2 3 4 5 6 1 2 3 4 5 6
values values

Bayesian Statistics, Jose Storopoli 189


Bayesian Statistics
Ordinal Regression

Adding Coefficients 𝜷

With the equidistant assumption solved with 𝐾 − 1 intercepts, we can add coefficients to
represent the independent variable’s effects into our ordinal regression model.

Bayesian Statistics, Jose Storopoli 190


Bayesian Statistics
Ordinal Regression

More Log-cumulative-odds
We’ve transformed all intercepts into log-cumulative-odds so that we can add effects as
weighted sums of the independent variables to our basal rates (intercepts).
For every 𝑘 ∈ 𝐾 − 1, we calculate:
𝜑 = 𝛼𝑘 + 𝛽𝑖 𝑥𝑖

where 𝛼𝑘 is the log-cumulative-odds for the 𝑘 ∈ 𝐾 − 1 intercepts, 𝛽𝑖 is the coefficient for


the 𝑖th independent variable 𝑥𝑖 .
Lastly, 𝜑𝑘 represents the linear predictor for the 𝑘th intercept.

Bayesian Statistics, Jose Storopoli 191


Bayesian Statistics
Ordinal Regression

Matrix Notation
This can become more elegant and computationally efficient if we use matrix/vector
notation:
𝝋 = 𝜶 + 𝑿𝑐 ⋅ 𝜷

where 𝝋, 𝜶 e 𝜷xxxiii are vectors and 𝑿 is the data matrix, in which every line is an
observation and every column an independent variable.

xxxiii
note that both the coefficients and intercepts will have to be interpret as odds, like we did in logistic regression.

Bayesian Statistics, Jose Storopoli 192


Bayesian Statistics
Ordinal Regression

Ordinal Regression Specification


𝒚 ∼ Categorical(𝒑)
𝒑 = logistic(𝝋)
𝝋 = 𝜶 + 𝒄 + 𝑿𝑐 ⋅ 𝜷
𝑐1 = logit(CDF(𝑦1 ))
𝑐𝑘 = logit(CDF(𝑦𝑘 ) − CDF(𝑦𝑘−1 )) for 2 ≤ 𝑘 ≤ 𝐾 − 1
𝑐𝐾 = logit(1 − CDF(𝑦𝐾−1 ))
𝜶 ∼ Normal(𝜇𝛼 , 𝜎𝛼 )
𝜷 ∼ Normal(𝜇𝜷 , 𝜎{𝜷} )
• 𝒚 – ordinal discrete dependent variable.
• 𝒑 – probability vector of size 𝐾.
• 𝐾: number of possible values that 𝒚 can take, i.e. number of ordered discrete values.
• 𝝋: log-cumulative-odds, i.e. the cut points considering the intercepts and the weighted sum of the independent variables.
• 𝑐𝑘 : cutpoint in log-cumulative-odds for every 𝑘 ∈ 𝐾 − 1.
• 𝛼𝑘 : intercept in log-cumulative-odds for every 𝑘 ∈ 𝐾 − 1.
• 𝑿: data matrix of the independent variables.
• 𝜷: coefficient vector with size the same as the number of columns of 𝑿.

Bayesian Statistics, Jose Storopoli 193


Poisson Regression
Bayesian Statistics
Poisson Regression

Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013) - Chapter 16: Generalized linear
models
• McElreath (2020):
‣ Chapter 10: Big Entropy and the Generalized Linear Model
‣ Chapter 11, Section 11.2: Poisson regression
• Gelman, Hill and Vehtari (2020) - Chapter 15, Section 15.2: Poisson and negative
binomial regression

Bayesian Statistics, Jose Storopoli 195


Bayesian Statistics
Poisson Regression

Count Data

Poisson regression is used when our dependent variable can only take positive values,
usually in the context of count data.

Bayesian Statistics, Jose Storopoli 197


Bayesian Statistics
Poisson Regression

What is Poisson Regression?


Poisson regression behaves exactly like a linear model: it makes a prediction by simply
computing a weighted sum of the independent variables 𝑿 with the estimated
coefficients 𝜷: 𝒚.
But, different from linear regression, it outputs the natural log of 𝒚:
log(𝒚) = 𝛼 ⋅ 𝛽1 𝑥1 ⋅ 𝛽2 𝑥2 ⋅ … ⋅ 𝛽𝑘 𝑥𝑘

which is the same as:

𝒚 = 𝑒(𝛼+𝛽1 𝑥1 +𝛽2 𝑥2 +…+𝛽𝑘 𝑥𝑘 )

Bayesian Statistics, Jose Storopoli 198


Bayesian Statistics
Poisson Regression

Exponential Function
140
120
100
80
𝑒𝑥

60
40
20
0
−1 0 1 2 3 4 5
𝑥

Bayesian Statistics, Jose Storopoli 199


Bayesian Statistics
Poisson Regression

Comparison with Linear Regression


Linear regression has the following mathematical expression:
linear = 𝛼 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + … + 𝛽𝑘 𝑥𝑘

where:
• 𝛼 – intercept.
• 𝜷 = 𝛽1 , 𝛽2 , …, 𝛽𝑘 – independent variables’ 𝑥1 , 𝑥2 , …, 𝑥𝑘 coefficients.
• 𝑘 – number of independent variables.
If you implement a small mathematical transformation, you’ll have Poisson regression:
• log(𝑦) = 𝑒Linear = 𝑒𝛼+𝛽1 𝑥1 +𝛽2 𝑥2 +…+𝛽𝑘 𝑥𝑘

Bayesian Statistics, Jose Storopoli 200


Bayesian Statistics
Poisson Regression

Poisson Regression Specification


We can use Poisson regression if the dependent variable 𝒚 has count data, i.e., 𝒚 only
takes positive values.
Poisson likelihood function uses an intercept 𝛼 and coefficients 𝜷, however these are
“exponentiated” (𝑒𝑥 ):

𝒚 ∼ Poisson(𝑒(𝛼+𝑿𝜷) )
𝛼 ∼ Normal(𝜇𝛼 , 𝜎𝛼 )
𝜷 ∼ Normal(𝜇𝜷 , 𝜎𝜷 )

Bayesian Statistics, Jose Storopoli 201


Bayesian Statistics
Poisson Regression

Interpreting the Coefficients


When we see the Poisson regression specification, we realize that the coefficient
interpretation requires a transformation. What we need to do is undo the logarithm
transformation:

log−1 (𝑥) = 𝑒𝑥

So, we need to “exponentiate” the values of 𝛼 and 𝜷:

𝒚 = 𝑒(𝛼+𝑿𝜷)
= 𝑒𝛼 ⋅ 𝑒(𝑋(1) ⋅𝛽(1) ) ⋅ 𝑒(𝑋(2) ⋅𝛽(2) ) ⋅ … ⋅ 𝑒(𝑋(𝑘) ⋅𝛽(𝑘) )

Bayesian Statistics, Jose Storopoli 202


Bayesian Statistics
Poisson Regression

Interpreting the Coefficients


Finally, notice that, when transformed, our dependent variables is no more a “weighted
sum of an intercept and independent variables”:

𝒚 = 𝑒(𝛼+𝑿𝜷)
= 𝑒𝛼 ⋅ 𝑒(𝑋(1) ⋅𝛽(1) ) ⋅ 𝑒(𝑋(2) ⋅𝛽(2) ) ⋅ … ⋅ 𝑒(𝑋(𝑘) ⋅𝛽(𝑘) )

It becomes a “weighted product”.

Bayesian Statistics, Jose Storopoli 203


Robust Regression
Bayesian Statistics
Robust Regression

Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013) - Chapter 17: Models for robust
inference
• McElreath (2020) - Chapter 12: Monsters and Mixtures
• Gelman, Hill and Vehtari (2020):
‣ Chapter 15, Section 15.6: Robust regression using the t model
‣ Chapter 15, Section 15.8: Going beyond generalized linear models

Bayesian Statistics, Jose Storopoli 205


Bayesian Statistics
Robust Regression

Robust Models
Almost always data from real world are really strange.

For the sake of convenience, we use simple models. But always ask yourself. How many
ways might the posterior inference depends on the following:

• extreme observations (outliers)?


• unrealistic model assumptions?

Bayesian Statistics, Jose Storopoli 207


Bayesian Statistics
Robust Regression

Outliers

Models based on the normal distribution are notoriously “non-robust” against outliers,
in the sense that a single observation can greatly affect the inference of all model’s
parameters, even those that has a shallow relationship with it.

Bayesian Statistics, Jose Storopoli 208


Bayesian Statistics
Robust Regression

Overdispersion
Superdispersion and underdispersionxxxiv refer to data that have more or fewer variation
than expected under a probability model (Gelman, Hill and Vehtari, 2020).

For each one of the models we covered, there is a natural extension in which a single
parameter is added to allow for overdispersion (Andrew Gelman, John B. Carlin, Stern, et
al., 2013).

xxxiv
rarer to find in the real world.

Bayesian Statistics, Jose Storopoli 209


Bayesian Statistics
Robust Regression

Overdispersion Example
Suppose you are analyzing data from car accidents. The model we generally use in this
type of phenomena is Poisson regression.

Poisson distribution has the same parameter for both the mean and variance: the rate
parameter 𝜆.

Hence, if you find a higher variability than expected under the Poisson likelihood
function allows, then probably you won’t be able to model properly the desired
phenomena.

Bayesian Statistics, Jose Storopoli 210


Bayesian Statistics
Robust Regression

Student’s 𝑡 instead of Normal


Student’s 𝑡 distribution has widerxxxv tails than the Normal distribution.

This makes it a good candidate to fit outliers without instabilities in the parameters’


inference.

From the Bayesian viewpoint, there is nothing special or magical in the Gaussian/
Normal likelihood.
It is just another distribution specified in a statistical model. We can make our model
robust by using the Student’s 𝑡 distribution as a likelihood function.

xxxv
or “fatter”.

Bayesian Statistics, Jose Storopoli 211


Bayesian Statistics
Robust Regression

Student’s 𝑡 instead of Normal


0.4

0.3
PDF

0.2

0.1

0
−4 −3 −2 −1 0 1 2 3 4

Bayesian Statistics, Jose Storopoli 212


Bayesian Statistics
Robust Regression

Student’s 𝑡 instead of Normal


By using a Student’s 𝑡 distribution instead of the Normal distribution as likelihood functions, the model’s error
𝜎 does not follow a Normal distribution, but a Student’s 𝑡 distribution:
𝒚 ∼ Student(𝜈, 𝛼 + 𝑿𝜷, 𝜎)
𝛼 ∼ Normal(𝜇𝛼 , 𝜎𝛼 )
𝜷 ∼ Normal(𝜇𝜷 , 𝜎𝜷 )
𝜈 ∼ Log-Normal(2, 1)
𝜎 ∼ Exponential(𝜆𝜎 )

Note that we are including an extra parameter 𝜈, which represents the Student’s 𝑡 distribution degrees of
freedom, to be estimated by the model (Andrew Gelman, John B. Carlin, Stern, et al., 2013).
This controls how wide or narrow the “tails” of the distribution will be. A heavy-tailed, positive-only prior is
advised.

Bayesian Statistics, Jose Storopoli 213


Bayesian Statistics
Robust Regression

Beta-Binomial instead of the Binomial


The binomial distribution has a practical limitation that we only have one free
parameter to estimatexxxvi (𝑝). This implies in the variance to determined by the mean.
Hence, the binomial distribution cannot tolerate overdispersion.

A robust alternative is the beta-binomial distribution, which, as the name suggests, is a


beta mixture of binomials distributions. Most important, it allows that the variance to
be independent of the mean, making it robust against overdispersion.

xxxvi
since 𝑛 already comes from data.

Bayesian Statistics, Jose Storopoli 214


Bayesian Statistics
Robust Regression

Beta-Binomial instead of Binomial


The beta-binomial distribution is a binomial distribution, where the probability of
success 𝑝 is parameterized as a Beta(𝛼, 𝛽).

Generally, we use 𝛼 as the binomial’s probability of the success 𝑝, and 𝛽 xxxvii is the
additional parameter to control and allow for overdispersion.

Values of 𝛽 ≥ 1 make the beta-binomial behave the same as a binomial.

xxxvii
sometimes specified as 𝜑

Bayesian Statistics, Jose Storopoli 215


Bayesian Statistics
Robust Regression

Beta-Binomial instead of Binomial


𝒚 ∼ Beta-Binomial(𝑛, 𝑝, 𝜑)
𝑝 ∼ Logistic/Probit(𝛼 + 𝑿𝜷)
𝛼 ∼ Normal(𝜇𝛼 , 𝜎𝛼 )
𝜷 ∼ Normal(𝜇𝜷 , 𝜎𝜷 )
𝜑 ∼ Exponential(1)

It is also proper to include the overdispersion 𝛽 parameter as an additional parameter to be


estimated by the model (Andrew Gelman, John B. Carlin, Stern, et al., 2013; McElreath, 2020). A
heavy-tailed, positive-only prior is advised.

Bayesian Statistics, Jose Storopoli 216


Bayesian Statistics
Robust Regression

Student’s 𝑡 instead Binomial


Also known as Robitxxxviii (Andrew Gelman, John B. Carlin, Stern, et al., 2013; Gelman, Hill and Vehtari, 2020). The idea is
to make the logistic regression robust by using a latent variable 𝑧 as the linear predictor. 𝑧’s errors, 𝜀, are distributed
as a Student’s 𝑡 distribution:

0 if 𝑧𝑖 < 0
𝑦𝑖 = {
1 if 𝑧𝑖 > 0
𝑧𝑖 = 𝑋𝑖 𝜷 + 𝜀𝑖
𝜈−2
𝜀𝑖 ∼ Student(𝜈, 0, √ )
𝜈
𝜈 ∼ Gamma(2, 0.1) ∈ [2, ∞)

Here we are using the gamma distribution as a truncated Student’s 𝑡 distribution for the degrees of freedom
parameter 𝜈 ≥ 2. Another option would be to fix 𝜈 = 4.

Bayesian Statistics, Jose Storopoli 217


Bayesian Statistics
Robust Regression

xxxviii
there is a great discussion between Gelman, Vehtari and Kurz at Stan’s Discourse .

Bayesian Statistics, Jose Storopoli 217


Bayesian Statistics
Robust Regression

Negative Binomial instead of Poisson


This is the overdispersion example. The Poisson distribution uses a single parameter for
both its mean and variance.

Hence, if you find overdispersion, probably you’ll need a robust alternative to Poisson.
This is where the negative binomial, with an extra parameter 𝜑, that makes it robust to
overdispersion.

𝜑 controls the probability of success 𝑝, and we generally use a gamma distribution as its
prior. 𝜑 is also known as a “reciprocal dispersion” parameter.

Bayesian Statistics, Jose Storopoli 218


Bayesian Statistics
Robust Regression

Negative Binomial instead of Poisson


𝒚 ∼ Negative Binomial(𝑒(𝛼+𝑿𝜷) , 𝜑)
𝜑 ∼ Gamma(0.01, 0.01)
𝛼 ∼ Normal(𝜇𝛼 , 𝜎𝛼 )
𝜷 ∼ Normal(𝜇𝜷 , 𝜎𝜷 )

Here we also give a heavy-tailed, positive-only prior to 𝜑. Something like the


Gamma(0.01, 0.01) works.

Bayesian Statistics, Jose Storopoli 219


Bayesian Statistics
Robust Regression

Negative Binomial Mixture instead of Poisson


Even using a negative binomial likelihood, if you encounter acute overdispersion,
specially when there is a lot of zeros in your data (zero-inflated), your model can still
perform a bad fit to the data.

Another suggestion is to use a mixture of negative binomial (McElreath, 2020).

Bayesian Statistics, Jose Storopoli 220


Bayesian Statistics
Robust Regression

Negative Binomial Mixture instead of Poisson


Here, 𝑆𝑖 is a dummy variable, taking value 1 if the 𝑖th observation has a value ≠ 0. 𝑆𝑖 can
be modeled using logistic regression:

= 0 if 𝑆𝑖 = 0
𝒚{
∼ Negative Binomial(𝑒(𝛼+𝑿𝜷) , 𝜑) if 𝑆𝑖 = 1
𝑃 (𝑆𝑖 = 1) = Logistic/Probit(𝑿𝜸)
𝛾 ∼ Beta(1, 1)

𝛾 is a new coefficients which we give uniform prior of Beta(1, 1).

Bayesian Statistics, Jose Storopoli 221


Bayesian Statistics
Robust Regression

Why Use Non-Robust Models?


The central limit theorem tells us that the normal distribution is an appropriate model
for data that arises as a sum of independent components.
Even when they are naturally not implicit in a phenomena structure, simpler non-robust
models are computational efficient.
Finally, there’s occam’s razor, also known as the principle of parsimony, which states the
preference for simplicity in the scientific method.
Of course, you must always guide the model choice in a principled manner, taking into
account the underlying phenomena data generating process. And make sure to make
posterior predictive checks.

Bayesian Statistics, Jose Storopoli 222


Sparse Regression
Bayesian Statistics
Sparse Regression

Recommended References
• Gelman, Hill and Vehtari (2020) - Chapter 12, Section 12.8: Models for regression
coefficients
• Horseshoe Prior: Carvalho, Polson and Scott (2009)
• Horseshoe+ Prior: Bhadra et al. (2015)
• Regularized Horseshoe Prior: Piironen and Vehtari (2017)
• R2-D2 Prior: Zhang et al. (2022)
• Betancourt’s Case study on Sparsity: Betancourt (2021)

Bayesian Statistics, Jose Storopoli 224


Bayesian Statistics
Sparse Regression

What is Sparsity?

Sparsity is a concept frequently encountered in statistics, signal processing, and


machine learning, which refers to situations where the vast majority of elements in a
dataset or a vector are zero or close to zero.

Bayesian Statistics, Jose Storopoli 226


Bayesian Statistics
Sparse Regression

How to Handle Sparsity?


Almost all techniques deal with some sort of variable selection, instead of altering data.

This makes sense from a Bayesian perspective, as data is information, and we don’t
want to throw information away.

Bayesian Statistics, Jose Storopoli 227


Bayesian Statistics
Sparse Regression

Frequentist Approach
The frequentist approach deals with sparse regression by staying in the “optimization”
context but adding Lagrangian constraintsxxxix:
𝑁
2
min{∑ (𝑦𝑖 − 𝛼 − 𝑥𝑇𝑖 𝜷) }
𝛽
𝑖=1

suject to ‖ 𝜷 ‖𝑝 ≤ 𝑡.

Here ‖ ⋅ ‖𝑝 is the 𝑝-norm.

xxxix
this is called LASSO (least absolute shrinkage and selection operator) from Tibshirani (1996); Zou and Hastie (2005).

Bayesian Statistics, Jose Storopoli 228


Bayesian Statistics
Sparse Regression

Variable Selection Techniques

• discrete mixtures: spike-and-slab prior


• shrinkage priors: Laplace prior and horseshoe prior (Carvalho, Polson and Scott, 2009)

Bayesian Statistics, Jose Storopoli 229


Bayesian Statistics
Sparse Regression

Discrete Mixtures – Spike-and-Slab Prior


Mixture of two distributions—one that is concentrated at zero (the “spike”) and one with a much wider spread
(the “slab”). This prior indicates that we believe most coefficients in our model are likely to be zero (or close to
zero), but we allow the possibility that some are not.
Here is the Gaussian case:

𝛽𝑖 | 𝜆𝑖 , 𝑐 ∼ Normal(0, √𝜆2𝑖 𝑐2 )

𝜆𝑖 ∼ Bernoulli(𝑝)

where:
• 𝑐: slab width
• 𝑝: prior inclusion probability; encodes the prior information about the sparsity of the coefficient vector 𝜷
• 𝜆𝑖 ∈ {0, 1}: whether the coefficient 𝛽𝑖 is close to zero (comes from the “spike”, 𝜆𝑖 = 0) or nonzero (comes
from the “slab”, 𝜆𝑖 = 1)

Bayesian Statistics, Jose Storopoli 230


Bayesian Statistics
Sparse Regression

Discrete Mixtures – Spike-and-Slab Prior


0.5 0.5
0.4 0.4
0.3 0.3
PDF

PDF
0.2 0.2
0.1 0.1
0 0
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
𝑐 = 1, 𝜆 = 0 𝑐 = 1, 𝜆 = 1

Bayesian Statistics, Jose Storopoli 231


Bayesian Statistics
Sparse Regression

Shinkrage Priors – Laplace Prior


The Laplace distribution is a continuous probability distribution named after Pierre-
Simon Laplace. It is also known as the double exponential distribution.
It has parameters:
• 𝜇: location parameter
• 𝑏: scale parameter
The PDF is:
1 −( | 𝑥−𝜇 |
)
Laplace(𝜇, 𝑏) = 𝑒 𝑏

2𝑏
It is a symmetrical exponential decay around 𝜇 with scale governed by 𝑏.

Bayesian Statistics, Jose Storopoli 232


Bayesian Statistics
Sparse Regression

Shinkrage Priors – Laplace Prior

0.5

0.4

0.3
PDF

0.2

0.1

0
−4 −3 −2 −1 0 1 2 3 4
𝜇 = 0, 𝑏 = 1

Bayesian Statistics, Jose Storopoli 233


Bayesian Statistics
Sparse Regression

Shinkrage Priors – Horseshoe Prior


The horseshoe prior (Carvalho, Polson and Scott, 2009) assumes that each coefficient 𝛽𝑖 is conditionally
independent with density 𝑃HS (𝛽𝑖 | 𝜏 ), where 𝑃HS can be represented as a scale mixture of Gaussians:

𝛽𝑖 | 𝜆𝑖 , 𝜏 ∼ Normal(0, √𝜆2𝑖 𝜏 2 )

𝜆𝑖 ∼ Cauchy+ (0, 1)

where:
• 𝜏 : global shrinkage parameter
• 𝜆𝑖 : local shrinkage parameter
• Cauchy+ is the half-Cauchy distribution for the standard deviation 𝜆𝑖
Note that it is similar to the spike-and-slab, but the discrete mixture becomes a “continuous” mixture
with the Cauchy+ .

Bayesian Statistics, Jose Storopoli 234


Bayesian Statistics
Sparse Regression

Discrete Mixtures – Spike-and-Slab Prior


0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
PDF

PDF
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
1
𝜏 = 1, 𝜆 = 1 𝜏 = 1, 𝜆 = 2

Bayesian Statistics, Jose Storopoli 235


Bayesian Statistics
Sparse Regression

Discrete Mixtures versus Shinkrage Priors


Discrete mixtures offer the correct representation of sparse problems (Carvalho, Polson
and Scott, 2009) by placing positive prior probability on 𝛽𝑖 = 0 (regression coefficient),
but pose several difficulties: mostly computational due to the non-continuous nature.

Shrinkage priors, despite not having the best representation of sparsity, can be very
attractive computationally: again due to the continuous property.

Bayesian Statistics, Jose Storopoli 236


Bayesian Statistics
Sparse Regression

Horseshoe versus Laplace


The advantages of the Horseshoe prior over the Laplace prior are primarily:
• shrinkage: The Horseshoe prior has infinitely heavy tails and an infinite spike at zero. Parameters
estimated under the Horseshoe prior can be shrunken towards zero more aggressively than under the
Laplace prior, promoting sparsity without sacrificing the ability to detect true non-zero signals.
• signal detection: Due to its heavy tails, the Horseshoe prior does not overly penalize large values,
which allows significant effects to stand out even in the presence of many small or zero effects.
• uncertainty quantification: With its heavy-tailed nature, the Horseshoe prior better captures
uncertainty in parameter estimates, especially when the truth is close to zero.
• regularization: In high-dimensional settings where the number of predictors can exceed the number
of observations, the Horseshoe prior acts as a strong regularizer, automatically adapting to the
underlying sparsity level without the need for external tuning parameters.

Bayesian Statistics, Jose Storopoli 237


Bayesian Statistics
Sparse Regression

Effective Shinkrage Comparison


Makes more sense to compare the shinkrage effects of the proposed approaches so far.
1
Assume for now that 𝜎2 = 𝜏 2 = 1, and define 𝜅𝑖 = 1+𝜆 2.
𝑖

Then 𝜅𝑖 is a random shrinkage coefficient, and can be interpreted as the amount of


weight that the posterior mean for 𝛽𝑖 places on 0 once the data 𝒚 have been observed:

2
2 𝜆𝑖 1
𝐸(𝛽𝑖 | 𝑦𝑖 , 𝜆𝑖 ) = 𝑦 + 0 = (1 − 𝜅𝑖 )𝑦𝑖
1 + 𝜆2𝑖 𝑖 1 + 𝜆2𝑖

Bayesian Statistics, Jose Storopoli 238


Bayesian Statistics
Sparse Regression

Effective Shinkrage Comparisonxl


0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
PDF

PDF
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 1 0 1
Laplace Horseshoe

1
xl
spike-and-slab with 𝑝 = 2
would be very similar to Horseshoe but with discontinuities.

Bayesian Statistics, Jose Storopoli 239


Bayesian Statistics
Sparse Regression

Shinkrage Priors – Horseshoe+


Natural extension from the Horseshoe that has improved performance with highly sparse data (Bhadra
et al., 2015).
Just introduce a new half-Cauchy mixing variable 𝜂𝑖 in the Horseshoe:
𝛽𝑖 | 𝜆𝑖 , 𝜂𝑖 , 𝜏 ∼ Normal(0, 𝜆𝑖 )
𝜆𝑖 | 𝜂𝑖 , 𝜏 ∼ Cauchy+ (0, 𝜏 𝜂𝑖 )
𝜂𝑖 ∼ Cauchy+ (0, 1)

where:
• 𝜏 : global shrinkage parameter
• 𝜆𝑖 : local shrinkage parameter
• 𝜂𝑖 : additional local shrinkage parameter
• Cauchy+ is the half-Cauchy distribution for the standard deviation 𝜆𝑖 and 𝜂𝑖

Bayesian Statistics, Jose Storopoli 240


Bayesian Statistics
Sparse Regression

Shinkrage Priors – Regularized Horseshoe


The Horseshoe and Horseshoe+ guarantees that the strong signals will not be
overshrunk. However, this property can also be harmful, especially when the parameters
are weakly identified.

The solution, Regularized Horseshoe (Piironen and Vehtari, 2017) (also known as the
“Finnish Horseshoe”), is able to control the amount of shrinkage for the largest
coefficient.

Bayesian Statistics, Jose Storopoli 241


Bayesian Statistics
Sparse Regression

Shinkrage Priors – Regularized Horseshoe

𝛽𝑖 | 𝜆𝑖 , 𝜏 , 𝑐 ∼ Normal (0, √𝜏 2 𝜆̃𝑖 2 )

𝑐2 𝜆2𝑖
𝜆̃𝑖 2 =
𝑐2 + 𝜏 2 𝜆2𝑖
𝜆𝑖 ∼ Cauchy+ (0, 1)

where:
• 𝜏 : global shrinkage parameter
• 𝜆𝑖 : local shrinkage parameter
• 𝑐 > 0: regularization constant
• Cauchy+ is the half-Cauchy distribution for the standard deviation 𝜆𝑖
Note that when 𝜏 2 𝜆2𝑖 ≪ 𝑐2 (coefficient 𝛽𝑖 ≈ 0), then 𝜆̃𝑖 2 → 𝜆2𝑖 ; and when 𝜏 2 𝜆2𝑖 ≫ 𝑐2 (coefficient 𝛽𝑖 far from 0),
2
then 𝜆̃𝑖 2 → 𝜏𝑐 2 and 𝛽𝑖 prior approaches Normal(0, 𝑐).

Bayesian Statistics, Jose Storopoli 242


Bayesian Statistics
Sparse Regression

Shinkrage Priors – R2-D2


Still, we can do better. The R2-D2xli prior (Zhang et al., 2022) has heavier tails and higher
concentration around zero than the previous approaches.

The idea is to, instead of specifying a prior on 𝜷, we construct a prior on the coefficient
of determination 𝑅2 footnote{ square of the correlation coefficient between the
dependent variable and its modeled expectation.}. Then using that prior to “distribute”
throughout the 𝜷.

xli
𝑅2 -induced Dirichlet Decomposition

Bayesian Statistics, Jose Storopoli 243


Bayesian Statistics
Sparse Regression

Shinkrage Priors – R2-D2


𝑅2 ∼ Beta(𝜇𝑅2 𝜎𝑅2 , (1 − 𝜇𝑅2 )𝜎𝑅2 )
𝝋 ∼ Dirichlet(𝐽 , 1)

2 𝑅2
𝜏 =
1 − 𝑅2
𝜷 = 𝑍 ⋅ √𝝋𝜏 2

where:
• 𝜏 : global shrinkage parameter
• 𝝋: proportion of total variance allocated to each covariate, can be interpreted as the local shrinkage
parameter
• 𝜇𝑅2 is the mean of the 𝑅2 parameter, generally 12
• 𝜎𝑅2 is the precision of the 𝑅2 parameter, generally 2
• 𝑍 is the standard Gaussian, i.e. Normal(0, 1)

Bayesian Statistics, Jose Storopoli 244


Hierarchical Models
Bayesian Statistics
Hierarchical Models

Recommended References
• Gelman, Hill and Vehtari (2020):
‣ Chapter 5: Hierarchical models
‣ Chapter 15: Hierarchical linear models
• (McElreath, 2020):
‣ Chapter 13: Models With Memory
‣ Chapter 14: Adventures in Covariance
• Gelman and Hill (2007)
• Michael Betancourt’s case study on Hierarchical modeling
• Kruschke and Vanpaemel (2015)

Bayesian Statistics, Jose Storopoli 246


Bayesian Statistics
Hierarchical Models

I have many names…


Hierarchical models are also known for several namesxlii
• Hierarchical Models
• Random Effects Models
• Mixed Effects Models
• Cross-Sectional Models
• Nested Data Models

xlii
for the whole full list check here.

Bayesian Statistics, Jose Storopoli 248


Bayesian Statistics
Hierarchical Models

What are hierarchical models?


Statistical model specified in multiple levels that estimates parameters from the
posterior distribution using a Bayesian approach.
The sub-models inside the model combines to form a hierarchical model, and Bayes’
theorem is used to integrate it to observed data and account for all uncertain.

Hierarchical models are mathematical descriptions that involves several parameters,


where some parameters’ estimates depend on another parameters’ values.

Bayesian Statistics, Jose Storopoli 249


Bayesian Statistics
Hierarchical Models

What are hierarchical models?


Hyperparameter 𝜑 that parameterizes 𝜃1 , 𝜃2 , …, 𝜃𝐾 , that are used to infer the posterior density of some random
variable 𝒚 = 𝑦1 , 𝑦2 , …, 𝑦𝐾

𝑦1 … 𝑦𝑘 … 𝑦𝐾

𝜃1 … 𝜃𝑘 … 𝜃𝐾

Bayesian Statistics, Jose Storopoli 250


Bayesian Statistics
Hierarchical Models

What are hierarchical models?


Even that the observations directly inform only a single set of parameters, a hierarchical model couples individual parameters,
and provides a “backdoor” for information flow.

𝑦1 … 𝑦𝑘 … 𝑦𝐾 𝑦1 … 𝑦𝑘 … 𝑦𝐾

𝜃1 … 𝜃𝑘 … 𝜃𝐾 𝜃1 … 𝜃𝑘 … 𝜃𝐾

𝜑 𝜑

For example, the observations from the 𝑘th group, 𝑦𝑘 , informs directly the parameters that quantify the 𝑘th group’s behavior,
𝜃𝑘 . These parameters, however, inform directly the population-level parameters, 𝜑, that, in turn, informs others group-level
parameters. In the same manner, observations that informs directly other group’s parameters also provide indirectly
information to population-level parameters, which then informs other group-level parameters, and so on…

Bayesian Statistics, Jose Storopoli 251


Bayesian Statistics
Hierarchical Models

When to Use Hierarchical Models?

Hierarchical models are used when information is available in several levels of units of
observation. The hierarchical structure of analysis and organization assists in the
understanding of multiparameter problems, while also performing a crucial role in the
development of computational strategies.

Bayesian Statistics, Jose Storopoli 252


Bayesian Statistics
Hierarchical Models

When to Use Hierarchical Models?


Hierarchical models are particularly appropriate for research projects where participant data can be organized
in more than one levelxliii.
The units of analysis are generally individuals that are nested inside contextual/aggregate units (groups).
An example is when we measure individual performance and we have additional information about distinct
group membership such as:
• sex
• age group
• income level
• education level
• state/province of residence

xliii
also known as nested data.

Bayesian Statistics, Jose Storopoli 253


Bayesian Statistics
Hierarchical Models

When to Use Hierarchical Models?


Another good use case is big data (Andrew Gelman, John B. Carlin, Stern, et al., 2013).
• simple nonhierarchical models are usually inappropriate for hierarchical data: with few
parameters, they generally cannot fit large datasets accurately.
• whereas with many parameters, they tend to overfit.
• hierarchical models can have enough parameters to fit the data well, while using a
population distribution to structure some dependence into the parameters, thereby
avoiding problems of overfitting.

Bayesian Statistics, Jose Storopoli 254


Bayesian Statistics
Hierarchical Models

When to Use Hierarchical Models?

Most important is not to violate the exchangeability assumption (Finetti, 1974).

This assumption stems from the principle that groups are exchangeable.

Bayesian Statistics, Jose Storopoli 255


Bayesian Statistics
Hierarchical Models

Hyperprior
In hierarchical models, we have a hyperprior, which is a prior’s prior:

𝒚 ∼ Normal(10, 𝜽)
𝜽 ∼ Normal(0, 𝜑)
𝜑 ∼ Exponential(1)

Here 𝒚 is a variable of interest that belongs to distinct groups. 𝜽, a prior for 𝒚, is a vector
of group-level parameters with their own prior (which becomes a hyperprior) 𝜑.

Bayesian Statistics, Jose Storopoli 256


Bayesian Statistics
Hierarchical Models

Frequentist versus Bayesian Approaches


There are also hierarchical models in frequentist statistics. They are mainly available in the lme4
package (Bates et al., 2015), and also in MixedModels.jl (Bates et al., 2022).
• optimization of the likelihood function versus posterior approximation via MCMC. Almost always
lead to convergence failure for models that are not extremely simple.
• frequentist hierarchical models do not compute 𝑝-values for the group-level effectsxliv. This is
due to the underlying assumptions of the approximations that frequentist statistics has to to do
in order to calculate the group-level effects 𝑝-values. The main one being that the groups must
be balanced. In other words, the groups must be homogeneous in size. Hence, any unbalance in
group compositions results in pathological 𝑝-values that should not be trusted.

xliv
see https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html [Douglas Bates, creator of the lme4 package explanation].

Bayesian Statistics, Jose Storopoli 257


Bayesian Statistics
Hierarchical Models

Frequentist versus Bayesian Approaches

To sum up, frequentist approach for hierarchical models is not robust in both the
inference process (convergence flaws during the maximum likelihood estimation), and
also in the results from the inference process (do not provide 𝑝-values due to strong
assumptions that are almost always violated).

Bayesian Statistics, Jose Storopoli 258


Bayesian Statistics
Hierarchical Models

Approaches to Hierarchical Modeling

• Varying-intercept model: One group-level intercept besides the population-level


coefficients.

• Varying-slope model: One or more group-level coefficient(s) besides the population-


level intercept.

• Varying-intercept-slope model: One group-level intercept and one or more group-


level coefficient(s).

Bayesian Statistics, Jose Storopoli 259


Bayesian Statistics
Hierarchical Models

Mathematical Specification of Hierarchical Models

We have 𝑁 observations organized in 𝐽 groups with 𝐾 independent variables.

Bayesian Statistics, Jose Storopoli 260


Bayesian Statistics
Hierarchical Models

Mathematical Specification – Varying-Intercept Model


This example is for linear regression:

𝒚 ∼ Normal(𝛼𝑗 + 𝑿 ⋅ 𝜷, 𝜎)
𝛼𝑗 ∼ Normal(𝛼, 𝜏 )
𝛼 ∼ Normal(𝜇𝛼 , 𝜎𝛼 )
𝜷 ∼ Normal(𝜇{𝜷} , 𝜎𝜷 )
𝜏 ∼ Cauchy+ (0, 𝜓𝛼 )
𝜎 ∼ Exponential(𝜆𝜎 )

Bayesian Statistics, Jose Storopoli 261


Bayesian Statistics
Hierarchical Models

Mathematical Specification – Varying-Intercept Model


If you need to extend to more than one group, such as 𝐽1 , 𝐽2 , …:

𝒚 ∼ Normal(𝛼𝑗1 + 𝛼𝑗2 + 𝑿𝜷, 𝜎)

𝛼𝑗1 ∼ Normal(𝛼1 , 𝜏𝛼𝑗 )


1

𝛼𝑗2 ∼ Normal(𝛼2 , 𝜏𝛼𝑗 )


2

𝛼1 ∼ Normal(𝜇𝛼1 , 𝜎𝛼1 )

𝛼2 ∼ Normal(𝜇𝛼2 , 𝜎𝛼2 )
𝜷 ∼ Normal(𝜇𝜷 , 𝜎𝜷 )

𝜏𝛼𝑗 ∼ Cauchy+ (0, 𝜓𝛼𝑗 )


1 1

𝜏𝛼𝑗 ∼ Cauchy+ (0, 𝜓𝛼𝑗 )


2 2

𝜎 ∼ Exponential(𝜆𝜎 )

Bayesian Statistics, Jose Storopoli 262


Bayesian Statistics
Hierarchical Models

Mathematical Specification – Varying-(Intercept-)Slope Model


If we want a varying intercept, we just insert a column filled with 1s in the data matrix 𝑿.

Mathematically, this makes the column behave like an “identity” variable (because the
number 1 in the multiplication operation 1 ⋅ 𝛽 is the identity element. It maps 𝑥 → 𝑥
keeping the value of 𝑥 intact) and, consequently, we can interpret the column’s
coefficient as the model’s intercept.

Bayesian Statistics, Jose Storopoli 263


Bayesian Statistics
Hierarchical Models

Mathematical Specification – Varying-(Intercept-)Slope Model


Hence, we have as a data matrix:

1 𝑥11 𝑥12 … 𝑥1𝐾


⎡ ⎤
⎢1 𝑥21 𝑥22 … 𝑥2𝐾 ⎥
𝑿=⎢
⎢⋮ ⋮ ⋮ ⋱ ⋮ ⎥⎥
⎣1 𝑥𝑁1 𝑥𝑁2 … 𝑥𝑁𝐾 ⎦

Bayesian Statistics, Jose Storopoli 264


Bayesian Statistics
Hierarchical Models

Mathematical Specification – Varying-(Intercept-)Slope Model


This example is for linear regression:

𝒚 ∼ Normal(𝑿𝜷{𝑗} , 𝜎)
𝜷𝑗 ∼ Multivariate Normal(𝝁𝑗 , 𝚺) for 𝑗 ∈ {1, …, 𝐽 }
𝚺 ∼ LKJ(𝜂)
𝜎 ∼ Exponential(𝜆𝜎 )

Each coefficient vector 𝜷𝑗 represents the model columns 𝑿 coefficients for every group
𝑗 ∈ 𝐽 . Also the first column of 𝑿 could be a column filled with 1s (intercept).

Bayesian Statistics, Jose Storopoli 265


Bayesian Statistics
Hierarchical Models

Mathematical Specification – Varying-(Intercept-)Slope Model


If you need to extend to more than one group, such as 𝐽1 , 𝐽2 , …:

𝒚 ∼ Normal(𝛼 + 𝑿𝜷𝑗1 + 𝑿𝜷𝑗2 , 𝜎)

𝜷𝑗1 ∼ Multivariate Normal(𝝁𝑗1 , 𝚺1 ) for 𝑗1 ∈ {1, …, 𝐽1 }

𝜷𝑗2 ∼ Multivariate Normal(𝝁𝑗2 , 𝚺2 ) for 𝑗2 ∈ {1, …, 𝐽2 }


𝚺1 ∼ LKJ(𝜂1 )
𝚺2 ∼ LKJ(𝜂2 )
𝜎 ∼ Exponential(𝜆𝜎 )

Bayesian Statistics, Jose Storopoli 266


Bayesian Statistics
Hierarchical Models

Priors for Covariance Matrices


We can specify a prior for a covariance matrix 𝚺.

For computational efficiency, we can make the covariance matrix 𝚺 into a correlation
matrix. Every covariance matrix can be decomposed into:
𝚺 = diagmatrix (𝝉 ) ⋅ 𝛀 ⋅ diagmatrix (𝝉 )

where 𝛀 is a correlation matrix with 1s in the diagonal and the off-diagonal elements
between −1 e 1 𝜌 ∈ (−1, 1).
𝝉 is a vector composed of the variables’ standard deviation from 𝚺 (is is the 𝚺’s
diagonal).

Bayesian Statistics, Jose Storopoli 267


Bayesian Statistics
Hierarchical Models

Priors for Covariance Matrices


Additionally, the correlation matrix 𝛀 can be decomposed once more for greater computational efficiency.
Since all correlations matrices are symmetric and positive definite (all of its eigenvalues are real numbers ℝ
and positive > 0), we can use the Cholesky Decomposition to decompose it into a triangular matrix (which is
much more computational efficient to handle):

𝛀 = 𝑳Ω 𝑳𝑇Ω

where 𝑳Ω is a lower-triangular matrix.


What we are missing is to define a prior for the correlation matrix 𝛀. Not a long time ago, we’ve used a Wishart
distribution as a prior (Andrew Gelman, John B. Carlin, Stern, et al., 2013).
But this has been abandoned after the proposal of the LKJ distribution by Lewandowski, Kurowicka and Joe
(2009)xlv as a prior for correlation matrices.

xlv
LKJ are the authors’ last name initials – Lewandowski, Kurowicka and Joe.

Bayesian Statistics, Jose Storopoli 268


Markov Chain Monte Carlo (MCMC) and Model Metrics
Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013):
‣ Chapter 10: Introduction to Bayesian computation
‣ Chapter 11: Basics of Markov chain simulation
‣ Chapter 12: Computationally efficient Markov chain simulation
• McElreath (2020) - Chapter 9: Markov Chain Monte Carlo
• Neal (2011)
• Betancourt (2017)
• Gelman, Hill and Vehtari (2020) - Chapter 22, Section 22.8: Computational efficiency
• Chib and Greenberg (1995)
• Casella and George (1992)

Bayesian Statistics, Jose Storopoli 270


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Monte Carlo Methods


• Stan is named after the mathematician Stanislaw Ulam, who was
involved in the Manhattan project, and while trying to calculate the
neutron diffusion process for the hydrogen bomb ended up
creating a whole class of methods called Monte Carlo (Eckhardt,
1987).
• Monte Carlo methods employ randomness to solve problems in
principle are deterministic in nature. They are frequently used in
physics and mathematical problems, and very useful when it is
difficult or impossible to use other approaches.

Bayesian Statistics, Jose Storopoli 272


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

History Behind the Monte Carlo Methodsxlvi


• The idea came when Ulam was playing Solitaire while recovering from surgery.
Ulam was trying to calculate the deterministic, i.e. analytical solution, of the
probability of being dealt an already-won game. The calculations where almost
impossible. So, he thought that he could play hundreds of games to
statistically estimate, i.e. numerical solution, the probability of this result.
• Ulam described the idea to John von Neumann in 1946.
• Due to the secrecy, von Neumann and Ulam’s work demanded a code name.
Nicholas Metropolis suggested using “Monte Carlo”, a homage to the “Casino
Monte Carlo” in Monaco, where Ulam’s uncle would ask relatives for money to
play.

xlvi
those who are interested, should read Eckhardt (1987).

Bayesian Statistics, Jose Storopoli 273


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Why Do We Need MCMC?


The main computation barrier for Bayesian statistics is the denominator in Bayes’ theorem,
𝑃 (data):
𝑃 (𝜃) ⋅ 𝑃 (data | 𝜃)
𝑃 (𝜃 | data) =
𝑃 (data)

In discrete cases, we can turn the denominator into a sum over all parameters using the chain rule
of probability:
𝑃 (𝐴, 𝐵 | 𝐶) = 𝑃 (𝐴 | 𝐵, 𝐶) ⋅ 𝑃 (𝐵 | 𝐶)

This is also known as marginalization:

𝑃 (data) = ∑ 𝑃 (data | 𝜃) ⋅ 𝑃 (𝜃)


𝜃

Bayesian Statistics, Jose Storopoli 274


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Why Do We Need MCMC?


However, in the case of continuous values, the denominator 𝑃 (data) turns into a very big
and nasty integral:

𝑃 (data) = ∫ 𝑃 (data | 𝜃) ⋅ 𝑃 (𝜃) d𝜃


𝜃

In many cases the integral is intractable (not possible of being deterministic evaluated)
and, thus, we must find other ways to compute the posterior 𝑃 (𝜃 | data) without using
the denominator 𝑃 (data).

This is where Monte Carlo methods comes into play!

Bayesian Statistics, Jose Storopoli 275


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Why Do We Need the Denominator 𝑃 (data)?


To normalize the posterior with the intent of making it a valid probability. This means
that the probability for all possible parameters’ values must be 1:
• in the discrete case:

∑ 𝑃 (𝜃 | data) = 1
𝜃

• in the continuous case:

∫ 𝑃 (𝜃 | data) d𝜃 = 1
𝜃

Bayesian Statistics, Jose Storopoli 276


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

What If We Remove the Denominator 𝑃 (data)?


By removing the denominator (data), we conclude that the posterior 𝑃 (𝜃 | data) is
proportional to the product of the prior and the likelihood 𝑃 (𝜃) ⋅ 𝑃 (data | 𝜃):

𝑃 (𝜃 | data) ∝ 𝑃 (𝜃) ⋅ 𝑃 (data | 𝜃)

Bayesian Statistics, Jose Storopoli 277


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Markov Chain Monte Carlo (MCMC)


Here is where Markov Chain Monte Carlo comes in:
MCMC is an ample class of computational tools to approximate integrals and generate
samples from a posterior probability (Brooks et al., 2011).
MCMC is used when it is not possible to sample 𝜽 directly from the posterior probability
𝑃 (𝜽 | data).
Instead, we collect samples in an iterative manner, where every step of the process we
expect that the distribution which we are sampling from 𝑃 ∗ (𝜽(∗) | data) becomes more
similar in every iteration to the posterior 𝑃 (𝜽 | data).
All of this is to eliminate the evaluation (often impossible) of the denominator 𝑃 (data).

Bayesian Statistics, Jose Storopoli 278


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Markov Chains

• We proceed by defining an ergodic Markov chainxlvii in which the set


of possible states is the sample size and the stationary distribution
is the distribution to be approximated (or sampled).
• Let 𝑋0 , 𝑋1 , …, 𝑋𝑛 be a simulation of the chain. The Markov chain
converges to the stationary distribution from any initial state 𝑋0
after a sufficient large number of iterations 𝑟. The distribution of
the state 𝑋𝑟 will be similar to the stationary distribution, hence we
can use it as a sample.

xlvii
meaning that there is an unique stationary distribution.

Bayesian Statistics, Jose Storopoli 279


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Markov Chains
• Markov chains have a property that the probability distribution of
the next state depends only on the current state and not in the
sequence of events that preceded:
𝑃 (𝑋𝑛+1 = 𝑥 | 𝑋0 , 𝑋1 , 𝑋2 , …, 𝑋𝑛 ) = 𝑃 (𝑋𝑛+1 = 𝑥 | 𝑋𝑛 )

This property is called Markovian


• Similarly, using this argument with 𝑋𝑟 as the initial state, we can
use 𝑋2𝑟 as a sample, and so on. We can use the sequence of states
𝑋𝑟 , 𝑋2𝑟 , 𝑋3𝑟 , … as almost (independent samples) of Markov chain
stationary distribution.

Bayesian Statistics, Jose Storopoli 280


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Example of a Markov Chain


0.6

0.4 Sun Rain 0.3

0.7

Bayesian Statistics, Jose Storopoli 281


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Markov Chains
The efficacy of this approach depends on:
• how big 𝑟 must be to guarantee an adequate sample.
• computational power required for every Markov chain iteration.
Besides, it is custom to discard the first iterations of the algorithm because they are usually non-
representative of the underlying stationary distribution to be approximate. In the initial iterations
of MCMC algorithms, often the Markov chain is in a “warm-up”xlviii process, and its state is very far
away from an ideal one to begin a trustworthy sampling.
Generally, it is recommended to discard the first half iterations (Andrew Gelman, John B. Carlin,
Stern, et al., 2013).

xlviii
some references call this “burnin”.

Bayesian Statistics, Jose Storopoli 282


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

MCMC Algorithms
We have TONS of MCMC algorithmsxlix. Here we are going to cover two classes of MCMC
algorithms:

• Metropolis-Hastings (Metropolis et al., 1953; Hastings, 1970).

• Hamiltonian Monte Carlol (Neal, 2011; Betancourt, 2017).

xlix
see the Wikipedia page for a full list.
l
sometimes called Hybrid Monte Carlo, specially in the physics literature.

Bayesian Statistics, Jose Storopoli 283


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

MCMC Algorithms – Metropolis-Hastings


These are the first MCMC algorithms. They use an acceptance/rejection rule for the
proposals. They are characterized by proposals originated from a random walk in the
parameter space. The Gibbs algorithm can be seen as a special case of MH because all
proposals are automatically accepted (Gelman, 1992)

Asymptotically, they have an acceptance rate of 23.4%, and the computational cost of
every iteration is 𝒪(𝑑), where 𝑑 is the number of dimension in the parameter space
(Beskos et al., 2013).

Bayesian Statistics, Jose Storopoli 284


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

MCMC Algorithms – Hamiltonian Monte Carlo


The current most efficient MCMC algorithms. They try to avoid the random walk behavior
by introducing an auxiliary vector of momenta using Hamiltonian dynamics. The
proposals are “guided” to higher density regions of the sample space. This makes HMC
more efficient in orders of magnitude when compared to MH and Gibbs.

Asymptotically, they have an acceptance rate of 65.1%, and the computational cost of
1
every iteration is 𝒪(𝑑 4 ), where 𝑑 is the number of dimension in the parameter space
(Beskos et al., 2013).

Bayesian Statistics, Jose Storopoli 285


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Metropolis Algorithm
The first broadly used MCMC algorithm to generate samples from a Markov
chain was originated in the physics literature in the 1950s and is called
Metropolis (Metropolis et al., 1953), in honor of the first author Nicholas
Metropolis.
In sum, the Metropolis algorithm is an adaptation of a random walk coupled
with an acceptance/rejection rule to converge to the target distribution.
Metropolis algorithm uses a “proposal distribution” 𝐽𝑡 (𝜽∗ ) to define the next
values of the distribution 𝑃 ∗ (𝜽∗ | data). This distribution must be symmetric:

𝐽𝑡 (𝜽∗ | 𝜽𝑡−1 ) = 𝐽𝑡 (𝜽𝑡−1 | 𝜽∗ )

Bayesian Statistics, Jose Storopoli 286


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Metropolis Algorithm
Metropolis is a random walk through the parameter sample space, where the probability of the
Markov chain changing its state is defined as:

𝑃 (𝜽proposed )
𝑃change = min( , 1).
𝑃 (𝜽current )

This means that the Markov chain will only change to a new state based in one of two conditions:
• when the probability of the random walk proposed parameters 𝑃 (𝜽proposed ) is higher than the
probability of the current state parameters 𝑃 (𝜽current ), we change with 100% probability.
• when the probability of the random walk proposed parameters 𝑃 (𝜽proposed ) is lower than the
probability of the current state parameters 𝑃 (𝜽current ), we change with probability equal to the
proportion of this probability difference.

Bayesian Statistics, Jose Storopoli 287


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Metropolis Algorithm

Define an initial set 𝜽0 ∈ ℝ𝑝 that 𝑃 (𝜽0 | 𝒚) > 0


for 𝑡 = 1, 2, …
Sample a proposal of 𝜽∗ from a proposal distribution in time 𝑡, 𝐽𝑡 (𝜽∗ | 𝜽𝑡−1 )
As an acceptance/rejection rule, compute the proportion of the probabilities:
𝑃 (𝜽∗ | 𝒚)
𝑟=
𝑃 (𝜽𝑡−1 | 𝒚)

Assign:

𝑡 𝜽∗ with probability min(𝑟, 1)


𝜽 ={
𝜽𝑡−1 otherwise

Bayesian Statistics, Jose Storopoli 288


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Visual Intuition – Metropolis

0.5

0.4
𝑃 =1 🚶 𝑃 ≈ 1
4

0.3
PDF

0.2

0.1 🚶 🚶

−4 −3 −2 −1 0 1 2 3 4

Bayesian Statistics, Jose Storopoli 289


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Metropolis-Hastings Algorithm
In the 1970s emerged a generalization of the Metropolis algorithm,
which does not need that the proposal distributions be symmetric:

𝐽𝑡 (𝜽∗ | 𝜽𝑡−1 ) ≠ 𝐽𝑡 (𝜽𝑡−1 | 𝜽∗ )

The generalization was proposed by Wilfred Keith Hastings (Hastings,


1970) and is called Metropolis-Hastings algorithm.

Bayesian Statistics, Jose Storopoli 290


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Metropolis-Hastings Algorithm

Define an initial set 𝜽0 ∈ ℝ𝑝 that 𝑃 (𝜽0 | 𝒚) > 0


for 𝑡 = 1, 2, …
Sample a proposal of 𝜽∗ from a proposal distribution in time 𝑡, 𝐽𝑡 (𝜽∗ | 𝜽𝑡−1 )
As an acceptance/rejection rule, compute the proportion of the probabilities:
𝑃 (𝜽∗ | 𝒚)
𝐽𝑡 (𝜽∗ | 𝜽𝑡−1 )
𝑟= 𝑃 (𝜽𝑡−1 | 𝒚)
𝐽𝑡 (𝜽𝑡−1 | 𝜽∗ )

Assign:

𝑡 𝜽∗ with probability min(𝑟, 1)


𝜽 ={
𝜽𝑡−1 otherwise

Bayesian Statistics, Jose Storopoli 291


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Metropolis-Hastings Animation

See Metropolis-Hastings in action at chi-feng/mcmc-demo .

Bayesian Statistics, Jose Storopoli 292


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Limitations of the Metropolis Algorithms


The limitations of the Metropolis-Hastings algorithms are mainly computational:
• with the proposals randomly generated, it can take a large number of iterations for the
Markov chain to enter higher posterior densities spaces.
• even highly-efficient MH algorithms sometimes accept less than 25% of the proposals
(Roberts, Gelman and Gilks, 1997; Beskos et al., 2013).
• in lower-dimensional contexts, higher computational power can compensate the low
efficiency up to a point. But in higher-dimensional (and higher-complexity) modeling
situations, higher computational power alone are rarely sufficient to overcome the low
efficiency.

Bayesian Statistics, Jose Storopoli 293


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Gibbs Algorithm
To circumvent Metropolis’ low acceptance rate, the Gibbs algorithm
was conceived. Gibbs do not have an acceptance/rejection rule for
the Markov chain state change: all proposals are accepted!
Gibbs algorithm was originally conceived by the physicist Josiah
Willard Gibbs while referencing an analogy between a sampling
algorithm and statistical physics (a physics field that originates from
statistical mechanics).
The algorithm was described by the Geman brothers in 1984 (Geman
and Geman, 1984), about 8 decades after Gibbs death.

Bayesian Statistics, Jose Storopoli 294


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Gibbs Algorithm
The Gibbs algorithm is very useful in multidimensional sample spaces. It is also known
as alternating conditional sampling, because we always sample a parameter
conditioned on the probability of the other model’s parameters.
The Gibbs algorithm can be seen as a special case of the Metropolis-Hastings algorithm,
because all proposals are accepted (Gelman, 1992).
The essence of the Gibbs algorithm is the sampling of parameters conditioned in other
parameters:

𝑃 (𝜃1 | 𝜃2 , …, 𝜃𝑝 )

Bayesian Statistics, Jose Storopoli 295


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Gibbs Algorithm

Define an initial set 𝜽0 ∈ ℝ𝑝 that 𝑃 (𝜽0 | 𝒚) > 0


for 𝑡 = 1, 2, …
Assign:
⎧𝜽𝑡 ∼ 𝑃 (𝜃 | 𝜃0 , …, 𝜃0 )
{ 1 1 2 𝑝
{ 𝑡
{𝜽2 ∼ 𝑃 (𝜃2 | 𝜃1𝑡−1 , …, 𝜃𝑝𝑡−1 )
𝜽𝑡 = ⎨
{⋮
{ 𝑡
{𝜽𝑝 ∼ 𝑃 (𝜃𝑝 | 𝜃1𝑡−1 , …, 𝜃𝑝−1
𝑡−1
)

Bayesian Statistics, Jose Storopoli 296


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Gibbs Animation

See Gibbs in action at chi-feng/mcmc-demo .

Bayesian Statistics, Jose Storopoli 297


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Limitations of the Gibbs Algorithm


The main limitation of Gibbs algorithm is with relation to alternating conditional
sampling:
• In Metropolis, the parameters’ random proposals are sampled unconditionally, jointly,
and simultaneous. The Markov chain state changes are executed in a multidimensional
manner. This makes multidimensional diagonal movements.
• In the case of the Gibbs algorithm, this movement only happens one parameter at a
time, because we sample parameters in a conditional and sequential manner with
respect to other parameters. This makes unidimensional horizontal/vertical
movements, and never multidimensional diagonal movements.

Bayesian Statistics, Jose Storopoli 298


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Hamiltonian Monte Carlo (HMC)


Metropolis’ low acceptance rate and Gibbs’ low performance in
multidimensional problems (where the posterior geometry is highly
complex) made a new class of MCMC algorithms to emerge.
These are called Hamiltonian Monte Carlo (HMC), because they
incorporate Hamiltonian dynamics (in honor of Irish physicist
William Rowan Hamilton).

Bayesian Statistics, Jose Storopoli 299


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

HMC Algorithm
HMC algorithm is an adaptation of the MH algorithm, and employs a guidance scheme to the
generation of new proposals. It boosts the acceptance rate, and, consequently, has a better
efficiency.
More specifically, HMC uses the gradient of the posterior’s log density to guide the Markov chain to
higher density regions of the sample space, where most of the samples are sampled:
d log 𝑃 (𝜽 | 𝒚)
d𝜃
As a result, a Markov chain that uses a well-adjusted HMC algorithm will accept proposals with a
much higher rate than if using the MH algorithm (Roberts, Gelman and Gilks, 1997; Beskos et al.,
2013).

Bayesian Statistics, Jose Storopoli 300


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

History of HMC Algorithm


HMC was originally described in the physics literatureli (Duane et al., 1987).

Soon after, HMC was applied to statistical problems by Neal (1994) who named it as
Hamiltonian Monte Carlo (HMC).

For a much more detailed and in-depth discussion (not our focus here) of HMC, I
recommend Neal (2011) and Betancourt (2017).

li
where is called “Hybrid” Monte Carlo (HMC)

Bayesian Statistics, Jose Storopoli 301


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

What Changes With HMC?


HMC uses Hamiltonian dynamics applied to particles efficiently exploring a posterior
probability geometry, while also being robust to complex posterior’s geometries.

Besides that, HMC is much more efficiently than Metropolis and does not suffer Gibbs’
parameters correlation issues

Bayesian Statistics, Jose Storopoli 302


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Intuition Behind the HMC Algorithm


For every parameter 𝜃𝑗 , HMC adds a momentum variable 𝜑𝑗 . The posterior density 𝑃 (𝜽 | 𝑦) is
incremented by an independent momenta distribution 𝑃 (𝝋), hence defining the following joint
probability:
𝑃 (𝜽, 𝝋 | 𝑦) = 𝑃 (𝝋) ⋅ 𝑃 (𝜽 | 𝑦)

HMC uses a proposal distribution that changes depending on the Markov chain current state. HMC
finds the direction where the posterior density increases, the gradient, and alters the proposal
distribution towards the gradient direction.
The probability of the Markov chain to change its state in HMC is defined as:

𝑃 (𝜽proposed ) ⋅ 𝑃 (𝝋proposed )
𝑃change = min( , 1, )
𝑃 (𝜽current ) ⋅ 𝑃 (𝝋current )

Bayesian Statistics, Jose Storopoli 303


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Momenta Distribution – 𝑃 (𝝋)


Generally we give 𝝋 a multivariate normal distribution with mean 0 and covariance 𝑴 , a
“mass matrix”.

To keep things computationally simple, we used a diagonal mass matrix 𝑴 . This makes
that the diagonal elements (components) 𝝋 are independent, each one having a normal
distribution:

𝜑𝑗 ∼ Normal(0, 𝑀𝑗𝑗 )

Bayesian Statistics, Jose Storopoli 304


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

HMC Algorithm
Define an initial set 𝜽0 ∈ ℝ𝑝 that 𝑃 (𝜽0 | 𝒚) > 0
Sample 𝝋 from a Multivariate Normal(𝟎, 𝑴 )
Simultaneously sample 𝜽∗ and 𝝋 with 𝐿 steps and step-size 𝜀
Define the current value of 𝜽 as the proposed value 𝜽∗ : 𝜽∗ ← 𝜽
for 1, 2, …, 𝐿
d log 𝑃 (𝜽∗ | 𝒚)
Use the log of the posterior’s gradient 𝜽∗ to produce a half-step of 𝝋: 𝝋 ← 𝝋 + 12 𝜀 d𝜃
Use 𝝋 to update 𝜽∗ : 𝜽∗ ← 𝜽∗ + 𝜀𝑴 −1 𝝋
d log 𝑃 (𝜽∗ | 𝒚)
Use again 𝜽∗ log gradient to produce a half-step of 𝝋: 𝝋 ← 𝝋 + 12 𝜀 d𝜃
As an acceptance/rejection rule, compute:
𝑃 (𝜽∗ | 𝒚)𝑃 (𝝋∗ )
𝑟=
𝑃 (𝜽𝑡−1 | 𝒚)𝑃 (𝝋𝑡−1 )

Assign:

𝜽∗ with probability min(𝑟, 1)


𝜽𝑡 = {
𝜽𝑡−1 otherwise

Bayesian Statistics, Jose Storopoli 305


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

HMC Animation

See HMC in action at chi-feng/mcmc-demo .

Bayesian Statistics, Jose Storopoli 306


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

An Interlude into Numerical Integration


In the field of ordinary differential equations (ODE), we have the idea of “discretizing” a
system of ODEs by applying a small step-size 𝜀lii. Such approaches are called “numerical
integrators” and are composed by an ample class of tools.

The most famous and simple of these numerical integrators is the Euler method, where
we use a step-size 𝜀 to compute a numerical solution of system in a future time 𝑡 from
specific initial conditions.

lii
sometimes also called ℎ

Bayesian Statistics, Jose Storopoli 307


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

An Interlude into Numerical Integration


The problem is that Euler method, when applied to
Hamiltonian dynamics, does not preserve volume.

One of the fundamental properties of Hamiltonian


dynamics if volume preservation.

This makes the Euler method a bad choice as a


HMC’s numerical integrator. Figure 6: HMC numerically
integrated using Euler with 𝜀 = 0.3
and 𝐿 = 20

Bayesian Statistics, Jose Storopoli 308


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

An Interlude into Numerical Integrationliii


To preserve volume, we need a numerical
symplectic integrator.

Symplectic integrators are at most second-order


and demands a constant step-size 𝜀.

One of the main numerical symplectic integrator


used in Hamiltonian dynamics is the Störmer– Figure 7: HMC numerically
Verlet integrator, also known as leapfrog integrated using leapfrog with 𝜀 =
integrator. 0.3 and 𝐿 = 20

liii
An excellent textbook for numerical and symplectic integrator is Iserles (2008).

Bayesian Statistics, Jose Storopoli 309


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Limitations of the HMC Algorithm


As you can see, HMC algorithm is highly sensible to
the choice of leapfrog steps 𝐿 and step-size 𝜀,

More specific, the leapfrog integrator allows only a


constant 𝜀.

There is a delicate balance between 𝐿 and 𝜀, that


are hyperparameters and need to be carefully Figure 8: HMC numerically
adjusted. integrated using leapfrog with 𝜀 =
1.2 and 𝐿 = 20

Bayesian Statistics, Jose Storopoli 310


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

No-U-Turn-Sampler (NUTS)
In HMC, we can adjust 𝜀 during the algorithm runtime. But, for 𝐿, we need to to “dry run”
the HMC sampler to find a good candidate value for 𝐿.

Here is where the idea for No-U-Turn-Sampler (NUTS) (Hoffman and Gelman, 2011)
enters: you don’t need to adjust anything, just “press the button”.

It will automatically find 𝜀 and 𝐿.

Bayesian Statistics, Jose Storopoli 311


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

No-U-Turn-Sampler (NUTS)
More specifically, we need a criterion that informs that we performed enough
Hamiltonian dynamics simulation.
In other words, to simulate past beyond would not increase the distance between the
proposal 𝜽∗ and the current value 𝜽.
NUTS uses a criterion based on the dot product between the current momenta vector 𝝋
and the difference between the proposal vector 𝜽∗ and the current vector 𝜽, which turns
into the derivative with respect to time 𝑡 of half of the distance squared between 𝜽 e 𝜽∗ :

∗ ∗ d ∗ d (𝜽∗ − 𝜽) ⋅ (𝜽∗ − 𝜽)
(𝜽 − 𝜽) ⋅ 𝝋 = (𝜽 − 𝜽) ⋅ (𝜽 − 𝜽) =
d𝑡 d𝑡 2

Bayesian Statistics, Jose Storopoli 312


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

No-U-Turn-Sampler (NUTS)

This suggests an algorithms that does not allow proposals be guided infinitely until the
distance between the proposal 𝜽∗ and the current 𝜽 is less than zero.

This means that such algorithm will not allow u-turns.

Bayesian Statistics, Jose Storopoli 313


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

No-U-Turn-Sampler (NUTS)
NUTS uses the leapfrog integrator to create a binary tree where each leaf node is a proposal of the
momenta vector 𝝋 tracing both a forward (𝑡 + 1) as well as a backward (𝑡 − 1) path in a determined
fictitious time 𝑡.
The growing of the leaf nodes are interrupted when an u-turn is detected, both forward or
backward.

Figure 9: NUTS growing leaf nodes forward

Bayesian Statistics, Jose Storopoli 314


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

No-U-Turn-Sampler (NUTS)

NUTS also uses a procedure called Dual Averaging (Nesterov, 2009) to simultaneously
adjust 𝜀 and 𝐿 by considering the product 𝜀 ⋅ 𝐿.

Such adjustment is done during the warmup phase and the defined values of 𝜀 and 𝐿
are kept fixed during the sampling phase.

Bayesian Statistics, Jose Storopoli 315


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

NUTS Algorithm

Bayesian Statistics, Jose Storopoli 316


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Define an initial set 𝜽0 ∈ ℝ𝑝 that 𝑃 (𝜽0 | 𝒚) > 0

Instantiate an empty binary tree with 2𝐿 leaf nodes

Sample 𝝋 from a Multivariate Normal(𝟎, 𝑴 )

Simultaneously sample 𝜽∗ and 𝝋 with 𝐿 steps and step-size 𝜀

Define the current value of 𝜽 as the proposed value 𝜽∗ : 𝜽∗ ← 𝜽

for 1, 2, …, 𝐿

Choose a direction 𝑣 ∼ Uniform({−1, 1})


d log 𝑃 (𝜽∗ | 𝒚)
Use the log of the posterior’s gradient 𝜽∗ to produce a half-step of 𝝋: 𝝋 ← 𝝋 + 12 𝜀 d𝜃

Use 𝝋 to update 𝜽∗ : 𝜽∗ ← 𝜽∗ + 𝜀𝑴 −1 𝝋
d log 𝑃 (𝜽∗ | 𝒚)
Use again 𝜽∗ log gradient to produce a half-step of 𝝋: 𝝋 ← 𝝋 + 12 𝜀 d𝜃

Define the node 𝐿𝑣𝑡 as the proposal 𝜽


d (𝜽∗ −𝜽∗ )⋅(𝜽∗ −𝜽∗ )
if the difference between proposal vector 𝜽∗ and current vector 𝜽 in the direction 𝑣 is lower than zero: 𝑣 d𝑡 2
< 0 or 𝐿 steps have been reached

Stop sampling 𝜽∗ in the direction 𝑣 and continue sampling only in the direction −𝑣
d (𝜽∗ −𝜽∗ )⋅(𝜽∗ −𝜽∗ )
The difference between proposal vector 𝜽∗ and current vector 𝜽 in the direction −𝑣 is lower than zero: −𝑣 d𝑡 2
< 0 or 𝐿 steps have been reached

Stop sampling 𝜽 ∗

Choose a random node from the binary tree as the proposal

As an acceptance/rejection rule, compute:

𝑃 (𝜽∗ | 𝒚)𝑃 (𝝋∗ )


𝑟=
𝑃 (𝜽𝑡−1 | 𝒚)𝑃 (𝝋𝑡−1 )

Assign:

𝜽∗ with probability min(𝑟, 1)


𝜽𝑡 = {
𝜽𝑡−1 otherwise

Bayesian Statistics, Jose Storopoli 316


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

NUTS Animation

See NUTS in action at chi-feng/mcmc-demo .

Bayesian Statistics, Jose Storopoli 317


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Limitations of HMC and NUTS Algorithms – Neal (2003)’s Funnel


The famous “Devil’s Funnel”liv.
Here we see that HMC and NUTS, during the exploration of the posterior, have to change
often 𝐿 and 𝜀 valueslv.

liv
very common em hierarchical models.
lv
remember that 𝐿 and 𝜀 are defined in the warmup phase and kept fixed during sampling.

Bayesian Statistics, Jose Storopoli 318


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Neal (2003)’s Funnel and Non-Centered Parameterization (NCP)


The funnel occurs when we have a variable that its variance depends on another variable variance
in an exponential scale. A canonical example of a centered parameterization (CP) is:
𝑦
𝑃 (𝑦, 𝑥) = Normal(𝑦 | 0, 3) ⋅ Normal(𝑥 | 0, 𝑒 2 )

This occurs often in hierarchical models, in the relationship between group-level priors and
population-level hyperpriors. Hence, we reparameterize in a non-centered way, changing the
posterior geometry to make life easier for our MCMC sampler:
𝑃 (̃
𝑦, 𝑥
̃) = Normal(̃
𝑦 | 0, 1) ⋅ Normal(̃
𝑥 | 0, 1)
𝑦 = 𝑦̃ ⋅ 3 + 0
𝑦
𝑥=𝑥
̃ ⋅ 𝑒2 + 0

Bayesian Statistics, Jose Storopoli 319


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Non-Centered Parameterization – Varying-Intercept Model


This example is for linear regression:

𝒚 ∼ Normal(𝛼𝑗 + 𝑿 ⋅ 𝜷, 𝜎)
𝛼𝑗 = 𝑧𝑗 ⋅ 𝜏 + 𝛼
𝑧𝑗 ∼ Normal(0, 1)
𝛼 ∼ Normal(𝜇𝛼 , 𝜎𝛼 )
𝜷 ∼ Normal(𝜇𝜷 , 𝜎𝜷 )
𝜏 ∼ Cauchy+ (0, 𝜓𝛼 )
𝜎 ∼ Exponential(𝜆𝜎 )

Bayesian Statistics, Jose Storopoli 320


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Non-Centered Parameterization – Varying-(Intercept-)Slope Model


This example is for linear regression:

𝒚 ∼ Normal(𝑿𝜷𝑗 , 𝜎)
𝜷𝑗 = 𝜸𝑗 ⋅ 𝚺 ⋅ 𝜸𝑗
𝜸𝑗 ∼ Multivariate Normal(𝟎, 𝑰) for 𝑗 ∈ {1, …, 𝐽 }
𝚺 ∼ LKJ(𝜂)
𝜎 ∼ Exponential(𝜆𝜎 )

Each coefficient vector 𝜷𝑗 represents the model columns 𝑿 coefficients for every group
𝑗 ∈ 𝐽 . Also the first column of 𝑿 could be a column filled with 1s (intercept).

Bayesian Statistics, Jose Storopoli 321


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Stan and NUTS


Stan was the first MCMC sampler to implement NUTS.
Besides that, it has an automatic optimized adjustment routine for values of 𝐿 and 𝜀
during warmup.
It has the following default NUTS hyperparameters’ valueslvi:

• target acceptance rate of Metropolis proposals: 0.8


• max tree depth (in powers of 2): 10 (which means 210 = 1024)

lvi
for more information about how to change those values, see Section 15.2 of the Stan Reference Manual .

Bayesian Statistics, Jose Storopoli 322


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Turing and NUTS


Turing also implements NUTS which lives, along with other MCMC samplers, inside the
package AdvancedHMC.jl.
It also has an automatic optimized adjustment routine for values of 𝐿 and 𝜀 during
warmup.
It has the same default NUTS hyperparameters’ valueslvii:

• target acceptance rate of Metropolis proposals: 0.65


• max tree depth (in powers of 2): 10 (which means 210 = 1024)

lvii
for more information about how to change those values, see Turing Documentation .

Bayesian Statistics, Jose Storopoli 323


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Markov Chain Convergence


MCMC has an interesting property that it will asymptotically converge to the target
distributionlviii.
That means, if we have all the time in the world, it is guaranteed, irrelevant of the target
distribution posterior geometry, MCMC will give you the right answer.
However, we don’t have all the time in the world Different MCMC algorithms, like HMC
and NUTS, can reduce the sampling (and warmup) time necessary for convergence to the
target distribution.

lviii
this property is not present on neural networks.

Bayesian Statistics, Jose Storopoli 324


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Convergence Metrics

We have some options on how to measure if the Markov chains converged to the target
distribution, i.e. if they are “reliable”:

• Effective Sample Size (ESS): an approximation of the “number of independent


samples” generated by a Markov chain.

̂ (Rhat): potential scale reduction factor, a metric to measure if the Markov chain
• 𝑅
have mixed, and, potentially, converged.

Bayesian Statistics, Jose Storopoli 325


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Convergence Metrics – Effective Sample Size (Andrew Gelman, John B.


Carlin, Stern, et al., 2013)
𝑚𝑛
𝑛
̂eff =
1 + ∑𝑇𝑡=1 𝜌̂𝑡

where:
• 𝑚: number of Markov chains.
• 𝑛: total samples per Markov chain (discarding warmup).
• 𝜌̂𝑡 : an autocorrelation estimate.

Bayesian Statistics, Jose Storopoli 326


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Convergence Metrics – Rhat (Andrew Gelman, John B. Carlin, Stern, et al.,


2013)
var
̂ + (𝜓 | 𝑦)
̂=√
𝑅
𝑊
where var
̂ + (𝜓 | 𝑦) is the Markov chains’ sample variance for a certain parameter 𝜓.
We calculate it by using a weighted sum of the within-chain 𝑊 and between-chain 𝐵 variances:

+ 𝑛−1 1
var
̂ (𝜓 | 𝑦) = 𝑊+ 𝐵
𝑛 𝑛
Intuitively, the value is 1.0 if all chains are totally convergent.
̂ > 1.1, you need to worry because probably the chains have not converged
As a heuristic, if 𝑅
adequate.

Bayesian Statistics, Jose Storopoli 327


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Traceplot – Convergent Markov Chains

Figure 10: A convergent Markov chains traceplot

Bayesian Statistics, Jose Storopoli 328


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Traceplot – Divergent Markov Chains

Figure 11: A divergent Markov chains traceplot

Bayesian Statistics, Jose Storopoli 329


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Stan’s Warning Messageslix

Warning messages:
1: There were 275 divergent transitions after warmup. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-
warmup
to find out why this is a problem and how to eliminate them.
2: Examine the pairs() plot to diagnose sampling problems

3: The largest R-hat is 1.12, indicating chains have not mixed.


Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat
4: Bulk Effective Samples Size (ESS) is too low, indicating
posterior
means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess
5: Tail Effective Samples Size (ESS) is too low, indicating
posterior
variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

lix
also see Stan’s warnings guide.

Bayesian Statistics, Jose Storopoli 330


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

Turing’s Warning Messages

Turing does not give warning messages! But you can check divergent transitions with
summarize(chn;
sections=[:internals]):
Summary Statistics
parameters mean std naive_se mcse ess
rhat ess_per_sec
Symbol Float64 Float64 Float64 Float64 Float64
Float64 Float64

lp -3.9649 1.7887 0.0200 0.1062 179.1235


1.0224 6.4133
n_steps 9.1275 11.1065 0.1242 0.7899 38.3507
1.3012 1.3731
acceptance_rate 0.5944 0.4219 0.0047 0.0322 40.5016
1.2173 1.4501
tree_depth 2.2444 1.3428 0.0150 0.1049 32.8514
1.3544 1.1762
numerical_error 0.1975 0.3981 0.0045 0.0273 59.8853
1.1117 2.1441

Bayesian Statistics, Jose Storopoli 331


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

What To Do If the Markov Chains Do Not Converge?


First: before making any fine adjustments in the number of Markov chains or the number
of iterations per chain, etc.

Acknowledge that both Stan’s and Turing’s NUTS sampler is very efficient and effective
in exploring the most crazy and diverse target posterior densities.

And the standard settings, 2,000 iterations and 4 chains, works perfectly for 99% of the
time.

Bayesian Statistics, Jose Storopoli 332


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

What To Do If the Markov Chains Do Not Converge?

When you have computational problems, often there’s a problem with your model.
— Gelman (2008)

Bayesian Statistics, Jose Storopoli 333


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

What To Do If the Markov Chains Do Not Converge?


If you experiencing convergence issues, and you’ve discarded that something is wrong
with you model, here is a few steps to trylx.
Here listed in increasing complexity:
1. Increase the number of iterations and chains: try first increasing the number of
iterations, then try increasing the number of chains. (remember the default is 2,000
iterations and 4 chains).

besides that, maybe should be worth to do a QR decomposition in the data matrix 𝑿, thus having an orthogonal basis (non-correlated) for the sampler to explore. This
lx

makes the target distribution’s geometry much more friendlier, in the topological/geometrical sense, for the MCMC sampler explore. Check the backup slides.

Bayesian Statistics, Jose Storopoli 334


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

What To Do If the Markov Chains Do Not Converge?


2. Change the HMC’s warmup adaptation routine: make the HMC sampler to be more
conservative in the proposals. This can be changed by increasing the hyperparameter
target acceptance rate of Metropolis proposalslxi. The maximum value is 1.0 (not
recommended). Then, any value between 0.8 and 1.0 is more conservative.
3. Model reparameterization: there are two approaches. Centered parameterization (CP)
and non-centered parameterization (NCP).

lxi
Stan’s default is 0.8 and Turing’s default is 0.65.

Bayesian Statistics, Jose Storopoli 335


Bayesian Statistics
Markov Chain Monte Carlo (MCMC) and Model Metrics

What To Do If the Markov Chains Do Not Converge?


4. Collect more data: sometimes the model is too complex and we need a higher sample
size for stable estimates.
5. Rethink the model: convergence issues with an adequate sample size might be due to
incompatibility between priors and likelihood function(s). In this case you need to
rethink the whole data generating process underlying the model, in which the model
assumptions stems from.

Bayesian Statistics, Jose Storopoli 336


Model Comparison
Bayesian Statistics
Model Comparison

Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013) - Chapter 7: Evaluating, comparing,
and expanding models
• Gelman, Hill and Vehtari (2020) - Chapter 11, Section 11.8: Cross validation
• McElreath (2020) - Chapter 7, Section 7.5: Model comparison
• Vehtari, Gelman and Gabry (2015)
• Spiegelhalter et al. (2002)
• Van Der Linde (2005)
• Watanabe and Opper (2010)
• Gelfand (1996)
• Watanabe and Opper (2010)
• Geisser and Eddy (1979)

Bayesian Statistics, Jose Storopoli 338


Bayesian Statistics
Model Comparison

Why Compare Models?

After model parameters estimation, many times we want to measure its predictive
accuracy by itself, or for model comparison, model selection, or computing a model
performance metric (Geisser and Eddy, 1979).

Bayesian Statistics, Jose Storopoli 340


Bayesian Statistics
Model Comparison

But What About Visual Posterior Predictive Checks?


To analyze and compare models using visual posterior predictive checks is a subjective
and arbitrary approach.

There is an objective approach to compare Bayesian models which uses a robust metric
that helps us select the best model in a set of candidate models.

Having an objective way of comparing and choosing the best model is very important. In
the Bayesian workflow, we generally have several iterations between priors and
likelihood functions resulting in several different models (Gelman et al., 2020).

Bayesian Statistics, Jose Storopoli 341


Bayesian Statistics
Model Comparison

Model Comparison Techniques


We have several model comparison techniques that use predictive accuracy, but the
main ones are:
• Leave-one-out cross-validation (LOO) (Vehtari, Gelman and Gabry, 2015).
• Deviance Information Criterion (DIC) (Spiegelhalter et al., 2002), but it is known to have
some issues, due to not being full-Bayesian, because it is only based on point
estimates (Van Der Linde, 2005),
• Widely Applicable Information Criteria (WAIC) (Watanabe and Opper, 2010), full-
Bayesian, in the sense that uses the full posterior distribution density, and it is
asymptotically equal to LOO (Vehtari, Gelman and Gabry, 2015).

Bayesian Statistics, Jose Storopoli 342


Bayesian Statistics
Model Comparison

Historical Interlude
In the past, we did not have computational power and data abundance. Model comparison was done based on
a theoretical divergence metric originated from information theory’s entropy:
𝑁
𝐻(𝑝) = − 𝐸 log(𝑝𝑖 ) = − ∑ 𝑝𝑖 log(𝑝𝑖 )
𝑖=1

We compute the divergence by multiplying entropy by −2lxii, so lower values are preferable:
𝑁
1 𝑆
𝐷(𝑦, 𝜽) = −2 ⋅ ∑ log ∑ 𝑃 (𝑦𝑖 | 𝜽𝑠 )
𝑆 𝑠=1
⏟⏟
𝑖=1 ⏟⏟⏟⏟⏟ ⏟⏟⏟⏟
log pointwise predictive density - lppd

where 𝑛 is the sample size and 𝑆 is the number of posterior draws.

lxii
historical reasons.

Bayesian Statistics, Jose Storopoli 343


Bayesian Statistics
Model Comparison

Historical Interlude – AIC (Akaike, 1973)


AIC = 𝐷(𝑦, 𝜽) + 2𝑘 = −2lppdmle + 2𝑘

where 𝑘 is the number of the model’s free parameters and lppdmle is the maximum
likelihood estimate of the log pointwise predictive density.

AIC is an approximation and can only be reliable when:


• The priors are uniform (flat priors) or totally dominated by the likelihood function.
• The posterior is approximate a multivariate normal distribution.
• The sample size 𝑁 is much larger than the number of the model’s free parameters 𝑘:
𝑁 ≫𝑘

Bayesian Statistics, Jose Storopoli 344


Bayesian Statistics
Model Comparison

Historical Interlude – DIC (Spiegelhalter et al., 2002)


A generalization of the AIC, where we replace the maximum likelihood estimate for the
posterior mean and 𝑘 by a data-based bias correction:

1 𝑆
DIC = 𝐷(𝑦, 𝜽) + 𝑘DIC = −2lppdBayes + 2(lppdBayes − ∑ log 𝑃 (𝑦 | 𝜽𝑠 ))
𝑆 𝑠=1
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
bias-corrected 𝑘

DIC removes the restriction on uniform AIC priors, but still keeps the assumptions of the
posterior being a multivariate Gaussian/normal distribution and that 𝑁 ≫ 𝑘.

Bayesian Statistics, Jose Storopoli 345


Bayesian Statistics
Model Comparison

Predictive Accuracy
With current computational power, we do not need approximationslxiii.

We can discuss predictive accuracy objective metrics

But, first, let’s define what is predictive accuracy.

lxiii
AIC, DIC etc.

Bayesian Statistics, Jose Storopoli 346


Bayesian Statistics
Model Comparison

Predictive Accuracy
Bayesian approaches measure predictive accuracy using posterior draws 𝑦̃ from the
model. For that we have the predictive posterior distribution:

𝑝(̃
𝑦 | 𝑦) = ∫ 𝑝(̃
𝑦𝑖 | 𝜃)𝑝(𝜃 | 𝑦) d𝜃

Where 𝑝(𝜃 | 𝑦) is the model’s posterior distribution. The above equation means that we
evaluate the integral with respect to the whole joint probability of the model’s
predictive posterior distribution and posterior distribution.

The higher the predictive posterior distribution 𝑝(̃


𝑦 | 𝑦), the better will be the model’s
predictive accuracy.

Bayesian Statistics, Jose Storopoli 347


Bayesian Statistics
Model Comparison

Predictive Accuracy
To make samples comparable, we calculate the expectation of this measure for each one
of the 𝑁 sample observations:
𝑁
elpd = ∑ ∫ 𝑝𝑡(̃𝑦𝑖 ) log 𝑝(̃
𝑦𝑖 | 𝑦) d̃
𝑦
𝑖=1

where elpd is the expected log pointwise predictive density, and 𝑝𝑡(̃𝑦𝑖 ) is the distribution
that represents the 𝑦̃𝑖 ’s true underlying data generating process.
The 𝑝𝑡(̃𝑦𝑖 ) are unknown and we generally use cross-validation or approximations to
estimate elpd.

Bayesian Statistics, Jose Storopoli 348


Bayesian Statistics
Model Comparison

Leave-One-Out Cross-Validation (LOO)


We can compute the elpd using LOO (Vehtari, Gelman and Gabry, 2015):
𝑁
elpdloo = ∑ log 𝑝(𝑦𝑖 | 𝑦−𝑖 )
𝑖=1

where

𝑝(𝑦𝑖 | 𝑦−𝑖 ) = ∫ 𝑝(𝑦𝑖 | 𝜃)𝑝(𝜃 | 𝑦−𝑖 ) d𝜃

which is the predictive density conditioned on the data without a single observation 𝑖 (𝑦−𝑖 ). Almost
always we use the PSIS-LOOlxiv approximation due to its robustness and low computational cost.

lxiv
upcoming…

Bayesian Statistics, Jose Storopoli 349


Bayesian Statistics
Model Comparison

Widely Applicable Information Criteria (WAIC)


WAIC (Watanabe and Opper, 2010), like LOO, is also an alternative approach to compute the elpd,
and is defined as:
̂ waic = lppd
elpd ̂ −̂𝑝waic

where 𝑝̂waic is the number of effective parameters based on:


𝑁
𝑝̂waic = ∑ varpost (log 𝑝(𝑦𝑖 | 𝜃))
𝑖=1

which we can compute using the posterior variance of the log predictive density for each observation 𝑦𝑖 :
𝑁
𝑆
𝑝̂waic = ∑ 𝑉𝑠=1 (log 𝑝(𝑦𝑖 | 𝜃𝑠 ))
𝑖=1

𝑆
where 𝑉𝑠=1 is the sample’s variance:
𝑆
1
𝑆
𝑉𝑠=1 𝑎𝑠 = ∑ (𝑎𝑠 − |(𝑎))2
𝑆 − 1 𝑠=1

Bayesian Statistics, Jose Storopoli 350


Bayesian Statistics
Model Comparison

𝐾-fold Cross-Validation (𝐾-fold CV)


In the same manner that we can compute the elpd using LOO with 𝑁 − 1 sample
partitions, we can also compute it with any desired partition number.

Such approach is called 𝐾-fold cross-validation (𝐾-fold CV).

Contrary to LOO, we cannot approximate the actual elpd using 𝐾-fold CV, and we need to
compute the actual elpd over 𝐾 partitions, which almost involves a high computational
cost.

Bayesian Statistics, Jose Storopoli 351


Bayesian Statistics
Model Comparison

Pareto Smoothed Importance Sampling LOO (PSIS-LOO)


PSIS uses importance samplinglxv, which means a importance weighting scheme
approach.

The Pareto smoothing is a technique to increase the importance weights’ reliability.

lxv
another class of MCMC algorithm that we did not cover yet.

Bayesian Statistics, Jose Storopoli 352


Bayesian Statistics
Model Comparison

Importance Sampling
If the 𝑁 samples are conditionally independentlxvi (Gelfand, Dey and Chang, 1992), we
can compute LOO with 𝜽𝑠 posterior’ samples 𝑃 (𝜃 | 𝑦) using importance weights:
1 𝑃 (𝜃𝑠 |𝑦−𝑖 )
𝑟𝑖𝑠 = 𝑠

𝑃 (𝑦𝑖 |𝜃 ) 𝑃 (𝜃𝑠 |𝑦)

Hence, to get Importance Sampling Leave-One-Out (IS-LOO):

∑𝑆𝑠=1 𝑟𝑖𝑠 𝑃 (̃
𝑦𝑖 |𝜃𝑠 )
𝑃 (̃
𝑦𝑖 | 𝑦−𝑖 ) ≈
∑𝑆𝑠=1 𝑟𝑖𝑠

lxvi
that is, they are independent if conditioned on the model’s parameters, which is a basic assumption in any Bayesian (and frequentist) model

Bayesian Statistics, Jose Storopoli 353


Bayesian Statistics
Model Comparison

Importance Sampling
However, the posterior 𝑃 (𝜃 | 𝑦 often has low variance and shorter tails than the LOO
distributions 𝑃 (𝜃 | 𝑦−1 ). Hence, if we use:

∑𝑆𝑠=1 𝑟𝑖𝑠 𝑃 (̃
𝑦𝑖 |𝜃𝑠 )
𝑃 (̃
𝑦𝑖 | 𝑦−𝑖 ) ≈
∑𝑆𝑠=1 𝑟𝑖𝑠

we will have instabilities because the 𝑟𝑖 can have high, or even infinite, variance.

Bayesian Statistics, Jose Storopoli 354


Bayesian Statistics
Model Comparison

Pareto Smoothed Importance Sampling


We can enhance the IS-LOO estimate using a Pareto Smoothed Importance Sampling
(Vehtari, Gelman and Gabry, 2015).

When the tails of the importance weights’ distribution are long, a direct usage of the
importance is sensible to one or more large value. By fitting a generalized Pareto
distribution to the importance weights’ upper-tail, we smooth out these values.

Bayesian Statistics, Jose Storopoli 355


Bayesian Statistics
Model Comparison

Pareto Smoothed Importance Sampling LOO (PSIS-LOO)


Finally, we have PSIS-LOO:
𝑛 ∑𝑆𝑠=1 𝑤𝑖𝑠 𝑃 (𝑦𝑖 |𝜃𝑠 )
̂ psis-loo = ∑ log⎛
elpd ⎜ ⎞

𝑖=1 ⎝ ∑𝑆𝑠=1 𝑤𝑖𝑠 ⎠

where 𝑤 is the truncated weights.

Bayesian Statistics, Jose Storopoli 356


Bayesian Statistics
Model Comparison

Pareto Smoothed Importance Sampling LOO (PSIS-LOO)


We use the importance weights Pareto distribution’s estimated shape parameter ̂𝑘 to assess its
reliability:
• 𝑘 < 12 : the importance weights variance is finite, the central limit theorem holds, and the estimate
rapidly converges.
1
• 2
< 𝑘 < 1 the importance weights variance is infinite, but the mean exists (is finite), the generalized
central limit theorem for stable distributions holds, and the estimate converges, but slower. The PSIS
variance estimate is finite, but could be large.
• 𝑘 > 1 both the importance weights variance and mean do not exist (they are infinite). The PSIS
variance estimate is finite, but could be large.

Any ̂𝑘 > 0.5 is a warning sign, but empirically there is still a good performance up to ̂𝑘 < 0.7.

Bayesian Statistics, Jose Storopoli 357


Backup Slides
Bayesian Statistics
Backup Slides

How the Normal distribution aroselxvii


𝑛
Binomial(𝑛, 𝑘) = ( )𝑝𝑘 (1 − 𝑝)𝑛−𝑘
𝑘
√ 𝑛 𝑛
𝑛! ≈ 2𝜋𝑛( )
𝑒
𝑛 𝑘 𝑛−𝑘 1 (𝑘−𝑛𝑝)2
− 2𝑛𝑝𝑞
lim ( )𝑝 (1 − 𝑝) =√ 𝑒
𝑛→∞ 𝑘 2𝜋𝑛𝑝𝑞

We know that in the binomial: 𝐸 = 𝑛𝑝 and Var = 𝑛𝑝𝑞; hence replacing 𝐸 by 𝜇 and Var by 𝜎2 :
𝑛 𝑘 𝑛−𝑘 1 −
(𝑘−𝜇)2
lim ( )𝑝 (1 − 𝑝) = √ 𝑒 𝜎2
𝑛→∞ 𝑘 𝜎 2𝜋

lxvii
Origins can be traced back to Abraham de Moivre in 1738. A better explanation can be found by clicking here.

Bayesian Statistics, Jose Storopoli 359


Bayesian Statistics
Backup Slides

QR Decomposition
In Linear Algebra 101, we learn that any matrix (even non-square ones) can be decomposed into a product of two matrices:
• 𝑸: an orthogonal matrix (its columns are orthogonal unit vectors, i.e. 𝑸𝑇 = 𝑸−1 )
• 𝑹: an upper-triangular matrix
Now, we incorporate the QR decomposition into the linear regression model. Here, I am going to use the “thin” QR instead of the “fat”,

which scales 𝑸 and 𝑹 matrices by a factor of 𝑛 − 1 where 𝑛 is the number of rows in 𝑿. In practice, it is better to implement the thin
QR, than the fat QR decomposition. It is more numerical stable. Mathematically speaking, the thing QR decomposition is:
𝑿 = 𝑸∗ 𝑹 ∗

𝑸∗ = 𝑸 ⋅ 𝑛 − 1
1
𝑹∗ = √ ⋅𝑹
𝑛−1
𝝁=𝛼+𝑿⋅𝜷+𝜎
= 𝛼 + 𝑸∗ ⋅ 𝑹 ∗ ⋅ 𝜷 + 𝜎
= 𝛼 + 𝑸∗ ⋅ (𝑹 ∗ ⋅ 𝜷) + 𝜎
= 𝛼 + 𝑸∗ ⋅ 𝜷̃ + 𝜎

Bayesian Statistics, Jose Storopoli 360


Bayesian Statistics
Backup Slides

Bibliography
Akaike, H. (1973) “Information theory and an extension of the maximum likelihood
principle,” Second International Symposium on Information Theory. Edited by B. N.
Petrov and F. Csaki
Amrhein, V., Greenland, S. and McShane, B. (2019) “Scientists Rise up against Statistical
Significance,” Nature, 567(7748), pp. 305–307. Available at: https://doi.org/10.1038/d41586-
019-00857-9
Bates, D. et al. (2022) JuliaStats/MixedModels.jl. Zenodo. Available at: https://doi.org/10.
5281/ZENODO.6925652
Bates, D. et al. (2015) “Fitting Linear Mixed-Effects Models Using lme4,” Journal of
Statistical Software, 67(1), pp. 1–48. Available at: https://doi.org/10.18637/jss.v067.i01

Bayesian Statistics, Jose Storopoli 361


Bayesian Statistics
Backup Slides

Benjamin, D. J. et al. (2018) “Redefine Statistical Significance,” Nature Human Behaviour,


2(1), pp. 6–10. Available at: https://doi.org/10.1038/s41562-017-0189-z
Bertsekas, D. P. and Tsitsiklis, J. N. (2008) Introduction to Probability, 2nd Edition. 2nd
edition. Belmont, Massachusetts: Athena Scientific
Beskos, A. et al. (2013) “Optimal Tuning of the Hybrid Monte Carlo Algorithm,” Bernoulli,
19(5A), pp. 1501–1534. Available at: https://doi.org/10.3150/12-BEJ414
Betancourt, M. (2017) A Conceptual Introduction to Hamiltonian Monte Carlo. Available at:
http://arxiv.org/abs/1701.02434 (Accessed: November 6, 2019)
Betancourt, M. (2019) Probabilistic Building Blocks. Available at: https://betanalpha.
github.io/assets/case_studies/probability_densities.html (Accessed: May 27, 2021)

Bayesian Statistics, Jose Storopoli 361


Bayesian Statistics
Backup Slides

Betancourt, M. (2021) Sparsity Blues. Available at: https://betanalpha.github.io/assets/


case_studies/modeling_sparsity.html (Accessed: December 9, 2023)
Bhadra, A. et al. (2015) “The Horseshoe+ Estimator of Ultra-Sparse Signals”
Blei, D. M. (2014) “Build, Compute, Critique, Repeat: Data Analysis with Latent Variable
Models,” Annual Review of Statistics and Its Application, 1(1), pp. 203–232. Available at:
https://doi.org/10.1146/annurev-statistics-022513-115657
Box, G. E. P. (1976) “Science and Statistics,” Journal of the American Statistical
Association, 71(356), pp. 791–799. Available at: https://doi.org/10.2307/2286841
Brooks, S. et al. (2011) Handbook of Markov Chain Monte Carlo. CRC Press

Bayesian Statistics, Jose Storopoli 361


Bayesian Statistics
Backup Slides

Bürkner, P.-C. and Vuorre, M. (2019) “Ordinal Regression Models in Psychology: A Tutorial,”
Advances in Methods and Practices in Psychological Science, 2(1), p. 77–78. Available at:
https://doi.org/10.1177/2515245918823199
Carpenter, B. et al. (2017) “Stan : A Probabilistic Programming Language,” Journal of
Statistical Software, 76(1). Available at: https://doi.org/10.18637/jss.v076.i01
Carvalho, C. M., Polson, N. G. and Scott, J. G. (2009) “Handling sparsity via the horseshoe,”
in Artificial intelligence and statistics, pp. 73–80
Casella, G. and George, E. I. (1992) “Explaining the Gibbs Sampler,” The American
Statistician, 46(3), pp. 167–174. Available at: https://doi.org/10.1080/00031305.1992.
10475878

Bayesian Statistics, Jose Storopoli 361


Bayesian Statistics
Backup Slides

Chib, S. and Greenberg, E. (1995) “Understanding the Metropolis-Hastings Algorithm,” The


American Statistician, 49(4), pp. 327–335. Available at: https://doi.org/10.1080/00031305.
1995.10476177
Dekking, F. M. et al. (2010) A Modern Introduction to Probability and Statistics:
Understanding Why and How. Springer
Diaconis, P. and Skyrms, B. (2019) Ten Great Ideas about Chance. Princeton University
Press
Duane, S. et al. (1987) “Hybrid Monte Carlo,” Physics Letters B, 195(2), pp. 216–222.
Available at: https://doi.org/10.1016/0370-2693(87)91197-X

Bayesian Statistics, Jose Storopoli 361


Bayesian Statistics
Backup Slides

Eckhardt, R. (1987) “Stan Ulam, John von Neumann, and the Monte Carlo Method,” Los
Alamos Science, 15(30), pp. 131–136
Finetti, B. de (1974) Theory of Probability. Volume1 ed. New York: John Wiley & Sons
Fisher, R. A. (1925) Statistical methods for research workers. Oliver, Boyd
Fisher, R. A. (1962) “Some Examples of Bayes' Method of the Experimental Determination
of Probabilities A Priori,” Journal of the Royal Statistical Society. Series B
(Methodological), 24(1), pp. 118–124
Ge, H., Xu, K. and Ghahramani, Z. (2018) “Turing: A Language for Flexible Probabilistic
Inference,” in International Conference on Artificial Intelligence and Statistics. PMLR, pp.
1682–1690

Bayesian Statistics, Jose Storopoli 361


Bayesian Statistics
Backup Slides

Geisser, S. and Eddy, W. F. (1979) “A predictive approach to model selection,” Journal of


the American Statistical Association, 74(365), pp. 153–160
Gelfand, A. E., Dey, D. K. and Chang, H. (1992) “Model determination using predictive
distributions with implementation via sampling-based methods,” Bayesian Statistics.
Edited by J. M. Bernardo et al. Oxford University Press
Gelfand, A. E. (1996) “Model determination using sampling-based methods,” Markov
chain Monte Carlo in practice, pp. 145–161
Gelman, A. (1992) “Iterative and Non-Iterative Simulation Algorithms,” in Computing
Science and Statistics (Interface Proceedings). PROCEEDINGS PUBLISHED BY VARIOUS
PUBLISHERS, pp. 457–511

Bayesian Statistics, Jose Storopoli 361


Bayesian Statistics
Backup Slides

Gelman, A. (2008) The Folk Theorem of Statistical Computing. Available at: https://
statmodeling.stat.columbia.edu/2008/05/13/the_folk_theore/
Gelman, A. and Hill, J. (2007) Data Analysis Using Regression and Multilevel/Hierarchical
Models. Cambridge university press
Gelman, Andrew, Carlin, John B., Stern, et al. (2013) “Basics of Markov Chain Simulation,”
Bayesian Data Analysis. Chapman and Hall/CRC
Gelman, Andrew, Carlin, John B., Stern, et al. (2013) Bayesian Data Analysis. Chapman and
Hall/CRC
Gelman, A., Hill, J. and Vehtari, A. (2020) Regression and Other Stories. Cambridge
University Press

Bayesian Statistics, Jose Storopoli 361


Bayesian Statistics
Backup Slides

Gelman, A. et al. (2020) Bayesian Workflow. Available at: http://arxiv.org/abs/2011.01808


(Accessed: February 4, 2021)
Geman, S. and Geman, D. (1984) “Stochastic Relaxation, Gibbs Distributions, and the
Bayesian Restoration of Images,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, (6), pp. 721–741. Available at: https://doi.org/10.1109/TPAMI.1984.4767596
Goodman, S. N. (2016) “Aligning Statistical and Scientific Reasoning,” Science, 352(6290),
pp. 1180–1181. Available at: https://doi.org/10.1126/science.aaf5406
Grimmett, G. and Stirzaker, D. (2020) Probability and Random Processes: Fourth Edition.
Fourth Edition, New to this Edition:. Oxford, New York: Oxford University Press

Bayesian Statistics, Jose Storopoli 361


Bayesian Statistics
Backup Slides

Hastings, W. K. (1970) “Monte Carlo Sampling Methods Using Markov Chains and Their
Applications,” Biometrika, 57(1), pp. 97–109. Available at: https://doi.org/10.1093/biomet/
57.1.97
Head, M. L. et al. (2015) “The extent and consequences of p-hacking in science,” PLoS
Biol, 13(3), p. e1002106
Hoffman, M. D. and Gelman, A. (2011) “The No-U-Turn Sampler: Adaptively Setting Path
Lengths in Hamiltonian Monte Carlo,” Journal of Machine Learning Research, 15(1), pp.
1593–1623. Available at: http://arxiv.org/abs/1111.4246

Bayesian Statistics, Jose Storopoli 361


Bayesian Statistics
Backup Slides

Ioannidis, J. P. A. (2019) “What Have We (Not) Learnt from Millions of Scientific Papers
with <i>P</i> Values?,” The American Statistician, 73(sup1), pp. 20–25. Available at:
https://doi.org/10.1080/00031305.2018.1447512
Iserles, A. (2008) A First Course in the Numerical Analysis of Differential Equations. 2nd
ed. USA: Cambridge University Press
Jaynes, E. T. (2003) Probability Theory: The Logic of Science. Cambridge university press
Khan, M. E. and Rue, H. (2021) The Bayesian Learning Rule. Available at: http://arxiv.org/
abs/2107.04562 (Accessed: July 13, 2021)
Kolmogorov, A. N. (1933) Foundations of the Theory of Probability. Berlin: Julius Springer

Bayesian Statistics, Jose Storopoli 361


Bayesian Statistics
Backup Slides

Kruschke, J. K. and Vanpaemel, W. (2015) “Bayesian Estimation in Hierarchical Models,”


The Oxford Handbook of Computational and Mathematical Psychology. Edited by J. R.
Busemeyer et al. Oxford University Press Oxford, UK
Kurt, W. (2019) Bayesian Statistics the Fun Way: Understanding Statistics and Probability
with Star Wars, LEGO, and Rubber Ducks. Illustrated edition. San Francisco: No Starch
Press
Lakens, D. et al. (2018) “Justify Your Alpha,” Nature Human Behaviour, 2(3), pp. 168–171.
Available at: https://doi.org/10.1038/s41562-018-0311-x

Bayesian Statistics, Jose Storopoli 361


Bayesian Statistics
Backup Slides

Lewandowski, D., Kurowicka, D. and Joe, H. (2009) “Generating random correlation


matrices based on vines and extended onion method,” Journal of multivariate analysis,
100(9), pp. 1989–2001
Van Der Linde, A. (2005) “DIC in variable selection,” Statistica Neerlandica, 59(1), pp. 45–
56
McElreath, R. (2020) Statistical Rethinking: A Bayesian Course with Examples in R and
Stan. CRC press
Metropolis, N. et al. (1953) “Equation of State Calculations by Fast Computing Machines,”
The Journal of Chemical Physics, 21(6), pp. 1087–1092. Available at: https://doi.org/10.
1063/1.1699114

Bayesian Statistics, Jose Storopoli 361


Bayesian Statistics
Backup Slides

Neal, R. M. (1994) “An Improved Acceptance Procedure for the Hybrid Monte Carlo
Algorithm,” Journal of Computational Physics, 111(1), pp. 194–203. Available at: https://doi.
org/10.1006/jcph.1994.1054
Neal, R. M. (2003) “Slice Sampling,” The Annals of Statistics, 31(3), pp. 705–741
Neal, R. M. (2011) “MCMC Using Hamiltonian Dynamics,” Handbook of Markov Chain Monte
Carlo. Edited by S. Brooks et al.
Nesterov, Y. (2009) “Primal-dual subgradient methods for convex problems,”
Mathematical programming, 120(1), pp. 221–259

Bayesian Statistics, Jose Storopoli 361


Bayesian Statistics
Backup Slides

Neyman, J. (1937) “Outline of a theory of statistical estimation based on the classical


theory of probability,” Philosophical Transactions of the Royal Society of London. Series
A, Mathematical and Physical Sciences, 236(767), pp. 333–380
Piironen, J. and Vehtari, A. (2017) “Sparsity information and regularization in the
horseshoe and other shrinkage priors,” Electronic Journal of Statistics, 11(2), pp. 5018–
5051. Available at: https://doi.org/10.1214/17-EJS1337SI
Roberts, G. O., Gelman, A. and Gilks, W. R. (1997) “Weak Convergence and Optimal Scaling
of Random Walk Metropolis Algorithms,” Annals of Applied Probability, 7(1), pp. 110–120.
Available at: https://doi.org/10.1214/aoap/1034625254

Bayesian Statistics, Jose Storopoli 361


Bayesian Statistics
Backup Slides

Rosnow, R. L. and Rosenthal, R. (1989) “Statistical procedures and the justification of


knowledge in psychological science,” American Psychologist, 44, pp. 1276–1284
Salvatier, J., Wiecki, T. V. and Fonnesbeck, C. (2016) “Probabilistic programming in Python
using PyMC3,” PeerJ Computer Science, 2, p. e55
Schoot, R. van de et al. (2021) “Bayesian Statistics and Modelling,” Nature Reviews
Methods Primers, 1(1), pp. 1–26. Available at: https://doi.org/10.1038/s43586-020-00001-2
Semenova, E. (2019) Ordered Logistic regression and Probabilistic Programming:with
examples in Stan, PyMC3 and Turing. Available at: https://medium.com/@liza_p_
semenova/ordered-logistic-regression-and-probabilistic-programming-502d8235ad3f

Bayesian Statistics, Jose Storopoli 361


Bayesian Statistics
Backup Slides

Spiegelhalter, D. J. et al. (2002) “Bayesian measures of model complexity and fit,” Journal
of the royal statistical society: Series b (statistical methodology), 64(4), pp. 583–639
Storopoli, J. (2021) Bayesian Statistics with Julia and Turing. Available at: https://
storopoli.io/Bayesian-Julia
Tibshirani, R. (1996) “Regression shrinkage and selection via the lasso,” Journal of the
Royal Statistical Society Series B: Statistical Methodology, 58(1), pp. 267–288
Vehtari, A., Gelman, A. and Gabry, J. (2015) Practical Bayesian Model Evaluation Using
Leave-One-out Cross-Validation and WAIC. Available at: https://doi.org/10.1007/s11222-
016-9696-4

Bayesian Statistics, Jose Storopoli 361


Bayesian Statistics
Backup Slides

Wasserstein, R. L. and Lazar, N. A. (2016) “The ASA's Statement on p-Values: Context,


Process, and Purpose,” American Statistician, 70(2), pp. 129–133. Available at: https://doi.
org/10.1080/00031305.2016.1154108
Wasserstein, R. L., Schirm, A. L. and Lazar, N. A. (2019) “Moving to a World Beyond “p <
0.05”, American Statistician, 73(sup1), pp. 1–19. Available at: https://doi.org/10.1080/
00031305.2019.1583913
Watanabe, S. and Opper, M. (2010) “Asymptotic equivalence of Bayes cross validation and
widely applicable information criterion in singular learning theory.,” Journal of machine
learning research, 11(12)

Bayesian Statistics, Jose Storopoli 361


Bayesian Statistics
Backup Slides

Zhang, Y. D. et al. (2022) “Bayesian regression using a prior on the model fit: The r2-d2
shrinkage prior,” Journal of the American Statistical Association, 117(538), pp. 862–874
Zou, H. and Hastie, T. (2005) “Regularization and variable selection via the elastic net,”
Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2), pp. 301–
320
“It’s Time to Talk about Ditching Statistical Significance” (2019) Nature, 567(7748), p. 283–
284. Available at: https://doi.org/10.1038/d41586-019-00874-8

Bayesian Statistics, Jose Storopoli 361

You might also like