slides
slides
Jose Storopolii
License
Outline
1. Tools
2. Bayesian Statistics
3. Statistical Distributions
4. Priors
5. Bayesian Workflow
6. Linear Regression
7. Logistic Regression
8. Ordinal Regression
9. Poisson Regression
10. Robust Regression
11. Sparse Regression
12. Hierarchical Models
13. Markov Chain Monte Carlo (MCMC) and Model Metrics
14. Model Comparison
15. Backup Slides
Recommended References
• Ge, Xu and Ghahramani (2018) - Turing paper
• Carpenter et al. (2017) - Stan paper
• Salvatier, Wiecki and Fonnesbeck (2016) - PyMC paper
• Bayesian Statistics with Julia and Turing - Why Julia?
Tools
• Stan (BSD-3 License)
• Turing (MIT License)
• PyMC (Apache License)
• JAGS (GPL License)
• BUGS (GPL License)
Stanii
• High-performance platform for statistical modeling and statistical computation
• Financial support from NUMFocus:
‣ AWS Amazon
‣ Bloomberg
‣ Microsoft
‣ IBM
‣ RStudio
‣ Facebook
‣ NVIDIA
‣ Netflix
• Open-source language, similar to C++
• Markov Chain Monte Carlo (MCMC) parallel sampler
ii
Carpenter et al. (2017)
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
alpha ~ normal(0, 20);
beta ~ normal(0, 2);
sigma ~ cauchy(0, 2.5);
y ~ normal(alpha + beta * x, sigma);
}
Turingiii
• Ecosystem of Julia packages for Bayesian Inference using probabilistic
programming
• Julia is a fast dynamic-typed language that just-in-time (JIT) compiles into
native code using LLVM: “runs like C but reads like Python” ; meaning that
is blazing fast, easy to prototype and read/write code
• Julia has Financial support from NUMFocus
• Composability with other Julia packages
• Several other options of Markov Chain Monte Carlo (MCMC) samplers
iii
Ge, Xu and Ghahramani (2018)
Turing Ecosystem
We have several Julia packages under Turing’s GitHub organization TuringLang, but I will focus on 6 of those:
• Turing: main package that we use to interface with all the Turing ecosystem of packages and the backbone of
everything
• MCMCChains: interface to summarizing MCMC simulations and has several utility functions for diagnostics
and visualizations
• DynamicPPL: specifies a domain-specific language for Turing, entirely written in Julia, and it is modular
• AdvancedHMC: modular and efficient implementation of advanced Hamiltonian Monte Carlo (HMC)
algorithms
• DistributionsAD: defines the necessary functions to enable automatic differentiation (AD) of the log PDF
functions from Distributions
• Bijectors: implements a set of functions for transforming constrained random variables (e.g. simplexes,
intervals) to Euclidean space
y .~ Normal(α .+ β * x, σ)
end
iv
I believe in Julia’s potential and wrote a whole set of Bayesian Statistics tutorials using Julia and Turing (Storopoli, 2021)
PyMCv
• Python package for Bayesian statistics with a Markov Chain Monte
Carlo sampler
• Financial support from NUMFocus
• Backend was based on Theano
• Theano died, but PyMC developers create a fork named Aesara
• We have no idea what will be the backend in the future. PyMC
developers are still experimenting with other backends:
TensorFlow Probability, NumPyro, BlackJAX, and so on …
v
Salvatier, Wiecki and Fonnesbeck (2016)
likelihood = pm.Normal("y",
mu=alpha + beta * x1,
sigma=sigma, observed=y)
Turing Stan
Why Turing
• Julia all the way down…
• Can interface/compose any Julia package
• Decoupling of modeling DSL, inference algorithms and data
• Not only HMC-NUTS, but a whole plethora of MCMC algorithms, e.g. Metropolis-
Hastings, Gibbs, SMC, IS etc.
• Easy to create/prototype/modify inference algorithms
• Transparent MCMC workflow, e.g. iterative sampling API allows step-wise execution
and debugging of the inference algorithm
• Very easy to do stuff in the GPU, e.g. NVIDIA’s CUDA.jl, AMD’s AMDGPU.jl, Intel’s
oneAPI.jl, and Apple’s Metal.jl
• Very easy to do distributed model inference and prediction.
Why Stan
• API for R, Python and Julia.
• Faster than Turing.jl in 95% of models.
• Well-known in the academic community.
• High citation count.
• More tutorials, example models, and learning materials available.
Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013) - Chapter 1: Probability and inference
• McElreath (2020) - Chapter 1: The Golem of Prague
• Gelman, Hill and Vehtari (2020) - Chapter 3: Some basic methods in mathematics and probability
• Khan and Rue (2021)
• Probability:
‣ A great textbook - Bertsekas and Tsitsiklis (2008)
‣ Also a great textbook (skip the frequentist part) - Dekking et al. (2010)
‣ Bayesian point-of-view and also a philosophical approach - Jaynes (2003)
‣ Bayesian point-of-view with a simple and playful approach - Kurt (2019)
‣ Philosophical approach not so focused on mathematical rigor - Diaconis and Skyrms (2019)
Bayesian statistics is a data analysis approach based on Bayes’ theorem where available
knowledge about the parameters of a statistical model is updated with the information
of observed data. (Andrew Gelman, John B. Carlin, Stern, et al., 2013).
The posterior can also be used to make predictions about future events.
vi
like LEGO
vii
Finetti (1974)
viii
Finetti (1974)
Probability Interpretations
• Objective - frequency in the long run for an event:
days that rained
‣ 𝑃 (rain) = total days
What is Probability?
We define 𝐴 is an event and 𝑃 (𝐴) the probability of event 𝐴.
𝑃 (𝐴) has to be between 0 and 1, where higher values defines higher probability of 𝐴
happening.
𝑃 (𝐴) ∈ ℝ
𝑃 (𝐴) ∈ [0, 1]
0 ≤ 𝑃 (𝐴) ≤ 1
Probability Axiomsix
• Non-negativity: For every 𝐴: 𝑃 (𝐴) ≥ 0
ix
Kolmogorov (1933)
Sample Space
• Discrete: Θ = {1, 2, …}
• Continuous: Θ ∈ (−∞, ∞)
∫ 𝑝(𝑥) d𝑥 = 1
𝑥∈𝑋
Conditional Probability
Probability of an event occurring in case another has occurred or not.
The notation we use is 𝑃 (𝐴 | 𝐵), that read as “the probability of observing 𝐴 given that
we already observed 𝐵”.
number of elements in A and B
𝑃 (𝐴 | 𝐵) =
number of elements in B
𝑃 (𝐴 ∩ 𝐵)
𝑃 (𝐴 | 𝐵) =
𝑃 (𝐵)
• 1
𝑃 (pope): Probability of some random person being the Pope, something really small, 1 in 8 billion ( 8⋅10 9)
• 𝑃 (catholic): Probability of some random person being catholic, 1.34 billion in 8 billion ( 1.34
8
≈ 0.17)
• 999
𝑃 (catholic | pope): Probability of the Pope being catholic ( 1000 = 0.999)
• 1
𝑃 (pope | catholic): Probability of a catholic person being the Pope ( 1.34⋅10 9 ⋅ 0.999 ≈ 7.46 ⋅ 10
−10
)
x
More specific, if the basal rates 𝑃 (𝐴) and 𝑃 (𝐵) aren’t equal, the symmetry is broken 𝑃 (𝐴 | 𝐵) ≠ 𝑃 (𝐵 | 𝐴)
Joint Probability
Probability of two or more events occurring.
The notation we use is 𝑃 (𝐴, 𝐵), that read as “the probability of observing 𝐴 and also
observing 𝐵”.
𝑃 (𝐴, 𝐵) = number of elements in A or B
𝑃 (𝐴, 𝐵) = 𝑃 (𝐴 ∪ 𝐵)
𝑃 (𝐴, 𝐵) = 𝑃 (𝐵, 𝐴)
We can decompose a joint probability 𝑃 (𝐴, 𝐵) into the product of two probabilities:
𝑃 (𝐴, 𝐵) = 𝑃 (𝐵, 𝐴)
𝑃 (𝐴) ⋅ 𝑃 (𝐵 | 𝐴) = 𝑃 (𝐵) ⋅ 𝑃 (𝐴 | 𝐵)
xi
also called the Product Rule of Probability.
Bayes Theorem
Tells us how to “invert” conditional probability:
𝑃 (𝐴) ⋅ 𝑃 (𝐵 | 𝐴)
𝑃 (𝐴 | 𝐵) =
𝑃 (𝐵)
xii
Adapted from: Yudkowski - An Intuitive Explanation of Bayes’ Theorem
NO!
𝑝-value is the probability of obtaining results at least as extreme as the observed, given
that the null hypothesis 𝐻0 is true:
𝑃 (𝐷 | 𝐻0 )
• To quantify the strength of the evidence against the null hypothesis, Fisher
defended “𝑝 < 0.05 as the standard level to conclude that there is evidence
against the tested hypothesis”
• “We should not be off-track if we draw a conventional line at 0.05”
𝑝 = 0.06
• Since 𝑝-value is a probability, it is also a continuous measure.
• Robert Rosenthal, a psychologist said “surely, God loves the .06 nearly as much as the
.05” (Rosnow and Rosenthal, 1989).
xiii
inverse probability was how Bayes’ theorem was called in the beginning of the 20th century.
xiv
quote from Dennis Lindley.
• 𝜃 – parameter(s) of interest
• 𝑦 – observed data
• Priori: prior probability of the parameter(s) value(s)
• Likelihood: probability of the observed data given the parameter(s) value(s)
• Posterior: posterior probability of the parameter(s) value(s) after we observed data 𝑦
• Normalizing Constantxv: 𝑃 (𝑦) does not make any intuitive sense. This probability is transformed and can be interpreted as something
that only exists so that the result 𝑃 (𝑦 | 𝜃)𝑃 (𝜃) be constrained between 0 and 1 – a valid probability.
xv
sometimes also called evidence.
This is the main feature of Bayesian statistics, since we are estimating directly 𝑃 (𝜃 | 𝑦)
using Bayes’ theorem.
The resulting estimate is totally intuitive: simply quantifies the uncertainty that we have
about the value of one or more parameters given the data, model assumptions
(likelihood) and the prior probability of these parameter’s values.
xvi
pun intended …
Recommended References
• Grimmett and Stirzaker (2020):
‣ Chapter 3: Discrete random variables
‣ Chapter 4: Continuous random variables
• Dekking et al. (2010):
‣ Chapter 4: Discrete random variables
‣ Chapter 5: Continuous random variables
• Betancourt (2019)
Probability Distributions
Bayesian statistics uses probability distributions as the inference engine of the
parameter and uncertainty estimates.
Imagine that probability distributions are small “Lego” pieces. We can construct
anything we want with these little pieces. We can make a castle, a house, a city; literally
anything.
The same is valid for Bayesian statistical models. We can construct models from the
simplest ones to the most complex using probability distributions and their
relationships.
For discrete random variables, we define as “mass”, and for continuous random
variables, we define as “density”.
Mathematical Notation
We use the notation
𝑋 ∼ Dist(𝜃1 , 𝜃2 , …)
where:
• 𝑋: random variable
• Dist: distribution name
• 𝜃1 , 𝜃2 , …: parameters that define how the distribution behaves
Every probability distribution can be “parameterized” by specifying parameters that
allow to control certain distribution aspects for a specific goal.
0.4
0.3
PDF
0.2
0.1
−4 −3 −2 −1 0 1 2 3 4
𝑋
0.75
PDF
0.5
0.25
0
−4 −3 −2 −1 0 1 2 3 4
𝑋
Discrete Distributions
Discrete probability distributions are distributions which the results are a discrete
number: −𝑁 , …, −2, 1, 0, 1, 2, …, 𝑁 and 𝑁 ∈ ℤ.
PMF(𝑥) = 𝑃 (𝑋 = 𝑥)
Discrete Uniform
The discrete uniform is a symmetric probability distribution in which a finite number of
values are equally likely of being observable. Each one of the 𝑛 values have probability
1
𝑛
.
The uniform discrete distribution has two parameters and its notation is Uniform(𝑎, 𝑏):
• 𝑎 – lower bound
• 𝑏 – upper bound
Example: dice.
Discrete Uniform
1
Uniform(𝑎, 𝑏) = 𝑓(𝑥, 𝑎, 𝑏) = for 𝑎 ≤ 𝑥 ≤ 𝑏 and 𝑥 ∈ {𝑎, 𝑎 + 1, …, 𝑏 − 1, 𝑏}
𝑏−𝑎+1
Discrete Uniform
0.2
PMF
0.1
1 2 3 4 5 6
𝑎 = 1, 𝑏 = 6
Bernoulli
Bernoulli distribution describes a binary event of the success of an experiment. We
represent 0 as failure and 1 as success, hence the result of a Bernoulli distribution is a
binary variable 𝑌 ∈ {0, 1}.
Bernoulli distribution is often used to model binary discrete results where there is only
two possible results.
Bernoulli distribution has only a single parameter and its notation is Bernoulli(𝑝):
• 𝑝 – probability of success
Bernoulli
Bernoulli
0.6
PMF 0.4
0.2
0 1
1
𝑝= 3
Binomial
The binomial distribution describes an event in which the number of successes in a
sequence 𝑛 independent experiments, each one making a yes–no question with
probability of success 𝑝. Notice that Bernoulli distribution is a special case of the
binomial distribution where 𝑛 = 1.
The binomial distribution has two parameters and its notation is Binomial(𝑛, 𝑝) :
• 𝑛 – number of experiments
• 𝑝 – probability of success
Binomial
𝑛 𝑥
Binomial(𝑛, 𝑝) = 𝑓(𝑥, 𝑛, 𝑝) = ( )𝑝 (1 − 𝑝)𝑛−𝑥 for 𝑥 ∈ {0, 1, …, 𝑛}
𝑥
Binomial
0.4 0.4
0.3 0.3
PMF
PMF
0.2 0.2
0.1 0.1
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
1 1
𝑛 = 10, 𝑝 = 5
𝑛 = 10, 𝑝 = 2
Poisson
Poisson distribution describes the probability of a certain number of events occurring in
a fixed time interval if these events occur with a constant mean rate which is known and
independent since the time of last occurrence. Poisson distribution can also be used for
number of events in other type of intervals, such as distance, area or volume.
Example: number of e-mails that you receive daily or the number of the potholes you’ll
find in your commute.
Poisson
𝜆𝑥 𝑒−𝜆
Poisson(𝜆) = 𝑓(𝑥, 𝜆) = for 𝜆 > 0
𝑥!
Poisson
0.4 0.4
0.3 0.3
PMF
PMF
0.2 0.2
0.1 0.1
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
𝜆=2 𝜆=3
Negative Binomialxvii
The binomial distribution describes an event in which the number of successes in a sequence 𝑛 independent
experiments, each one making a yes–no question with probability of success 𝑝 until 𝑘 successes.
Notice that it becomes the Poisson distribution in the limit as 𝑘 → ∞. This makes it a robust option to replace
a Poisson distribution to model phenomena with overdispersion (presence of greater variability in data than
would be expected).
The negative binomial has two parameters and its notation is Negative Binomial(𝑘, 𝑝):
• 𝑘 – number of successes
• 𝑝 – probability of success
Example: annual occurrence of tropical cyclones.
any phenomena that can be modeles as a Poisson distribution can be modeled also as negative binomial distribution (Andrew Gelman, John B. Carlin, Stern, et al., 2013),
xvii
Negative Binomial
𝑥+𝑘−1 𝑥
Negative Binomial(𝑘, 𝑝) = 𝑓(𝑥, 𝑘, 𝑝) = ( )𝑝 (1 − 𝑝)𝑘
𝑘−1
for 𝑥 ∈ {0, 1, …, 𝑛}
Negative Binomial
0.4 0.4
0.3 0.3
PMF
PMF
0.2 0.2
0.1 0.1
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
1 1
𝑘 = 1, 𝑝 = 2
𝑘 = 5, 𝑝 = 2
Continuous Distributions
Continuous probability distributions are distributions which the results are values in a continuous real number
line: (−∞, +∞) ∈ ℝ.
In continuous probability distributions we call the probability of a distribution taking values as “density”.
Since we are referring to real numbers we cannot obtain the probability of a random variable 𝑋 taking exactly
the value 𝑥.
This will always be 0, since we cannot specify the exact value of 𝑥. 𝑥 lies in the real number line, hence, we
need to specify the probability of 𝑋 taking values in an interval [𝑎, 𝑏].
The probability density function (PDF) is defined as:
𝑏
PDF(𝑥) = 𝑃 (𝑎 ≤ 𝑋 ≤ 𝑏) = ∫ 𝑓(𝑥) d𝑥
𝑎
Continuous Uniform
The continuous uniform distribution is a symmetric probability distribution in which an
infinite number of value intervals are equally likely of being observable. Each one of the
infinite 𝑛 intervals have probability 𝑛1 .
The continuous uniform distribution has two parameters and its notation is
Uniform(𝑎, 𝑏):
• 𝑎 – lower bound
• 𝑏 – upper bound
Continuous Uniform
1
Uniform(𝑎, 𝑏) = 𝑓(𝑥, 𝑎, 𝑏) = for 𝑎 ≤ 𝑥 ≤ 𝑏 and 𝑥 ∈ [𝑎, 𝑏]
𝑏−𝑎
Continuous Uniform
0.4
0.3
PDF
0.2
0.1
0
0 1 2 3 4 5 6
𝑎 = 0, 𝑏 = 6
Normal
This distribution is generally used in social and natural sciences to represent continuous
variables in which its underlying distribution are unknown.
This assumption is due to the central limit theorem (CLT) that, under precise conditions,
the mean of many samples (observations) of a random variable with finite mean and
variance is itself a random variable which the underlying distribution converges to a
normal distribution as the number of samples increases (as 𝑛 → ∞).
Hence, physical quantities that we assume that are the sum of many independent
processes (with measurement error) often have underlying distributions that are similar
to normal distributions.
Normal
The normal distribution has two parameters and its notation is Normal(𝜇, 𝜎) or 𝑁 (𝜇, 𝜎):
• 𝜇 – mean of the distribution, and also median and mode
• 𝜎 – standard deviationxviii, a dispersion measure of how observations occur in relation
from the mean
xviii
sometimes is also parameterized as variance 𝜎2 .
Normalxix
1 𝑥−𝜇 2
− 12 ( 𝜎 )
Normal(𝜇, 𝜎) = 𝑓(𝑥, 𝜇, 𝜎) = √ 𝑒 for 𝜎 > 0
𝜎 2𝜋
xix
see how the normal distribution was derived from the binomial distribution in the backup slides.
Normal
0.6 0.6
0.5 0.5
0.4 0.4
PDF
PDF
0.3 0.3
0.2 0.2
0.1 0.1
0 0
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
2
𝜇 = 0, 𝜎 = 1 𝜇 = 1, 𝜎 = 3
Log-Normal
The log-normal distribution is a continuous probability distribution of a random variable
which its natural logarithm is distributed as a normal distribution. Thus, if the natural
logarithm a random variable 𝑋, ln(𝑋), is distributed as a normal distribution, then 𝑌 =
ln(𝑋) is normally distributed and 𝑋 is log-normally distributed.
A log-normal random variable only takes positive real values. It is a convenient and
useful model for measurements in exact and engineering sciences, as well as in
biomedical, economical and other sciences. For example, energy, concentrations, length,
financial returns and other measurements.
A log-normal process is the statistical realization of a multiplicative product of many
independent positive random variables.
Log-Normal
The log-normal distribution has two parameters and its notation is Log-Normal(𝜇, 𝜎2 ):
Log-Normal
1 (− ln(𝑥)−𝜇)2
Log-Normal(𝜇, 𝜎) = 𝑓(𝑥, 𝜇, 𝜎) = √ 𝑒 2𝜎2 for 𝜎 > 0
𝑥𝜎 2𝜋
Log-Normal
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
PDF
PDF
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 1 2 3 4 5 0 1 2 3 4 5
𝜇 = 0, 𝜎 = 1 𝜇 = 1, 𝜎 = 23
Exponential
The exponential distribution is the probability distribution of the time between events
that occurs in a continuous manner, are independent, and have constant mean rate of
occurrence.
The exponential distribution has one parameter and its notation is Exponential(𝜆):
• 𝜆 – rate
Example: How long until the next earthquake or how long until the next bus arrives.
Exponential
Exponential
0.75 0.75
PDF
PDF
0.5 0.5
0.25 0.25
0 0
0 1 2 3 4 5 0 1 2 3 4 5
1
𝜆=1 𝜆= 2
Gamma
The gamma distribution is a long-tailed distribution with support only for positive real
numbers.
The gamma distribution has two parameters and its notation is Gamma(𝛼, 𝜃):
• 𝛼 – shape parameter
• 𝜃 – rate parameter
Gamma
𝑥
𝑥𝛼−1 𝑒− 𝜃
Gamma(𝛼, 𝜃) = 𝑓(𝑥, 𝛼, 𝜃) = for 𝑥, 𝛼, 𝜃 > 0
Γ(𝛼)𝜃𝛼
Gamma
0.8 0.8
0.6 0.6
PDF
PDF
0.4 0.4
0.2 0.2
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
1
𝛼 = 1, 𝜃 = 1 𝛼 = 2, 𝜃 = 2
Student’s 𝑡
Student’s 𝑡 distribution arises by estimating the mean of a normally-distributed population in
situations where the sample size is small and the standard deviation is knownxx.
Student’s 𝑡 distribution is symmetric and in a bell-shape, like the normal distribution, but with
long tails, which means that has more chance to produce values far away from its mean.
xx
this is where the ubiquitous Student’s 𝑡 test.
Student’s 𝑡
Student’s 𝑡 distribution has one parameter and its notation is Student(𝜈):
• 𝜈 – degrees of freedom, controls how much it resembles a normal distribution
Student’s 𝑡
− 𝜈+1
Γ( 𝜈+1
2
) 2
𝑥
2
Student’s 𝑡
0.4 0.4
0.3 0.3
PDF
PDF
0.2 0.2
0.1 0.1
0 0
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
𝜈=1 𝜈=3
Cauchy
The Cauchy distribution is bell-shaped distribution and a special case for Student’s 𝑡
with 𝜈 = 1.
But, differently than Student’s 𝑡, the Cauchy distribution has two parameters and its
notation is Cauchy(𝜇, 𝜎):
• 𝜇 – location parameter
• 𝜎 – scale parameter
Cauchy
1
Cauchy(𝜇, 𝜎) = 2
for 𝜎 ≥ 0
𝜋𝜎(1 + ( 𝑥−𝜇
𝜎
) )
Cauchy
0.6 0.6
0.5 0.5
0.4 0.4
PDF
PDF
0.3 0.3
0.2 0.2
0.1 0.1
0 0
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
1
𝜇 = 0, 𝜎 = 1 𝜇 = 0, 𝜎 = 2
Beta
The beta distribution is a natural choice to model anything that is restricted to values
between 0 e 1. Hence, it is a good candidate to model probabilities and proportions.
The beta distribution has two parameters and its notations is Beta (𝛼, 𝛽):
• 𝛼 (or sometimes 𝑎) – shape parameter, controls how much the shape is shifted towards
1
• 𝛽 (or sometimes 𝑏) – shape parameter, controls how much the shape is shifted towards
0
Example: A basketball player that has already scored 5 free throws and missed 3 in a
total of 8 attempts – Beta(3, 5)
Beta
𝑥𝛼−1 (1 − 𝑥)𝛽−1
Beta(𝛼, 𝛽) = 𝑓(𝑥, 𝛼, 𝛽) Γ(𝛼)Γ(𝛽)
for 𝛼, 𝛽 > 0 and 𝑥 ∈ [0, 1]
Γ(𝛼+𝛽)
Beta
0.3 0.3
0.2 0.2
PDF
PDF
0.1 0.1
0 0
0 1 0 1
𝛼 = 1, 𝛽 = 1 𝛼 = 3, 𝛽 = 2
Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013):
‣ Chapter 2: Single-parameter models
‣ Chapter 3: Introduction to multiparameter models
• McElreath (2020) - Chapter 4: Geocentric Models
• Gelman, Hill and Vehtari (2020):
‣ Chapter 9, Section 9.3: Prior information and Bayesian synthesis
‣ Chapter 9, Section 9.5: Uniform, weakly informative, and informative priors in
regression
• Schoot et al. (2021)
Prior Probability
Bayesian statistics is characterized by the use of prior information as the prior
probability 𝑃 (𝜃), often just prior:
Likelihood Prior
⏞ ⏞
𝑃 (𝑦 | 𝜃) ⋅ 𝑃 (𝜃)
𝑃
⏟ (𝜃 | 𝑦) =
⏟𝑃 (𝑦)
Posterior
Normalizing Constant
Types of Priors
In general, we can have 3 types of priors in a Bayesian approach (Andrew Gelman, John
B. Carlin, Stern, et al., 2013; McElreath, 2020; Schoot et al., 2021):
• uniform (flat): not recommended.
• weakly informative: small amounts of real-world information along with common
sense and low specific domain knowledge added.
• informative: introduction of medium to high domain knowledge.
I recommend always to transform the priors of the problem at hand into something
centered in 0 with standard deviation of 1xxi:
• 𝜃 ∼ Normal(0, 1) (Andrew Gelman’s preferred choicexxii )
• 𝜃 ∼ Student(𝜈 = 3, 0, 1) (Aki Vehtari’s preferred choicexxii)
xxi
this is called standardization, transforming all variables into 𝜇 = 0 and 𝜎 = 1.
xxii
see more about prior choices in the Stan’s GitHub wiki.
xxiii
https://youtu.be/p6cyRBWahRA, in case you want to see the full video, the section about priors related to the argument begins at minute 40
xxiv
log(odds ratio) = log(2.5) = 0.9163.
Informative Prior
In some contexts, it is interesting to use an informative prior. Good candidates are when
data is scarce or expensive and prior knowledge about the phenomena is available.
Some examples:
• Normal(5, 20)
• Log-Normal(0, 5)
• Beta(100, 9803)xxv
xxv
this is used in COVID-19 models from the CoDatMo Stan research group.
Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013) - Chapter 6: Model checking
• McElreath (2020) - Chapter 4: Geocentric Models
• Gelman, Hill and Vehtari (2020):
‣ Chapter 6: Background on regression modeling
‣ Chapter 11: Assumptions, diagnostics, and model evaluation
• Gelman et al. (2020) - “Workflow Paper”
Bayesian Workflowxxvi
Prior Posterior
Predictive Predictive
Check Check
Prior Model Posterior
Elicitation Specification Inference
xxvi
based on Gelman et al. (2020)
Bayesian Workflowxxvii
• Understand the domain and problem.
• Formulate the model mathematically.
• Implement model, test, and debug.
• Perform prior predictive checks.
• Fit the model.
• Assess convergence diagnostics.
• Perform posterior predictive checks.
• Improve the model iteratively: from baseline to complex and computationally efficient
models.
xxvii
adapted from Elizaveta Semenova.
Figure 3: Box’s Loop from Box (1976) but taken from Blei (2014).
In a very simple way, it consists in simulate parameter values based on prior distribution
without conditioning on any data or employing any likelihood function.
xxviii
we also perform mathematical/exact inspections, see the section on Model Comparison.
Figure 4: Real versus Simulated Densities Figure 5: Real versus Simulated Empirical CDFs
Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013):
‣ Chapter 14: Introduction to regression models
‣ Chapter 16: Generalized linear models
• McElreath (2020) - Chapter 4: Geocentric Models
• Gelman, Hill and Vehtari (2020):
‣ Chapter 7: Linear regression with a single predictor
‣ Chapter 8: Fitting regression models
‣ Chapter 10: Linear regression with multiple predictors
1 2 3 4
where:
• 𝒚 – dependent variable
• 𝛼 – intercept (also called as constant)
• 𝜷 – coefficient vector
• 𝑿 – data matrix
• 𝜀 – model error
xxix
independent and identically distributed.
𝒚 ∼ Normal(𝛼 + 𝑿𝜷, 𝜎)
𝛼 ∼ Normal(𝜇𝛼 , 𝜎𝛼 )
𝜷 ∼ Normal(𝜇𝜷 , 𝜎𝜷 )
𝜎 ∼ Exponential(𝜆𝜎 )
• Prior Distribution for 𝛼 – Knowledge that we have about the model’s intercept.
• Prior Distribution for 𝜷 – Knowledge that we have about the model’s independent
variable coefficients.
• Prior Distribution for 𝜎 – Knowledge that we have about the model’s error.
Posterior Computation
Our aim to is to find the posterior distribution of the model’s parameters of interest (𝛼
and 𝜷) by computing the full posterior distribution of:
𝑃 (𝜽 | 𝒚) = 𝑃 (𝛼, 𝜷, 𝜎 | 𝒚)
Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013) - Chapter 16: Generalized linear
models
• McElreath (2020)
‣ Chapter 10: Big Entropy and the Generalized Linear Model
‣ Chapter 11, Section 11.1: Binomial regression
• Gelman, Hill and Vehtari (2020):
‣ Chapter 13: Logistic regression
‣ Chapter 14: Working with logistic regression
‣ Chapter 15, Section 15.3: Logistic-binomial model
‣ Chapter 15, Section 15.4: Probit regression
The first one is logistic regression (also called Bernoulli regression or binomial
regression).
Binary Dataxxx
We use logistic regression when our dependent variable is binary. It only takes two
distinct values, usually coded as 0 and 1.
xxx
also known as dichotomous, dummy, indicator variable, etc.
Logistic Function
1
logistic(𝑥) 0.75
0.5
0.25
0
−10 −5 0 5 10
𝑥
Probit Function
We can also opt to choose to use the probit function (usually represented by the Greek
letter Φ) which is the CDF of a normal distribution:
𝑥
1 −𝑡2
Φ(𝑥) = √ ∫ 𝑒 2 d𝑡
2𝜋 −∞
Probit Function
1
Φ(𝑥) 0.75
0.5
0.25
0
−10 −5 0 5 10
𝑥
Φ(𝑥) 0.75
0.5
0.25
0
−10 −5 0 5 10
𝑥
• 𝛼 – intercept.
• 𝜷 = 𝛽1 , 𝛽2 , …, 𝛽𝑘 – independent variables’ 𝑥1 , 𝑥2 , …, 𝑥𝑘 coefficients.
• 𝑘 – number of independent variables.
If you implement a small mathematical transformation, you’ll have logistic regression:
• 𝑝̂ = logistic(linear) = 1+𝑒−1 linear – probability of an observation taking value 1.
• 𝑦̂ = {0 if 𝑝̂ <0.5 – 𝒚’s predicted binary value.
1 if 𝑝
̂ ≥0.5
• Bernoulli likelihood – binary dependent variable 𝒚 which results from a Bernoulli trial
with some probability 𝑝.
• binomial likelihood – discrete and positive dependent variable 𝒚 which results from 𝑘
successes in 𝑛 independent Bernoulli trials.
Bernoulli Likelihood
𝒚 ∼ Bernoulli(𝑝)
𝑝 = logistic/probit(𝛼 + 𝑿𝜷)
𝛼 ∼ Normal(𝜇𝛼 , 𝜎𝛼 )
𝜷 ∼ Normal(𝜇𝜷 , 𝜎𝜷 )
where:
• 𝒚 - dependent binary variable.
• 𝑝 - probability of 𝒚 taking value of 1 – success in an independent Bernoulli trial.
• logistic/probit – logistic or probit function.
• 𝛼 – intercept (also called constant).
• 𝜷 – coefficient vector.
• 𝑿 – data matrix.
Binomial Likelihood
𝒚 ∼ Binomial(𝑛, 𝑝)
𝑝 = logistic/probit(𝛼 + 𝑿𝜷)
𝛼 ∼ Normal(𝜇𝛼 , 𝜎𝛼 )
𝜷 ∼ Normal(𝜇𝜷 , 𝜎𝜷 )
where:
• 𝒚 - dependent binary variable.
• 𝑛 - number of independent Bernoulli trials.
• 𝑝 - probability of 𝒚 taking value of 1 – success in an independent Bernoulli trial.
• logistic/probit – logistic or probit function.
• 𝛼 – intercept (also called constant).
• 𝜷 – coefficient vector.
• 𝑿 – data matrix.
Posterior Computation
Our aim to is to find the posterior distribution of the model’s parameters of interest (𝛼
and 𝜷) by computing the full posterior distribution of:
𝑃 (𝜽 | 𝒚) = 𝑃 (𝛼, 𝜷 | 𝒚)
Specifically, we need to undo the logistic transformation. We are looking for its inverse
function.
xxxi
mathematically speaking.
1
• Odds with a value of 1 is a neutral odds, similar to a fair coin: 𝑝 = 2
• Odds below 1 decrease the probability of seeing a certain event.
• Odds over 1 increase the probability of seeing a certain event.
Logodds
If you revisit the logistic function, you’ll se that the intercept 𝛼 and coefficients 𝜷 are
literally the log of the odds (logodds):
𝑝 = logistic(𝛼 + 𝑿𝜷)
𝑝 = logistic(𝛼) + logistic(𝑿𝜷)
1
𝑝=
1 + 𝑒−𝜷
𝜷 = log(odds)
Logodds
Hence, the coefficients of a logistic regression are expressed in logodds, in which 0 is the
neutral element, and any number above or below it increases or decreases, respectively,
the changes of obtaining a “success” in 𝒚. To have a more intuitive interpretation
(similar to the betting houses), we need to convert the logodds into chances by undoing
the log function. We need to perform an exponentiation of 𝛼 and 𝜷 values:
odds(𝛼) = 𝑒𝛼
odds(𝜷) = 𝑒𝜷
Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013) - Chapter 16: Generalized linear
models, Section 16.2: Models for multivariate and multinomial responses
• McElreath (2020) - Chapter 12, Section 12.3: Ordered categorical outcomes
• Gelman, Hill and Vehtari (2020) - Chapter 15, Section 15.5: Ordered and unordered
categorical regression
• Bürkner and Vuorre (2019)
• Semenova (2019)
Ordinal regression is a regression model for discrete data and, more specific, when the
values of the dependent variables have a “natural ordering”.
For example, opinion polls with its plausible ordered values from agree-disagree, or a
patient perception of pain score.
This is an assumption in linear regression (and in almost all models that use “metric”
dependent variables): the distance between, for example, 2 and 3 is not the same
distance between 1 and 2.
Log-cumulative-odds
Still, this is not enough. We need to apply the logit function onto the CDF:
𝑥
logit(𝑥) = logistic−1 (𝑥) = ln( )
1−𝑥
where ln is the natural log function.
The logit function is the inverse of the logistic function: it takes as input any value between 0 and
1 (e.g. a probability) and outputs an unconstrained real number which we call logoddsxxxii.
As the transformation is performed onto the CDF, we call the result as the CDF logodds or log-
cumulative-odds.
xxxii
we already seen it in logistic regression.
𝐾 − 1 Intercepts
What do we do with this log-cumulative-odds?
It allows us to construct different intercepts for all possible values of the ordinal
dependent variable. We create an unique intercept for 𝑘 ∈ 𝐾.
Actually is 𝑘 ∈ 𝐾 − 1. Notice that the maximum value of the CDF of 𝑌 will always be 1.
Which translates to a log-cumulative-odds of ∞, since 𝑝 = 1:
𝑝 1
ln( ) = ln( ) = ln(0) = ∞
1−𝑝 1−1
Hence, we need only 𝐾 − 1 intercepts for all 𝐾 possible values that 𝑌 can take.
Since each intercept implies a different CDF value for each 𝑘 ∈ 𝐾, we can safely violate
the equidistant assumption which is not valid in almost all ordinal variables.
Cut Points
Each intercept implies in a log-cumulative-odds for each 𝑘 ∈ 𝐾; We need also to undo
the cumulative nature of the 𝐾 − 1 intercepts. Firstly, we convert the log-cumulative-
odds back to a valid probability with the logistic function:
1
logit−1 (𝑥) = logistic(𝑥) = ( −𝑥
)
1+𝑒
Then, finally, we remove the cumulative nature of the CDF by subtracting every one of
the 𝑘 cut points by the 𝑘 − 1 cut point:
𝑃 (𝑌 = 𝑘) = 𝑃 (𝑌 ≤ 𝑘) − 𝑃 (𝑌 ≤ 𝑘 − 1)
0.4
0.3
PDF
0.2
0.1
1 2 3 4 5 6
values
1.1
log-cumulative-odds
1.0 2
0.9
0.8 1
0.7
CDF
0.6 0
0.5
0.4 -1
0.3
0.2 -2
0.1
1 2 3 4 5 6 1 2 3 4 5 6
values values
Adding Coefficients 𝜷
With the equidistant assumption solved with 𝐾 − 1 intercepts, we can add coefficients to
represent the independent variable’s effects into our ordinal regression model.
More Log-cumulative-odds
We’ve transformed all intercepts into log-cumulative-odds so that we can add effects as
weighted sums of the independent variables to our basal rates (intercepts).
For every 𝑘 ∈ 𝐾 − 1, we calculate:
𝜑 = 𝛼𝑘 + 𝛽𝑖 𝑥𝑖
Matrix Notation
This can become more elegant and computationally efficient if we use matrix/vector
notation:
𝝋 = 𝜶 + 𝑿𝑐 ⋅ 𝜷
where 𝝋, 𝜶 e 𝜷xxxiii are vectors and 𝑿 is the data matrix, in which every line is an
observation and every column an independent variable.
xxxiii
note that both the coefficients and intercepts will have to be interpret as odds, like we did in logistic regression.
Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013) - Chapter 16: Generalized linear
models
• McElreath (2020):
‣ Chapter 10: Big Entropy and the Generalized Linear Model
‣ Chapter 11, Section 11.2: Poisson regression
• Gelman, Hill and Vehtari (2020) - Chapter 15, Section 15.2: Poisson and negative
binomial regression
Count Data
Poisson regression is used when our dependent variable can only take positive values,
usually in the context of count data.
Exponential Function
140
120
100
80
𝑒𝑥
60
40
20
0
−1 0 1 2 3 4 5
𝑥
where:
• 𝛼 – intercept.
• 𝜷 = 𝛽1 , 𝛽2 , …, 𝛽𝑘 – independent variables’ 𝑥1 , 𝑥2 , …, 𝑥𝑘 coefficients.
• 𝑘 – number of independent variables.
If you implement a small mathematical transformation, you’ll have Poisson regression:
• log(𝑦) = 𝑒Linear = 𝑒𝛼+𝛽1 𝑥1 +𝛽2 𝑥2 +…+𝛽𝑘 𝑥𝑘
𝒚 ∼ Poisson(𝑒(𝛼+𝑿𝜷) )
𝛼 ∼ Normal(𝜇𝛼 , 𝜎𝛼 )
𝜷 ∼ Normal(𝜇𝜷 , 𝜎𝜷 )
log−1 (𝑥) = 𝑒𝑥
𝒚 = 𝑒(𝛼+𝑿𝜷)
= 𝑒𝛼 ⋅ 𝑒(𝑋(1) ⋅𝛽(1) ) ⋅ 𝑒(𝑋(2) ⋅𝛽(2) ) ⋅ … ⋅ 𝑒(𝑋(𝑘) ⋅𝛽(𝑘) )
𝒚 = 𝑒(𝛼+𝑿𝜷)
= 𝑒𝛼 ⋅ 𝑒(𝑋(1) ⋅𝛽(1) ) ⋅ 𝑒(𝑋(2) ⋅𝛽(2) ) ⋅ … ⋅ 𝑒(𝑋(𝑘) ⋅𝛽(𝑘) )
Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013) - Chapter 17: Models for robust
inference
• McElreath (2020) - Chapter 12: Monsters and Mixtures
• Gelman, Hill and Vehtari (2020):
‣ Chapter 15, Section 15.6: Robust regression using the t model
‣ Chapter 15, Section 15.8: Going beyond generalized linear models
Robust Models
Almost always data from real world are really strange.
For the sake of convenience, we use simple models. But always ask yourself. How many
ways might the posterior inference depends on the following:
Outliers
Models based on the normal distribution are notoriously “non-robust” against outliers,
in the sense that a single observation can greatly affect the inference of all model’s
parameters, even those that has a shallow relationship with it.
Overdispersion
Superdispersion and underdispersionxxxiv refer to data that have more or fewer variation
than expected under a probability model (Gelman, Hill and Vehtari, 2020).
For each one of the models we covered, there is a natural extension in which a single
parameter is added to allow for overdispersion (Andrew Gelman, John B. Carlin, Stern, et
al., 2013).
xxxiv
rarer to find in the real world.
Overdispersion Example
Suppose you are analyzing data from car accidents. The model we generally use in this
type of phenomena is Poisson regression.
Poisson distribution has the same parameter for both the mean and variance: the rate
parameter 𝜆.
Hence, if you find a higher variability than expected under the Poisson likelihood
function allows, then probably you won’t be able to model properly the desired
phenomena.
From the Bayesian viewpoint, there is nothing special or magical in the Gaussian/
Normal likelihood.
It is just another distribution specified in a statistical model. We can make our model
robust by using the Student’s 𝑡 distribution as a likelihood function.
xxxv
or “fatter”.
0.3
PDF
0.2
0.1
0
−4 −3 −2 −1 0 1 2 3 4
Note that we are including an extra parameter 𝜈, which represents the Student’s 𝑡 distribution degrees of
freedom, to be estimated by the model (Andrew Gelman, John B. Carlin, Stern, et al., 2013).
This controls how wide or narrow the “tails” of the distribution will be. A heavy-tailed, positive-only prior is
advised.
xxxvi
since 𝑛 already comes from data.
Generally, we use 𝛼 as the binomial’s probability of the success 𝑝, and 𝛽 xxxvii is the
additional parameter to control and allow for overdispersion.
xxxvii
sometimes specified as 𝜑
0 if 𝑧𝑖 < 0
𝑦𝑖 = {
1 if 𝑧𝑖 > 0
𝑧𝑖 = 𝑋𝑖 𝜷 + 𝜀𝑖
𝜈−2
𝜀𝑖 ∼ Student(𝜈, 0, √ )
𝜈
𝜈 ∼ Gamma(2, 0.1) ∈ [2, ∞)
Here we are using the gamma distribution as a truncated Student’s 𝑡 distribution for the degrees of freedom
parameter 𝜈 ≥ 2. Another option would be to fix 𝜈 = 4.
xxxviii
there is a great discussion between Gelman, Vehtari and Kurz at Stan’s Discourse .
Hence, if you find overdispersion, probably you’ll need a robust alternative to Poisson.
This is where the negative binomial, with an extra parameter 𝜑, that makes it robust to
overdispersion.
𝜑 controls the probability of success 𝑝, and we generally use a gamma distribution as its
prior. 𝜑 is also known as a “reciprocal dispersion” parameter.
= 0 if 𝑆𝑖 = 0
𝒚{
∼ Negative Binomial(𝑒(𝛼+𝑿𝜷) , 𝜑) if 𝑆𝑖 = 1
𝑃 (𝑆𝑖 = 1) = Logistic/Probit(𝑿𝜸)
𝛾 ∼ Beta(1, 1)
Recommended References
• Gelman, Hill and Vehtari (2020) - Chapter 12, Section 12.8: Models for regression
coefficients
• Horseshoe Prior: Carvalho, Polson and Scott (2009)
• Horseshoe+ Prior: Bhadra et al. (2015)
• Regularized Horseshoe Prior: Piironen and Vehtari (2017)
• R2-D2 Prior: Zhang et al. (2022)
• Betancourt’s Case study on Sparsity: Betancourt (2021)
What is Sparsity?
This makes sense from a Bayesian perspective, as data is information, and we don’t
want to throw information away.
Frequentist Approach
The frequentist approach deals with sparse regression by staying in the “optimization”
context but adding Lagrangian constraintsxxxix:
𝑁
2
min{∑ (𝑦𝑖 − 𝛼 − 𝑥𝑇𝑖 𝜷) }
𝛽
𝑖=1
suject to ‖ 𝜷 ‖𝑝 ≤ 𝑡.
xxxix
this is called LASSO (least absolute shrinkage and selection operator) from Tibshirani (1996); Zou and Hastie (2005).
𝛽𝑖 | 𝜆𝑖 , 𝑐 ∼ Normal(0, √𝜆2𝑖 𝑐2 )
𝜆𝑖 ∼ Bernoulli(𝑝)
where:
• 𝑐: slab width
• 𝑝: prior inclusion probability; encodes the prior information about the sparsity of the coefficient vector 𝜷
• 𝜆𝑖 ∈ {0, 1}: whether the coefficient 𝛽𝑖 is close to zero (comes from the “spike”, 𝜆𝑖 = 0) or nonzero (comes
from the “slab”, 𝜆𝑖 = 1)
PDF
0.2 0.2
0.1 0.1
0 0
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
𝑐 = 1, 𝜆 = 0 𝑐 = 1, 𝜆 = 1
2𝑏
It is a symmetrical exponential decay around 𝜇 with scale governed by 𝑏.
0.5
0.4
0.3
PDF
0.2
0.1
0
−4 −3 −2 −1 0 1 2 3 4
𝜇 = 0, 𝑏 = 1
𝛽𝑖 | 𝜆𝑖 , 𝜏 ∼ Normal(0, √𝜆2𝑖 𝜏 2 )
𝜆𝑖 ∼ Cauchy+ (0, 1)
where:
• 𝜏 : global shrinkage parameter
• 𝜆𝑖 : local shrinkage parameter
• Cauchy+ is the half-Cauchy distribution for the standard deviation 𝜆𝑖
Note that it is similar to the spike-and-slab, but the discrete mixture becomes a “continuous” mixture
with the Cauchy+ .
PDF
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
1
𝜏 = 1, 𝜆 = 1 𝜏 = 1, 𝜆 = 2
Shrinkage priors, despite not having the best representation of sparsity, can be very
attractive computationally: again due to the continuous property.
2
2 𝜆𝑖 1
𝐸(𝛽𝑖 | 𝑦𝑖 , 𝜆𝑖 ) = 𝑦 + 0 = (1 − 𝜅𝑖 )𝑦𝑖
1 + 𝜆2𝑖 𝑖 1 + 𝜆2𝑖
PDF
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 1 0 1
Laplace Horseshoe
1
xl
spike-and-slab with 𝑝 = 2
would be very similar to Horseshoe but with discontinuities.
where:
• 𝜏 : global shrinkage parameter
• 𝜆𝑖 : local shrinkage parameter
• 𝜂𝑖 : additional local shrinkage parameter
• Cauchy+ is the half-Cauchy distribution for the standard deviation 𝜆𝑖 and 𝜂𝑖
The solution, Regularized Horseshoe (Piironen and Vehtari, 2017) (also known as the
“Finnish Horseshoe”), is able to control the amount of shrinkage for the largest
coefficient.
𝑐2 𝜆2𝑖
𝜆̃𝑖 2 =
𝑐2 + 𝜏 2 𝜆2𝑖
𝜆𝑖 ∼ Cauchy+ (0, 1)
where:
• 𝜏 : global shrinkage parameter
• 𝜆𝑖 : local shrinkage parameter
• 𝑐 > 0: regularization constant
• Cauchy+ is the half-Cauchy distribution for the standard deviation 𝜆𝑖
Note that when 𝜏 2 𝜆2𝑖 ≪ 𝑐2 (coefficient 𝛽𝑖 ≈ 0), then 𝜆̃𝑖 2 → 𝜆2𝑖 ; and when 𝜏 2 𝜆2𝑖 ≫ 𝑐2 (coefficient 𝛽𝑖 far from 0),
2
then 𝜆̃𝑖 2 → 𝜏𝑐 2 and 𝛽𝑖 prior approaches Normal(0, 𝑐).
The idea is to, instead of specifying a prior on 𝜷, we construct a prior on the coefficient
of determination 𝑅2 footnote{ square of the correlation coefficient between the
dependent variable and its modeled expectation.}. Then using that prior to “distribute”
throughout the 𝜷.
xli
𝑅2 -induced Dirichlet Decomposition
2 𝑅2
𝜏 =
1 − 𝑅2
𝜷 = 𝑍 ⋅ √𝝋𝜏 2
where:
• 𝜏 : global shrinkage parameter
• 𝝋: proportion of total variance allocated to each covariate, can be interpreted as the local shrinkage
parameter
• 𝜇𝑅2 is the mean of the 𝑅2 parameter, generally 12
• 𝜎𝑅2 is the precision of the 𝑅2 parameter, generally 2
• 𝑍 is the standard Gaussian, i.e. Normal(0, 1)
Recommended References
• Gelman, Hill and Vehtari (2020):
‣ Chapter 5: Hierarchical models
‣ Chapter 15: Hierarchical linear models
• (McElreath, 2020):
‣ Chapter 13: Models With Memory
‣ Chapter 14: Adventures in Covariance
• Gelman and Hill (2007)
• Michael Betancourt’s case study on Hierarchical modeling
• Kruschke and Vanpaemel (2015)
xlii
for the whole full list check here.
𝑦1 … 𝑦𝑘 … 𝑦𝐾
𝜃1 … 𝜃𝑘 … 𝜃𝐾
𝑦1 … 𝑦𝑘 … 𝑦𝐾 𝑦1 … 𝑦𝑘 … 𝑦𝐾
𝜃1 … 𝜃𝑘 … 𝜃𝐾 𝜃1 … 𝜃𝑘 … 𝜃𝐾
𝜑 𝜑
For example, the observations from the 𝑘th group, 𝑦𝑘 , informs directly the parameters that quantify the 𝑘th group’s behavior,
𝜃𝑘 . These parameters, however, inform directly the population-level parameters, 𝜑, that, in turn, informs others group-level
parameters. In the same manner, observations that informs directly other group’s parameters also provide indirectly
information to population-level parameters, which then informs other group-level parameters, and so on…
Hierarchical models are used when information is available in several levels of units of
observation. The hierarchical structure of analysis and organization assists in the
understanding of multiparameter problems, while also performing a crucial role in the
development of computational strategies.
xliii
also known as nested data.
This assumption stems from the principle that groups are exchangeable.
Hyperprior
In hierarchical models, we have a hyperprior, which is a prior’s prior:
𝒚 ∼ Normal(10, 𝜽)
𝜽 ∼ Normal(0, 𝜑)
𝜑 ∼ Exponential(1)
Here 𝒚 is a variable of interest that belongs to distinct groups. 𝜽, a prior for 𝒚, is a vector
of group-level parameters with their own prior (which becomes a hyperprior) 𝜑.
xliv
see https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html [Douglas Bates, creator of the lme4 package explanation].
To sum up, frequentist approach for hierarchical models is not robust in both the
inference process (convergence flaws during the maximum likelihood estimation), and
also in the results from the inference process (do not provide 𝑝-values due to strong
assumptions that are almost always violated).
𝒚 ∼ Normal(𝛼𝑗 + 𝑿 ⋅ 𝜷, 𝜎)
𝛼𝑗 ∼ Normal(𝛼, 𝜏 )
𝛼 ∼ Normal(𝜇𝛼 , 𝜎𝛼 )
𝜷 ∼ Normal(𝜇{𝜷} , 𝜎𝜷 )
𝜏 ∼ Cauchy+ (0, 𝜓𝛼 )
𝜎 ∼ Exponential(𝜆𝜎 )
𝛼1 ∼ Normal(𝜇𝛼1 , 𝜎𝛼1 )
𝛼2 ∼ Normal(𝜇𝛼2 , 𝜎𝛼2 )
𝜷 ∼ Normal(𝜇𝜷 , 𝜎𝜷 )
𝜎 ∼ Exponential(𝜆𝜎 )
Mathematically, this makes the column behave like an “identity” variable (because the
number 1 in the multiplication operation 1 ⋅ 𝛽 is the identity element. It maps 𝑥 → 𝑥
keeping the value of 𝑥 intact) and, consequently, we can interpret the column’s
coefficient as the model’s intercept.
𝒚 ∼ Normal(𝑿𝜷{𝑗} , 𝜎)
𝜷𝑗 ∼ Multivariate Normal(𝝁𝑗 , 𝚺) for 𝑗 ∈ {1, …, 𝐽 }
𝚺 ∼ LKJ(𝜂)
𝜎 ∼ Exponential(𝜆𝜎 )
Each coefficient vector 𝜷𝑗 represents the model columns 𝑿 coefficients for every group
𝑗 ∈ 𝐽 . Also the first column of 𝑿 could be a column filled with 1s (intercept).
For computational efficiency, we can make the covariance matrix 𝚺 into a correlation
matrix. Every covariance matrix can be decomposed into:
𝚺 = diagmatrix (𝝉 ) ⋅ 𝛀 ⋅ diagmatrix (𝝉 )
where 𝛀 is a correlation matrix with 1s in the diagonal and the off-diagonal elements
between −1 e 1 𝜌 ∈ (−1, 1).
𝝉 is a vector composed of the variables’ standard deviation from 𝚺 (is is the 𝚺’s
diagonal).
𝛀 = 𝑳Ω 𝑳𝑇Ω
xlv
LKJ are the authors’ last name initials – Lewandowski, Kurowicka and Joe.
Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013):
‣ Chapter 10: Introduction to Bayesian computation
‣ Chapter 11: Basics of Markov chain simulation
‣ Chapter 12: Computationally efficient Markov chain simulation
• McElreath (2020) - Chapter 9: Markov Chain Monte Carlo
• Neal (2011)
• Betancourt (2017)
• Gelman, Hill and Vehtari (2020) - Chapter 22, Section 22.8: Computational efficiency
• Chib and Greenberg (1995)
• Casella and George (1992)
xlvi
those who are interested, should read Eckhardt (1987).
In discrete cases, we can turn the denominator into a sum over all parameters using the chain rule
of probability:
𝑃 (𝐴, 𝐵 | 𝐶) = 𝑃 (𝐴 | 𝐵, 𝐶) ⋅ 𝑃 (𝐵 | 𝐶)
In many cases the integral is intractable (not possible of being deterministic evaluated)
and, thus, we must find other ways to compute the posterior 𝑃 (𝜃 | data) without using
the denominator 𝑃 (data).
∑ 𝑃 (𝜃 | data) = 1
𝜃
∫ 𝑃 (𝜃 | data) d𝜃 = 1
𝜃
Markov Chains
xlvii
meaning that there is an unique stationary distribution.
Markov Chains
• Markov chains have a property that the probability distribution of
the next state depends only on the current state and not in the
sequence of events that preceded:
𝑃 (𝑋𝑛+1 = 𝑥 | 𝑋0 , 𝑋1 , 𝑋2 , …, 𝑋𝑛 ) = 𝑃 (𝑋𝑛+1 = 𝑥 | 𝑋𝑛 )
0.7
Markov Chains
The efficacy of this approach depends on:
• how big 𝑟 must be to guarantee an adequate sample.
• computational power required for every Markov chain iteration.
Besides, it is custom to discard the first iterations of the algorithm because they are usually non-
representative of the underlying stationary distribution to be approximate. In the initial iterations
of MCMC algorithms, often the Markov chain is in a “warm-up”xlviii process, and its state is very far
away from an ideal one to begin a trustworthy sampling.
Generally, it is recommended to discard the first half iterations (Andrew Gelman, John B. Carlin,
Stern, et al., 2013).
xlviii
some references call this “burnin”.
MCMC Algorithms
We have TONS of MCMC algorithmsxlix. Here we are going to cover two classes of MCMC
algorithms:
xlix
see the Wikipedia page for a full list.
l
sometimes called Hybrid Monte Carlo, specially in the physics literature.
Asymptotically, they have an acceptance rate of 23.4%, and the computational cost of
every iteration is 𝒪(𝑑), where 𝑑 is the number of dimension in the parameter space
(Beskos et al., 2013).
Asymptotically, they have an acceptance rate of 65.1%, and the computational cost of
1
every iteration is 𝒪(𝑑 4 ), where 𝑑 is the number of dimension in the parameter space
(Beskos et al., 2013).
Metropolis Algorithm
The first broadly used MCMC algorithm to generate samples from a Markov
chain was originated in the physics literature in the 1950s and is called
Metropolis (Metropolis et al., 1953), in honor of the first author Nicholas
Metropolis.
In sum, the Metropolis algorithm is an adaptation of a random walk coupled
with an acceptance/rejection rule to converge to the target distribution.
Metropolis algorithm uses a “proposal distribution” 𝐽𝑡 (𝜽∗ ) to define the next
values of the distribution 𝑃 ∗ (𝜽∗ | data). This distribution must be symmetric:
Metropolis Algorithm
Metropolis is a random walk through the parameter sample space, where the probability of the
Markov chain changing its state is defined as:
𝑃 (𝜽proposed )
𝑃change = min( , 1).
𝑃 (𝜽current )
This means that the Markov chain will only change to a new state based in one of two conditions:
• when the probability of the random walk proposed parameters 𝑃 (𝜽proposed ) is higher than the
probability of the current state parameters 𝑃 (𝜽current ), we change with 100% probability.
• when the probability of the random walk proposed parameters 𝑃 (𝜽proposed ) is lower than the
probability of the current state parameters 𝑃 (𝜽current ), we change with probability equal to the
proportion of this probability difference.
Metropolis Algorithm
Assign:
0.5
0.4
𝑃 =1 🚶 𝑃 ≈ 1
4
0.3
PDF
0.2
0.1 🚶 🚶
−4 −3 −2 −1 0 1 2 3 4
Metropolis-Hastings Algorithm
In the 1970s emerged a generalization of the Metropolis algorithm,
which does not need that the proposal distributions be symmetric:
Metropolis-Hastings Algorithm
Assign:
Metropolis-Hastings Animation
Gibbs Algorithm
To circumvent Metropolis’ low acceptance rate, the Gibbs algorithm
was conceived. Gibbs do not have an acceptance/rejection rule for
the Markov chain state change: all proposals are accepted!
Gibbs algorithm was originally conceived by the physicist Josiah
Willard Gibbs while referencing an analogy between a sampling
algorithm and statistical physics (a physics field that originates from
statistical mechanics).
The algorithm was described by the Geman brothers in 1984 (Geman
and Geman, 1984), about 8 decades after Gibbs death.
Gibbs Algorithm
The Gibbs algorithm is very useful in multidimensional sample spaces. It is also known
as alternating conditional sampling, because we always sample a parameter
conditioned on the probability of the other model’s parameters.
The Gibbs algorithm can be seen as a special case of the Metropolis-Hastings algorithm,
because all proposals are accepted (Gelman, 1992).
The essence of the Gibbs algorithm is the sampling of parameters conditioned in other
parameters:
𝑃 (𝜃1 | 𝜃2 , …, 𝜃𝑝 )
Gibbs Algorithm
Gibbs Animation
HMC Algorithm
HMC algorithm is an adaptation of the MH algorithm, and employs a guidance scheme to the
generation of new proposals. It boosts the acceptance rate, and, consequently, has a better
efficiency.
More specifically, HMC uses the gradient of the posterior’s log density to guide the Markov chain to
higher density regions of the sample space, where most of the samples are sampled:
d log 𝑃 (𝜽 | 𝒚)
d𝜃
As a result, a Markov chain that uses a well-adjusted HMC algorithm will accept proposals with a
much higher rate than if using the MH algorithm (Roberts, Gelman and Gilks, 1997; Beskos et al.,
2013).
Soon after, HMC was applied to statistical problems by Neal (1994) who named it as
Hamiltonian Monte Carlo (HMC).
For a much more detailed and in-depth discussion (not our focus here) of HMC, I
recommend Neal (2011) and Betancourt (2017).
li
where is called “Hybrid” Monte Carlo (HMC)
Besides that, HMC is much more efficiently than Metropolis and does not suffer Gibbs’
parameters correlation issues
HMC uses a proposal distribution that changes depending on the Markov chain current state. HMC
finds the direction where the posterior density increases, the gradient, and alters the proposal
distribution towards the gradient direction.
The probability of the Markov chain to change its state in HMC is defined as:
𝑃 (𝜽proposed ) ⋅ 𝑃 (𝝋proposed )
𝑃change = min( , 1, )
𝑃 (𝜽current ) ⋅ 𝑃 (𝝋current )
To keep things computationally simple, we used a diagonal mass matrix 𝑴 . This makes
that the diagonal elements (components) 𝝋 are independent, each one having a normal
distribution:
𝜑𝑗 ∼ Normal(0, 𝑀𝑗𝑗 )
HMC Algorithm
Define an initial set 𝜽0 ∈ ℝ𝑝 that 𝑃 (𝜽0 | 𝒚) > 0
Sample 𝝋 from a Multivariate Normal(𝟎, 𝑴 )
Simultaneously sample 𝜽∗ and 𝝋 with 𝐿 steps and step-size 𝜀
Define the current value of 𝜽 as the proposed value 𝜽∗ : 𝜽∗ ← 𝜽
for 1, 2, …, 𝐿
d log 𝑃 (𝜽∗ | 𝒚)
Use the log of the posterior’s gradient 𝜽∗ to produce a half-step of 𝝋: 𝝋 ← 𝝋 + 12 𝜀 d𝜃
Use 𝝋 to update 𝜽∗ : 𝜽∗ ← 𝜽∗ + 𝜀𝑴 −1 𝝋
d log 𝑃 (𝜽∗ | 𝒚)
Use again 𝜽∗ log gradient to produce a half-step of 𝝋: 𝝋 ← 𝝋 + 12 𝜀 d𝜃
As an acceptance/rejection rule, compute:
𝑃 (𝜽∗ | 𝒚)𝑃 (𝝋∗ )
𝑟=
𝑃 (𝜽𝑡−1 | 𝒚)𝑃 (𝝋𝑡−1 )
Assign:
HMC Animation
The most famous and simple of these numerical integrators is the Euler method, where
we use a step-size 𝜀 to compute a numerical solution of system in a future time 𝑡 from
specific initial conditions.
lii
sometimes also called ℎ
liii
An excellent textbook for numerical and symplectic integrator is Iserles (2008).
No-U-Turn-Sampler (NUTS)
In HMC, we can adjust 𝜀 during the algorithm runtime. But, for 𝐿, we need to to “dry run”
the HMC sampler to find a good candidate value for 𝐿.
Here is where the idea for No-U-Turn-Sampler (NUTS) (Hoffman and Gelman, 2011)
enters: you don’t need to adjust anything, just “press the button”.
No-U-Turn-Sampler (NUTS)
More specifically, we need a criterion that informs that we performed enough
Hamiltonian dynamics simulation.
In other words, to simulate past beyond would not increase the distance between the
proposal 𝜽∗ and the current value 𝜽.
NUTS uses a criterion based on the dot product between the current momenta vector 𝝋
and the difference between the proposal vector 𝜽∗ and the current vector 𝜽, which turns
into the derivative with respect to time 𝑡 of half of the distance squared between 𝜽 e 𝜽∗ :
∗ ∗ d ∗ d (𝜽∗ − 𝜽) ⋅ (𝜽∗ − 𝜽)
(𝜽 − 𝜽) ⋅ 𝝋 = (𝜽 − 𝜽) ⋅ (𝜽 − 𝜽) =
d𝑡 d𝑡 2
No-U-Turn-Sampler (NUTS)
This suggests an algorithms that does not allow proposals be guided infinitely until the
distance between the proposal 𝜽∗ and the current 𝜽 is less than zero.
No-U-Turn-Sampler (NUTS)
NUTS uses the leapfrog integrator to create a binary tree where each leaf node is a proposal of the
momenta vector 𝝋 tracing both a forward (𝑡 + 1) as well as a backward (𝑡 − 1) path in a determined
fictitious time 𝑡.
The growing of the leaf nodes are interrupted when an u-turn is detected, both forward or
backward.
No-U-Turn-Sampler (NUTS)
NUTS also uses a procedure called Dual Averaging (Nesterov, 2009) to simultaneously
adjust 𝜀 and 𝐿 by considering the product 𝜀 ⋅ 𝐿.
Such adjustment is done during the warmup phase and the defined values of 𝜀 and 𝐿
are kept fixed during the sampling phase.
NUTS Algorithm
for 1, 2, …, 𝐿
Use 𝝋 to update 𝜽∗ : 𝜽∗ ← 𝜽∗ + 𝜀𝑴 −1 𝝋
d log 𝑃 (𝜽∗ | 𝒚)
Use again 𝜽∗ log gradient to produce a half-step of 𝝋: 𝝋 ← 𝝋 + 12 𝜀 d𝜃
Stop sampling 𝜽∗ in the direction 𝑣 and continue sampling only in the direction −𝑣
d (𝜽∗ −𝜽∗ )⋅(𝜽∗ −𝜽∗ )
The difference between proposal vector 𝜽∗ and current vector 𝜽 in the direction −𝑣 is lower than zero: −𝑣 d𝑡 2
< 0 or 𝐿 steps have been reached
Stop sampling 𝜽 ∗
Assign:
NUTS Animation
liv
very common em hierarchical models.
lv
remember that 𝐿 and 𝜀 are defined in the warmup phase and kept fixed during sampling.
This occurs often in hierarchical models, in the relationship between group-level priors and
population-level hyperpriors. Hence, we reparameterize in a non-centered way, changing the
posterior geometry to make life easier for our MCMC sampler:
𝑃 (̃
𝑦, 𝑥
̃) = Normal(̃
𝑦 | 0, 1) ⋅ Normal(̃
𝑥 | 0, 1)
𝑦 = 𝑦̃ ⋅ 3 + 0
𝑦
𝑥=𝑥
̃ ⋅ 𝑒2 + 0
𝒚 ∼ Normal(𝛼𝑗 + 𝑿 ⋅ 𝜷, 𝜎)
𝛼𝑗 = 𝑧𝑗 ⋅ 𝜏 + 𝛼
𝑧𝑗 ∼ Normal(0, 1)
𝛼 ∼ Normal(𝜇𝛼 , 𝜎𝛼 )
𝜷 ∼ Normal(𝜇𝜷 , 𝜎𝜷 )
𝜏 ∼ Cauchy+ (0, 𝜓𝛼 )
𝜎 ∼ Exponential(𝜆𝜎 )
𝒚 ∼ Normal(𝑿𝜷𝑗 , 𝜎)
𝜷𝑗 = 𝜸𝑗 ⋅ 𝚺 ⋅ 𝜸𝑗
𝜸𝑗 ∼ Multivariate Normal(𝟎, 𝑰) for 𝑗 ∈ {1, …, 𝐽 }
𝚺 ∼ LKJ(𝜂)
𝜎 ∼ Exponential(𝜆𝜎 )
Each coefficient vector 𝜷𝑗 represents the model columns 𝑿 coefficients for every group
𝑗 ∈ 𝐽 . Also the first column of 𝑿 could be a column filled with 1s (intercept).
lvi
for more information about how to change those values, see Section 15.2 of the Stan Reference Manual .
lvii
for more information about how to change those values, see Turing Documentation .
lviii
this property is not present on neural networks.
Convergence Metrics
We have some options on how to measure if the Markov chains converged to the target
distribution, i.e. if they are “reliable”:
̂ (Rhat): potential scale reduction factor, a metric to measure if the Markov chain
• 𝑅
have mixed, and, potentially, converged.
where:
• 𝑚: number of Markov chains.
• 𝑛: total samples per Markov chain (discarding warmup).
• 𝜌̂𝑡 : an autocorrelation estimate.
+ 𝑛−1 1
var
̂ (𝜓 | 𝑦) = 𝑊+ 𝐵
𝑛 𝑛
Intuitively, the value is 1.0 if all chains are totally convergent.
̂ > 1.1, you need to worry because probably the chains have not converged
As a heuristic, if 𝑅
adequate.
Warning messages:
1: There were 275 divergent transitions after warmup. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-
warmup
to find out why this is a problem and how to eliminate them.
2: Examine the pairs() plot to diagnose sampling problems
lix
also see Stan’s warnings guide.
Turing does not give warning messages! But you can check divergent transitions with
summarize(chn;
sections=[:internals]):
Summary Statistics
parameters mean std naive_se mcse ess
rhat ess_per_sec
Symbol Float64 Float64 Float64 Float64 Float64
Float64 Float64
Acknowledge that both Stan’s and Turing’s NUTS sampler is very efficient and effective
in exploring the most crazy and diverse target posterior densities.
And the standard settings, 2,000 iterations and 4 chains, works perfectly for 99% of the
time.
When you have computational problems, often there’s a problem with your model.
— Gelman (2008)
besides that, maybe should be worth to do a QR decomposition in the data matrix 𝑿, thus having an orthogonal basis (non-correlated) for the sampler to explore. This
lx
makes the target distribution’s geometry much more friendlier, in the topological/geometrical sense, for the MCMC sampler explore. Check the backup slides.
lxi
Stan’s default is 0.8 and Turing’s default is 0.65.
Recommended References
• Andrew Gelman, John B. Carlin, Stern, et al. (2013) - Chapter 7: Evaluating, comparing,
and expanding models
• Gelman, Hill and Vehtari (2020) - Chapter 11, Section 11.8: Cross validation
• McElreath (2020) - Chapter 7, Section 7.5: Model comparison
• Vehtari, Gelman and Gabry (2015)
• Spiegelhalter et al. (2002)
• Van Der Linde (2005)
• Watanabe and Opper (2010)
• Gelfand (1996)
• Watanabe and Opper (2010)
• Geisser and Eddy (1979)
After model parameters estimation, many times we want to measure its predictive
accuracy by itself, or for model comparison, model selection, or computing a model
performance metric (Geisser and Eddy, 1979).
There is an objective approach to compare Bayesian models which uses a robust metric
that helps us select the best model in a set of candidate models.
Having an objective way of comparing and choosing the best model is very important. In
the Bayesian workflow, we generally have several iterations between priors and
likelihood functions resulting in several different models (Gelman et al., 2020).
Historical Interlude
In the past, we did not have computational power and data abundance. Model comparison was done based on
a theoretical divergence metric originated from information theory’s entropy:
𝑁
𝐻(𝑝) = − 𝐸 log(𝑝𝑖 ) = − ∑ 𝑝𝑖 log(𝑝𝑖 )
𝑖=1
We compute the divergence by multiplying entropy by −2lxii, so lower values are preferable:
𝑁
1 𝑆
𝐷(𝑦, 𝜽) = −2 ⋅ ∑ log ∑ 𝑃 (𝑦𝑖 | 𝜽𝑠 )
𝑆 𝑠=1
⏟⏟
𝑖=1 ⏟⏟⏟⏟⏟ ⏟⏟⏟⏟
log pointwise predictive density - lppd
lxii
historical reasons.
where 𝑘 is the number of the model’s free parameters and lppdmle is the maximum
likelihood estimate of the log pointwise predictive density.
1 𝑆
DIC = 𝐷(𝑦, 𝜽) + 𝑘DIC = −2lppdBayes + 2(lppdBayes − ∑ log 𝑃 (𝑦 | 𝜽𝑠 ))
𝑆 𝑠=1
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
bias-corrected 𝑘
DIC removes the restriction on uniform AIC priors, but still keeps the assumptions of the
posterior being a multivariate Gaussian/normal distribution and that 𝑁 ≫ 𝑘.
Predictive Accuracy
With current computational power, we do not need approximationslxiii.
lxiii
AIC, DIC etc.
Predictive Accuracy
Bayesian approaches measure predictive accuracy using posterior draws 𝑦̃ from the
model. For that we have the predictive posterior distribution:
𝑝(̃
𝑦 | 𝑦) = ∫ 𝑝(̃
𝑦𝑖 | 𝜃)𝑝(𝜃 | 𝑦) d𝜃
Where 𝑝(𝜃 | 𝑦) is the model’s posterior distribution. The above equation means that we
evaluate the integral with respect to the whole joint probability of the model’s
predictive posterior distribution and posterior distribution.
Predictive Accuracy
To make samples comparable, we calculate the expectation of this measure for each one
of the 𝑁 sample observations:
𝑁
elpd = ∑ ∫ 𝑝𝑡(̃𝑦𝑖 ) log 𝑝(̃
𝑦𝑖 | 𝑦) d̃
𝑦
𝑖=1
where elpd is the expected log pointwise predictive density, and 𝑝𝑡(̃𝑦𝑖 ) is the distribution
that represents the 𝑦̃𝑖 ’s true underlying data generating process.
The 𝑝𝑡(̃𝑦𝑖 ) are unknown and we generally use cross-validation or approximations to
estimate elpd.
where
which is the predictive density conditioned on the data without a single observation 𝑖 (𝑦−𝑖 ). Almost
always we use the PSIS-LOOlxiv approximation due to its robustness and low computational cost.
lxiv
upcoming…
which we can compute using the posterior variance of the log predictive density for each observation 𝑦𝑖 :
𝑁
𝑆
𝑝̂waic = ∑ 𝑉𝑠=1 (log 𝑝(𝑦𝑖 | 𝜃𝑠 ))
𝑖=1
𝑆
where 𝑉𝑠=1 is the sample’s variance:
𝑆
1
𝑆
𝑉𝑠=1 𝑎𝑠 = ∑ (𝑎𝑠 − |(𝑎))2
𝑆 − 1 𝑠=1
Contrary to LOO, we cannot approximate the actual elpd using 𝐾-fold CV, and we need to
compute the actual elpd over 𝐾 partitions, which almost involves a high computational
cost.
lxv
another class of MCMC algorithm that we did not cover yet.
Importance Sampling
If the 𝑁 samples are conditionally independentlxvi (Gelfand, Dey and Chang, 1992), we
can compute LOO with 𝜽𝑠 posterior’ samples 𝑃 (𝜃 | 𝑦) using importance weights:
1 𝑃 (𝜃𝑠 |𝑦−𝑖 )
𝑟𝑖𝑠 = 𝑠
∝
𝑃 (𝑦𝑖 |𝜃 ) 𝑃 (𝜃𝑠 |𝑦)
∑𝑆𝑠=1 𝑟𝑖𝑠 𝑃 (̃
𝑦𝑖 |𝜃𝑠 )
𝑃 (̃
𝑦𝑖 | 𝑦−𝑖 ) ≈
∑𝑆𝑠=1 𝑟𝑖𝑠
lxvi
that is, they are independent if conditioned on the model’s parameters, which is a basic assumption in any Bayesian (and frequentist) model
Importance Sampling
However, the posterior 𝑃 (𝜃 | 𝑦 often has low variance and shorter tails than the LOO
distributions 𝑃 (𝜃 | 𝑦−1 ). Hence, if we use:
∑𝑆𝑠=1 𝑟𝑖𝑠 𝑃 (̃
𝑦𝑖 |𝜃𝑠 )
𝑃 (̃
𝑦𝑖 | 𝑦−𝑖 ) ≈
∑𝑆𝑠=1 𝑟𝑖𝑠
we will have instabilities because the 𝑟𝑖 can have high, or even infinite, variance.
When the tails of the importance weights’ distribution are long, a direct usage of the
importance is sensible to one or more large value. By fitting a generalized Pareto
distribution to the importance weights’ upper-tail, we smooth out these values.
Any ̂𝑘 > 0.5 is a warning sign, but empirically there is still a good performance up to ̂𝑘 < 0.7.
We know that in the binomial: 𝐸 = 𝑛𝑝 and Var = 𝑛𝑝𝑞; hence replacing 𝐸 by 𝜇 and Var by 𝜎2 :
𝑛 𝑘 𝑛−𝑘 1 −
(𝑘−𝜇)2
lim ( )𝑝 (1 − 𝑝) = √ 𝑒 𝜎2
𝑛→∞ 𝑘 𝜎 2𝜋
lxvii
Origins can be traced back to Abraham de Moivre in 1738. A better explanation can be found by clicking here.
QR Decomposition
In Linear Algebra 101, we learn that any matrix (even non-square ones) can be decomposed into a product of two matrices:
• 𝑸: an orthogonal matrix (its columns are orthogonal unit vectors, i.e. 𝑸𝑇 = 𝑸−1 )
• 𝑹: an upper-triangular matrix
Now, we incorporate the QR decomposition into the linear regression model. Here, I am going to use the “thin” QR instead of the “fat”,
√
which scales 𝑸 and 𝑹 matrices by a factor of 𝑛 − 1 where 𝑛 is the number of rows in 𝑿. In practice, it is better to implement the thin
QR, than the fat QR decomposition. It is more numerical stable. Mathematically speaking, the thing QR decomposition is:
𝑿 = 𝑸∗ 𝑹 ∗
√
𝑸∗ = 𝑸 ⋅ 𝑛 − 1
1
𝑹∗ = √ ⋅𝑹
𝑛−1
𝝁=𝛼+𝑿⋅𝜷+𝜎
= 𝛼 + 𝑸∗ ⋅ 𝑹 ∗ ⋅ 𝜷 + 𝜎
= 𝛼 + 𝑸∗ ⋅ (𝑹 ∗ ⋅ 𝜷) + 𝜎
= 𝛼 + 𝑸∗ ⋅ 𝜷̃ + 𝜎
Bibliography
Akaike, H. (1973) “Information theory and an extension of the maximum likelihood
principle,” Second International Symposium on Information Theory. Edited by B. N.
Petrov and F. Csaki
Amrhein, V., Greenland, S. and McShane, B. (2019) “Scientists Rise up against Statistical
Significance,” Nature, 567(7748), pp. 305–307. Available at: https://doi.org/10.1038/d41586-
019-00857-9
Bates, D. et al. (2022) JuliaStats/MixedModels.jl. Zenodo. Available at: https://doi.org/10.
5281/ZENODO.6925652
Bates, D. et al. (2015) “Fitting Linear Mixed-Effects Models Using lme4,” Journal of
Statistical Software, 67(1), pp. 1–48. Available at: https://doi.org/10.18637/jss.v067.i01
Bürkner, P.-C. and Vuorre, M. (2019) “Ordinal Regression Models in Psychology: A Tutorial,”
Advances in Methods and Practices in Psychological Science, 2(1), p. 77–78. Available at:
https://doi.org/10.1177/2515245918823199
Carpenter, B. et al. (2017) “Stan : A Probabilistic Programming Language,” Journal of
Statistical Software, 76(1). Available at: https://doi.org/10.18637/jss.v076.i01
Carvalho, C. M., Polson, N. G. and Scott, J. G. (2009) “Handling sparsity via the horseshoe,”
in Artificial intelligence and statistics, pp. 73–80
Casella, G. and George, E. I. (1992) “Explaining the Gibbs Sampler,” The American
Statistician, 46(3), pp. 167–174. Available at: https://doi.org/10.1080/00031305.1992.
10475878
Eckhardt, R. (1987) “Stan Ulam, John von Neumann, and the Monte Carlo Method,” Los
Alamos Science, 15(30), pp. 131–136
Finetti, B. de (1974) Theory of Probability. Volume1 ed. New York: John Wiley & Sons
Fisher, R. A. (1925) Statistical methods for research workers. Oliver, Boyd
Fisher, R. A. (1962) “Some Examples of Bayes' Method of the Experimental Determination
of Probabilities A Priori,” Journal of the Royal Statistical Society. Series B
(Methodological), 24(1), pp. 118–124
Ge, H., Xu, K. and Ghahramani, Z. (2018) “Turing: A Language for Flexible Probabilistic
Inference,” in International Conference on Artificial Intelligence and Statistics. PMLR, pp.
1682–1690
Gelman, A. (2008) The Folk Theorem of Statistical Computing. Available at: https://
statmodeling.stat.columbia.edu/2008/05/13/the_folk_theore/
Gelman, A. and Hill, J. (2007) Data Analysis Using Regression and Multilevel/Hierarchical
Models. Cambridge university press
Gelman, Andrew, Carlin, John B., Stern, et al. (2013) “Basics of Markov Chain Simulation,”
Bayesian Data Analysis. Chapman and Hall/CRC
Gelman, Andrew, Carlin, John B., Stern, et al. (2013) Bayesian Data Analysis. Chapman and
Hall/CRC
Gelman, A., Hill, J. and Vehtari, A. (2020) Regression and Other Stories. Cambridge
University Press
Hastings, W. K. (1970) “Monte Carlo Sampling Methods Using Markov Chains and Their
Applications,” Biometrika, 57(1), pp. 97–109. Available at: https://doi.org/10.1093/biomet/
57.1.97
Head, M. L. et al. (2015) “The extent and consequences of p-hacking in science,” PLoS
Biol, 13(3), p. e1002106
Hoffman, M. D. and Gelman, A. (2011) “The No-U-Turn Sampler: Adaptively Setting Path
Lengths in Hamiltonian Monte Carlo,” Journal of Machine Learning Research, 15(1), pp.
1593–1623. Available at: http://arxiv.org/abs/1111.4246
Ioannidis, J. P. A. (2019) “What Have We (Not) Learnt from Millions of Scientific Papers
with <i>P</i> Values?,” The American Statistician, 73(sup1), pp. 20–25. Available at:
https://doi.org/10.1080/00031305.2018.1447512
Iserles, A. (2008) A First Course in the Numerical Analysis of Differential Equations. 2nd
ed. USA: Cambridge University Press
Jaynes, E. T. (2003) Probability Theory: The Logic of Science. Cambridge university press
Khan, M. E. and Rue, H. (2021) The Bayesian Learning Rule. Available at: http://arxiv.org/
abs/2107.04562 (Accessed: July 13, 2021)
Kolmogorov, A. N. (1933) Foundations of the Theory of Probability. Berlin: Julius Springer
Neal, R. M. (1994) “An Improved Acceptance Procedure for the Hybrid Monte Carlo
Algorithm,” Journal of Computational Physics, 111(1), pp. 194–203. Available at: https://doi.
org/10.1006/jcph.1994.1054
Neal, R. M. (2003) “Slice Sampling,” The Annals of Statistics, 31(3), pp. 705–741
Neal, R. M. (2011) “MCMC Using Hamiltonian Dynamics,” Handbook of Markov Chain Monte
Carlo. Edited by S. Brooks et al.
Nesterov, Y. (2009) “Primal-dual subgradient methods for convex problems,”
Mathematical programming, 120(1), pp. 221–259
Spiegelhalter, D. J. et al. (2002) “Bayesian measures of model complexity and fit,” Journal
of the royal statistical society: Series b (statistical methodology), 64(4), pp. 583–639
Storopoli, J. (2021) Bayesian Statistics with Julia and Turing. Available at: https://
storopoli.io/Bayesian-Julia
Tibshirani, R. (1996) “Regression shrinkage and selection via the lasso,” Journal of the
Royal Statistical Society Series B: Statistical Methodology, 58(1), pp. 267–288
Vehtari, A., Gelman, A. and Gabry, J. (2015) Practical Bayesian Model Evaluation Using
Leave-One-out Cross-Validation and WAIC. Available at: https://doi.org/10.1007/s11222-
016-9696-4
Zhang, Y. D. et al. (2022) “Bayesian regression using a prior on the model fit: The r2-d2
shrinkage prior,” Journal of the American Statistical Association, 117(538), pp. 862–874
Zou, H. and Hastie, T. (2005) “Regularization and variable selection via the elastic net,”
Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2), pp. 301–
320
“It’s Time to Talk about Ditching Statistical Significance” (2019) Nature, 567(7748), p. 283–
284. Available at: https://doi.org/10.1038/d41586-019-00874-8