0% found this document useful (0 votes)

10 views31 pages

Stat Cookbook

Uploaded by

DANIELA ORELLANA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views31 pages

Stat Cookbook

Uploaded by

DANIELA ORELLANA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Probability and Statistics

Cookbook

Version 0.1.2
30th October, 2015
http://statistics.zone/
Copyright c Matthias Vallentin, 2015
Contents 14 Exponential Family 16 21.5 Spectral Analysis . . . . . . . . . . . . . 28

1 Distribution Overview 3 15 Bayesian Inference 16 22 Math 29

1.1 Discrete Distributions . . . . . . . . . . 3 15.1 Credible Intervals . . . . . . . . . . . . . 16 22.1 Gamma Function . . . . . . . . . . . . . 29
1.2 Continuous Distributions . . . . . . . . 5 15.2 Function of parameters . . . . . . . . . . 17 22.2 Beta Function . . . . . . . . . . . . . . . 29
15.3 Priors . . . . . . . . . . . . . . . . . . . 17 22.3 Series . . . . . . . . . . . . . . . . . . . 29
2 Probability Theory 8 15.3.1 Conjugate Priors . . . . . . . . . 17 22.4 Combinatorics . . . . . . . . . . . . . . 30
15.4 Bayesian Testing . . . . . . . . . . . . . 18
3 Random Variables 8
3.1 Transformations . . . . . . . . . . . . . 9 16 Sampling Methods 18
16.1 Inverse Transform Sampling . . . . . . . 18
4 Expectation 9 16.2 The Bootstrap . . . . . . . . . . . . . . 18
16.2.1 Bootstrap Confidence Intervals . 18
5 Variance 9 16.3 Rejection Sampling . . . . . . . . . . . . 19
16.4 Importance Sampling . . . . . . . . . . . 19
6 Inequalities 10
17 Decision Theory 19
7 Distribution Relationships 10 17.1 Risk . . . . . . . . . . . . . . . . . . . . 19
17.2 Admissibility . . . . . . . . . . . . . . . 20
8 Probability and Moment Generating
17.3 Bayes Rule . . . . . . . . . . . . . . . . 20
Functions 11
17.4 Minimax Rules . . . . . . . . . . . . . . 20
9 Multivariate Distributions 11 18 Linear Regression 20
9.1 Standard Bivariate Normal . . . . . . . 11 18.1 Simple Linear Regression . . . . . . . . 20
9.2 Bivariate Normal . . . . . . . . . . . . . 11 18.2 Prediction . . . . . . . . . . . . . . . . . 21
9.3 Multivariate Normal . . . . . . . . . . . 11 18.3 Multiple Regression . . . . . . . . . . . 21
18.4 Model Selection . . . . . . . . . . . . . . 22
10 Convergence 11
10.1 Law of Large Numbers (LLN) . . . . . . 12 19 Non-parametric Function Estimation 22
10.2 Central Limit Theorem (CLT) . . . . . 12 19.1 Density Estimation . . . . . . . . . . . . 22
19.1.1 Histograms . . . . . . . . . . . . 23
11 Statistical Inference 12 19.1.2 Kernel Density Estimator (KDE) 23
11.1 Point Estimation . . . . . . . . . . . . . 12 19.2 Non-parametric Regression . . . . . . . 23
11.2 Normal-Based Confidence Interval . . . 13 19.3 Smoothing Using Orthogonal Functions 24
11.3 Empirical distribution . . . . . . . . . . 13
11.4 Statistical Functionals . . . . . . . . . . 13 20 Stochastic Processes 24
20.1 Markov Chains . . . . . . . . . . . . . . 24
12 Parametric Inference 13 20.2 Poisson Processes . . . . . . . . . . . . . 25
12.1 Method of Moments . . . . . . . . . . . 13
12.2 Maximum Likelihood . . . . . . . . . . . 14 21 Time Series 25
12.2.1 Delta Method . . . . . . . . . . . 14 21.1 Stationary Time Series . . . . . . . . . . 26 This cookbook integrates various topics in probability theory
12.3 Multiparameter Models . . . . . . . . . 14 21.2 Estimation of Correlation . . . . . . . . 26 and statistics, based on literature [1, 6, 3] and in-class material
12.3.1 Multiparameter delta method . . 15 21.3 Non-Stationary Time Series . . . . . . . 26 from courses of the statistics department at the University of
12.4 Parametric Bootstrap . . . . . . . . . . 15 21.3.1 Detrending . . . . . . . . . . . . 27 California in Berkeley but also influenced by others [4, 5]. If you
21.4 ARIMA models . . . . . . . . . . . . . . 27 find errors or have suggestions for improvements, please get in
13 Hypothesis Testing 15 21.4.1 Causality and Invertibility . . . . 28 touch at http://statistics.zone/.
1 Distribution Overview
1.1 Discrete Distributions
Notation1 FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b − a + 1)2 − 1 eas − e−(b+1)s

bxc−a+1 I(a ≤ x ≤ b) a+b
Uniform Unif {a, . . . , b} a≤x≤b
 b−a b−a+1 2 12 s(b − a)
1 x>b

Bernoulli Bern (p) (1 − p)1−x px (1 − p)1−x p p(1 − p) 1 − p + pes
!
n x
Binomial Bin (n, p) I1−p (n − x, x + 1) p (1 − p)n−x np np(1 − p) (1 − p + pes )n
x
k k
!n
n! x
X X
Multinomial Mult (n, p) px1 1 · · · pkk xi = n npi npi (1 − pi ) pi e si
x1 ! . . . xk ! i=1 i=0
m N −m
!
x − np x n−x nm nm(N − n)(N − m)
Hypergeometric Hyp (N, m, n) ≈Φ N N 2 (N − 1)
p
np(1 − p) n
N
! r
x+r−1 r 1−p 1−p p
Negative Binomial NBin (r, p) Ip (r, x + 1) p (1 − p)x r r
r−1 p p2 1 − (1 − p)es
1 1−p pes
Geometric Geo (p) 1 − (1 − p)x x ∈ N+ p(1 − p)x−1 x ∈ N+
p p2 1 − (1 − p)es
x
X λi λx e−λ s
Poisson Po (λ) e−λ λ λ eλ(e −1)

i=0
i! x!

1 We use the notation γ(s, x) and Γ(x) to refer to the Gamma functions (see §22.1), and use B(x, y) and Ix to refer to the Beta functions (see §22.2).

3
Uniform (discrete) Binomial Geometric Poisson
●
● n = 40, p = 0.3 0.8 ●
● p = 0.2 ● ●
● λ=1
● n = 30, p = 0.6 ● p = 0.5 ● λ=4
● n = 25, p = 0.9 ● p = 0.8 ● λ = 10
●
0.3
0.2 ● 0.6

●
0.2
PMF

PMF

PMF
1 ● ● ● ●
● ●
● ● ● ● ● ● ● 0.4 ●
n ●
● ● ●
●
● ●
0.1
● ●
● ● ● ● ●
● ●
● 0.1 ●
● 0.2 ● ●
● ● ● ● ●
● ●
● ● ●
● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●●● ● ● ● ●
0.0 ●●●● ●●
●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●● 0.0 ● ●
●
● ●
● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

x x x x
Uniform (discrete) Binomial Geometric Poisson
1 ● 1.00 ●●●●●●●●●●●●●●●●
● ● ●●●●●●●●●●●●●●●●●●●●●
●● 1.0 ● ● ● ● ●
● ● ● ● ● 1.00 ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ●
● ●
● ●
● ●
●
●
● ● ●
0.75 ●
0.8 ● ● 0.75 ●
● ● ●
i ●
●
● ●
n ●
● ● ●
●
CDF

CDF

CDF
0.50 0.6 ● 0.50
● ●
● ●
●
● ●
i ●
● ● ●
n ●
●
0.25 ● 0.4 0.25 ●
●
● ●
●
●
● ● ● ● n = 40, p = 0.3 ● p = 0.2 ● ● λ=1
● ● ●
● n = 30, p = 0.6 ● p = 0.5 ●
● λ=4
●
0 ● 0.00 ●●●● ●
●
●●
●●●●●●●●●●●●●●●●●
●●●●●●●●●● ● ● n = 25, p = 0.9 0.2 ● ● p = 0.8 0.00
●
● ● ● ● ● λ = 10

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

x x x x

4
1.2 Continuous Distributions
Notation FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b − a)2 esb − esa

x−a I(a < x < b) a+b
Uniform Unif (a, b) a<x<b
 b−a b−a 2 12 s(b − a)
1 x>b

(x − µ)2
Z x
σ 2 s2

1
N µ, σ 2 σ2

Normal Φ(x) = φ(t) dt φ(x) = √ exp − µ exp µs +
−∞ σ 2π 2σ 2 2
(ln x − µ)2

1 1 ln x − µ 1 2 2 2
ln N µ, σ 2 eµ+σ /2
(eσ − 1)e2µ+σ

Log-Normal + erf √ √ exp −
2 2 2σ 2 x 2πσ 2 2σ 2

1 T
Σ−1 (x−µ) 1
Multivariate Normal MVN (µ, Σ) (2π)−k/2 |Σ|−1/2 e− 2 (x−µ) µ Σ exp µT s + sT Σs
2
−(ν+1)/2 ( ν
Γ ν+1

ν ν
2 x2 ν−2
ν>2
Student’s t Student(ν) Ix , √ 1 + 0
νπΓ ν2

2 2 ν ∞ 1<ν≤2

1 k x 1
Chi-square χ2k γ , xk/2−1 e−x/2 k 2k (1 − 2s)−k/2 s < 1/2
Γ(k/2) 2 2 2k/2 Γ(k/2)
r
d
(d1 x)d1 d2 2
2d22 (d1 + d2 − 2)

d1 d1 (d1 x+d2 )d1 +d2 d2
F F(d1 , d2 ) I d1 x , d1 d1 d2 − 2 d1 (d2 − 2)2 (d2 − 4)

d1 x+d2 2 2 xB 2
, 2
1 −x/β 1
Exponential Exp (β) 1 − e−x/β e β β2 (s < 1/β)
β 1 − βs
α
γ(α, x/β) 1 1
Gamma Gamma (α, β) xα−1 e−x/β αβ αβ 2 (s < 1/β)
Γ(α) Γ (α) β α 1 − βs
Γ α, βx

β α −α−1 −β/x β β2 2(−βs)α/2 p
Inverse Gamma InvGamma (α, β) x e α>1 α>2 Kα −4βs
Γ (α) Γ (α) α−1 (α − 1)2 (α − 2) Γ(α)
P
k
Γ i=1 αi Y α −1
k
αi E [Xi ] (1 − E [Xi ])
Dirichlet Dir (α) Qk xi i Pk Pk
i=1 Γ (αi ) i=1 i=1 αi i=1 αi + 1
∞ k−1
!
Γ (α + β) α−1 α αβ X Y α+r sk
Beta Beta (α, β) Ix (α, β) x (1 − x)β−1 1+
Γ (α) Γ (β) α+β (α + β)2 (α + β + 1) r=0
α+β+r k!
k=1
∞ n n
k k x k−1 −(x/λ)k 1 2 X s λ n
Weibull Weibull(λ, k) 1 − e−(x/λ) e λΓ 1 + λ2 Γ 1 + − µ2 Γ 1+
λ λ k k n=0
n! k
x α
m xα αxm xα
Pareto Pareto(xm , α) 1− x ≥ xm m
α α+1 x ≥ xm α>1 m
α>2 α(−xm s)α Γ(−α, −xm s) s < 0
x x α−1 (α − 1)2 (α − 2)

5
Uniform (continuous) Normal Log−Normal Student's t
2.0 1.00 0.4
µ = 0, σ2 = 0.2 µ = 0, σ2 = 3 ν=1
µ = 0, σ2 = 1 µ = 2, σ2 = 2 ν=2
µ = 0, σ2 = 5 µ = 0, σ2 = 1 ν=5
ν=∞
µ = −2, σ2 = 0.5 µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1
1.5 0.75 µ = 0.125, σ2 = 1 0.3
PDF

PDF

PDF
1
● ● 1.0 0.50 0.2
b−a

0.5 0.25 0.1

● ● 0.0 0.00 0.0

a b −5.0 −2.5 0.0 2.5 5.0 0 1 2 3 −5.0 −2.5 0.0 2.5 5.0
x x x x
χ 2 F Exponential Gamma
d1 = 1, d2 = 1 2.0 β=2 2.0 α = 1, β = 2
1.00 k=1 3 d1 = 2, d2 = 1 β=1 α = 2, β = 2
k=2 d1 = 5, d2 = 2 β = 0.4 α = 3, β = 2
k=3 d1 = 100, d2 = 1 α = 5, β = 1
k=4 d1 = 100, d2 = 100 α = 9, β = 0.5
k=5
1.5 1.5
0.75

2
PDF

PDF

PDF
PDF

0.50 1.0 1.0

1
0.25 0.5 0.5

0.00 0 0.0 0.0

0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
α = 1, β = 1 5 α = 0.5, β = 0.5 2.0 λ = 1, k = 0.5 4 xm = 1, k = 1
α = 2, β = 1 α = 5, β = 1 λ = 1, k = 1 xm = 1, k = 2
α = 3, β = 1 α = 1, β = 3 λ = 1, k = 1.5 xm = 1, k = 4
4 α = 3, β = 0.5 α = 2, β = 2 λ = 1, k = 5
4 α = 2, β = 5
1.5 3

3
3
PDF

PDF

PDF
1.0 2
2
2

0.5 1
1 1

0 0 0.0 0

0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
x x x x
6
Uniform (continuous) Normal Log−Normal Student's t
1 1.00 1.00
µ = 0, σ2 = 3
µ = 2, σ2 = 2
0.75 µ = 0, σ2 = 1
µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1
0.75 µ = 0.125, σ2 = 1 0.75

0.50
CDF

CDF

CDF
0.50 0.50

0.25
0.25 0.25

µ = 0, σ = 0.2
2
ν=1
µ = 0, σ2 = 1 ν=2
µ = 0, σ2 = 5 ν=5
0 0.00 µ = −2, σ2 = 0.5 0.00 0.00 ν=∞

a b −5.0 −2.5 0.0 2.5 5.0 0 1 2 3 −5.0 −2.5 0.0 2.5 5.0
x x x x
χ 2 F Exponential Gamma
1.00 1.00 1.00
1.00

0.75 0.75 0.75 0.75

CDF

CDF
CDF

0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

k=1 d1 = 1, d2 = 1 α = 1, β = 2
k=2 d1 = 2, d2 = 1 α = 2, β = 2
k=3 d1 = 5, d2 = 2 β=2 α = 3, β = 2
k=4 d1 = 100, d2 = 1 β=1 α = 5, β = 1
0.00 k=5 0.00 d1 = 100, d2 = 100 0.00 β = 0.4 0.00 α = 9, β = 0.5

0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
1.00 1.00
1.00 1.00 α = 0.5, β = 0.5
α = 5, β = 1
α = 1, β = 3
α = 2, β = 2
α = 2, β = 5
0.75 0.75 0.75 0.75
CDF

CDF

CDF
0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

α = 1, β = 1 λ = 1, k = 0.5
α = 2, β = 1 λ = 1, k = 1 xm = 1, k = 1
α = 3, β = 1 λ = 1, k = 1.5 xm = 1, k = 2
0.00 α = 3, β = 0.5 0.00 0.00 λ = 1, k = 5 0.00 xm = 1, k = 4

0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
x x x x
7
2 Probability Theory Law of Total Probability
n n
Definitions X G
P [B] = P [B|Ai ] P [Ai ] Ω= Ai
• Sample space Ω i=1 i=1

• Outcome (point or element) ω ∈ Ω Bayes’ Theorem

• Event A ⊆ Ω n
• σ-algebra A P [B | Ai ] P [Ai ] G
P [Ai | B] = Pn Ω= Ai
1. ∅ ∈ A j=1 P [B | Aj ] P [Aj ] i=1
S∞
2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A Inclusion-Exclusion Principle
3. A ∈ A =⇒ ¬A ∈ A
n
[ n
X X r
\
• Probability Distribution P Ai = (−1)r−1 Ai j
1. P [A] ≥ 0 ∀A i=1 r=1 i≤i1 <···<ir ≤n j=1

2. P [Ω] = 1
"∞ #
G ∞
X 3 Random Variables
3. P Ai = P [Ai ]
i=1 i=1 Random Variable (RV)
• Probability space (Ω, A, P) X:Ω→R
Probability Mass Function (PMF)
Properties
• P [∅] = 0 fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}]
• B = Ω ∩ B = (A ∪ ¬A) ∩ B = (A ∩ B) ∪ (¬A ∩ B) Probability Density Function (PDF)
• P [¬A] = 1 − P [A]
Z b
• P [B] = P [A ∩ B] + P [¬A ∩ B]
P [a ≤ X ≤ b] = f (x) dx
• P [Ω] = 1 P [∅] = 0 a
S T T S
• ¬( n An ) = n ¬An ¬( n An ) = n ¬An DeMorgan
S T Cumulative Distribution Function (CDF)
• P [ n An ] = 1 − P [ n ¬An ]
• P [A ∪ B] = P [A] + P [B] − P [A ∩ B] FX : R → [0, 1] FX (x) = P [X ≤ x]
=⇒ P [A ∪ B] ≤ P [A] + P [B] 1. Nondecreasing: x1 < x2 =⇒ F (x1 ) ≤ F (x2 )
• P [A ∪ B] = P [A ∩ ¬B] + P [¬A ∩ B] + P [A ∩ B] 2. Normalized: limx→−∞ = 0 and limx→∞ = 1
• P [A ∩ ¬B] = P [A] − P [A ∩ B] 3. Right-Continuous: limy↓x F (y) = F (x)
Continuity of Probabilities
S∞ Z b
• A1 ⊂ A2 ⊂ . . . =⇒ limn→∞ P [An ] = P [A] where A = i=1 Ai P [a ≤ Y ≤ b | X = x] = fY |X (y | x)dy a≤b
T∞
• A1 ⊃ A2 ⊃ . . . =⇒ limn→∞ P [An ] = P [A] where A = i=1 Ai a

f (x, y)
Independence ⊥
⊥ fY |X (y | x) =
A⊥
⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B] fX (x)
Independence
Conditional Probability
1. P [X ≤ x, Y ≤ y] = P [X ≤ x] P [Y ≤ y]
P [A ∩ B]
P [A | B] = P [B] > 0 2. fX,Y (x, y) = fX (x)fY (y)
P [B] 8
Z
3.1 Transformations • E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
• E [ϕ(Y )] 6= ϕ(E [X]) (cf. Jensen inequality)
Z = ϕ(X)
• P [X ≥ Y ] = 1 =⇒ E [X] ≥ E [Y ]
Discrete • P [X = Y ] = 1 ⇐⇒ E [X] = E [Y ]
X ∞
fZ (z) = P [ϕ(X) = z] = P [{x : ϕ(x) = z}] = P X ∈ ϕ−1 (z) =

fX (x)
X
• E [X] = P [X ≥ x]
x∈ϕ−1 (z) x=1

Continuous Sample mean

n
Z 1X
X̄n = Xi
FZ (z) = P [ϕ(X) ≤ z] = f (x) dx with Az = {x : ϕ(x) ≤ z} n i=1
Az
Conditional expectation
Special case if ϕ strictly monotone Z
d −1 dx 1 • E [Y | X = x] = yf (y | x) dy
fZ (z) = fX (ϕ−1 (z)) ϕ (z) = fX (x) = fX (x)
dz dz |J| • E [X] = E [E [X | Y ]]
Z ∞
The Rule of the Lazy Statistician • Eϕ(X,Y ) | X=x [=] ϕ(x, y)fY |X (y | x) dx
−∞
Z Z ∞
E [Z] = ϕ(x) dFX (x) • E [ϕ(Y, Z) | X = x] = ϕ(y, z)f(Y,Z)|X (y, z | x) dy dz
−∞
Z Z • E [Y + Z | X] = E [Y | X] + E [Z | X]
E [IA (x)] = IA (x) dFX (x) = dFX (x) = P [X ∈ A] • E [ϕ(X)Y | X] = ϕ(X)E [Y | X]
A
• EY | X [=] c =⇒ Cov [X, Y ] = 0
Convolution
Z ∞ Z z
X,Y ≥0
• Z := X + Y fZ (z) = fX,Y (x, z − x) dx = fX,Y (x, z − x) dx
−∞ 0 5 Variance
Z ∞
• Z := |X − Y | fZ (z) = 2 fX,Y (x, z + x) dx Definition and properties
0
Z ∞ Z ∞ 2
2
X ⊥⊥ • V [X] = σX = E (X − E [X])2 = E X 2 − E [X]
• Z := fZ (z) = |x|fX,Y (x, xz) dx = xfx (x)fX (x)fY (xz) dx " n # n
Y −∞ −∞ X X X
• V Xi = V [Xi ] + 2 Cov [Xi , Yj ]
i=1 i=1 i6=j
4 Expectation " n
X
# n
X
• V Xi = V [Xi ] if Xi ⊥
⊥ Xj
Definition and properties
i=1 i=1
X

 xfX (x) X discrete Standard deviation p
sd[X] = V [X] = σX

Z  x

• E [X] = µX = x dFX (x) =

 Z Covariance
 xfX (x) dx X continuous


• Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ]
• P [X = c] = 1 =⇒ E [X] = c • Cov [X, a] = 0
• E [cX] = c E [X] • Cov [X, X] = V [X]
• E [X + Y ] = E [X] + E [Y ] • Cov [X, Y ] = Cov [Y, X]
9
• Cov [aX, bY ] = abCov [X, Y ] 7 Distribution Relationships
• Cov [X + a, Y + b] = Cov [X, Y ]

n m

n X m
Binomial
X X X
n
• Cov  Xi , Yj  = Cov [Xi , Yj ] X
i=1 j=1 i=1 j=1
• Xi ∼ Bern (p) =⇒ Xi ∼ Bin (n, p)
i=1
Correlation • X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p)
Cov [X, Y ]
ρ [X, Y ] = p • limn→∞ Bin (n, p) = Po (np) (n large, p small)
V [X] V [Y ] • limn→∞ Bin (n, p) = N (np, np(1 − p)) (n large, p far from 0 and 1)
Independence
Negative Binomial
X⊥
⊥ Y =⇒ ρ [X, Y ] = 0 ⇐⇒ Cov [X, Y ] = 0 ⇐⇒ E [XY ] = E [X] E [Y ]
• X ∼ NBin (1, p) = Geo (p)
Pr
Sample variance • X ∼ NBin (r, p) = i=1 Geo (p)
n P P
1 X • Xi ∼ NBin (ri , p) =⇒ Xi ∼ NBin ( ri , p)
S2 = (Xi − X̄n )2
n − 1 i=1 • X ∼ NBin (r, p) . Y ∼ Bin (s + r, p) =⇒ P [X ≤ s] = P [Y ≥ r]
Conditional variance Poisson
2 n n
!
• V [Y | X] = E (Y − E [Y | X])2 | X = E Y 2 | X − E [Y | X] X X
• Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi ∼ Po λi
• V [Y ] = E [V [Y | X]] + V [E [Y | X]]
i=1 i=1
 
n n
X X λi
6 Inequalities • Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi Xj ∼ Bin  Xj , Pn 
j=1 j=1 j=1 λj
Cauchy-Schwarz
2 Exponential
E [XY ] ≤ E X 2 E Y 2

n
X
Markov • Xi ∼ Exp (β) ∧ Xi ⊥
⊥ Xj =⇒ Xi ∼ Gamma (n, β)
E [ϕ(X)]
P [ϕ(X) ≥ t] ≤ i=1
t • Memoryless property: P [X > x + y | X > y] = P [X > x]
Chebyshev
V [X] Normal
P [|X − E [X]| ≥ t] ≤
t2
X−µ

Chernoff • X ∼ N µ, σ 2 =⇒ σ ∼ N (0, 1)
δ

e
• X ∼ N µ, σ ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ 2
2

P [X ≥ (1 + δ)µ] ≤ δ > −1
(1 + δ)1+δ
• Xi ∼ N µi , σi2 ∧ Xi ⊥⊥ Xj =⇒
P
Xi ∼ N
P
µi , i σi2
P
i i
Hoeffding
• P [a < X ≤ b] = Φ b−µ − Φ a−µ

σ σ
X1 , . . . , Xn independent ∧ P [Xi ∈ [ai , bi ]] = 1 ∧ 1 ≤ i ≤ n • Φ(−x) = 1 − Φ(x) φ0 (x) = −xφ(x) φ00 (x) = (x2 − 1)φ(x)
−1
2 • Upper quantile of N (0, 1): zα = Φ (1 − α)
P X̄ − E X̄ ≥ t ≤ e−2nt t > 0

Gamma
2n2 t2

P |X̄ − E X̄ | ≥ t ≤ 2 exp − Pn 2
t>0
i=1 (bi − ai ) • X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1)
Pα
Jensen • Gamma (α, β) ∼ i=1 Exp (β)
P P
E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex • Xi ∼ Gamma (αi , β) ∧ Xi ⊥
⊥ Xj =⇒ i Xi ∼ Gamma ( i αi , β)
10
Z ∞
Γ(α) 9.2 Bivariate Normal
• = xα−1 e−λx dx
λα 0
Let X ∼ N µx , σx2 and Y ∼ N µy , σy2 .
Beta
1 Γ(α + β) α−1 1 z
• xα−1 (1 − x)β−1 = x (1 − x)β−1 f (x, y) = exp −
2(1 − ρ2 )
p
B(α, β) Γ(α)Γ(β) 2πσx σy 1 − ρ2
B(α + k, β) α+k−1
E X k−1

• E Xk =
" 2 2 #
= x − µx

y − µy

x − µx

y − µy
B(α, β) α+β+k−1 z= + − 2ρ
• Beta (1, 1) ∼ Unif (0, 1) σx σy σx σy
Conditional mean and variance
8 Probability and Moment Generating Functions E [X | Y ] = E [X] + ρ
σX
(Y − E [Y ])
σY
• GX (t) = E tX |t| < 1 p
V [X | Y ] = σX 1 − ρ2
"∞ # ∞
X (Xt)i X E Xi
· ti

• MX (t) = GX (et ) = E eXt = E =
i=0
i! i=0
i!
• P [X = 0] = GX (0)
9.3 Multivariate Normal
• P [X = 1] = G0X (0) Covariance matrix Σ (Precision matrix Σ−1 )
(i)
GX (0)  
• P [X = i] = V [X1 ] · · · Cov [X1 , Xk ]
i! .. .. ..
• E [X] = G0X (1− ) Σ=
 
. . . 
(k)
• E X k = MX (0) Cov [Xk , X1 ] · · · V [Xk ]

X! (k)
• E = GX (1− ) If X ∼ N (µ, Σ),
(X − k)!
2

• V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− )) −1/2 1
fX (x) = (2π)−n/2 |Σ| exp − (x − µ)T Σ−1 (x − µ)
d 2
• GX (t) = GY (t) =⇒ X = Y
Properties
9 Multivariate Distributions • Z ∼ N (0, 1) ∧ X = µ + Σ1/2 Z =⇒ X ∼ N (µ, Σ)
• X ∼ N (µ, Σ) =⇒ Σ−1/2 (X − µ) ∼ N (0, 1)
9.1 Standard Bivariate Normal • X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT

p
Let X, Y ∼ N (0, 1) ∧ X ⊥
⊥ Z where Y = ρX + 1 − ρ2 Z

• X ∼ N (µ, Σ) ∧ kak = k =⇒ aT X ∼ N aT µ, aT Σa

Joint density
1 x2 + y 2 − 2ρxy

10 Convergence
f (x, y) = exp −
2(1 − ρ2 )
p
2π 1 − ρ2
Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote
Conditionals the cdf of Xn and let F denote the cdf of X.
Types of Convergence
(Y | X = x) ∼ N ρx, 1 − ρ2 (X | Y = y) ∼ N ρy, 1 − ρ2

and
D
1. In distribution (weakly, in law): Xn → X
Independence
X⊥
⊥ Y ⇐⇒ ρ = 0 lim Fn (t) = F (t) ∀t where F continuous
n→∞ 11
P
2. In probability: Xn → X √
X̄n − µ n(X̄n − µ) D
Zn := q = →Z where Z ∼ N (0, 1)
(∀ε > 0) lim P [|Xn − X| > ε] = 0 σ
n→∞ V X̄n
as
3. Almost surely (strongly): Xn → X lim P [Zn ≤ z] = Φ(z) z∈R
n→∞
h i h i
P lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1 CLT notations
n→∞ n→∞

qm
Zn ≈ N (0, 1)
4. In quadratic mean (L2 ): Xn → X
σ2

X̄n ≈ N µ,
lim E (Xn − X)2 = 0

n
n→∞
σ2

X̄n − µ ≈ N 0,
Relationships n
√ 2

qm P D n(X̄n − µ) ≈ N 0, σ
• Xn → X =⇒ Xn → X =⇒ Xn → X √
as
• Xn → X =⇒ Xn → X
P
n(X̄n − µ)
≈ N (0, 1)
D P
• Xn → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ Xn → X σ
P P P
• Xn →X ∧ Yn → Y =⇒ Xn + Yn → X + Y
qm qm qm
• Xn →X ∧ Yn → Y =⇒ Xn + Yn → X + Y Continuity correction
P P P
• Xn →X ∧ Yn → Y =⇒ Xn Yn → XY
P P
x + 12 − µ

• Xn →X =⇒ ϕ(Xn ) → ϕ(X)
D D
P X̄n ≤ x ≈ Φ √
• Xn → X =⇒ ϕ(Xn ) → ϕ(X) σ/ n
qm
• Xn → b ⇐⇒ limn→∞ E [Xn ] = b ∧ limn→∞ V [Xn ] = 0
x − 12 − µ

qm
• X1 , . . . , Xn iid ∧ E [X] = µ ∧ V [X] < ∞ ⇐⇒ X̄n → µ

P X̄n ≥ x ≈ 1 − Φ √
σ/ n
Slutzky’s Theorem
Delta method
D P D
• Xn → X and Yn → c =⇒ Xn + Yn → X + c
σ2 σ2

D P D 0 2
• Xn → X and Yn → c =⇒ Xn Yn → cX Yn ≈ N µ, =⇒ ϕ(Yn ) ≈ N ϕ(µ), (ϕ (µ))
D D D n n
• In general: Xn → X and Yn → Y =⇒
6 Xn + Yn → X + Y

10.1 Law of Large Numbers (LLN)

11 Statistical Inference
iid
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ. Let X1 , · · · , Xn ∼ F if not otherwise noted.
Weak (WLLN)
P
X̄n → µ n→∞ 11.1 Point Estimation
Strong (SLLN) • Point estimator θbn of θ is a rv: θbn = g(X1 , . . . , Xn )
as
X̄n → µ n→∞
h i
• bias(θbn ) = E θbn − θ
P
10.2 Central Limit Theorem (CLT) • Consistency: θbn → θ
• Sampling distribution: F (θbn )
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] = σ 2 . r h i
• Standard error: se(θn ) = V θbn
b
12
h i h i
• Mean squared error: mse = E (θbn − θ)2 = bias(θbn )2 + V θbn 11.4 Statistical Functionals
• limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent • Statistical functional: T (F )
θbn − θ D • Plug-in estimator of θ = (F ): θbn = T (Fbn )
• Asymptotic normality: → N (0, 1) R
se • Linear functional: T (F ) = ϕ(x) dFX (x)
• Slutzky’s Theorem often lets us replace se(θbn ) by some (weakly) consis- • Plug-in estimator for linear functional:
tent estimator σ
bn . Z n
1X
T (Fbn ) = ϕ(x) dFbn (x) = ϕ(Xi )
11.2 Normal-Based Confidence Interval n i=1

b 2 . Let zα/2 = Φ−1 (1 − (α/2)), i.e., P Z > zα/2 = α/2

Suppose θbn ≈ N θ, se

b 2 =⇒ T (Fbn ) ± zα/2 se
• Often: T (Fbn ) ≈ N T (F ), se b
and P −zα/2 < Z < zα/2 = 1 − α where Z ∼ N (0, 1). Then
• pth quantile: F −1 (p) = inf{x : F (x) ≥ p}
Cn = θbn ± zα/2 se
b • µb = X̄n
n
1 X
11.3 Empirical distribution b2 =
• σ (Xi − X̄n )2
n − 1 i=1
1
Pn
Empirical Distribution Function (ECDF) n i=1 (Xi − µb)3
Pn • κ
b=
I(Xi ≤ x) b3
Pσ
Fn (x) = i=1
b n
i=1 (Xi − X̄n )(Yi − Ȳn )
n • ρb = qP qP
n 2 n 2
− i=1 (Yi − Ȳn )
(
1 Xi ≤ x i=1 (X i X̄n )
I(Xi ≤ x) =
0 Xi > x
Properties (for any fixed x) 12 Parametric Inference
h i
• E Fbn = F (x)

Let F = f (x; θ) : θ ∈ Θ be a parametric model with parameter space Θ ⊂ Rk
h i F (x)(1 − F (x)) and parameter θ = (θ1 , . . . , θk ).
• V Fbn =
n
F (x)(1 − F (x)) D 12.1 Method of Moments
• mse = →0
n
th
P
• Fbn → F (x) j moment Z
αj (θ) = E X j = xj dFX (x)

Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn ∼ F )

2
P sup F (x) − Fn (x) > ε = 2e−2nε
b j th sample moment
x n
1X j
Nonparametric 1 − α confidence band for F α
bj = X
n i=1 i
L(x) = max{Fbn − n , 0}
Method of Moments estimator (MoM)
U (x) = min{Fbn + n , 1}
s
1

2 α1 (θ) = α
b1
= log α2 (θ) = α
2n α b2
.. ..
.=.
P [L(x) ≤ F (x) ≤ U (x) ∀x] ≥ 1 − α αk (θ) = α
bk
13
Properties of the MoM estimator • Equivariance: θbn is the mle =⇒ ϕ(θbn ) is the mle of ϕ(θ)
• θbn exists with probability tending to 1 • Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-
P
• Consistency: θbn → θ ples. If θen is any other estimator, the asymptotic relative efficiency is:
p
• Asymptotic normality: 1. se ≈ 1/In (θ)
√ (θbn − θ) D
D
n(θb − θ) → N (0, Σ) → N (0, 1)
se
q
where Σ = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T , b ≈ 1/In (θbn )
2. se
∂ −1
g = (g1 , . . . , gk ) and gj = ∂θ αj (θ)
(θbn − θ) D
→ N (0, 1)
se
b
12.2 Maximum Likelihood • Asymptotic optimality
Likelihood: Ln : Θ → [0, ∞) h i
V θbn
n
Y are(θen , θbn ) = h i ≤ 1
Ln (θ) = f (Xi ; θ) V θen
i=1
• Approximately the Bayes estimator
Log-likelihood
n
X 12.2.1 Delta Method
`n (θ) = log Ln (θ) = log f (Xi ; θ)
i=1 b where ϕ is differentiable and ϕ0 (θ) 6= 0:
If τ = ϕ(θ)
Maximum likelihood estimator (mle)
τn − τ ) D
(b
→ N (0, 1)
Ln (θbn ) = sup Ln (θ) se(b
b τ)
θ
where τb = ϕ(θ)
b is the mle of τ and
Score function
∂
s(X; θ) = log f (X; θ) b = ϕ0 (θ)
se b se(
b θbn )
∂θ
Fisher information
I(θ) = Vθ [s(X; θ)] 12.3 Multiparameter Models
In (θ) = nI(θ) Let θ = (θ1 , . . . , θk ) and θb = (θb1 , . . . , θbk ) be the mle.
Fisher information (exponential family)
∂ 2 `n ∂ 2 `n
Hjj = Hjk =
∂ ∂θ2 ∂θj ∂θk
I(θ) = Eθ − s(X; θ)
∂θ Fisher information matrix
Observed Fisher information 
Eθ [H11 ] ··· Eθ [H1k ]

n
In (θ) = −  .. .. ..
∂2 X
 
. . .
Inobs (θ) = −

log f (Xi ; θ)
∂θ2 i=1 Eθ [Hk1 ] · · · Eθ [Hkk ]

Properties of the mle Under appropriate regularity conditions

P
• Consistency: θbn → θ (θb − θ) ≈ N (0, Jn )
14
with Jn (θ) = In−1 . Further, if θbj is the j th component of θ, then • Critical value c
• Test statistic T
(θbj − θj ) D • Rejection region R = {x : T (x) > c}
→ N (0, 1)
se
bj • Power function β(θ) = P [X ∈ R]
h i • Power of a test: 1 − P [Type II error] = 1 − β = inf β(θ)
b 2j = Jn (j, j) and Cov θbj , θbk = Jn (j, k)
where se θ∈Θ1
• Test size: α = P [Type I error] = sup β(θ)
θ∈Θ0
12.3.1 Multiparameter delta method
Let τ = ϕ(θ1 , . . . , θk ) and let the gradient of ϕ be Retain H0 Reject H0
√
∂ϕ
  H0 true Type
√ I Error (α)
 ∂θ1  H1 true Type II Error (β) (power)
 . 
p-value
 .. 
∇ϕ =  
 ∂ϕ 
∂θk

• p-value = supθ∈Θ0 Pθ [T (X) ≥ T (x)] = inf α : T (x) ∈ Rα
Pθ [T (X ? ) ≥ T (X)]

• p-value = supθ∈Θ0 = inf α : T (X) ∈ Rα
Suppose ∇ϕ θ=θb
6= 0 and τb = ϕ(θ).
b Then, | {z }
1−Fθ (T (X)) since T (X ? )∼Fθ
τ − τ) D
(b
→ N (0, 1)
se(b
b τ)
p-value evidence
where r < 0.01 very strong evidence against H0
T
0.01 − 0.05 strong evidence against H0

se(b
b τ) = ∇ϕ
b Jbn ∇ϕ
b
0.05 − 0.1 weak evidence against H0
b and ∇ϕ
b = ∇ϕ > 0.1 little or no evidence against H0
and Jbn = Jn (θ) θ=θb
.
Wald test
12.4 Parametric Bootstrap
• Two-sided test
Sample from f (x; θbn ) instead of from Fbn , where θbn could be the mle or method
of moments estimator. θb − θ0
• Reject H0 when |W | > zα/2 where W =
se
b
• P |W | > zα/2 → α
13 Hypothesis Testing • p-value = Pθ0 [|W | > |w|] ≈ P [|Z| > |w|] = 2Φ(−|w|)

H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1
Likelihood ratio test
Definitions

• Null hypothesis H0 supθ∈Θ Ln (θ) Ln (θbn )

• Alternative hypothesis H1 • T (X) = =
supθ∈Θ0 Ln (θ) Ln (θbn,0 )
• Simple hypothesis θ = θ0 k
• Composite hypothesis θ > θ0 or θ < θ0 iid
D
X
• λ(X) = 2 log T (X) → χ2r−q where Zi2 ∼ χ2k and Z1 , . . . , Zk ∼ N (0, 1)
• Two-sided test: H0 : θ = θ0 versus H1 : θ 6= θ0
i=1
• One-sided test: H0 : θ ≤ θ0 versus H1 : θ > θ0 • p-value = Pθ0 [λ(X) > λ(x)] ≈ P χ2r−q > λ(x)
15
Multinomial LRT Natural form

X1 Xk
• mle: pbn = ,..., fX (x | η) = h(x) exp {η · T(x) − A(η)}
n n
k
Y pbj Xj = h(x)g(η) exp {η · T(x)}
Ln (b
pn )
• T (X) = = = h(x)g(η) exp η T T(x)

Ln (p0 ) j=1
p0j
k
X pbj D
• λ(X) = 2 Xj log → χ2k−1 15 Bayesian Inference
j=1
p 0j

• The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α Bayes’ Theorem
Pearson Chi-square Test f (x | θ)f (θ) f (x | θ)f (θ)
f (θ | x) = =R ∝ Ln (θ)f (θ)
k f (xn ) f (x | θ)f (θ) dθ
X (Xj − E [Xj ])2
• T = where E [Xj ] = np0j under H0
j=1
E [Xj ] Definitions
D
• T → χ2k−1 • X n = (X1 , . . . , Xn )

• p-value = P χ2k−1 > T (x) • xn = (x1 , . . . , xn )
D
2
• Faster → Xk−1 than LRT, hence preferable for small n • Prior density f (θ)
• Likelihood f (xn | θ): joint density of the data
Independence testing Yn
In particular, X n iid =⇒ f (xn | θ) = f (xi | θ) = Ln (θ)
• I rows, J columns, X multinomial sample of size n = I ∗ J i=1
X
• mles unconstrained: pbij = nij • Posterior density f (θ | xn )
• Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ
R
X
• mles under H0 : pb0ij = pbi· pb·j = Xni· n·j
PI PJ
nX
• Kernel: part of a density that depends Ron θ
• LRT: λ = 2 i=1 j=1 Xij log Xi· Xij·j θLn (θ)f (θ)
• Posterior mean θ̄n = θf (θ | xn ) dθ = R Ln (θ)f
R
PI PJ (X −E[X ])2 (θ) dθ
• PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij
D
• LRT and Pearson → χ2k ν, where ν = (I − 1)(J − 1) 15.1 Credible Intervals
Posterior interval
14 Exponential Family Z b
P [θ ∈ (a, b) | xn ] = f (θ | xn ) dθ = 1 − α
Scalar parameter a

fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)} Equal-tail credible interval

= h(x)g(θ) exp {η(θ)T (x)} Z a Z ∞
f (θ | xn ) dθ = f (θ | xn ) dθ = α/2
−∞ b
Vector parameter
( s
) Highest posterior density (HPD) region Rn
X
fX (x | θ) = h(x) exp ηi (θ)Ti (x) − A(θ)
1. P [θ ∈ Rn ] = 1 − α
i=1
2. Rn = {θ : f (θ | xn ) > k} for some k
= h(x) exp {η(θ) · T (x) − A(θ)}
= h(x)g(θ) exp {η(θ) · T (x)} Rn is unimodal =⇒ Rn is an interval
16
15.2 Function of parameters 15.3.1 Conjugate Priors
Continuous likelihood (subscript c denotes constant)
Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }.
Likelihood Conjugate prior Posterior hyperparameters
Posterior CDF for τ
Unif (0, θ) Pareto(xm , k) max x(n) , xm , k + n
Z Xn
n n n
H(r | x ) = P [ϕ(θ) ≤ τ | x ] = f (θ | x ) dθ Exp (λ) Gamma (α, β) α + n, β + xi
A
i=1
Pn
µ0 i=1 xi 1 n
2
2

Posterior density N µ, σc N µ0 , σ0 + / + 2 ,
σ2 σ2 σ02 σc
0 c−1
1 n
h(τ | xn ) = H 0 (τ | xn ) + 2
σ02 σc
Pn
νσ02 + i=1 (xi − µ)2
Bayesian delta method N µc , σ 2 Scaled Inverse Chi- ν + n,
ν+n
square(ν, σ02 )
νλ + nx̄ n

τ | X n ≈ N ϕ(θ),
b seb ϕ0 (θ)

b N µ, σ 2 Normal- , ν + n, α + ,
ν+n 2
scaled Inverse n 2
1X γ(x̄ − λ)
Gamma(λ, ν, α, β) β+ (xi − x̄)2 +
2 i=1 2(n + γ)
15.3 Priors −1
Σ−1 −1
Σ−1 −1

MVN(µ, Σc ) MVN(µ0 , Σ0 ) 0 + nΣc 0 µ0 + nΣ x̄ ,
−1 −1
Σ−1

Choice 0 + nΣc
Xn
MVN(µc , Σ) Inverse- n + κ, Ψ + (xi − µc )(xi − µc )T
• Subjective Bayesianism: prior should incorporate as much detail as possible Wishart(κ, Ψ) i=1
the research’s a priori knowledge—via prior elicitation n
X xi
• Objective Bayesianism: prior should incorporate as little detail as possible Pareto(xmc , k) Gamma (α, β) α + n, β + log
x mc
(non-informative prior) i=1
Pareto(xm , kc ) Pareto(x0 , k0 ) x0 , k0 − kn where k0 > kn
• Robust Bayesianism: consider various priors and determine sensitivity of Xn
our inferences to changes in the prior Gamma (αc , β) Gamma (α0 , β0 ) α0 + nαc , β0 + xi
i=1

Types

• Flat: f (θ) ∝ constant

R∞
• Proper: −∞ f (θ) dθ = 1
R∞
• Improper: −∞ f (θ) dθ = ∞
• Jeffrey’s Prior (transformation-invariant):

p p
f (θ) ∝ I(θ) f (θ) ∝ det(I(θ))

• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family

17
Discrete likelihood log10 BF10 BF10 evidence
Likelihood Conjugate prior Posterior hyperparameters 0 − 0.5 1 − 1.5 Weak
n n
0.5 − 1 1.5 − 10 Moderate
Bern (p) Beta (α, β) α+
X
xi , β + n −
X
xi 1−2 10 − 100 Strong
i=1 i=1
>2 > 100 Decisive
n n n p
1−p BF 10
p∗ = where p = P [H1 ] and p∗ = P [H1 | xn ]
X X X
Bin (p) Beta (α, β) α+ xi , β + Ni − xi p
1+ 1−p BF10
i=1 i=1 i=1
n
X
NBin (p) Beta (α, β) α + rn, β + xi
i=1
16 Sampling Methods
n
X
Po (λ) Gamma (α, β) α+ xi , β + n 16.1 Inverse Transform Sampling
i=1
Xn Setup
Multinomial(p) Dir (α) α+ x(i)
i=1 • U ∼ Unif (0, 1)
n
X • X∼F
Geo (p) Beta (α, β) α + n, β + xi
i=1
• F −1 (u) = inf{x | F (x) ≥ u}
Algorithm
15.4 Bayesian Testing
1. Generate u ∼ Unif (0, 1)
If H0 : θ ∈ Θ0 : 2. Compute x = F −1 (u)
Z
Prior probability P [H0 ] = f (θ) dθ 16.2 The Bootstrap
Θ0
Z
Let Tn = g(X1 , . . . , Xn ) be a statistic.
Posterior probability P [H0 | xn ] = f (θ | xn ) dθ
Θ0
1. Estimate VF [Tn ] with VFbn [Tn ].
2. Approximate VFbn [Tn ] using simulation:
∗ ∗
Let H0 . . .Hk−1 be k hypotheses. Suppose θ ∼ f (θ | Hk ), (a) Repeat the following B times to get Tn,1 , . . . , Tn,B , an iid sample from
the sampling distribution implied by Fn b
f (xn | Hk )P [Hk ]
P [Hk | xn ] = PK , i. Sample uniformly X1∗ , . . . , Xn∗ ∼ Fbn .
n
k=1 f (x | Hk )P [Hk ] ii. Compute Tn∗ = g(X1∗ , . . . , Xn∗ ).
Marginal likelihood (b) Then
B B
!2
Z 1 X
∗ 1 X
f (xn | Hi ) = f (xn | θ, Hi )f (θ | Hi ) dθ vboot = V
bb = Tn,b − T∗
Θ
Fn B B r=1 n,r
b=1

Posterior odds (of Hi relative to Hj ) 16.2.1 Bootstrap Confidence Intervals

n n
P [Hi | x ] f (x | Hi ) P [Hi ] Normal-based interval
= ×
P [Hj | xn ] f (xn | Hj ) P [Hj ] Tn ± zα/2 se
b boot
| {z } | {z }
Bayes Factor BFij prior odds Pivotal interval
Bayes factor 1. Location parameter θ = T (F )
18
2. Pivot Rn = θbn − θ 2. Generate u ∼ Unif (0, 1)
3. Let H(r) = P [Rn ≤ r] be the cdf of Rn Ln (θcand )
∗ ∗
3. Accept θcand if u ≤
4. Let Rn,b = θbn,b − θbn . Approximate H using bootstrap: Ln (θbn )
B
1 X ∗ 16.4 Importance Sampling
H(r)
b = I(Rn,b ≤ r)
B Sample from an importance function g rather than target density h.
b=1
Algorithm to obtain an approximation to E [q(θ) | xn ]:
5. θβ∗ = β sample quantile of (θbn,1
∗ ∗
, . . . , θbn,B ) iid
1. Sample from the prior θ1 , . . . , θn ∼ f (θ)
6. rβ∗ = beta sample quantile of (Rn,1
∗ ∗
, . . . , Rn,B ), i.e., rβ∗ = θβ∗ − θbn
Ln (θi )
2. wi = PB ∀i = 1, . . . , B

7. Approximate 1 − α confidence interval Cn = â, b̂ where
i=1 Ln (θi )
PB
3. E [q(θ) | xn ] ≈ i=1 q(θi )wi
b −1 1 − α =

∗ ∗
â = θbn − H θbn − r1−α/2 = 2θbn − θ1−α/2
2
α
−1 ∗ ∗
b̂ = θbn − Hb
2
= θbn − rα/2 = 2θbn − θα/2 17 Decision Theory
Percentile interval Definitions
∗ ∗
Cn = θα/2 , θ1−α/2 • Unknown quantity affecting our decision: θ ∈ Θ
• Decision rule: synonymous for an estimator θb
16.3 Rejection Sampling • Action a ∈ A: possible value of the decision rule. In the estimation
context, the action is just an estimate of θ, θ(x).
b
Setup
• Loss function L: consequences of taking action a when true state is θ or
• We can easily sample from g(θ) discrepancy between θ and θ, b L : Θ × A → [−k, ∞).
• We want to sample from h(θ), but it is difficult
Loss functions
k(θ)
• We know h(θ) up to a proportional constant: h(θ) = R • Squared error loss: L(θ, a) = (θ − a)2
k(θ) dθ (
• Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ K1 (θ − a) a − θ < 0
• Linear loss: L(θ, a) =
K2 (a − θ) a − θ ≥ 0
Algorithm
• Absolute error loss: L(θ, a) = |θ − a| (linear loss with K1 = K2 )
1. Draw θcand ∼ g(θ) • Lp loss: L(θ, a) = |θ − a|p
2. Generate u ∼ Unif (0, 1) (
0 a=θ
k(θcand ) • Zero-one loss: L(θ, a) =
3. Accept θcand if u ≤ 1 a 6= θ
M g(θcand )
4. Repeat until B values of θcand have been accepted
17.1 Risk
Example
Posterior risk
• We can easily sample from the prior g(θ) = f (θ) Z h i
• Target is the posterior h(θ) ∝ k(θ) = f (xn | θ)f (θ) r(θb | x) = L(θ, θ(x))f
b (θ | x) dθ = Eθ|X L(θ, θ(x))
b

• Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M

(Frequentist) risk
• Algorithm Z h i
1. Draw θcand ∼ f (θ) R(θ, θ)
b = L(θ, θ(x))f
b (x | θ) dx = EX|θ L(θ, θ(X))
b
19
Bayes risk 18 Linear Regression
ZZ
Definitions
h i
r(f, θ)
b = L(θ, θ(x))f
b (x, θ) dx dθ = Eθ,X L(θ, θ(X))
b
• Response variable Y
• Covariate X (aka predictor variable or feature)
h h ii h i
r(f, θ)
b = Eθ EX|θ L(θ, θ(X)
b = Eθ R(θ, θ)
b

18.1 Simple Linear Regression

h h ii h i
r(f, θ)
b = EX Eθ|X L(θ, θ(X)
b = EX r(θb | X)
Model
17.2 Admissibility Yi = β0 + β1 Xi + i E [i | Xi ] = 0, V [i | Xi ] = σ 2
Fitted line
• θb0 dominates θb if
b0 rb(x) = βb0 + βb1 x
∀θ : R(θ, θ ) ≤ R(θ, θ)
b
Predicted (fitted) values
∃θ : R(θ, θb0 ) < R(θ, θ)
b Ybi = rb(Xi )
• θb is inadmissible if there is at least one other estimator θb0 that dominates Residuals
it. Otherwise it is called admissible. ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi

Residual sums of squares (rss)

17.3 Bayes Rule
n
X
Bayes rule (or Bayes estimator) rss(βb0 , βb1 ) = ˆ2i
i=1
• r(f, θ)
b = inf e r(f, θ)
θ
e
R Least square estimates
• θ(x)
b = inf r(θb | x) ∀x =⇒ r(f, θ)
b = r(θb | x)f (x) dx
βbT = (βb0 , βb1 )T : min rss
β
b0 ,β
b1
Theorems

• Squared error loss: posterior mean βb0 = Ȳn − βb1 X̄n

Pn Pn
• Absolute error loss: posterior median i=1 (Xi − X̄n )(Yi − Ȳn ) i=1 Xi Yi − nX̄Y
β1 =
b Pn = P n
• Zero-one loss: posterior mode i=1 (Xi − X̄n )
2 2 2
i=1 Xi − nX

β0
h i
E βb | X n =
17.4 Minimax Rules β1
σ 2 n−1 ni=1 Xi2 −X n
h i P
Maximum risk V βb | X n = 2
R̄(θ)
b = sup R(θ, θ)
b R̄(a) = sup R(θ, a) nsX −X n 1
θ θ r Pn
2
σ i=1 Xi
√
b
Minimax rule se(
b βb0 ) =
sX n n
sup R(θ, θ)
b = inf R̄(θ)
e = inf sup R(θ, θ)
e
θ θe θe θ σ
√
b
se(
b βb1 ) =
sX n
θb = Bayes rule ∧ ∃c : R(θ, θ)
b =c Pn Pn 2
where s2X = n−1 i=1 (Xi − X n )2 and σ b2 = n−21
i=1
ˆi (unbiased estimate).
Least favorable prior Further properties:
P P
θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ • Consistency: βb0 → β0 and βb1 → β1
20
• Asymptotic normality: 18.3 Multiple Regression
βb0 − β0 D βb1 − β1 D Y = Xβ +
→ N (0, 1) and → N (0, 1)
se(
b βb0 ) se(
b βb1 )
where
• Approximate 1 − α confidence intervals for β0 and β1 :      
X11 ··· X1k β1 1
 .. ..  β =  ... 
..  .. 
βb0 ± zα/2 se( and βb1 ± zα/2 se( X= . =.
 
b βb0 ) b βb1 ) . . 
Xn1 ··· Xnk βk n
• Wald test for H0 : β1 = 0 vs. H1 : β1 6= 0: reject H0 if |W | > zα/2 where
W = βb1 /se(
b βb1 ). Likelihood

1
R2 L(µ, Σ) = (2πσ 2 )−n/2 exp − 2 rss
Pn b 2
Pn 2 2σ
i=1 (Yi − Y ) ˆ rss
2
R = Pn 2
= 1 − Pn i=1 i 2 = 1 −
i=1 (Yi − Y ) i=1 (Yi − Y )
tss
N
X
Likelihood rss = (y − Xβ)T (y − Xβ) = kY − Xβk2 = (Yi − xTi β)2
n n n i=1
Y Y Y
L= f (Xi , Yi ) = fX (Xi ) × fY |X (Yi | Xi ) = L1 × L2
i=1 i=1 i=1 If the (k × k) matrix X T X is invertible,
Yn
L1 = fX (Xi ) βb = (X T X)−1 X T Y
i=1 h i
V βb | X n = σ 2 (X T X)−1
n
( )
Y 1 X 2
−n
L2 = fY |X (Yi | Xi ) ∝ σ exp − 2 Yi − (β0 − β1 Xi )
2σ i βb ≈ N β, σ 2 (X T X)−1

i=1

Under the assumption of Normality, the least squares estimator is also the mle
Estimate regression function
but the least squares variance estimator is not the mle.
n k
1X 2 X
b2 =
σ ˆ rb(x) = βbj xj
n i=1 i j=1

18.2 Prediction Unbiased estimate for σ 2

Observe X = x∗ of the covariate and want to predict their outcome Y∗ . n
1 X 2
b2 =
σ ˆ ˆ = X βb − Y
Yb∗ = βb0 + βb1 x∗ n − k i=1 i
h i h i h i h i
V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1 mle
n−k 2
Prediction interval µ
b = X̄ b2 =
σ σ
Pn 2
n
2 2 i=1 (Xi − X∗ )
ξn = σ
b P +1
n i (Xi − X̄)2 j
b
1 − α Confidence interval
Yb∗ ± zα/2 ξbn βbj ± zα/2 se(
b βbj )
21
18.4 Model Selection Akaike Information Criterion (AIC)
Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J
denote a subset of the covariates in the model, where |S| = k and |J| = n. bS2 ) − k
AIC(S) = `n (βbS , σ
Issues
Bayesian Information Criterion (BIC)
• Underfitting: too few covariates yields high bias
• Overfitting: too many covariates yields high variance k
bS2 ) − log n
BIC(S) = `n (βbS , σ
Procedure 2

1. Assign a score to each model Validation and training

2. Search through all models to find the one with the highest score
m
X n n
Hypothesis testing R
bV (S) = (Ybi∗ (S) − Yi∗ )2 m = |{validation data}|, often or
i=1
4 2
H0 : βj = 0 vs. H1 : βj 6= 0 ∀j ∈ J
Leave-one-out cross-validation
Mean squared prediction error (mspe)
n n
!2
h i X X Yi − Ybi (S)
mspe = E (Yb (S) − Y ∗ )2 R
bCV (S) = (Yi − Yb(i) )2 =
i=1 i=1
1 − Uii (S)
Prediction risk
n n h i
U (S) = XS (XST XS )−1 XS (“hat matrix”)
X X
R(S) = mspei = E (Ybi (S) − Yi∗ )2
i=1 i=1

Training error
n
R
btr (S) =
X
(Ybi (S) − Yi )2 19 Non-parametric Function Estimation
i=1

R 2 19.1 Density Estimation

Pn b 2
R i=1 (Yi (S) − Y )
rss(S) btr (S) R
R2 (S) = 1 − =1− =1− P n
Estimate f (x), where f (x) = P [X ∈ A] = A
f (x) dx.
2
i=1 (Yi − Y )
tss tss Integrated square error (ise)
The training error is a downward-biased estimate of the prediction risk. Z Z
2
h i L(f, fbn ) = f (x) − fbn (x) dx = J(h) + f 2 (x) dx
E R btr (S) < R(S)

h i n
X h i Frequentist risk
bias(Rtr (S)) = E Rtr (S) − R(S) = −2
b b Cov Ybi , Yi
i=1 h i Z Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
Adjusted R2
n − 1 rss
R2 (S) = 1 −
n − k tss
h i
Mallow’s Cp statistic b(x) = E fbn (x) − f (x)
h i
R(S)
b =R σ 2 = lack of fit + complexity penalty
btr (S) + 2kb v(x) = V fbn (x)
22
19.1.1 Histograms KDE
n
Definitions

1X1 x − Xi
fbn (x) = K
n i=1 h h
• Number of bins m 1 4
Z
00 2 1
Z
• Binwidth h = m 1 R(f, fn ) ≈ (hσK )
b (f (x)) dx + K 2 (x) dx
4 nh
• Bin Bj has νj observations c
−2/5 −1/5 −1/5
c2 c3
Z Z
h∗ = 1 c = σ 2
, c = K 2
(x) dx, c = (f 00 (x))2 dx
R
• Define pbj = νj /n and pj = Bj f (u) du 1 K 2 3
n1/5
Z 4/5 Z 1/5
∗ c4 5 2 2/5 2 00 2
Histogram estimator R (f, fn ) = 4/5
b c4 = (σK ) K (x) dx (f ) dx
n 4
| {z }
m C(K)
X pbj
fbn (x) = I(x ∈ Bj )
h
j=1 Epanechnikov Kernel
h i pj
E fbn (x) = ( √
h √ 3
|x| < 5
h i p (1 − p ) K(x) = 4 5(1−x2 /5)
j j
V fbn (x) = 0 otherwise
nh2
h2
Z
2 1
R(fbn , f ) ≈ (f 0 (u)) du + Cross-validation estimate of E [J(h)]
12 nh
!1/3
∗ 1 6 n n n
1 X X ∗ Xi − Xj
Z
h = 1/3 R 2 du 2Xb 2
n (f 0 (u)) JbCV (h) = fbn2 (x) dx − f(−i) (Xi ) ≈ 2
K + K(0)
n i=1 hn i=1 j=1 h nh
2/3 Z 1/3
∗ b C 3 0 2
R (fn , f ) ≈ 2/3 C= (f (u)) du
n 4 Z
∗ (2) (2)
K (x) = K (x) − 2K(x) K (x) = K(x − y)K(y) dy
Cross-validation estimate of E [J(h)]

Z n m
JbCV (h) = fbn2 (x) dx −
2Xb
f(−i) (Xi ) =
2
−
n+1 X 2
pb
19.2 Non-parametric Regression
n i=1 (n − 1)h (n − 1)h j=1 j
Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points
(x1 , Y1 ), . . . , (xn , Yn ) related by

19.1.2 Kernel Density Estimator (KDE) Yi = r(xi ) + i

E [i ] = 0
Kernel K
V [i ] = σ 2
• K(x) ≥ 0
•
R
K(x) dx = 1 k-nearest Neighbor Estimator
R
• xK(x) dx = 0
R 2 2 1 X
• x K(x) dx ≡ σK >0 rb(x) = Yi where Nk (x) = {k values of x1 , . . . , xn closest to x}
k 23
i:xi ∈Nk (x)
Nadaraya-Watson Kernel Estimator 20 Stochastic Processes
n
X
rb(x) = wi (x)Yi Stochastic Process
i=1 (
x−xi {0, ±1, . . . } = Z discrete

K {Xt : t ∈ T } T =
wi (x) = h ∈ [0, 1]
[0, ∞) continuous

Pn x−xj
j=1 K h
4 Z 2
h4 f 0 (x)
Z
2 2 00 0 • Notations Xt , X(t)
rn , r) ≈
R(b x K (x) dx r (x) + 2r (x) dx
4 f (x) • State space X
Z 2
σ K 2 (x) dx
R
• Index set T
+ dx
nhf (x)
c1
h∗ ≈ 1/5 20.1 Markov Chains
n
∗ c2
rn , r) ≈ 4/5
R (b Markov chain
n
P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ] ∀n ∈ T, x ∈ X
Cross-validation estimate of E [J(h)]
n n Transition probabilities
X X (Yi − rb(xi ))2
JbCV (h) = (Yi − rb(−i) (xi ))2 = !2
i=1 i=1 K(0)
pij ≡ P [Xn+1 = j | Xn = i]
1− x−x
Pn
j=1 K h
j
pij (n) ≡ P [Xm+n = j | Xm = i] n-step

19.3 Smoothing Using Orthogonal Functions Transition matrix P (n-step: Pn )

Approximation • (i, j) element is pij

∞ J
X X • pij > 0
r(x) = βj φj (x) ≈ βj φj (x) P
j=1 j=1
• i pij = 1

Multivariate regression Chapman-Kolmogorov

Y = Φβ + η
  X
φ0 (x1 ) ··· φJ (x1 ) pij (m + n) = pij (m)pkj (n)
 .. .. .. 
where ηi = i and Φ =  . . . 
k

φ0 (xn ) · · · φJ (xn )
Pm+n = Pm Pn
Least squares estimator
βb = (ΦT Φ)−1 ΦT Y Pn = P × · · · × P = Pn
1
≈ ΦT Y (for equally spaced observations only) Marginal probability
n
Cross-validation estimate of E [J(h)] µn = (µn (1), . . . , µn (N )) where µi (i) = P [Xn = i]
2
µ0 , initial distribution

Xn J
X
R
bCV (J) = Yi − φj (xi )βbj,(−i)  µn = µ0 Pn
24
i=1 j=1
20.2 Poisson Processes Autocorrelation function (ACF)
Poisson process
Cov [xs , xt ] γ(s, t)
ρ(s, t) = p =p
• {Xt : t ∈ [0, ∞)} = number of events up to and including time t V [xs ] V [xt ] γ(s, s)γ(t, t)
• X0 = 0
• Independent increments: Cross-covariance function (CCV)
∀t0 < · · · < tn : Xt1 − Xt0 ⊥
⊥ · · · ⊥⊥ Xtn − Xtn−1
γxy (s, t) = E [(xs − µxs )(yt − µyt )]
• Intensity function λ(t)
– P [Xt+h − Xt = 1] = λ(t)h + o(h) Cross-correlation function (CCF)
– P [Xt+h − Xt = 2] = o(h)
Rt γxy (s, t)
• Xs+t − Xs ∼ Po (m(s + t) − m(s)) where m(t) = 0
λ(s) ds ρxy (s, t) = p
γx (s, s)γy (t, t)
Homogeneous Poisson process
Backshift operator
λ(t) ≡ λ =⇒ Xt ∼ Po (λt) λ>0
B k (xt ) = xt−k
Waiting times
Wt := time at which Xt occurs
Difference operator
1
Wt ∼ Gamma t, ∇d = (1 − B)d
λ
Interarrival times
White noise
St = Wt+1 − Wt

1 2
• wt ∼ wn(0, σw )
St ∼ Exp
λ iid 2

• Gaussian: wt ∼ N 0, σw
St
• E [wt ] = 0 t ∈ T
• V [wt ] = σ 2 t ∈ T
Wt−1 Wt t • γw (s, t) = 0 s 6= t ∧ s, t ∈ T

Random walk
21 Time Series
• Drift δ
Mean function ∞
Pt
• xt = δt + j=1 wj
Z
µxt = E [xt ] = xft (x) dx
−∞ • E [xt ] = δt
Autocovariance function
Symmetric moving average
γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt
k
X k
X
γx (t, t) = E (xt − µt )2 = V [xt ]

mt = aj xt−j where aj = a−j ≥ 0 and aj = 1
25
j=−k j=−k
21.1 Stationary Time Series 21.2 Estimation of Correlation
Strictly stationary Sample mean
n
1X
x̄ = xt
P [xt1 ≤ c1 , . . . , xtk ≤ ck ] = P [xt1 +h ≤ c1 , . . . , xtk +h ≤ ck ] n t=1

Sample variance
n
∀k ∈ N, tk , ck , h ∈ Z

1 X |h|
V [x̄] = 1− γx (h)
n n
h=−n
Weakly stationary
Sample autocovariance function
• E x2t < ∞ ∀t ∈ Z
n−h
1 X
2
• E xt = m ∀t ∈ Z γ
b(h) = (xt+h − x̄)(xt − x̄)
• γx (s, t) = γx (s + r, t + r) ∀r, s, t ∈ Z n t=1

Autocovariance function Sample autocorrelation function

γ
b(h)
• γ(h) = E [(xt+h − µ)(xt − µ)] ∀h ∈ Z ρb(h) =
γ
b(0)
• γ(0) = E (xt − µ)2
• γ(0) ≥ 0 Sample cross-variance function
• γ(0) ≥ |γ(h)|
n−h
• γ(h) = γ(−h) 1 X
γ
bxy (h) = (xt+h − x̄)(yt − y)
n t=1
Autocorrelation function (ACF)
Sample cross-correlation function
Cov [xt+h , xt ] γ(t + h, t) γ(h)
ρx (h) = p =p = γ
bxy (h)
V [xt+h ] V [xt ] γ(t + h, t + h)γ(t, t) γ(0) ρbxy (h) = p
γbx (0)b
γy (0)
Jointly stationary time series Properties

γxy (h) = E [(xt+h − µx )(yt − µy )] 1

• σρbx (h) = √ if xt is white noise
n
1
γxy (h) • σρbxy (h) = √ if xt or yt is white noise
ρxy (h) = p n
γx (0)γy (h)
21.3 Non-Stationary Time Series
Linear process
Classical decomposition model
∞
X ∞
X
xt = µ + ψj wt−j where |ψj | < ∞ xt = µt + st + wt
j=−∞ j=−∞
• µt = trend
∞
X • st = seasonal component
2
γ(h) = σw ψj+h ψj • wt = random noise term
26
j=−∞
21.3.1 Detrending Moving average polynomial
Least squares θ(z) = 1 + θ1 z + · · · + θq zq z ∈ C ∧ θq 6= 0
2
1. Choose trend model, e.g., µt = β0 + β1 t + β2 t
Moving average operator
2. Minimize rss to obtain trend estimate µ bt = βb0 + βb1 t + βb2 t2
3. Residuals , noise wt θ(B) = 1 + θ1 B + · · · + θp B p
Moving average MA (q) (moving average model order q)
1
• The low-pass filter vt is a symmetric moving average mt with aj = 2k+1 : xt = wt + θ1 wt−1 + · · · + θq wt−q ⇐⇒ xt = θ(B)wt
k q
1 X X
vt = xt−1 E [xt ] = θj E [wt−j ] = 0
2k + 1
i=−k j=0
Pk ( Pq−h
1 2
• If 2k+1 i=−k wt−j ≈ 0, a linear trend function µt = β0 + β1 t passes
σw j=0 θj θj+h 0≤h≤q
γ(h) = Cov [xt+h , xt ] =
without distortion 0 h>q
Differencing MA (1)
xt = wt + θwt−1
• µt = β0 + β1 t =⇒ ∇xt = β1 
2 2
(1 + θ )σw h = 0

2
21.4 ARIMA models γ(h) = θσw h=1

0 h>1

Autoregressive polynomial
(
θ
φ(z) = 1 − φ1 z − · · · − φp zp z ∈ C ∧ φp 6= 0 2 h=1
ρ(h) = (1+θ )
0 h>1
Autoregressive operator
ARMA (p, q)
φ(B) = 1 − φ1 B − · · · − φp B p
xt = φ1 xt−1 + · · · + φp xt−p + wt + θ1 wt−1 + · · · + θq wt−q
Autoregressive model order p, AR (p)
φ(B)xt = θ(B)wt
xt = φ1 xt−1 + · · · + φp xt−p + wt ⇐⇒ φ(B)xt = wt
Partial autocorrelation function (PACF)
AR (1) • xih−1 , regression of xi on {xh−1 , xh−2 , . . . , x1 }
k−1 ∞ • φhh = corr(xh − xh−1
h , x0 − xh−1
0 ) h≥2
X k→∞,|φ|<1 X
• xt = φk (xt−k ) + φj (wt−j ) = φj (wt−j ) • E.g., φ11 = corr(x1 , x0 ) = ρ(1)
j=0 j=0
| {z } ARIMA (p, d, q)
linear process
P∞ j
∇d xt = (1 − B)d xt is ARMA (p, q)
• E [xt ] = j=0 φ (E [wt−j ]) = 0
2 h
σw φ φ(B)(1 − B)d xt = θ(B)wt
• γ(h) = Cov [xt+h , xt ] = 1−φ2
γ(h) Exponentially Weighted Moving Average (EWMA)
• ρ(h) = γ(0) = φh
• ρ(h) = φρ(h − 1) h = 1, 2, . . . xt = xt−1 + wt − λwt−1
27
∞
X • Frequency index ω (cycles per unit time), period 1/ω
xt = (1 − λ)λj−1 xt−j + wt when |λ| < 1
j=1
• Amplitude A
• Phase φ
x̃n+1 = (1 − λ)xn + λx̃n
• U1 = A cos φ and U2 = A sin φ often normally distributed rv’s
Seasonal ARIMA
Periodic mixture
• Denoted by ARIMA (p, d, q) × (P, D, Q)s
q
• ΦP (B s )φ(B)∇D d s
s ∇ xt = δ + ΘQ (B )θ(B)wt X
xt = (Uk1 cos(2πωk t) + Uk2 sin(2πωk t))
k=1
21.4.1 Causality and Invertibility
P∞ • Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2
ARMA (p, q) is causal (future-independent) ⇐⇒ ∃{ψj } : j=0 ψj < ∞ such that Pq
• γ(h) = k=1 σk2 cos(2πωk h)
Pq
∞
X • γ(0) = E x2t = k=1 σk2
xt = wt−j = ψ(B)wt
j=0 Spectral representation of a periodic process
P∞
ARMA (p, q) is invertible ⇐⇒ ∃{πj } : j=0 πj < ∞ such that γ(h) = σ 2 cos(2πω0 h)
∞ σ 2 −2πiω0 h σ 2 2πiω0 h
X = e + e
π(B)xt = Xt−j = wt 2 2
Z 1/2
j=0
= e2πiωh dF (ω)
Properties −1/2

• ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle Spectral distribution function
∞

X θ(z)
j 0
 ω < −ω0
ψ(z) = ψj z = |z| ≤ 1
φ(z) F (ω) = σ 2 /2 −ω ≤ ω < ω0
j=0 
 2
σ ω ≥ ω0
• ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle
• F (−∞) = F (−1/2) = 0
∞
X φ(z) • F (∞) = F (1/2) = γ(0)
π(z) = πj z j = |z| ≤ 1
j=0
θ(z)
Spectral density
Behavior of the ACF and PACF for causal and invertible ARMA models ∞
X 1 1
AR (p) MA (q) ARMA (p, q) f (ω) = γ(h)e−2πiωh − ≤ω≤
2 2
h=−∞
ACF tails off cuts off after lag q tails off
PACF cuts off after lag p tails off q tails off P∞ R 1/2
• Needs h=−∞ |γ(h)| < ∞ =⇒ γ(h) = −1/2
e2πiωh f (ω) dω h = 0, ±1, . . .
21.5 Spectral Analysis • f (ω) ≥ 0
• f (ω) = f (−ω)
Periodic process • f (ω) = f (1 − ω)
R 1/2
xt = A cos(2πωt + φ) • γ(0) = V [xt ] = −1/2 f (ω) dω
2
= U1 cos(2πωt) + U2 sin(2πωt) • White noise: fw (ω) = σw
28
• ARMA (p, q) , φ(B)xt = θ(B)wt : 22.2 Beta Function
Z 1
Γ(x)Γ(y)
2 |θ(e
−2πiω 2
)| • Ordinary: B(x, y) = B(y, x) = tx−1 (1 − t)y−1 dt =
fx (ω) = σw 0 Γ(x + y)
|φ(e−2πiω )|2 Z x
a−1 b−1
Pp Pq • Incomplete: B(x; a, b) = t (1 − t) dt
where φ(z) = 1 − k=1 φk z k and θ(z) = 1 + k=1 θk z
k 0
• Regularized incomplete:
a+b−1
Discrete Fourier Transform (DFT) B(x; a, b) a,b∈N X (a + b − 1)!
Ix (a, b) = = xj (1 − x)a+b−1−j
B(a, b) j=a
j!(a + b − 1 − j)!
n
X
−1/2
d(ωj ) = n xt e−2πiωj t • I0 (a, b) = 0 I1 (a, b) = 1
i=1 • Ix (a, b) = 1 − I1−x (b, a)

Fourier/Fundamental frequencies
22.3 Series
ωj = j/n Finite Binomial
n n
Inverse DFT X n(n + 1) X n
n−1 • k= • = 2n
X 2 k
xt = n−1/2 d(ωj )e2πiωj t k=1
n
k=0
n
j=0
X X r+k r+n+1
• (2k − 1) = n2 • =
k n
Periodogram k=1 k=0
n n
I(j/n) = |d(j/n)| 2 X n(n + 1)(2n + 1) X k n+1
• k2 = • =
6 m m+1
k=1 k=0
Scaled Periodogram n
X
n(n + 1)
2 • Vandermonde’s Identity:
• k3 = r
m n

m+n

4 2
X
P (j/n) = I(j/n) k=1 =
n n k r−k r
!2 !2 X cn+1 − 1 k=0

2X
n
2X
n • ck = c 6= 1 • Binomial Theorem:
= xt cos(2πtj/n + xt sin(2πtj/n c−1 n
n n−k k
k=0
X
n t=1 n t=1 a b = (a + b)n
k
k=0

22 Math Infinite
∞ ∞
X 1 X p
22.1 Gamma Function • pk = , pk = |p| < 1
∞
1−p 1−p
k=0 k=1
Z
• Ordinary: Γ(s) = ts−1 e−t dt ∞ ∞
!
X d X d 1 1
0 Z ∞ • kpk−1 = pk
= = |p| < 1
dp dp 1 − p (1 − p)2
• Upper incomplete: Γ(s, x) = ts−1 e−t dt k=0
∞
k=0
x X r+k−1 k
x = (1 − x)−r r ∈ N+
Z x
•
• Lower incomplete: γ(s, x) = ts−1 e−t dt k=0
k
0 ∞
• Γ(α + 1) = αΓ(α) α>1
X α k
• p = (1 + p)α |p| < 1 , α ∈ C
• Γ(n) = (n − 1)! n∈N k
√ k=0
• Γ(1/2) = π
29
22.4 Combinatorics [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications With R
Examples. Springer, 2006.
Sampling [4] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie, Algebra.
Springer, 2001.
k out of n w/o replacement w/ replacement [5] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und Statistik.
k−1 Springer, 2002.
Y n!
ordered nk = (n − i) = nk [6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2003.
i=0
(n − k)!
nk

n n! n−1+r n−1+r
unordered = = =
k k! k!(n − k)! r n−1

Stirling numbers, 2nd kind

(
n n−1 n−1 n 1 n=0
=k + 1≤k≤n =
k k k−1 0 0 else

Partitions
n
X
Pn+k,k = Pn,i k > n : Pn,k = 0 n ≥ 1 : Pn,0 = 0, P0,0 = 1
i=1

Balls and Urns f :B→U D = distinguishable, ¬D = indistinguishable.

|B| = n, |U | = m f arbitrary f injective f surjective f bijective

( (
mn m ≥ n

n n! m = n
B : D, U : D mn m!
0 else m 0 else
(
m+n−1 m n−1 1 m=n
B : ¬D, U : D
n n m−1 0 else
m
( (
X n 1 m≥n n 1 m=n
B : D, U : ¬D
k 0 else m 0 else
k=1
m
( (
X 1 m≥n 1 m=n
B : ¬D, U : ¬D Pn,k Pn,m
k=1
0 else 0 else

References
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory. Brooks Cole,
1972.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American
Statistician, 62(1):45–53, 2008.
30
Univariate distribution relationships, courtesy Leemis and McQueston [2].
31

Probability and Statistics Explorations With Maple
No ratings yet
Probability and Statistics Explorations With Maple
287 pages
Probability and Statistics With Examples Using R Siva Athreya, Deepayan Sarkar, and Steve Tanner
No ratings yet
Probability and Statistics With Examples Using R Siva Athreya, Deepayan Sarkar, and Steve Tanner
258 pages
Probability and Statistical Inference 9t PDF
100% (1)
Probability and Statistical Inference 9t PDF
30 pages
AP Workbook - Circ Motion & Grav
No ratings yet
AP Workbook - Circ Motion & Grav
28 pages
I ' S M P S I: Nstructor S Olutions Anual Robability AND Tatistical Nference
No ratings yet
I ' S M P S I: Nstructor S Olutions Anual Robability AND Tatistical Nference
16 pages
A Probability and Statistics Cheatsheet
No ratings yet
A Probability and Statistics Cheatsheet
28 pages
Introduction To Statistical Thought
100% (2)
Introduction To Statistical Thought
393 pages
Stat Cookbook
No ratings yet
Stat Cookbook
31 pages
Probability and Statistics - Cookbook
No ratings yet
Probability and Statistics - Cookbook
28 pages
Stat Cookbook
No ratings yet
Stat Cookbook
31 pages
Maple Manual
No ratings yet
Maple Manual
285 pages
Equipment Design: Mechanical Aspects Week 1 Assignment - 1 Solution
No ratings yet
Equipment Design: Mechanical Aspects Week 1 Assignment - 1 Solution
4 pages
Stat Cookbook
No ratings yet
Stat Cookbook
31 pages
Statistics 333
100% (1)
Statistics 333
84 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
28 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
31 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
31 pages
Probability and Statistics Cheat Sheet
100% (3)
Probability and Statistics Cheat Sheet
28 pages
Stats Cheat Sheet
No ratings yet
Stats Cheat Sheet
28 pages
Fdocuments - in Centrifugal Slurry Pumps
No ratings yet
Fdocuments - in Centrifugal Slurry Pumps
50 pages
Probability and Statistics With Examples Using R: Siva Athreya, Deepayan Sarkar, and Steve Tanner April 25, 2016
No ratings yet
Probability and Statistics With Examples Using R: Siva Athreya, Deepayan Sarkar, and Steve Tanner April 25, 2016
4 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
31 pages
Mathematical Statistics
No ratings yet
Mathematical Statistics
271 pages
Cimentaciones Maquinas
100% (1)
Cimentaciones Maquinas
235 pages
Statistic Book
100% (1)
Statistic Book
328 pages
Tocv 2
No ratings yet
Tocv 2
10 pages
Probability and Statistics Cookbook
No ratings yet
Probability and Statistics Cookbook
28 pages
ECE 313 Course Notes: Probability With Engineering Applications
No ratings yet
ECE 313 Course Notes: Probability With Engineering Applications
188 pages
Probabilityjan13 PDF
No ratings yet
Probabilityjan13 PDF
281 pages
MSC Notes
No ratings yet
MSC Notes
145 pages
A First Course in Mathematical Statistics - Nusbaum
No ratings yet
A First Course in Mathematical Statistics - Nusbaum
195 pages
Book Solutions
No ratings yet
Book Solutions
17 pages
(VCE Methods) 2012-16 TSSM Unit 34 Exam 2 Solutions
No ratings yet
(VCE Methods) 2012-16 TSSM Unit 34 Exam 2 Solutions
11 pages
Lecture Notes
No ratings yet
Lecture Notes
80 pages
ISO - 80369 - Luers Testing
No ratings yet
ISO - 80369 - Luers Testing
4 pages
Physics Lab Manual Diploma PDF
No ratings yet
Physics Lab Manual Diploma PDF
39 pages
EC400Stats Lecturenotes2021
No ratings yet
EC400Stats Lecturenotes2021
101 pages
Formulario Ep Probability and Statistics
No ratings yet
Formulario Ep Probability and Statistics
28 pages
Probabilistic Models in The Study of Language
No ratings yet
Probabilistic Models in The Study of Language
274 pages
Gas Turbine Flow Meter
No ratings yet
Gas Turbine Flow Meter
33 pages
Institute of Engineering & Technology, Devi Ahilya University, Indore, (M.P.), India. (Scheme Effective From July 2015)
No ratings yet
Institute of Engineering & Technology, Devi Ahilya University, Indore, (M.P.), India. (Scheme Effective From July 2015)
2 pages
Technical Report On System of Linear Equations
No ratings yet
Technical Report On System of Linear Equations
9 pages
Introduction To Statistical Thought - Michael Levine
No ratings yet
Introduction To Statistical Thought - Michael Levine
344 pages
Effect of Soil Structure Interaction On The Dynamic Response of Reinforced Concrete Structures
No ratings yet
Effect of Soil Structure Interaction On The Dynamic Response of Reinforced Concrete Structures
12 pages
2020-2021 EDA 101 Handout
No ratings yet
2020-2021 EDA 101 Handout
192 pages
Chapter One PPT of Geodesy
100% (3)
Chapter One PPT of Geodesy
31 pages
2434-Article Text-13246-2-10-20201130
No ratings yet
2434-Article Text-13246-2-10-20201130
22 pages
Math Stats Lecture 2020F
No ratings yet
Math Stats Lecture 2020F
122 pages
CO1 - Problems - Magnetic Properties
No ratings yet
CO1 - Problems - Magnetic Properties
8 pages
Book
No ratings yet
Book
475 pages
Moving Charges and Magnetism DPP 01 (Of Lecture 04) Yake
No ratings yet
Moving Charges and Magnetism DPP 01 (Of Lecture 04) Yake
4 pages
2.1 Measurement Techniques-Cie Ial Physics-Theory QP
No ratings yet
2.1 Measurement Techniques-Cie Ial Physics-Theory QP
16 pages
Statistical Models
No ratings yet
Statistical Models
248 pages
CompleteLectureNotes STAT 261
No ratings yet
CompleteLectureNotes STAT 261
158 pages
MA-202: Probability & Statistics: Class Notes
No ratings yet
MA-202: Probability & Statistics: Class Notes
221 pages
Main 1
No ratings yet
Main 1
36 pages
22amh32 - Data Analytics and Data Science Unit I & Mathematics Foundations For Data Science 1. Mathematics Foundations For Data Science
No ratings yet
22amh32 - Data Analytics and Data Science Unit I & Mathematics Foundations For Data Science 1. Mathematics Foundations For Data Science
5 pages
Course Outline
No ratings yet
Course Outline
9 pages
Statistical Inference
No ratings yet
Statistical Inference
158 pages
Example Pipe Flow Problems
No ratings yet
Example Pipe Flow Problems
20 pages
Antigravidade
No ratings yet
Antigravidade
21 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
28 pages
Eigensystems by Naresh
No ratings yet
Eigensystems by Naresh
35 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
31 pages
JSA Excavation
No ratings yet
JSA Excavation
6 pages
A Comparative Study of MHD Flow Analysis in A Porous Medium by Using Differential Transformation Method and Variational Iteration Method
No ratings yet
A Comparative Study of MHD Flow Analysis in A Porous Medium by Using Differential Transformation Method and Variational Iteration Method
15 pages
Extension of A Semi-Analytical Approach To Determine Natural Frequency and Mode Shape of Orthotropic Steel Bridge
No ratings yet
Extension of A Semi-Analytical Approach To Determine Natural Frequency and Mode Shape of Orthotropic Steel Bridge
17 pages
Aval and Ghabdian 2019 - Time-Dependent Reliability Analysis of RC Deep Beams-2
No ratings yet
Aval and Ghabdian 2019 - Time-Dependent Reliability Analysis of RC Deep Beams-2
16 pages
STP 559-1974
No ratings yet
STP 559-1974
319 pages
GPR Exploive Filler + Crosscut
No ratings yet
GPR Exploive Filler + Crosscut
21 pages
CALPUFF v7 UserGuide Addendum
No ratings yet
CALPUFF v7 UserGuide Addendum
67 pages
Analysis of Drugs Manual - March 2021
No ratings yet
Analysis of Drugs Manual - March 2021
287 pages
Aviation Info Bank
No ratings yet
Aviation Info Bank
1,025 pages

Stat Cookbook

Uploaded by

Stat Cookbook

Uploaded by

Probability and Statistics

1 Distribution Overview 3 15 Bayesian Inference 16 22 Math 29

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

0.5 0.25 0.1

● ● 0.0 0.00 0.0

0.50 1.0 1.0

0.00 0 0.0 0.0

0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

0.25 0.25 0.25 0.25

• Outcome (point or element) ω ∈ Ω Bayes’ Theorem

Continuous Sample mean

10.1 Law of Large Numbers (LLN)

Properties of the mle Under appropriate regularity conditions

• Null hypothesis H0 supθ∈Θ Ln (θ) Ln (θbn )

fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)} Equal-tail credible interval

• Flat: f (θ) ∝ constant

• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family

Posterior odds (of Hi relative to Hj ) 16.2.1 Bootstrap Confidence Intervals

• Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M

18.1 Simple Linear Regression

Residual sums of squares (rss)

• Squared error loss: posterior mean βb0 = Ȳn − βb1 X̄n

18.2 Prediction Unbiased estimate for σ 2

1. Assign a score to each model Validation and training

R 2 19.1 Density Estimation

19.1.2 Kernel Density Estimator (KDE) Yi = r(xi ) + i

19.3 Smoothing Using Orthogonal Functions Transition matrix P (n-step: Pn )

Approximation • (i, j) element is pij

Multivariate regression Chapman-Kolmogorov

Autocovariance function Sample autocorrelation function

γxy (h) = E [(xt+h − µx )(yt − µy )] 1

Stirling numbers, 2nd kind

Balls and Urns f :B→U D = distinguishable, ¬D = indistinguishable.

|B| = n, |U | = m f arbitrary f injective f surjective f bijective

You might also like

19.1.2 Kernel Density Estimator (KDE) Yi = r(xi ) + i