0% found this document useful (0 votes)
11 views

Lecture-2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lecture-2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Common Distributions

Paolo Zacchia

Probability and Statistics

Lecture 2
General overview

This lecture analyzes the most common univariate distributions


that are encountered in statistical analysis and are referenced in
this course.

The analysis proceeds as follows.

1. Discrete distributions.

2. Continuous distributions: location-scale families.

3. Continuous distributions: other common families.

4. Continuous distributions: generalized extreme value.


Parametric families
• The expressions ‘distribution’ and ‘parametric distribution
family’ are often conflated.

• A “family” of distributions is a set of distributions that are


identical up some numeric parameters.

• Example: all experiments about “tossing a coin” belong to


the Bernoulli family, but the probability p of observing a
‘Head’ might differ between coins with different “balance.”

• Thus p is a parameter of a Bernoulli distribution and:

X ∼ Be (p)

here means: the random variable X follows the Bernoulli


(Be) distribution with parameter p.
Overviewing a distribution (1/2)
The analysis of distributions is systematic. For each is reported:

• the support, e.g. X ∈ {0, 1};


• the parameters and their admissible values, e.g. p ∈ [0, 1];
• the notation by which they are indicated, e.g. X ∼ Be (p);
• the p.m.f. or p.d.f., e.g. fX (x; p) = px (1 − p)1−x ;
• the c.d.f., e.g.

FX (x; p) = (1 − p) 1 [x ∈ [0, 1)] + 1 [x ∈ [1, ∞)] ;

• the m.g.f., e.g. MX (t; p) = p exp (t) + (1 − p);


• the key moments (usually only mean and variance).

These examples refer to to the Bernoulli distribution (family).


Overviewing a distribution (2/2)

In addition, it often useful to report:

• some key derivations (of the m.g.f., of key moments, etc.);

• the graph of density functions (continuous distributions


only);

• the relationships between a certain distribution and other


distributions which result from specific transformations or
from equating parameters across distributions.

Relationships between distributions are especially useful to link


each distribution to the real world phenomena that it is meant
to describe, and to the underlying intuition.
The Bernoulli distribution

• The Bernoulli distribution describes all dichotomous events.

• The Bernoulli distribution is elementary, and it forms the


basis for other discrete distributions.

• Note: with p = 0 or p = 1, the entire probability mass is on


one realization (“degenerate distribution”).

• The key moments are:

E [X] = p
Var [X] = p (1 − p)

where Var [X] is maximized at p = 0.5 (“balanced” case).


The binomial distribution (1/3)
• The binomial distribution corresponds to the repetition of
n ∈ N Bernoulli “experiments” (also called trials).

• A random variable X that follows the binomial distribution


counts the number of 0 ≤ x ≤ n successes of the Bernoulli
experiment that defines it (every Bernoulli x = 1 counts).

• The underlying sample space is the set S = {0, 1}n . All the
underlying Bernoulli trials are independent.

• The support of X is X = {0, 1, . . . , n}.

• The binomial distribution has two parameters, p and n.

X ∼ Bn (p, n)
The binomial distribution (2/3)
• The binomial distribution owes its name to its extensive use
of the binomial coefficient and formula. In fact:
!
n n!
=
x x! (n − x)!

are the ways to obtain x successes out of n trials.


• The p.m.f. is thus:
!
n x
fX (x; p, n) = p (1 − p)n−x
x

• . . . and the c.d.f. is as follows (it equals 1 for x = n).

bxc !
n
pi (1 − p)n−i
X
FX (x; p, n) =
i=0
i
The binomial distribution (3/3)

• By the binomial formula, the distribution’s m.g.f. is:


n
!
n
exp (tx) px (1 − p)n−x
X
MX (t; p, n) =
x=0
x
n
!
n
[exp (t) p]x (1 − p)n−x
X
=
x=0
x
= [p exp (t) + (1 − p)]n

• . . . hence, the mean and variance are:

E [X] = np
Var [X] = np (1 − p)

which is intuitive.
The geometric distribution (1/3)
• The geometric distribution is also based on some possibly
infinite Bernoulli trials. Implicitly, trials are ordered.

• Here X indexes the trial that delivers the first success.

• The support is X = N and there is one parameter: p.

• Since x − 1 failures must occur before a success occurs at x,


the p.m.f. (that gives the distribution its name) is:

fX (x; p) = p (1 − p)x−1

• . . . and the c.d.f. is as follows (it tends to 1 as x → ∞).

bxc−1
p (1 − p)i = 1 − (1 − p)bxc
X
FX (x; p) =
i=0
The geometric distribution (2/3)
• The geometric m.g.f. exists for t < − log (1 − p):
M
exp (tx) · p (1 − p)x−1
X
MX (t; p) = lim
M →∞
x=0
M
[(1 − p) · exp (t)]x−1
X
= p exp (t) · lim
M →∞
x=0
1 − [(1 − p) · exp (t)]M
= p exp (t) · lim
M →∞ 1 − (1 − p) · exp (t)
p exp (t)
=
1 − (1 − p) exp (t)
• . . . hence, the mean and variance are as follows.
1
E [X] =
p
1−p
Var [X] =
p2
The geometric distribution (3/3)

• The geometric distribution has the memoryless property.

P (X > s ∩ X > t)
P ( X > s| X > t) =
P (X > t)
P (X > s)
=
P (X > t)
= (1 − p)s−t
= P (X > s − t)

• The probability of a success at the s-th trial conditional on


t failures is equal to the ex ante probability that a success
occurs after exactly s − t trials.

• Every failure “resets” the probability of a success!


The negative binomial distribution (1/2)
• The negative binomial models the occurrence of the first
r ∈ N successes out of a series of Bernoulli trials.
• Here X indexes the trial that delivers the r-th success.

• The support is X = N and there is are two parameters: p


and r. The distribution is denoted as follows.

X ∼ NB (p, r)

• Before the r-th success at x, one needs r − 1 successes: the


p.m.f. must account for all ways these successes can occur.
!
x−1 r
fX (x; p, r) = p (1 − p)x−r
r−1

Observation 1
The geometric distribution is a special case of the negative binomial
distribution, with r = 1; thus it is denoted as X ∼ NB (p, 1)
The negative binomial distribution (2/2)
• One could also look at the number of failures Y = X − r:
!
r+y−1 r
fY (y; p, r) = p (1 − p)y
y
!
y −r r
= (−1) p (1 − p)y
y
whence the name “negative” binomial.
• The m.g.f. is defined for t < − log (1 − p):
r
p exp (t)

MX (t; p, r) =
1 − (1 − p) exp (t)
• . . . and one can obtain the following mean and variance.
r
E [X] =
p
r (1 − p)
Var [X] =
p2
The Poisson distribution (1/6)
• The Poisson distribution is another important distribution
also connected to Bernoulli trials, albeit indirectly.

• Its support is X = N0 = {0, 1, 2, . . . }.

• It has one parameter λ ≥ 0, called intensity.

• The Poisson distribution is commonly denoted as:

X ∼ Pois (λ)

• Its p.m.f., which is as follows, is not too intuitive.

exp (−λ) · λx
fX (x; λ) =
x!
The Poisson distribution (2/6)
• The Poisson’s c.d.f. is:
bxc i
X λ
FX (x; λ) = exp (−λ)
i=0
i!

and note that it sums to 1 over the support:


M
X λx
lim FX (M ; λ) = exp (−λ) · lim
M →∞ M →∞
x=0
x!
= exp (−λ) · exp (λ)
=1

(the Taylor expansion of exponential functions is used).

• An important property of Poisson distributions is that they


approximate a binomial distribution when p is small and
λ = pn (a demonstration follows).
The Poisson distribution (3/6)
Start from a binomial, fix λ = pn and let n → ∞ (p → 0).
!
n x
lim fX (x; p, n) = lim p (1 − p)n−x
n→∞ n→∞ x
λ x
n−x
n! λ
  
= lim 1−
n→∞ x! (n − x)! n n
 Qx
λx (n − k + 1)

k=1
= lim ·
x! n→∞ nx
| {z }
→1
−x  n
λ λ

· 1− 1−
n n
| {z }| {z }
→1 → exp(−λ)
λx
= exp (−λ)
x!
= fX (x; λ)
The Poisson distribution (4/6)

Binomial vs. Poisson comparison with n = 20, p = 0.2, λ = 4

fX (x)
.2

.1
x
0 2 4 6 8 10

• Binomial probabilities are denoted with solid thin lines, smaller


full points; Poisson probabilities are denoted with dashed thicker
lines, larger hollow points. Probabilities for x > 10 are negligible.
The Poisson distribution (5/6)

It is now easier to interpret the Poisson distribution and its use.

• A model for the number of “occurrences” about some kind


of event in a well-defined interval/space.

• Examples: phone calls, emails, holes in a piece of fabric.

• The occurrences of interest happen independently. . .

• . . . and they are all equally likely.

• The larger the interval under examination, the higher the


number of occurrences one can expect.
The Poisson distribution (6/6)
• The Poisson’s m.g.f. obtains by another Taylor expansion:
M
X exp (−λ) · λx
MX (t; λ) = lim exp (tx) ·
M →∞
x=0
x!
M
X [λ · exp (t)]x
= exp (−λ) · lim
M →∞
x=0
x!
= exp (−λ) · exp (λ · exp (t))
= exp (λ (exp (t) − 1))

• . . . thus the key moments are both equal to:

E [X] = λ
Var [X] = λ

motivating the name “intensity” for λ.


The uniform discrete distribution (1/2)
• Not all discrete distributions are based on Bernoulli trials!
Suppose the probability is equal over entire support:
1
fX (x; N ) =
N
for any X with |X| = N : a uniform discrete distribution.

• If X are the integers between a and b (with b − a = N − 1):

X ∼ U {a, b}

• . . . and the c.d.f. can be expressed as follows.

bxc − a + 1
FX (x; a, b) = · 1 [a ≤ x ≤ b] + 1 [b < x]
b−a+1
The uniform discrete distribution (2/2)

• If X ∼ U {a, b}, the m.g.f. of X is:

exp (at) − exp ((b + 1) t)


MX (t; a, b) =
N (1 − exp (t))

• . . . and thus the mean and variance of X are as follows.


a+b
E [X] =
2
N2 − 1
Var [X] =
12
It is perhaps easier to calculate these moments without the
use of the m.g.f. here.
Hypergeometric distribution (1/3)
• A Bernoulli trial, when repeated, is equivalent to the urn
experiment with replacement (an urn contains balls of two
kinds; when extracted, a ball is re-insterted in the urn).

• What if there is no replacement at every repetition? As in


the binomial, interest falls on the number of successes X.

• Note: successes becomes less likely the more are obtained.


For example, if K balls can potentially return “success” at
the first trial, these “good” balls decrease at every trial.

• The resulting hypergeometric distribution has parameters:


N (total possible occurrences/balls), K (occurrences/balls
that can deliver a success) and n (number of trials).

• The distribution is denoted as follows.

X ∼ H (N, K, n)
Hypergeometric distribution (2/3)
• The support of the hypergeometric distribution is:

X = {max (0, n + K − N ) , . . . , min (n, K)}

since it must be x ≥ 0, x ≤ n, x ≤ K and n − x ≤ N − K.

• The p.m.f. (from which the c.d.f. can be derived) is:


! !
K N −K
x n−x
fX (x; N, K, n) = !
N
n

since there are N K


 
n ways to arrange all n occurrences, x
ways to arrange the occurrences that deliver success, and
N −K
n−x to arrange the occurrences that deliver failure; all
occurrences are equally likely.
Hypergeometric distribution (3/3)
• Manipulating this distribution (e.g. to obtain the m.g.f.) is
difficult due to the combinatorics involved.

• For example, the mean is calculated (for Y = X − 1) as:


K  N −K 
x x Nn−x
X
E [X] = 
x∈X n
 N −K 
X K K−1x−1 n−x
= N N −1

x∈X n n−1
K−1 (N −1)−(K−1)
K X y n−1−y
= n N −1
N y∈Y n−1
K
=n
N

• . . . while the variance is Var [X] = n K N −K N −n


N N N −1 .
Moving to continuous distributions
• The analysis now moves to continuous distributions.

• It helps to begin from location-scale distribution families:


these are characterized by two particular parameters.

• The location parameter determines the overall position of


the distribution on R (usually associated with the mean). It
is usually denoted by µ.

• The scale parameter determines the overall “spread” of the


distribution on R (usually associated with the variance). It
is usually denoted by σ (or σ2 ), with σ > 0.

• Other distributions overviewed later may have other kinds


of parameters that affect their overall shape.
Location and scale

Definition 1
Location and scale families. Let fZ (z) be a probability density
function associated with some random variable Z. For any µ ∈ R and
any σ ∈ R++ , the family of probability density functions of the form
 
1 x−µ
fX (x) = fZ
σ σ

for a generic random variable X is called the location-scale family


with standard probability density function fZ (z); µ is called the
location parameter while σ is called the scale parameter.

• Note: this implies Z = (X − µ) /σ.

• Conversely, it is X = σZ + µ.
Standardization of densities

Theorem 1
Standardization of densities. Let f (·) be any probability density
function, µ ∈ R and σ ∈ R++ . Then, a random variable X follows a
probability distribution with density function:
 
1 x−µ
fX (x) = f
σ σ

if and only if there exists a continuous random variable Z whose prob-


ability density function is fZ (z) = f (z) and X = σZ + µ.
Proof.
Necessity is shown through a density transformation with X = g (Z):
g (Z) = σZ + µ (a monotone transformation), g −1 (x) = (x − µ) /σ
d −1
and dx g (x) = 1/σ. Sufficiency is shown by the converse exercise:
define Z = g (X) = (X − µ) /σ – again a monotone transformation –
with g −1 (z) = σz + µ, dz
d −1
g (z) = σ and again one can extend the
theorem for monotone transformations of density functions.
Implications of standardization
• A standard density function (with associated distribution)
exists for every location-scale family.

• Given that all the distributions in a location-scale family


are all linked through a linear transformation, their mean,
variance, other moments and moment generating functions
are related via simple functions.

• All probabilities that are specific to a distribution from a


location-scale family can be expressed with reference to the
standard distribution.
a−µ X −µ b−µ
 
P (a ≤ X ≤ b) = P ≤ ≤
σ σ σ
a−µ b−µ
 
=P ≤Z≤
σ σ
The normal distribution (1/5)
• The queen of continuous distributions, a.k.a. “Gaussian.”

• Given parameters µ and σ2 , it is indicated as:


 
X ∼ N µ, σ2

where its standard version is Z ∼ N (0, 1).

• Its support is X = R; the p.d.f. is:

(x − µ)2
!
2 1
fX (x; µ, σ ) = √ exp −
2πσ2 2σ2

• . . . while its c.d.f. obtains by integrating the density.


ˆ
(t − µ)2
!
x
2 1
FX (x; µ, σ ) = √ exp − dt
−∞ 2πσ2 2σ2
The normal distribution (2/5)

fX (x) Stand.
0.4 µ=2
σ2 = 4

0.2

x
−5 −3 −1 1 3 5 7
The normal distribution (3/5)
• To show that the density integrates to 1, one can focus on
the standard density, and specifically on half its support.
ˆ ∞
! ˆ ∞
!
z2 z2
r
1 π
√ exp − dz = 1 ⇔ exp − dz =
−∞ 2π 2 0 2 2

• The derivation is somewhat tedious.


ˆ ∞  2 ˆ ∞   ˆ ∞
z2 t2 u2
    
exp − dz = exp − dt exp − du
0 2 0 2 0 2
ˆ ∞ˆ ∞  2 2

t +u
= exp − dtdu
0 0 2
ˆ ∞ˆ π  2

2 r
= r · exp − dθdr
0 0 2
ˆ  2

π ∞ r
= r · exp − dr
2 0 2
h  2
 ∞i
π r
= − exp −
2 2 0
π
=
2
The normal distribution (4/5)
• Obtaining the standard’s m.g.f. is easier:
ˆ +∞
!
1 z2
MZ (t) = exp (tz) √ exp − dz
−∞ 2π 2
ˆ +∞
!
1 z 2 − 2zt + t2 − t2
= √ exp − dz
−∞ 2π 2

(z − t)2
!
+∞
t2 1
= exp √ exp − dz
2 −∞ 2π 2
!
t2
= exp
2

• hence, in the general case it is as follows.


!

2
 σ2 t2
MX t; µ, σ = exp µt +
2
The normal distribution (5/5)
• The key moments of the normal distribution are:

E [X] = µ
Var [X] = σ2
Skew [X] = 0
Kurt [X] = 3

mean and variance coincide with µ, σ2 ; the constant Kurt is


a reference point (Kurt [X] − 3 is called “excess kurtosis”).

• There is an alternative parametrization of the distribution:


s
φ2 (x − µ)2
!
φ2
fX (x; µ, φ2 ) = exp −
2π 2

where φ2 = σ−2 is called the precision parameter.


The lognormal distribution (1/3)
• The lognormal distribution obtains from the transformation
Y = exp (X) where X ∼ N µ, σ2 . Thus, the support of Y


is Y = R++ but the parameters are still the normal’s.

• From X = log (Y ), the name and notation follow.

log (Y ) ∼ N (µ, σ2 )

• Its p.d.f. is:

(log y − µ)2
!
2 1 1
fY (y; µ, σ ) = √ exp −
2πσ2 y 2σ2

• . . . while its c.d.f. obtains by integrating the density.


ˆ y
(log t − µ)2
!
2 1 1
FY (y; µ, σ ) = √ exp − dt
0 2πσ2 t 2σ2
The lognormal distribution (2/3)

fX (x) Stand.
1 µ=2
σ2 = 4

0.5

x
2 4 6 8
The lognormal distribution (3/3)
• The distribution lacks a m.g.f. but:

E [Y r ] = E [(exp (X))r ] = E [exp (Xr)]


!
σ2 r2
= exp µr +
2

• . . . which makes calculating moments easy:


!
σ2
E [Y ] = exp µ +
2
h   i  
Var [Y ] = exp σ2 − 1 exp 2µ + σ2

• . . . while the skewness depends on σ2 and is always positive.


h   i q
Skew [Y ] = exp σ2 + 2 · exp (σ2 ) − 1 > 0
The logistic distribution (1/4)
• The logistic distribution has support X = R, parameters µ
and σ, and a “bell shape” similar to the normal case.

• It can be denoted as follows.

X ∼ Logistic (µ, σ)

• Its p.d.f. is:


−2
1 x−µ x−µ
  
fX (x; µ, σ) = exp − 1 + exp −
σ σ σ

• . . . while its c.d.f. is simpler to read.


−1
x−µ
 
FX (x; µ, σ) = 1 + exp −
σ
The logistic distribution (2/4)

fX (x) Stand.
µ=2
0.2 ξ=2

0.1

x
−8 −4 4 8 12
The logistic distribution (3/4)
• The m.g.f. of the standard logistic is obtained as:
ˆ ∞
exp (−z)
MZ (t) = exp (tz) dz
−∞ (1 + exp (−z))2
ˆ 1
= ut (1 − u)−t du
0
= B (1 + t, 1 − t)
1 du exp(−z)
where u = 1+exp(−z) ; observe that here dz = (1+exp(−z))2
.

• Here B (a, b) for a, b > 0 denotes the Beta function:


ˆ 1
B (a, b) ≡ ua−1 (1 − u)b−1 du
0

• In the general case the m.g.f. is as follows.

MX (t; µ, σ) = exp (µt) · B (1 − ξt, 1 + ξt)


The logistic distribution (4/4)
• By the properties of the Beta function one can show that:

E [X] = µ
σ2 π 2
Var [X] =
3
Skew [X] = 0
21
Kurt [X] =
5
observe the excess kurtosis!

• An obvious reparametrization of the logistic is σ∗ = 3
π σ.

• The logistic has an important practical advantage – among


others: a closed form quantile function!
p
 
QX (p; µ, σ) = µ + σ log for p ∈ (0, 1)
1−p
The Cauchy distribution (1/3)
• The Cauchy distribution has support X = R, parameters µ
and σ, and a “bell shape” similar to the normal case.
• It can be denoted as follows.

X ∼ Cauchy (µ, σ)

• Its p.d.f. is:


" 2 #−1
1 x−µ

fX (x; µ, σ) = 1+
πσ σ

• . . . while its c.d.f. also has a closed form expression. . .


1 x−µ 1
 
FX (x; µ, σ) = arctan +
π σ 2
  
• . . . which is invertible: QX (p; µ, σ) = µ + σ tan π p − 1
2 .
The Cauchy distribution (2/3)

fX (x) Stand.
0.3 µ=2
σ=2
0.2

0.1

x
−5 −3 −1 1 3 5 7
The Cauchy distribution (3/3)
• The Cauchy distribution is notorious for lacking defined
moments. Consider its standard version’s mean:
ˆ 0 ˆ +∞
1 z 1 z
E [Z] = 2
dz + dz
−∞ π 1 + z 0 π 1 + z2

these two halves are symmetric. But, take the latter:


ˆ +∞
 M
z log 1 + z 2
dz = lim =∞
0 1 + z2 M →∞ 2 0

the integrals diverge! The mean cannot be defined.

• The Cauchy lacks a m.g.f. but, like the all distributions, has
a characteristic function; this is discontinuous at t = 0.

ϕX (t; µ, σ) = exp (iµt − σ |t|)


The Laplace distribution (1/4)
• The Laplace distribution has support X = R, parameters µ
and σ, and a characteristic “tent shape.”

• It is also known as “double exponential” (for reasons to be


clarified later) and can be denoted as follows.

X ∼ Laplace (µ, σ)

• Its p.d.f. features an absolute value:

1 |x − µ|
 
fX (x; µ, σ) = exp −
2σ σ

• . . . thus, its c.d.f. depends on the value of x.


  
 1 exp x−µ if x < µ
FX (x; µ, σ) = 2 1 σ  x−µ 
1 − exp − if x ≥ µ
2 σ
The Laplace distribution (2/4)

fX (x) Stand.
µ=2
0.4 σ=2

0.2

x
−5 −3 −1 1 3 5 7
The Laplace distribution (3/4)
• As usual, it is easier to calculate the standard m.g.f. first:
ˆ +∞
1
MZ (t) = exp (tz − |z|) dz
−∞ 2
ˆ
1 0
= exp ((1 + t) z) dz+
2 −∞
ˆ
1 +∞
+ exp (− (1 − t) z) dz
2 0
1 1 1
 
= +
2 1+t 1−t
1
=
1 − t2

• . . . so to easily generalize it (note: only for |t| < σ−1 ).

exp (µt)
MX (t; µ, σ) =
1 − σ2 t2
The Laplace distribution (4/4)

• The mean and variance are as follows.

E [X] = µ
Var [X] = 2σ2

• Like the logistic and Cauchy, this distribution possesses an


explicit quantile function.
  i
µ + σ log (2p) if p ∈ 0, 12
QX (p; µ, σ) = h
1

µ − σ log (2 − 2p) if p ∈ 2, 1

• The Laplace distribution has some limited applications in


the social sciences; these include modeling growth rates of
certain populations (e.g. firms).
Beyond location-scale families
• The rest of this Lecture covers continuous distributions that
do not strictly relate to some location-scale family.

• The parameters of these distributions may be pure so-called


shape parameters, and/or determine the support.

• It is useful to always specify the range of admissible values


for these parameters.

• There is plenty of relationships within and between these


distributions.

• The analysis ends with extreme value distributions, that


feature three parameters: for location, scale and shape.
The uniform distribution (1/2)
• Uniform distributions have bounded support: X = [a, b]
with a ≤ b, but the interval may as well be open.

• Here a and b are effectively parameters. The notation is:

X ∼ U (a, b)

(parentheses and not brackets, unlike the discrete uniform).

• The p.d.f. makes use of indicator functions:

1
fX (x; a, b) = · 1 [x ∈ (a, b)]
b−a

• . . . and the c.d.f. too.


x−a
FX (x; a, b) = · 1 [x ∈ (a, b)] + 1 [x ∈ [b, ∞)]
b−a
The uniform distribution (2/2)
• If X ∼ U (a, b), the m.g.f. of X is:

1

t(b−a) [exp (bt) − exp (at)] if t 6= 0
MX (t; a, b) =
1 if t = 0

• . . . and thus the mean and variance of X are as follows.


a+b
E [X] =
2
(b − a)2
Var [X] =
12
It is perhaps easier to calculate these moments without the
use of the m.g.f. here.

• The analysis does not change if the support is open.


The Beta distribution (1/5)
• Beta distributions are general distributions with bounded
support. Focus for now on X = [0, 1], or X = (0, 1).

• Here α > 0 and β > 0 are the parameters; the notation is:

X ∼ Beta (α, β)

• which is motivated by a p.d.f. expressed through the Beta


function (a normalizing factor):

xα−1 (1 − x)β−1 xα−1 (1 − x)β−1


fX (x; α, β) = ´ 1 =
0 uα−1 (1 − u)β−1 du B (α, β)

• . . . just like the c.d.f.!


´x α−1 (1 − t)β−1 dt
0 t B (x; α, β)
FX (x; α, β) = ´1 =
α−1 (1 − u)β−1 du B (α, β)
0 u
The Beta distribution (2/5)
• In the expression for the c.d.f., B (x; α, β) is the so-called
lower incomplete Beta function; for any positive a, b:
ˆ x
B (x; a, b) ≡ ua−1 (1 − u)b−1 du.
0

• The so-called Gamma function Γ (c), for c > 0:


ˆ ∞
Γ (c) = uc−1 exp (−u) du
0

• . . . is related to the Beta function:


Γ (a) Γ (b)
B (a, b) =
Γ (a + b)

• . . . hence the p.d.f. can be alternatively written as follows.


Γ (α + β)
fX (x; α, β) = · xα−1 (1 − x)β−1
Γ (α) Γ (β)
The Beta distribution (3/5)

fX (x) α = 2, β=2
3 α = .5, β = .5
α = 2, β=5
2 α = 5, β=2

x
0.25 0.5 0.75 1

Observation 2
X ∼ Beta (1, 1) is equivalent to X ∼ U (0, 1).
The Beta distribution (4/5)
• The Beta’s m.g.f. is difficult to obtain:
 
∞ q−1
X Y α + k  tq
MX (t; α, β) = 1 + 
q k=0
α + β + k q!

• . . . and uncentered moments are best calculated directly!


ˆ 1
1
E [X r ] = xr+α−1 (1 − x)β−1 dx
B (α, β) 0
B (r + α, β)
=
B (α, β)
Γ (r + α) Γ (β) Γ (α + β)
= ·
Γ (r + α + β) Γ (α) Γ (β)
Γ (r + α) Γ (α + β)
=
Γ (r + α + β) Γ (α)
The Beta distribution (5/5)
• A Gamma function’s property greatly helps calculations:

Γ (c) = (c − 1) · Γ (c − 1)

and the key moments are obtained as follows.


α
E [X] =
α+β
αβ
Var [X] = 2
(α + β) (α + β + 1)

• This analysis generalizes to any connected segment of R


for support. For X = [a, b] or X = (a, b):

(x − a)α−1 (b − x)β−1
fX (x; α, β, a, b) =
B (α, β) · (b − a)α+β−1

is a nonstandard Beta (where if α = β = 1, X ∼ U (a, b)).


The exponential distribution (1/4)
• The exponential distributions are simple distributions with
support on the set of nonnegative real numbers, X = R+ .

• There is one parameter λ > 0 (yet often reparametrized as


β = λ−1 ) and the notation is the following.

X ∼ Exp (λ)

• The name comes from the functional form of the p.d.f.:

1 x
 
fX (x; λ) = exp −
λ λ

• . . . as well as that of the c.d.f. (unsurprisingly).

x
 
FX (x; λ) = 1 − exp −
λ
The exponential distribution (2/4)

2 fX (x) λ = .5
λ=1
λ=2

x
1 2 3 4 5
The exponential distribution (3/4)
• It is easy to obtain the m.g.f. (recall the case with λ = 1),
however it only exists for t ≤ λ−1 :
1
MX (t; λ) =
1 − λt

• . . . and the key moments are as follows.

E [X] = λ
Var [X] = λ2

• This distribution is the continuous analog of the geometric


distribution. They both share the memoryless property:

P ( X > s| X > t) = P (X > s − t)

(the derivation is also similar). The exponential distribution


is used to model (continuous) waiting times.
The exponential distribution (4/4)
Observation 3
If X ∼ U (0, 1) and Y = −λ log (X) it is Y ∼ Exp (λ).

Observation 4
1

If X ∼ Exp (λ) and Y = exp (−X) it is Y ∼ Beta λ, 1 .

Observation 5
If X ∼ Laplace (µ, σ) and Y = |X − µ| it is Y ∼ Exp (σ), whence the
name “double exponential” for the Laplace.

Observation 6
If X ∼ Exp (1) and
 
exp (−X)
Y = µ − σ log
1 − exp (−X)

it is Y ∼ Logistic (µ, σ). The standard logistic models the odds ratio of
exponential events.
The Gamma distribution (1/4)
• Gamma distributions have support upon the set of positive
real numbers X = R++ (X = 0 may be included at will).
• There are two parameters α > 0 and β > 0 (the latter can
be reparametrized as θ = β−1 ); two notations coexist.

X ∼ Γ (α, β) & X ∼ Gamma (α, β)

• As expected, the Gamma function normalizes the p.d.f.:

1
fX (x; α, β) = βα xα−1 exp (−βx)
Γ (α)

• . . . so the c.d.f. can be expressed via the lower incomplete


´b
Gamma function: γ (a, b) = 0 ua−1 exp (−u) du.
ˆ x
1 γ (α, x)
FX (x; α, β) = βα tα−1 exp (−βt) dt =
Γ (α) 0 Γ (α)
The Gamma distribution (2/4)

fX (x) α = 2, β = 2
1.5 α = 4, β = 2
α = 2, β = 8
1

0.5

x
1 2 3 4 5

Observation 7
X ∼ Gamma 1, λ1 is equivalent to X ∼ Exp (λ), that is, exponential
distributions are all special cases of the Gamma family.
The Gamma distribution (3/4)
• Uncentered moments are better calculated directly:
ˆ ∞
r 1 1
E [X ] = βr+α xr+α−1 exp (−βx) dx
Γ (α) βr 0
Γ (r + α)
=
Γ (α) βr

the integral, rescaled by Γ (r + α), is the p.d.f. of a Gamma


distribution with parameters r + α and β.

• Using the property Γ (c) = (c − 1) · Γ (c − 1) again, the key


moments are as follows.
α
E [X] =
β
α
Var [X] = 2
β
The Gamma distribution (4/4)
• Alternatively, one could have calculated the m.g.f. as:
ˆ ∞
1
MX (t; α, β) = exp (tx) βα xα−1 exp (−βx) dx
Γ (α) 0
ˆ ∞
βα (β − t)α α−1
= x exp (− (β − t) x) dx
(β − t)α 0 Γ (α)

β

=
β−t
t −α
 
= 1−
β

within the integral is a Gamma p.d.f. with parameters α and


β − t. The m.g.f. is only defined for t < β!

• Gamma distributions have a wide range of applications for


flexibly modeling phenomena with support on X = R+ .
The Chi-squared distribution (1/3)
• Chi-squared distributions also have support upon the set of
positive real numbers X = R++ (with possibly X = 0).

• There is one parameter κ > 0; when κ ∈ N (integer), this is


known as degrees of freedom. The notation is as follows.

X ∼ χ2 (κ) or X ∼ χ2κ

• The p.d.f. is normalized by the Gamma function:

1 x
 
κ
−1
fX (x; κ) = κ κ x 2 exp −
2

Γ 2 · 22

and so does the c.d.f. (not reported for brevity).

• It is obvious that this is a subfamily of the Gamma family.


It is singled out because of its role in statistical inference.
The Chi-squared distribution (2/3)

fX (x) κ=3
κ=5
0.2 κ=7

0.1

x
4 8 12 16

Observation 8 
κ 1
X ∼ Gamma 2 , 2 is equivalent to X ∼ χ2 (κ), that is, chi-squared
distributions are all special cases of the Gamma family.
The Chi-squared distribution (3/3)

• From the Gamma’s analysis, the m.g.f. is (for t < 0.5):


κ
MX (t; κ) = (1 − 2t)− 2

• . . . while the key moments are as follows.

E [X] = κ
Var [X] = 2κ

Observation 9
1

X ∼ χ2 (2) is equivalent to X ∼ Exp 2 .

Observation 10
If X ∼ N (0, 1) and Y = X 2 , it is Y ∼ χ2 (1).
Snedecor’s F-distribution (1/3)
• Another family of distribution with support upon the set of
positive real numbers X = R++ (with possibly X = 0).

• There are two parameters ν1 > 0 and ν2 > 0, called paired


degrees of freedom if integers. The notation is as follows.

X ∼ F (ν1 , ν2 ) or X ∼ Fν1 ,ν2

• The p.d.f. is normalized by the Beta function:


ν1  ν1  − ν1 +ν2
−1
x ν1 ν1

2 2 2
fX (x; ν1 , ν2 ) = ν1 ν2  1+ x
B 2 , 2 ν2 ν2

• . . . thus the c.d.f. is best expressed via the incomplete Beta


function.
B x, ν21 , ν22

FX (x; ν1 , ν2 ) =
B ν21 , ν22

Snedecor’s F-distribution (2/3)

ν1 = 2, ν2 = 2
1 fX (x) ν1 = 2, ν2 = 6
ν1 = 12, ν2 = 12

0.5

x
1 2 3 4 5

Observation 11
If X ∼ F (ν1 , ν2 ) and Y = X −1 , it is Y ∼ F (ν2 , ν1 ).
Snedecor’s F-distribution (3/3)
• The F-distribution lacks a m.g.f. and also its characteristic
function is involved. Key moments are better obtained via
direct integration.
ν2
E [X] =
ν2 − 2
2ν22 (ν1 + ν2 − 2)
Var [X] =
ν1 (ν2 − 2)2 (ν2 − 4)

• The F-distribution also plays a role in statistical inference.

Observation 12
If X ∼ F (ν1 , ν2 ) and Y ∼ Beta ν21 , ν22 , the random variables X and


Y are related through the following reciprocal transformations.

(ν1 X/ν2 ) ν2 Y
Y = X=
(1 − ν1 X/ν2 ) ν1 (1 − Y )
Student’s t-distribution (1/4)
• Back to a bell-shaped family with “full” support X = R!
• There is one parameter ν > 0; when ν ∈ N (integer), this is
known as degrees of freedom. The notation is as follows.
X ∼ T (ν) or X ∼ Tν
• The p.d.f. is normalized by the Beta function:
!− ν+1
1 1 x2 2
fX (x; ν) =  √ 1+
B 1, ν πν ν
2 2
   .  ν+1 
1 ν ν
or even by the Gamma, since B 2, 2 =Γ 2 Γ 2 .
• The t-distribution’s c.d.f. is perhaps best expressed through
the incomplete Beta function.
  
 1 B 2ν , 1 , ν
if x ≤ 0
FX (x; ν) = 2 1x +ν 2 2 
1 − B 2ν , 1, ν if x > 0
2 x +ν 2 2
Student’s t-distribution (2/4)

fX (x) Stud.’s t, ν = 3
0.4
Stand. Cauchy
Stand. Normal

0.2

x
−5 −3 −1 1 3 5

Observation 13
X ∼ T (1) is equivalent to X ∼ Cauchy (0, 1).
Student’s t-distribution (3/4)
• The t-distribution lacks a m.g.f., its characteristic function
is involved, and moments of order r ≥ ν are not defined.
  
r+1 ν−r

Γ 2 Γ 2
 q
νr
· if r is even, 0 < r < ν

r ν
E [X ] = π

 Γ 2

0 if r is odd, 0 < r < ν

• Hence, key moments exist only for some values of ν.

E [X] = 0 for ν > 1


ν
Var [X] = for ν > 2
ν−2
Skew [X] = 0 for ν > 3
3ν − 6
Kurt [X] = for ν > 4
ν−4
Student’s t-distribution (4/4)

Observation 14
If X ∼ T (ν) and Y = X 2 , it is Y ∼ F (1, ν).

Observation 15
If X ∼ T (ν) and Y = X −2 , it is Y ∼ F (ν, 1).

• The t-distribution is central in statistical inference, largely


because of some results that relate it to both the standard
normal and the chi-squared distribution (Lectures 3, 4).

• But this is also due to its asymptotic relationship with the


standard normal, which the t-distribution approximates as
ν → ∞ (Lecture 6).
The Pareto distribution (1/4)
• This distribution has a support which depends on one of its
parameters: X = [α, ∞), where α > 0.

• There is also a second parameter: β > 0. The notation for


a Pareto distribution is unsurprisingly as follows.

X ∼ Pareto (α, β)

• The p.d.f. is (note the specification of the support):

βαβ
fX (x; α, β) = for x ≥ α
xβ+1

• . . . while the c.d.f. is as follows.


 β
α
FX (x; α, β) = 1 − for x ≥ α
x
The Pareto distribution (2/4)

fX (x) All: α = 1
3 β=1
β=2
2 β=3

x
1 2 3 4

Observation 16
If X ∼ Pareto (α, β) and Y ∼ Exp β−1 , the two random variables


are related through the two symmetric transformations X = α exp (Y )


and Y = log (X/α).
The Pareto distribution (3/4)
• The m.g.f. is expressed here through the upper incomplete
´∞
Gamma function: Γ (a, b) = b ua−1 exp (−u) du.

MX (x; α, β) = β (−αt)β · Γ (−β, −αt)

• Key moments are best obtained via direct integration, but


they exist only for some values of β. The mean is:

∞
 for β ≤ 1
E [X] = αβ
 for β > 1
β−1

while the variance is as follows.



∞
 for β ≤ 2
Var [X] = α2 β
 for β > 2
(β − 1)2 (β − 2)

The Pareto distribution (4/4)
• Pareto distributions feature a so-called “fat tail:” extreme
realizations of X are relatively likely.
• They are also noteworthy for their Power Law: in logs the
p.d.f. is conveniently linear.
 
log fX (x; α, β) = log βαβ − (β + 1) log x for x ≥ α

• Their quantile function is also a simple expression.


1
QX (p; α, β) = α (1 − p)− β

• There is a “generalized” family of Pareto distributions:


"  1 #−β
x−µ

γ
FX (x; β, γ, µ, σ) = 1 − 1 + for x ≥ µ
σ

with µ ∈ R, (β, γ, σ) ∈ R3++ and support X = [µ, ∞).


Generalized Extreme Value distributions (1/4)
• The family of Generalized Extreme Value distribution is a
large one, and includes several subfamilies.

• It gets its name from its connection with the Extreme Value
Theorem (Lecture 6). These distributions are fat-tailed.

• The family features three parameters: µ ∈ R (location),


σ ∈ R++ (scale), and ξ ∈ R (shape). A notation valid for
the whole family is as follows.
X ∼ GEV (µ, σ, ξ)

• The support depends on the value of the shape parameter.


h 
σ



µ − ξ , ∞ if ξ > 0
X= (−∞, ∞) if ξ = 0
 i
 −∞, µ − σ

if ξ < 0
ξ
Generalized Extreme Value distributions (2/4)
• The p.d.f. also depends on the shape parameter ξ:
− ξ1
  

 exp − (1 + ξz)
for ξ 6= 0 and ξz > −1


 1
fZ (z; ξ) = (1 + ξz) ξ +1
 exp (− exp (−z))


for ξ = 0


exp (z)
• . . . and so does the c.d.f.:

exp − (1 + ξz)− ξ1
 
for ξ 6= 0 and ξz > −1
FZ (z; ξ) =
exp (− exp (−z)) for ξ = 0

• . . . and so does the quantile function.


−ξ

 (− log (p)) − 1

for ξ 6= 0
QZ (p; ξ) = ξ


− log (− log (p)) for ξ = 0
Generalized Extreme Value distributions (3/4)

fX (x) ξ = .5
0.4 ξ=0
ξ = −.5

0.2

x
−4 −2 2 4 6
• Type I Extreme Value: ξ = 0 (Gumbel)
• Type II Extreme Value: ξ > 0 (Fréchet)
• Type III Extreme Value: ξ < 0 (reverse Weibull)
Generalized Extreme Value distributions (4/4)
• The m.g.f. and characteristic functions are quite involved.
• Moments are better obtained via direct integration, but are
defined for some values of ξ only.
• The mean is given by:

 σ
µ + ξ [Γ (1 − ξ) − 1] if ξ 6= 0, ξ < 1



E [X] =

µ + σγ if ξ = 0

∞

if ξ ≥ 1
• . . . while the variance is as follows.
 2
σ h i
Γ (1 − 2ξ) − (Γ (1 − ξ))2 if ξ 6= 0, ξ < 1



ξ

 2 2
Var [X] = π2
 σ2 if ξ = 0
6



 1

∞ if ξ ≥ 2
The Gumbel (Type I GEV) distribution (1/2)
• The simplest GEV distributions (Type I) have ξ = 0 and
only a location and scale parameter.

• Two alternative pieces of notation are used for them.

X ∼ EV1 (µ, σ) & X ∼ Gumbel (µ, σ)

• The p.d.f. is given by:


1 x−µ x−µ
    
fX (x; µ, σ) = exp − exp − exp −
σ σ σ
• . . . the c.d.f. is:
x−µ
  
FX (x; µ, σ) = exp − exp −
σ
• . . . while the quantile function is as follows.

QX (p; µ, σ) = µ − σ log (− log (p))


The Gumbel (Type I GEV) distribution (2/2)

0.4 fX (x) Stand.


µ=2
σ=2

0.2

x
−5 −2 1 4 7 10
The Fréchet (Type II GEV) distribution (1/2)
• The Type II GEV distributions are usually rephrased via
α ≡ ξ−1 > 0 & the transformation Y = σ + µ (1 − ξ) + ξX.
• Two alternative pieces of notation are used for them.

Y ∼ EV2 (α, µ, σ) & Y ∼ Frechet (α, µ, σ)

• The p.d.f. is given by:


−α−1 −α !
α y−µ y−µ
 
fY (y; α, µ, σ) = exp −
σ σ σ
• . . . the c.d.f. is:
−α !
y−µ

FY (y; α, µ, σ) = exp −
σ
• . . . while the quantile function is as follows.
1
QY (p; α, µ, σ) = (− log (p)) α
The Fréchet (Type II GEV) distribution (2/2)

fY (y) All: α = 2
Stand.
µ=2
σ=2
0.5

y
2 4 6 8

• Recall that the support is Y = [µ, ∞).


The Weibull (Type III GEV) distribution (1/4)
• The Type III GEV distributions also feature α ≡ ξ−1 < 0.
• Here, Y is said to follow the reverse Weibull distribution.
The symmetric W = −Y follows the “traditional” Weibull
distribution. Each has its own notation.
Y ∼ EV3 (α, µ, σ) & W ∼ Weibull (α, µ, σ)
• The p.d.f. of the traditional Weibull is:
α−1 α 
α w−µ w−µ
  
fW (w; α, µ, σ) = exp −
σ σ σ
• . . . its c.d.f. is:
α 
w−µ
 
FW (w; α, µ, σ) = 1 − exp −
σ
• . . . while its quantile function is as follows.
1
QW (p; α, µ, σ) = (− log (1 − p)) α
The Weibull (Type III GEV) distribution (2/4)

1 fY (y) Stand.
µ=2
σ=2

0.5

y
−6 −4 −2

• Reverse Weibull: the support is Y = (+∞, µ].


The Weibull (Type III GEV) distribution (3/4)

1 fW (w) Stand.
µ=2
σ=2

0.5

w
2 4 6

• Traditional Weibull: the support is W = [µ, ∞).


The Weibull (Type III GEV) distribution (4/4)
Observation 17 1
If X ∼ Exp (1), Y = µ − σ log (X), and W = µ + σX α , it is as follows.

Y ∼ Gumbel (µ, σ) & W ∼ Weibull (α, µ, σ)

Observation
√ 18
α and W ∼ Weibull α, 0, 12 , it is as follows.

If X ∼ Exp

X= W & W = X2

Observation 19
−1
If Y ∼ Frechet (α, µY , σ), and W = (Y − µY ) + µW , it is as follows.

W ∼ Weibull α, µW , σ−1


A frequent application of the (traditional) Weibull distribution


is in survival analysis (the statistical study of waiting times).

You might also like