Chapter2 Probability
Chapter2 Probability
Notes on MLAPP
Wu Ziqing
14/07/2018
2 Discrete Distributions
Binomial and Bernoulli Distributions
Multinomial and Multinoulli Distributions
Poisson Distribution
Empirical Distribution
3 Continuous Distributions
Normal Distribution
Degenrate pdf
Laplace Distribution
Gamma Distribution
Beta Distribution
Pareto Distribution
7 Information Theory
Entropy
KL Divergence
Mutual Information
Discrete Random Variable: variable which can take on any value from a
finite or countable infinite set X . Notation p(X = x) denotes the event
X = x.
p() is called a Probability Mass Function (pmf), 0 ≤ p(x) ≤ 1 and
P
x∈X p(x) = 1.
Thus, E[X 2 ] = µ2 + σ 2
Standard Deviation: Std deviation adopts the same units as the data.
It is denoted by σ:
p
Std[X ] = Var [X ]
Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 10 / 45
Basic Concept of Probability Theory
Fundamental Rules
pemp (A) = N1 N
P
i=1 δxi (A), where δx (A) only = 1 if x ∈ A
p(x) = N
P PN
i=1 wi δxi (x), where 0 ≤ wi ≤ 1 and i=1 wi = 1.
Dirac Delta Function has sifting property: it will select out a single term
from a sum of integral:
R∞
−∞ f (x)δ(x − µ)dx = f (u)
1
Inverse Gamma Distribution: if X ∼ Ga(a, b), then X ∼ IG (a, b),
which is defined by:
b a −(a+1) −b/x
IG (x|a, b) = Γ(a) x e
A Joint Probability Distribution has multiple variables and has the form
of p(x1 , x2 , ..., xD ) for a set of variables x.
If all variables are discrete, we can represent Joint Probability in a
multi-dimensional array, with one variable in each dimension.
The size of high dimensional array can be reduced by making Conditional
Independence assumptions, or restrict the pdf into a certain
functional forms (for continuous distribution).
Covariance describes the degree which two variables are linearly related:
N (x|µ, Σ) = 1
exp[− 12 (x − µ)T Σ−1 (x − µ)],
(2π)D/2 |Σ|1/2
1 QK αk −1 II(x ∈ S ),
Dir (x|α) = B(α) k=1 xk K
QK
Γ(αk ) PK
where B(α) = k=1
Γ(α0 ) and α0 = k=1 αk
If y = f (x) = Ax + b, then:
E[y ] = E[Ax + b] = Aµ + b
cov [y ] = cov [Ax + b] = AΣAT
py (y ) = px (x)|detJy →x |
For N random variables with pdf, each with the same µ and σ 2 , i.e., the
variables are
PNindependent and identically distributed (idd).
Let SN = i=1 Xi , i.e., the sum of all variables, as N increases,
2
p(SN = s) = √ 1
2πNσ
exp(− (s−Nµ)
2Nσ 2
)
H(X ) = − K
P
k=1 p(X = k)log2 p(X = k)
Mutual Information (MI) show how much knowing one variable x can
tell us another variable y. It is defined by the KL Divergence between the
Joint Probability p(X , Y ) and the factored probability p(X )p(Y ):
From the equation, we can know that MI is the expected value of PMI.