0% found this document useful (0 votes)

134 views

Chapter2 Probability

The document outlines the key concepts in probability theory and distributions covered in Chapter 2. It introduces basic probability concepts like definitions of probability, notations, discrete and continuous variables, and fundamental rules. It then covers various discrete distributions like binomial, multinomial, Poisson, and empirical distributions. It also discusses continuous distributions and joint probability distributions that will be covered in later sections.

Uploaded by

wuziqi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

134 views

Chapter2 Probability

Uploaded by

wuziqi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Chapter2: Probability

Notes on MLAPP

Wu Ziqing

School of Computer Science and Engineering

Nanyang Technological University

14/07/2018

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 1 / 45

Outline

1 Basic Concept of Probability Theory

Definition of Probability
Basic Notations
Discrete and Continuous Variable
Mean and Variance
Fundamental Rules
Bayes Rule
Independence

2 Discrete Distributions
Binomial and Bernoulli Distributions
Multinomial and Multinoulli Distributions
Poisson Distribution
Empirical Distribution

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 2 / 45

Outline

3 Continuous Distributions
Normal Distribution
Degenrate pdf
Laplace Distribution
Gamma Distribution
Beta Distribution
Pareto Distribution

4 Joint Probability Distributions

Covariance and Correlation
Multivariate Gaussian
Multivariate Student t Distribution
Dirichlet Distribution

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 3 / 45

Outline

5 Transformations of Random Variables

Linear Transformation
General Transformation
Multivariate Change of Variables
Central Limit Theorem

6 Monte Carlo approximation

7 Information Theory
Entropy
KL Divergence
Mutual Information

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 4 / 45

Basic Concept of Probability Theory
Definition of Probability

There are 2 possible interpretations of a probability:

Frequentist Interpretation: probability represents long run
frequencies of events.
Bayesian Interpretation: probability quantifies our uncertainty of
something happening.
Bayesian view can help us model uncertainty of events that do not
have long term frequencies.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 5 / 45

Basic Concept of Probability Theory
Basic Notations

p(A) : denotes the probability of event A will happen. 0 ≤ p(A) ≤ 1

p(A) : denotes the probability of event not A (Complement).
p(A) = 1 − p(A).

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 6 / 45

Basic Concept of Probability Theory
Discrete and Continuous Variable(Discrete)

Discrete Random Variable: variable which can take on any value from a
finite or countable infinite set X . Notation p(X = x) denotes the event
X = x.
p() is called a Probability Mass Function (pmf), 0 ≤ p(x) ≤ 1 and
P
x∈X p(x) = 1.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 7 / 45

Basic Concept of Probability Theory
Discrete and Continuous Variable(Continuous)

If variable is continuous, we define Cumulative Distribution Function

(cdf) as:
F (q) = p(x ≤ q)
Cumulative Distribution is always monotonically increasing.
We define Probability Density Function (pdf) as:
d
f (x) = dx F (x)
Thus the probability of a continuous variable being in a finite interval is:
Rb
P(a < x ≤ b) = a f (x)dx
Probability of a continuous variable taking one value x is:
P(x ≤ X ≤ x + dx) ≈ p(x)dx
Note that here p(x) is allowed to take value > 1, so long as the
density integrates to 1.
Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 8 / 45
Basic Concept of Probability Theory
Discrete and Continuous Variable(Continuous)(Cont.)

Quantile: Since cdf F is monotonically increasing, F −1 (α) = xα such that

P(x ≤ xα ) = α. α is called the α quantile of F .

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 9 / 45

Basic Concept of Probability Theory
Mean and Variance

Mean/Expected Value: denoted by µ.

For Discrete Variables:
P
E[X ] = x∈X xp(x)
For Continuous Variables:
R
E[X ] = X xp(x)dx
Variance: Variance measures ’spread’ of data, denoted by σ 2 .

Var [X ] = E[(X − µ)2 ] = E[X 2 ] − µ2

Thus, E[X 2 ] = µ2 + σ 2

Standard Deviation: Std deviation adopts the same units as the data.
It is denoted by σ:
p
Std[X ] = Var [X ]
Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 10 / 45
Basic Concept of Probability Theory
Fundamental Rules

Joint Probability of Two events:

p(A, B) = p(A ∧ B) = p(A|B)p(B)
Probability of Union of Two Events (Product Rule):

p(A ∨ B) = p(A) + p(B) − p(A ∧ B)

= p(A) + p(B) if A and B are mutually exclusive

Marginal Distribution (Sum Rule):

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 11 / 45

Basic Concept of Probability Theory
Bayes Rule

Bayes Rule/Bayes Theorem is the combintation of Conditional

Probability, Product Rule and Sum Rule:

p(X = x|Y = y ) = p(X =x,Y =y )

= P p(Y =y |X =x)p(X
0
=x)
0
p(Y =y ) x 0 p(Y =y |X =x )p(X =x )

It is useful to obtain the conditional probability p(A|B) if we already know

p(B|A) and p(A).

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 12 / 45

Basic Concept of Probability Theory
Independence

Unconditional Independence/Marginal Independence: denoted by

X ⊥ Y , satisfies:

X ⊥ Y ⇐⇒ p(X , Y ) = p(X )p(Y )

Conditional Independence: In most cases, two variables are

independence only via other variables:

X ⊥ Y |Z ⇐⇒ p(X , Y |Z ) = p(X |Z )p(Y |Z )

Conditional independence has following property:

X ⊥ Y |Z iff there exist functions g () and h() such that:
p(x, y |z) = g (x, z)h(x, z)

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 13 / 45

Discrete Distributions
Binomial and Bernoulli Distributions

Binomial Distribution: When we toss a coin n times, with a probability

of head being θ, the number of heads X ∈ {1, 2, ..., n} has a binomial
distribudion:
Bin(k|n, θ) = kn θk (1 − θ)n−k

Binomial Distribution has the following properties:

mean = θ, var = nθ(1 − θ)
Bernoulli Distribution: It is a special case of binomial distribution where
n=1. Thus,
Ber (x|θ) = θII(x=1) (1 − θ)II(x=0)
That is,
(
θ, if x=1
Ber (x|θ) =
1 − θ, if x=0
Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 14 / 45
Discrete Distributions
Multinomial and Multinoulli Distributions

Multinomial Distribution: If we toss a K -side die n times instead, with a

probability of landing on each side represented by a vector
θ = (θ1 , θ2 , ..., θK ), let x = (x1 , x2 , ..., xK ), where xj represents the number
of jt h side occurs, then x follows Multinomial Distribution:
n
QK xj n
n!
Mu(x|n, θ) = x1 ,...,x K j=1 θ j , where x1 ,...,xK = x1 !x2 !...xK !

Multinoulli Distribution: It is a special case of Multinomial distribution

where n=1. It is also called Categorical Distribution/ Discrete
Distribution.

Cat(x|θ) = Mu(x|1, θ) = K II(xj =1)

Q
j=1 θj , here p(x = j|θ) = θj

Do note that here x becomes a binary vector with all elements 0 or 1

and only one dimension can be 1 (since only one side will occur). It is
also known as dummy encoding or one-hot encoding.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 15 / 45

Discrete Distributions
Poisson Distribution

Poisson Distribution is usually used to calculate the probability of a

number of a certain event occurring in a specified time interval. It has a
parameter λ, which is the average number of event occurring in the
interval.
x
Poi(x|λ) = e −λ λx!

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 16 / 45

Discrete Distributions
Empirical Distribution

Empirical Distribution is obtained from empirical tests. For a dataset

D = {x1 , x2 , ..., xN }:

pemp (A) = N1 N
P
i=1 δxi (A), where δx (A) only = 1 if x ∈ A

The probability for each result to occur is associated with a weight wi :

p(x) = N
P PN
i=1 wi δxi (x), where 0 ≤ wi ≤ 1 and i=1 wi = 1.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 17 / 45

Continuous Distributions
Normal Distribution

Normal Distribution/Gaussian Distribution has the pdf of:

1
− (x−µ)2
N (x|µ, σ 2 ) = √ 1 e 2σ2
2πσ 2

Its cdf is:

Rx
Φ(x; µ, σ 2 ) = −∞ N (z|µ, σ
2 )dz, where z = (x − µ)/σ

which is usually implemented as:

√ Rx 2
Φ(x; µ, σ 2 ) = 12 [1 + erf (z/ 2)], where erf (x) = √2
π 0 e −t dt

Standard Normal Distribution: the Normal distribution of N ∼ (0, 1).

Precision: precision λ = 1/σ 2 . The higher the precision, the smaller the
variance, the narrower the distribution.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 18 / 45

Continuous Distributions
Normal Distribution(Cont.)

Advantage of Normal Distribution:

1 It contains 2 parameter which captures the most basic property of a
distribution.
2 The sum of independent random variables have an approximately
Normal Distribution (Central Limit Theorem).
3 It makes the least amount of assumptions.
4 It has simple mathematical form, which is easy to implement.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 19 / 45

Continuous Distributions
Degenerate pdf (Dirac Delta Function)

Dirac Delta Function: When variance of a Normal Distribution

approaches 0, its pdf becomes infinitely thin and tall:

lim N (x|µ, σ 2 ) = δ(x − µ)

σ 2 →0

Here δ(x − µ) is called Dirac Delta Function. It is defined as:

(
∞, if x=0 R∞
δ(x) = , where −∞ δ(x)dx = 1
0, if x 6= 0

Dirac Delta Function has sifting property: it will select out a single term
from a sum of integral:
R∞
−∞ f (x)δ(x − µ)dx = f (u)

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 20 / 45

Continuous Distributions
Degenerate pdf (Student t Distribution)

Student t Distribution: Compared to Normal Distribution, Student t

Distribution is less affected by outliers. It has the following pdf:
v +1
T (x|µ, σ 2 , v ) ∝ [1 + v1 ( x−µ 2 −(
σ ) ]
2
)

v > 0 is called degree of freedom.

Student t Distribution has the following properties:
mode = µ
mean = µ only if v > 1
v σ2
variance = (v −2) only if v > 2

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 21 / 45

Continuous Distributions
Laplace Distribution

Laplace Distribution/Double Sided Exponential Distribution is also a

distribution more inert to outliers compared to Normal Distribution. It has
the following pdf:
1 |x−µ|
Lap(x|µ, b) = 2b exp(− b )

If has the following properties:

mode = µ
mean = µ
variance = 2b 2

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 22 / 45

Continuous Distributions
Gamma Distribution

Gamma Distribution is a flexible distribution for positive real value

random variables. It is defined as following:
a
b
Γ(T |shape = a, rate = b) = Γ(a) T a−1 e −Tb ,
R∞
where Γ(x) = 0 u x−1 e −u du

Gamma Distribution has following properties:

a−1
mode = b
a
mean = b
variance = ba2

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 23 / 45

Continuous Distributions
Gamma Distribution(Cont.)

Special cases for Gamma Distribution:

Exponential Distribution: is defined as Expon(x|λ) = Γ(x|1, λ),
where λ is the parameter in Poisson Distribution. It describes the
time between two consecutive events in a Poisson process.
Erlang Distribution: is defined as Erlang (x|λ) = Γ(x|2, λ).
Chi-squared Distribution: is defined by χ2 (x|v ) = Γ(x| v2 , 12 ).It is
the distribution of sum P
of squared Gaussian variables. i.e., if
Zi ∼ N (0, 1), and S = vi=1 Zi 2 , thenS ∼ χv 2 .

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 24 / 45

Continuous Distributions
Gamma Distribution(Cont.)

1
Inverse Gamma Distribution: if X ∼ Ga(a, b), then X ∼ IG (a, b),
which is defined by:
b a −(a+1) −b/x
IG (x|a, b) = Γ(a) x e

IG has following properties:

b
mode = a+1
b
mean = a−1 , only if a > 1
b2
variance = (a−1)2 (a−2)
, only if a > 2

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 25 / 45

Continuous Distributions
Beta Distribution

Beta Distribution has interval over [0,1] and is defined as:

1 a−1 (1 Γ(a)Γ(b)
Beta(x|a, b) = B(a,b) x − x)b−1 , where B(a, b) = Γ(a+b)

Beta Distribution only exists when a, b > 0.

If a = b = 1, the distribution turns into a Uniform Distribution.
If a, b < 1, the distribution turns into a Bimodal Distribution, which
spikes at 0 and 1.
If a, b > 1, the distribution turns into a Unimodal Distribution, which
has a heap shape.
Beta Distribution has following properties:
a−1
mode = a+b−2
a
mean = a+b
ab
variance = (a+b)2 (a+b+1)
Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 26 / 45
Continuous Distributions
Pareto Distribution

Pareto Distribution is used to model the quantities that exhibits long

tails. It is defined as:

Pareto(x|k, m) = kmk x −(k+1) II(x ≥ m)

Pareto Distribution has following properties:

mode = m
km
mean = k−1 , only if k > 1
m2 k
variance = (k−1)2 (k−2)
, only if k > 2.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 27 / 45

Joint Probability Distribution

A Joint Probability Distribution has multiple variables and has the form
of p(x1 , x2 , ..., xD ) for a set of variables x.
If all variables are discrete, we can represent Joint Probability in a
multi-dimensional array, with one variable in each dimension.
The size of high dimensional array can be reduced by making Conditional
Independence assumptions, or restrict the pdf into a certain
functional forms (for continuous distribution).

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 28 / 45

Joint Probability Distribution
Covariance

Covariance describes the degree which two variables are linearly related:

cov [X , Y ] = E[(X − E(X ))(Y − E(Y ))] = E(XY ) − E(X )E(Y )

If x is a d-dimensional vector containing d variables, its Covariance

Matrix is defined as:

cov [x] = E[(x − E[x])(x − E[x])T ]

 
var [X1 ] cov [X1 , X2 ] ... cov [X1 , Xd ]
 cov [X1 , X2 ] var [X2 ] ... cov [X2 , Xd ]
=  
 ... ... ... ... 
cov [Xd , X1 ] cov [Xd , X2 ] ... var [Xd ]

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 29 / 45

Joint Probability Distribution
Correlation

Correlation Coefficient normalises Covariance and gives it a finite upper

bound:
corr [X , Y ] = √ cov [X ,Y ]
var [X ]var [Y ]

A Correlation Matrix R is thus:

 
corr [X1 , X1 ] corr [X1 , X2 ] ... corr [X1 , Xd ]
 corr [X1 , X2 ] corr [X2 , X2 ] ... corr [X2 , Xd ] 
R= 
 ... ... ... ... 
corr [Xd , X1 ] corr [Xd , X2 ] ... corr [Xd , Xd ]
Correlation Coefficient measures the degree of linearity:
corr [X , Y ] = 1 if X and Y have a linear relationship.
corr [X , Y ] = 0 means X and Y are uncorrelated.
If X,Y are independent (i.e., p(X , Y ) = p(X )p(Y )), corr [X , Y ] = 0,
but not visa versa.
Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 30 / 45
Joint Probability Distribution
Multivariate Gaussian

Multivariate Gaussian is defined as:

N (x|µ, Σ) = 1
exp[− 12 (x − µ)T Σ−1 (x − µ)],
(2π)D/2 |Σ|1/2

where µ is the mean vector of x, and Σ is the D × D Covariance Matrix.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 31 / 45

Joint Probability Distribution
Multivariate Student t Distribution

Multivariate Student t Distribution is defined as:

Γ(v /2 + D/2) |Σ|−1/2 1 v +D

T (x|µ, Σ, v ) = D/2 D/2
× [1 + (x − µ)T Σ−1 (x − µ)]− 2
Γ(v /2) v π v
Γ(v /2 + D/2) v +D
= |piV |−1/2 × [1 + (x − µ)T V −1 (x − µ)]− 2
Γ(v /2)
where V = v Σ

Multivariate Student t Distribution has following properties:

mode = µ
mean = µ
v
variance = v −2 Σ

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 32 / 45

Joint Probability Distribution
Multivariate Student t Distribution

Dirichlet DistributionP has the support

SK = {x : 0 ≤ xk ≤ 1, K k=1 xk = 1}. It is defined by:

1 QK αk −1 II(x ∈ S ),
Dir (x|α) = B(α) k=1 xk K
QK
Γ(αk ) PK
where B(α) = k=1
Γ(α0 ) and α0 = k=1 αk

Dirichlet Distribution has following properties:

αk −1
mode[xk ] = α0 −K
mean[xk ] = ααk0
0 −αk )
variance[xk ] = ααk2(α
0 (α0 +1)

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 33 / 45

Transformations of Random Variables
Linear Transformation

If y = f (x) = Ax + b, then:
E[y ] = E[Ax + b] = Aµ + b
cov [y ] = cov [Ax + b] = AΣAT

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 34 / 45

Transformations of Random Variables
General Transformation

General Transformation: If we change x into y = f (x):

if x is discrete,
P
py (y ) = x:f (x)=y px (x)
if x in continuous, we work on cdf first:
Py (y ) = P(Y ≤ y ) = P(f (x) ≤ y ) = P(x ∈ x|f (x) ≤ y )
Since cdf is monotonically increasing, it can be inverted:
Py (y ) = P(f (x) ≤ y ) = P(X ≤ f −1 (y )) = Px (f −1 (y ))
To obtain the pdf, we can differentiate cmf:
py (y ) = d d −1 (y )) dx d dx
dy Py (y ) = dy Py (f = dy dx Px (x) = dy px (x)
Since the sign is insignificant, we get:
dx
py (y ) = | dy |px (x) (Change of Variables Formula)

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 35 / 45

Transformations of Random Variables
Multivariate Change of Variables

If f () is a function that maps x to y , which both are vectors of

n-dimensional vectors, then dy dx is given by |detJx→y |, where J is its
Jacobian Matrix:
 δy δy δy1

1 1
δx1 δx2 ... δxn
 δy2 δy2 δy 
 δx1 δx2 ... δxn2 
Jx→y = δ(y 1 ,y2 ,...,yn )
δ(x1 ,x2 ,...,xn ) = 
 ... ... ... ... 

δyn δyn δyn
δx1 δx2 ... δxn

If f () is a invertible mapping, then according to Change of Variables

Formula,

py (y ) = px (x)|detJy →x |

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 36 / 45

Transformations of Random Variables
Central Limit Theorem

For N random variables with pdf, each with the same µ and σ 2 , i.e., the
variables are
PNindependent and identically distributed (idd).
Let SN = i=1 Xi , i.e., the sum of all variables, as N increases,
2
p(SN = s) = √ 1
2πNσ
exp(− (s−Nµ)
2Nσ 2
)

The distribution converges to a standard normal.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 37 / 45

Monte Carlo approximation

Monte Carlo approximation computes the distribution of a function by:

1 generate S samples from the distribution, x1 , x2 , ..., xS
2 approximate the distribution using the empirical distribution of
{f (xs )}Ss=1 , by calculate the arithmetic mean of the function applied
to the sample:
E[f (x)] = f (x)p(x)dx ≈ S1 Ss=1 f (xs )
R P

From the samples drawn, we can compute the following quantities:

E(X ) = x̄ = S1 Ss=1 xs
P

var [X ] = barx = S1 Ss=1 (x − x̄)2

median(X) = medianx1 , x2 , ..., xS

(P(X ≤ c) = S1 No.{xs ≤ c}

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 38 / 45

Monte Carlo approximation (Cont.)

Accuracy of Monte Carlo approximation depends on the number of

samples drawn. By Central Limit Theorem, the error of MC, i.e., the
difference between actual mean and the sample mean, is:
2
µ̂ − µ → N (0, σS ),

where σ 2 , although unknown, can be estimated by:

σ̂ 2 = S1 Ss=1 (f (x) − µ̂)2

Since in Normal Distribution,

P{µ − 1.96 √σ̂S ≤ µ̂ ≤ µ + 1.96 √σ̂S } ≈ 0.95,

where √σ̂ is called the Standard Error.
S

We can obtain a estimation accurate to with in ± with probability at

2
least 95% by having sample size S ≥ 4σ̂
2
.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 39 / 45

Information Theory

Definition: Information Theory is concerned with:

Data Compression/Source Coding: represent data in a compressed
fashion.
Error Correction/Channel Coding: transmit and store data in a
way that is robust to errors.

Relation to Probability Model:

Compressing data need to represent message with high frequency with
short code words, and reserve long words for rarely used messages.
A good probability model is required for decoding messages sent over
noisy channels.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 40 / 45

Information Theory
Entropy

Entropy of a random variable X with distribution p, is a measure of its

uncertainty. It is denoted by H(X ):

H(X ) = − K
P
k=1 p(X = k)log2 p(X = k)

Entropy has following properties:

The unit when using log2 is called bits (binary digits). The unit
when using ln is called nats (natural digits).
Uniform distribution has the maximum Entropy, H(X ) = log2 K for a
K -ray random variable.
Deterministic Distribution (all mass is on one state) has the minimum
Entropy, H(X ) = 0.
If X is a binary variable, and we denote p(X = 1) = θ, we have
Binary Entropy Function:
H(X ) = −[θlog2 θ + (1 − θ)log2 (1 − θ)]
Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 41 / 45
Information Theory
KL Divergence

Kullback-Leibler Divergence (KL Divergence) is a measure of

dissimilarity between two probabilities p and q:
K
X pk
KL(p||q) = pk log
qk
k=1
K
X K
X
= pk logpk − pk logqk
k=1 k=1
= −H(p) + H(p, q)
H(p, q) is called cross entropy. It represents the average number of bits
needed to encode source with distribution p to model q. Thus, KL
Divergence is the average number extra bits to encode data.
Information Inequality Theorem states that:
KL(p||q) ≥ 0 and only = 0 if p = q.
Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 42 / 45
Information Theory
Mutual Information

Mutual Information (MI) show how much knowing one variable x can
tell us another variable y. It is defined by the KL Divergence between the
Joint Probability p(X , Y ) and the factored probability p(X )p(Y ):

II(X ; Y ) = KL(p(X , Y )||p(X )p(Y ))

XX p(x, y )
= p(x, y ) log
x y
p(x)p(y )
= H(X ) − H(X |Y )
= H(Y ) − H(Y |X )
P
H(Y |X ) is called Conditional Entropy, which = x p(x)H(Y |X = x).

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 43 / 45

Information Theory
Mutual Information (Cont.)

From the previous equation, we can know:

II(X , Y ) = 0 only if p(X , Y ) = p(X )p(Y ), which means X and Y are
independent.
According to the last two lines of the previous equation, we can
interpret MI as the reduction in uncertainty about X after observing
Y, or vice versa.

Pointwise Mutual Information measures the discrepancy between two

events occurring together as compared to what would be observed by
chance. It is defined as:
p(x,y ) p(x|y ) p(y |x)
PMI (x, y ) = log p(x)p(y ) = log p(x) = log p(y )

From the equation, we can know that MI is the expected value of PMI.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 44 / 45

Information Theory
Mutual Information (Cont.)

MI for continuous variables: We need to first discretise the variable by

separating the variables into different bins.
The size and boundary of the bins can be selected by trying many
combinations and calculate the largest among them. This normalised
statistic is called Mutual Information Coefficient:
max G ∈G(x,y ) II(X (G );Y (G )
MIC = max [ log min(x,y ) ]
x,y :xy <B

G(x, y ) is the set of 2d grids of x × y ; X (G ), Y (G ) are the

discretisation of the variables on the grid; B is a sample-size
dependent bound usually set to N 0.6 .
MIC ∈ [0, 1], where 0 represents no relationship and 1 represents noise-free
relationship (not only limited to linear relationship).

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 45 / 45

Checkums (30 Wallet - Dat)
50% (2)
Checkums (30 Wallet - Dat)
2 pages
S To Chas Tic Choice
No ratings yet
S To Chas Tic Choice
10 pages
Error Detection
No ratings yet
Error Detection
2 pages
lec2 (1)
No ratings yet
lec2 (1)
46 pages
Lecture01 Uppsala EQG 12
No ratings yet
Lecture01 Uppsala EQG 12
39 pages
Chapter03 RandomVariable 2023su
No ratings yet
Chapter03 RandomVariable 2023su
43 pages
Module_1__Topic_1
No ratings yet
Module_1__Topic_1
30 pages
Probability Random Variables Results PV
No ratings yet
Probability Random Variables Results PV
49 pages
Estimation Theory Presentation
100% (2)
Estimation Theory Presentation
66 pages
STAT 112 Session 3A - M
No ratings yet
STAT 112 Session 3A - M
25 pages
Chapter 2 Random Variables
No ratings yet
Chapter 2 Random Variables
19 pages
2 Mle
No ratings yet
2 Mle
28 pages
EN007001 Engineering Research Methodology: Statistical Inference: Bayesian Inference
No ratings yet
EN007001 Engineering Research Methodology: Statistical Inference: Bayesian Inference
72 pages
Statistical Models: Modeling and Simulation
No ratings yet
Statistical Models: Modeling and Simulation
51 pages
Discrete RV and Binomial, Poisson, And Hypergeometric Distributions - Lecture Notes [Dr AKM Azad] (1)
No ratings yet
Discrete RV and Binomial, Poisson, And Hypergeometric Distributions - Lecture Notes [Dr AKM Azad] (1)
3 pages
M3 Kalman Filter Equations
No ratings yet
M3 Kalman Filter Equations
131 pages
BU255 - All Formulas 2022fall
No ratings yet
BU255 - All Formulas 2022fall
9 pages
Random Variables
No ratings yet
Random Variables
31 pages
Math Supplement PDF
No ratings yet
Math Supplement PDF
17 pages
Ps Formuale
No ratings yet
Ps Formuale
7 pages
PRO-Ch4 (2021-22 Note
No ratings yet
PRO-Ch4 (2021-22 Note
52 pages
Probability Laws: Complementary Event
No ratings yet
Probability Laws: Complementary Event
23 pages
Lecture11 (Week 12) Updated
No ratings yet
Lecture11 (Week 12) Updated
34 pages
Probability Review Stochastic
No ratings yet
Probability Review Stochastic
23 pages
M-Unit 02 Notes
No ratings yet
M-Unit 02 Notes
13 pages
Operations_Research_Lesson_3-1
No ratings yet
Operations_Research_Lesson_3-1
42 pages
3. Probability
No ratings yet
3. Probability
44 pages
Comm ch02 Random en PDF
No ratings yet
Comm ch02 Random en PDF
84 pages
Cheat Sheet 4
No ratings yet
Cheat Sheet 4
2 pages
Probability Formula Sheet
No ratings yet
Probability Formula Sheet
11 pages
S1) Basic Probability Review
No ratings yet
S1) Basic Probability Review
71 pages
Lecture03 Discrete Random Variables Ver1
No ratings yet
Lecture03 Discrete Random Variables Ver1
37 pages
Probability Distributions n Special
No ratings yet
Probability Distributions n Special
86 pages
QM Notes: The Wave Function
No ratings yet
QM Notes: The Wave Function
11 pages
Prof Stanley Dukin Lectures Statistical Mechanics
No ratings yet
Prof Stanley Dukin Lectures Statistical Mechanics
43 pages
My Notes For Discrete and Continuous Distributions 987654
No ratings yet
My Notes For Discrete and Continuous Distributions 987654
28 pages
Chapter 5
No ratings yet
Chapter 5
21 pages
Pa E E: Random Variables
No ratings yet
Pa E E: Random Variables
7 pages
chapter1
No ratings yet
chapter1
15 pages
CPD
No ratings yet
CPD
5 pages
Financial Engineering & Risk Management: Review of Basic Probability
No ratings yet
Financial Engineering & Risk Management: Review of Basic Probability
46 pages
Lec1_RandomVariables (1)
No ratings yet
Lec1_RandomVariables (1)
11 pages
CS 725: Foundations of Machine Learning: Lecture 2. Overview of Probability Theory For ML
No ratings yet
CS 725: Foundations of Machine Learning: Lecture 2. Overview of Probability Theory For ML
23 pages
Lecture 4: Joint Probability Distribution & Limit Theorems: Wisnu Setiadi Nugroho
No ratings yet
Lecture 4: Joint Probability Distribution & Limit Theorems: Wisnu Setiadi Nugroho
48 pages
Lecture Prob
No ratings yet
Lecture Prob
160 pages
ECO 201 Lecture 2: Dr. Anomita Ghosh
No ratings yet
ECO 201 Lecture 2: Dr. Anomita Ghosh
46 pages
Section06 Solutions
No ratings yet
Section06 Solutions
11 pages
SI_Chapter-1
No ratings yet
SI_Chapter-1
30 pages
3-Random Variables-04-01-2023
No ratings yet
3-Random Variables-04-01-2023
54 pages
Chap1 Bishop
No ratings yet
Chap1 Bishop
35 pages
Conditional Density Estimation With Spatially Dependent Data
No ratings yet
Conditional Density Estimation With Spatially Dependent Data
22 pages
Chapter 4-6
No ratings yet
Chapter 4-6
39 pages
18.445 Introduction To Stochastic Processes: Lecture 3: Markov Chains: Time-Reversal
No ratings yet
18.445 Introduction To Stochastic Processes: Lecture 3: Markov Chains: Time-Reversal
12 pages
Notes 05 Joint Sampling Distributions
No ratings yet
Notes 05 Joint Sampling Distributions
12 pages
Introductory Probability and The Central Limit Theorem
No ratings yet
Introductory Probability and The Central Limit Theorem
11 pages
Probability Cheatsheet 140718
100% (1)
Probability Cheatsheet 140718
7 pages
Lecture Topic 4
No ratings yet
Lecture Topic 4
60 pages
Week 4
No ratings yet
Week 4
48 pages
MAS 102_Topic 1
No ratings yet
MAS 102_Topic 1
13 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet
Data Mining Notes C3
No ratings yet
Data Mining Notes C3
11 pages
Data Mining Notes C2
No ratings yet
Data Mining Notes C2
12 pages
Data Mining Notes C1
No ratings yet
Data Mining Notes C1
2 pages
Chapter1: Introduction: Notes On MLAPP
No ratings yet
Chapter1: Introduction: Notes On MLAPP
25 pages
Tongs Adler Dong 04 S PM Ref
No ratings yet
Tongs Adler Dong 04 S PM Ref
14 pages
221902001-Lab Report 7
No ratings yet
221902001-Lab Report 7
3 pages
(MS) Revision Test 1
No ratings yet
(MS) Revision Test 1
8 pages
Addition of Model. D: Rev - No. History Model Name
No ratings yet
Addition of Model. D: Rev - No. History Model Name
3 pages
Program - 3: AIM: Write A C Program To Implement Cyclic Redundant Check
No ratings yet
Program - 3: AIM: Write A C Program To Implement Cyclic Redundant Check
4 pages
Claude Shannon - A Mathematical Theory of Communications 1948
100% (1)
Claude Shannon - A Mathematical Theory of Communications 1948
87 pages
Huffman Code
No ratings yet
Huffman Code
8 pages
Ascii Codes
No ratings yet
Ascii Codes
4 pages
Error Correction PDF
No ratings yet
Error Correction PDF
2 pages
GSM Basics Chapter7 Detail
No ratings yet
GSM Basics Chapter7 Detail
71 pages
Student Name Student No. Group: Noorkhairunnisa Binti Hassan 2022770469 CS2402B
No ratings yet
Student Name Student No. Group: Noorkhairunnisa Binti Hassan 2022770469 CS2402B
7 pages
Information and Coding theory::PGIT 203:: Semester End Examination:: Academic Session:: 2020-21
No ratings yet
Information and Coding theory::PGIT 203:: Semester End Examination:: Academic Session:: 2020-21
13 pages
Compression:: Source Image Data FDCT Quantizer Compressed Image Data
No ratings yet
Compression:: Source Image Data FDCT Quantizer Compressed Image Data
12 pages
Adaptive Huffman Coding
No ratings yet
Adaptive Huffman Coding
13 pages
Dip Mod 6 - Cec Notes - Ktustudents - in
No ratings yet
Dip Mod 6 - Cec Notes - Ktustudents - in
60 pages
Igcse Computer Science (0478) Chapter 1.1: Data Representation 1.1.1 Binary Systems
No ratings yet
Igcse Computer Science (0478) Chapter 1.1: Data Representation 1.1.1 Binary Systems
4 pages
Lect34 Huffman Coding
No ratings yet
Lect34 Huffman Coding
13 pages
Selective Repeat ARQ Protocol
No ratings yet
Selective Repeat ARQ Protocol
16 pages
Pigz PDF
No ratings yet
Pigz PDF
3 pages
Data Compression For Network GIS: Synonyms
No ratings yet
Data Compression For Network GIS: Synonyms
6 pages
Electronics and Communication Engineering: PAPR Analysis of DHT-Precoded OFDM System For M-QAM
No ratings yet
Electronics and Communication Engineering: PAPR Analysis of DHT-Precoded OFDM System For M-QAM
19 pages
Information Theory and Logical Analysis in The Tractatus Logico-Philosophicus
No ratings yet
Information Theory and Logical Analysis in The Tractatus Logico-Philosophicus
32 pages
Comm... System CH2-Lec1
No ratings yet
Comm... System CH2-Lec1
36 pages
Error Checking Questions and Answers
No ratings yet
Error Checking Questions and Answers
12 pages
LO 2 Carry Out Mensuration and Calculation CHS
No ratings yet
LO 2 Carry Out Mensuration and Calculation CHS
13 pages
Cyclic Codes – Detailed Study Notes
No ratings yet
Cyclic Codes – Detailed Study Notes
24 pages
Unit-Ii Itc
No ratings yet
Unit-Ii Itc
42 pages
BPSK1000 Presentation
No ratings yet
BPSK1000 Presentation
22 pages