0% found this document useful (0 votes)
243 views8 pages

Bayes' Rule and Law of Total Probability: C C C C C 1 2 3

This document provides a summary of key probability concepts including: 1) Simpson's Paradox, expected value, linearity, and symmetry of expected value. 2) Bayes' rule, the law of total probability, and how indicator random variables relate to probabilities. 3) Key probability formulas and definitions like the multiplication rule, sampling tables, naive definition of probability, and independence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
243 views8 pages

Bayes' Rule and Law of Total Probability: C C C C C 1 2 3

This document provides a summary of key probability concepts including: 1) Simpson's Paradox, expected value, linearity, and symmetry of expected value. 2) Bayes' rule, the law of total probability, and how indicator random variables relate to probabilities. 3) Key probability formulas and definitions like the multiplication rule, sampling tables, naive definition of probability, and independence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Probability Cheatsheet v1.

1 Simpson’s Paradox
c c c c
Expected Value, Linearity, and Symmetry
P (A | B, C) < P (A | B , C) and P (A | B, C ) < P (A | B , C ) Expected Value (aka mean, expectation, or average) can be thought
Compiled by William Chen (http://wzchen.com) with contributions c of as the “weighted average” of the possible outcomes of our
yet still, P (A | B) > P (A | B )
from Sebastian Chiu, Yuan Jiang, Yuqi Hou, and Jessy Hwang. random variable. Mathematically, if x1 , x2 , x3 , . . . are all of the
Material based off of Joe Blitzstein’s (@stat110) lectures possible values that X can take, the expected value of X can be
(http://stat110.net) and Blitzstein/Hwang’s Intro to Probability Bayes’ Rule and Law of Total Probability calculated as follows:
textbook (http://bit.ly/introprobability). Licensed under CC P
Law of Total Probability with partitioning set B1 , B2 , B3 , ...Bn and E(X) = xi P (X = xi )
BY-NC-SA 4.0. Please share comments, suggestions, and errors at i
http://github.com/wzchen/probability_cheatsheet. with extra conditioning (just add C!)
Note that for any X and Y , a and b scaling coefficients and c is
P (A) = P (A|B1 )P (B1 ) + P (A|B2 )P (B2 ) + ...P (A|Bn )P (Bn ) our constant, the following property of Linearity of
Last Updated February 28, 2015
P (A) = P (A ∩ B1 ) + P (A ∩ B2 ) + ...P (A ∩ Bn ) Expectation holds:
P (A|C) = P (A|B1 , C)P (B1 , C) + ...P (A|Bn , C)P (Bn |C)
Counting E(aX + bY + c) = aE(X) + bE(Y ) + c
P (A|C) = P (A ∩ B1 |C) + P (A ∩ B2 |C) + ...P (A ∩ Bn |C) If two Random Variables have the same distribution, even
Multiplication Rule - Let’s say we have a compound experiment when they are dependent by the property of Symmetry their
(an experiment with multiple components). If the 1st Law of Total Probability with B and Bc (special case of a partitioning
expected values are equal.
component has n1 possible outcomes, the 2nd component has set), and with extra conditioning (just add C!)
Conditional Expected Value is calculated like expectation, only
n2 possible outcomes, and the rth component has nr possible c c
P (A) = P (A|B)P (B) + P (A|B )P (B ) conditioned on any event A.
outcomes, then overall there are n1 n2 . . . nr possibilities for the
c P
whole experiment. P (A) = P (A ∩ B) + P (A ∩ B ) E(X|A) = xP (X = x|A)
x
c c
Sampling Table - The sampling tables describes the different ways P (A|C) = P (A|B, C)P (B|C) + P (A|B , C)P (B |C)
to take a sample of size k out of a population of size n. The c
Indicator Random Variables
column names denote whether order matters or not. P (A|C) = P (A ∩ B|C) + P (A ∩ B |C) Indicator Random Variables is random variable that takes on
Bayes’ Rule, and with extra conditioning (just add C!) either 1 or 0. The indicator is always an indicator of some
Matters Not Matter
n + k − 1 event. If the event occurs, the indicator is 1, otherwise it is 0.
k They are useful for many problems that involve counting and
With Replacement n P (A ∩ B) P (B|A)P (A)
k P (A|B) = = expected value.
n! n P (B) P (B)
Without Replacement Distribution IA ∼ Bern(p) where p = P (A)
(n − k)! k
P (A ∩ B|C) P (B|A, C)P (A|C) Fundamental Bridge The expectation of an indicator for A is the
P (A|B, C) = =
Naı̈ve Definition of Probability - If the likelihood of each P (B|C) P (B|C) probability of the event. E(IA ) = P (A). Notation:
outcome is equal, the probability of any event happening is:
Odds Form of Bayes’ Rule, and with extra conditioning (just add C!) (
number of favorable outcomes 1 A occurs
P (Event) = P (A|B) P (B|A) P (A) IA =
number of outcomes = 0 A does not occur
P (Ac |B) P (B|Ac ) P (Ac )
Probability and Thinking Conditionally Poisson, Continuous RVs, LotUS, UoU
P (A|B, C) P (B|A, C) P (A|C)
=
Independence P (Ac |B, C) P (B|Ac , C) P (Ac |C) Continuous Random Variables
Independent Events - A and B are independent if knowing one What’s the prob that a CRV is in an interval? Use the CDF (or
gives you no information about the other. A and B are
Random Variables and their Distributions the PDF, see below). To find the probability that a CRV takes
independent if and only if one of the following equivalent on a value in the interval [a, b], subtract the respective CDFs.
statements hold:
PMF, CDF, and Independence P (a ≤ X ≤ b) = P (X ≤ b) − P (X ≤ a) = F (b) − F (a)
Probability Mass Function (PMF) (Discrete Only) gives the
P (A ∩ B) = P (A)P (B) probability that a random variable takes on the value X. Note that for an r.v. with a normal distribution,
P (A|B) = P (A) P (a ≤ X ≤ b) = P (X ≤ b) − P (X ≤ a)
PX (x) = P (X = x)
b−µ a−µ
   
Conditional Independence - A and B are conditionally Cumulative Distribution Function (CDF) gives the probability =Φ − Φ
independent given C if: P (A ∩ B|C) = P (A|C)P (B|C). σ2 σ2
that a random variable takes on the value x or less
Conditional independence does not imply independence, and What is the Cumulative Density Function (CDF)? It is the
independence does not imply conditional independence. FX (x0 ) = P (X ≤ x0 ) following function of x.
F (x) = P (X ≤ x)
Unions, Intersections, and Complements Independence - Intuitively, two random variables are independent if
De Morgan’s Laws - Gives a useful relation that can make knowing one gives you no information about the other. X and What is the Probability Density Function (PDF)? The PDF,
calculating probabilities of unions easier by relating them to Y are independent if for ALL values of x and y: f (x), is the derivative of the CDF.
intersections, and vice versa. De Morgan’s Law says that the 0
P (X = x, Y = y) = P (X = x)P (Y = y) F (x) = f (x)
complement is distributive as long as you flip the sign in the Or alternatively,
middle. Z x
c
(A ∪ B) ≡ A ∩ B
c c Expected Value and Indicators F (x) = f (t)dt
−∞
c
(A ∩ B) ≡ A ∪ B
c c Note that by the fundamental theorem of calculus,
Distributions Z b
Probability Mass Function (PMF) (Discrete Only) is a function F (b) − F (a) = f (x)dx
Joint, Marginal, and Conditional Probabilities that takes in the value x, and gives the probability that a a
Joint Probability - P (A ∩ B) or P (A, B) - Probability of A and B. random variable takes on the Pvalue x. The PMF is a Thus to find the probability that a CRV takes on a value in an
Marginal (Unconditional) Probability - P (A) - Probability of A positive-valued function, and x P (X = x) = 1
interval, you can integrate the PDF, thus finding the area
under the density curve.
Conditional Probability - P (A|B) - Probability of A given B PX (x) = P (X = x)
How do I find the expected value of a CRV? Where in discrete
occurred.
Cumulative Distribution Function (CDF) is a function that cases you sum over the probabilities, in continuous cases you
Conditional Probability is Probability - P (A|B) is a probability takes in the value x, and gives the probability that a random integrate over the densities.
as well, restricting the sample space to B instead of Ω. Any variable takes on the value at most x.
Z ∞
theorem that holds for probability also holds for conditional E(X) = xf (x)dx
F (x) = P (X ≤ x) −∞
probability.
Law of the Unconscious Statistician (LotUS) Moment Generating Functions Marginal Distributions
Expected Value of Function of RV Normally, you would find the MGF For any random variable X, this expected value and function of Review: Law of Total Probability
P Says for an event A and partition
expected value of X this way: dummy variable t; B1 , B2 , ...Bn : P (A) = i P (A ∩ Bi )
tX To find the distribution of one (or more) random variables from a joint
MX (t) = E(e )
E(X) = Σx xP (X = x) distribution, sum or integrate over the irrelevant random variables.
is the moment generating function (MGF) of X if it exists
Getting the Marginal PMF from the Joint PMF
Z ∞ for a finitely-sized interval centered around 0. Note that the
E(X) = xf (x)dx MGF is just a function of a dummy variable t. X
−∞
P (X = x) = P (X = x, Y = y)
Why is it called the Moment Generating Function? Because y
LotUS states that you can find the expected value of a function
the kth derivative of the moment generating function evaluated
of a random variable g(X) this way: Getting the Marginal PDF from the Joint PDF
0 is the kth moment of X!
Z
E(g(X)) = Σx g(x)P (X = x) 0 k
µk = E(X ) = MX (0)
(k)
fX (x) = fX,Y (x, y)dy
Z ∞ y
E(g(X)) = g(x)f (x)dx This is true by Taylor Expansion of etX
−∞
∞ ∞
Independence of Random Variables
tX
X E(X k )tk X µ0k tk Review: A and B are independent if and only if either
What’s a function of a random variable? A function of a random MX (t) = E(e )= =
variable is also a random variable. For example, if X is the k! k! P (A ∩ B) = P (A)P (B) or P (A|B) = P (A).
k=0 k=0
number of bikes you see in an hour, then g(X) = 2X could be Similar conditions apply to determine whether random variables are
the number of bike wheels you see in an hour. Both are random Or by differentiation under the integral sign and then plugging independent - two random variables are independent if their joint
variables. in t = 0 distribution function is simply the product of their marginal
distributions, or that the a conditional distribution of is the same as
What’s the point? You don’t need to know the PDF/PMF of g(X) (k) dk tX dk tX k tX its marginal distribution.
MX (t) = E(e ) = E( k e ) = E(X e )
to find its expected value. All you need is the PDF/PMF of X. dtk dt In words, random variables X and Y are independent for all x, y, if
(k) k 0X k 0 and only if one of the following hold:
MX (0) = E(X e ) = E(X ) = µk
Variance, Expectation and Independence, and ex • Joint PMF/PDF/CDFs are the product of the Marginal PMF
MGF of linear combinations If we have Y = aX + c, then • Conditional distribution of X given Y is the same as the
Taylor Series marginal distribution of X
∞ t(aX+c) ct (at)X ct
x
X xn MY (t) = E(e ) = e E(e ) = e MX (at)
e = Multivariate LotUS
n=0
n! P
Uniqueness of the MGF. If it exists, the MGF uniquely defines Review: E(g(X))
R∞ = x g(x)P (X = x), or
2
Var(X) = E(X ) − [E(X)]
2 the distribution. This means that for any two random variables E(g(X)) = −∞ g(x)fX (x)dx
X and Y , they are distributed the same (their CDFs/PDFs are For discrete random variables:
If X and Y are independent, then equal) if and only if their MGF’s are equal. You can’t have XX
different PDFs when you have two random variables that have E(g(X, Y )) = g(x, y)P (X = x, Y = y)
E(XY ) = E(X)E(Y ) the same MGF. x y

Summing Independent R.V.s by Multiplying MGFs. If X and For continuous random variables:
Universality of Uniform Y are independent, then Z ∞ Z ∞
When you plug any random variable into its own CDF, you get a E(g(X, Y )) = g(x, y)fX,Y (x, y)dxdy
t(X+Y ) tX tY
Uniform[0,1] random variable. When you put a Uniform[0,1] into an M(X+Y ) (t) = E(e ) = E(e )E(e ) = MX (t) · MY (t) −∞ −∞
inverse CDF, you get the corresponding random variable. For example,
M(X+Y ) (t) = MX (t) · MY (t)
let’s say that a random variable X has a CDF Covariance and Transformations
−x The MGF of the sum of two random variables is the product of
F (x) = 1 − e the MGFs of those two random variables. Covariance and Correlation
Covariance is the two-random-variable equivalent of Variance,
By the Universality of the the Uniform, if we plug in X into this
function then we get a uniformly distributed random variable. Joint PDFs and CDFs defined by the following:

Cov(X, Y ) = E[(X − E(X))(Y − E(Y ))] = E(XY ) − E(X)E(Y )


F (X) = 1 − e
−X
∼U Joint Distributions
Review: Joint Probability of events A and B: P (A ∩ B) Note that
Similarly, since F (X) ∼ U then X ∼ F −1 (U ). The key point is that Both the Joint PMF and Joint PDF must be non-negative and
for any continuous random variable X, we can transform it into a
P P
sum/integrate to 1. ( x y P (X = x, Y = y) = 1) Cov(X, X) = E(XX) − E(X)E(X) = Var(X)
uniform random variable and back by using its CDF. R R
( x y fX,Y (x, y) = 1). Like in the univariate cause, you sum/integrate Correlation is a rescaled variant of Covariance that is always
the PMF/PDF to get the CDF. between -1 and 1.
Exponential Distribution and MGFs
Conditional Distributions Corr(X, Y ) = p
Cov(X, Y )
=
Cov(X, Y )
P (B|A)P (A)
Can I Have a Moment? Review: By Baye’s Rule, P (A|B) = P (B)
Similar conditions Var(X)Var(Y ) σX σY
apply to conditional distributions of random variables.
Moment - Moments describe the shape of a distribution. The first Covariance and Indepedence - If two random variables are
For discrete random variables:
three moments, are related to Mean, Variance, and Skewness of independent, then they are uncorrelated. The inverse is not
a distribution. The kth moment of a random variable X is P (X = x, Y = y) P (X = x|Y = y)P (Y = y) necessarily true.
P (Y = y|X = x) = =
0 k P (X = x) P (X = x)
µk = E(X ) X ⊥
⊥ Y −→ Cov(X, Y ) = 0
For continuous random variables:
X ⊥
⊥ Y −→ E(XY ) = E(X)E(Y )
What’s a moment? Note that
fX,Y (x, y) fX|Y (x|y)fY (y)
fY |X (y|x) = = Covariance and Variance - Note that
Mean µ01 = E(X) fX (x) fX (x)
Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )
Variance µ02 = E(X 2 ) = Var(X) + (µ01 )2 Hybrid Bayes’ Rule
n
X X
Mean, Variance, and other moments (Skewness) can be P (A|X = x)f (x) Var(X1 + X2 + · · · + Xn ) = Var(Xi ) + 2 Cov(Xi , Xj )
f (x|A) =
expressed in terms of the moments of a random variable! P (A) i=1 i<j
In particular, if X and Y are independent then they have Poisson Process Bank and Post Office Result
covariance 0 thus Definition We have a Poisson Process if we have Let us say that we have X ∼ Gamma(a, λ) and Y ∼ Gamma(b, λ), and
X ⊥
⊥ Y =⇒ Var(X + Y ) = Var(X) + Var(Y ) 1. Arrivals at various times with an average of λ per unit that X ⊥⊥ Y . By Bank-Post Office result, we have that:
time.
In particular, If X1 , X2 , . . . , Xn are identically distributed have X + Y ∼ Gamma(a + b, λ)
2. The number of arrivals in a time interval of length t is
the same covariance relationships, then
Pois(λt) X X
n ∼ Beta(a, b) X+Y ⊥

Var(X1 + X2 + · · · + Xn ) = nVar(X1 ) + 2 Cov(X1 , X2 ) 3. Number of arrivals in disjoint time intervals are X+Y X+Y
2 independent.
Covariance and Linearity - For random variables W, X, Y, Z and Count-Time Duality - We wish to find the distribution of T1 , the
Special Cases of Beta and Gamma
constants a, b: first arrival time. We see that the event T1 > t, the event that Gamma(1, λ) ∼ Expo(λ) Beta(1, 1) ∼ Unif(0, 1)
you have to wait more than t to get the first email, is the same
Cov(X, Y ) = Cov(Y, X) as the event Nt = 0, which is the event that the number of Conditional Expectation and Variance
emails in the first time interval of length t is 0. We can solve
Cov(X + a, Y + b) = Cov(X, Y ) for the distribution of T1 .
Cov(aX, bY ) = abCov(X, Y ) P (T1 > t) = P (Nt = 0) = e
−λt
−→ P (T1 ≤ t) = 1 − e
−λt Conditional Expectation
Cov(W + X, Y + Z) = Cov(W, Y ) + Cov(W, Z) + Cov(X, Y ) Thus we have T1 ∼ Expo(λ). And similarly, the interarrival Conditioning on an Event - We can find the expected value of Y
times between arrivals are all Expo(λ), (e.g. given that event A or X = x has occurred. This would be
+ Cov(X, Z) finding the values of E(Y |A) and E(Y |X = x). Note that
Ti − Ti−1 ∼ Expo(λ)).
Covariance and Invariance - Correlation, Covariance, and Variance conditioning in an event results in a number. Note the
are addition-invariant, which means that adding a constant to Beta, Gamma, Order Statistics similarities between regularly finding expectation and finding
the conditional expectation. The expected value of a dice roll
the term(s) does not change the value. Let b and c be constants.
given that it is prime is 13 2 + 13 3 + 13 5 = 3 13 . The expected
Var(X + c) = Var(X)
Order Statistics amount of time that you have to wait until the shuttle comes
Definition - Let’s say you have n i.i.d. random variables (assuming that the waiting time is ∼ Expo( 10 1
)) given that you
Cov(X + b, Y + c) = Cov(X, Y ) X1 , X2 , X3 , . . . Xn . If you arrange them from smallest to have already waited n minutes, is 10 more minutes by the
Corr(X + b, Y + c) = Corr(X, Y ) largest, the ith element in that list is the ith order statistic, memoryless property.
denoted X(i) . X(1) is the smallest out of the set of random
In addition to addition-invariance, Correlation is variables, and X(n) is the largest.
scale-invariant, which means that multiplying the terms by any Discrete Y Continuous Y
Properties - The order statistics are dependent random variables. R∞
constant does not affect the value. Covariance and Variance are
P
E(Y ) = yP (Y = y) E(Y ) = −∞
R ∞ yfY (y)dy
The smallest value in a set of random variables will always vary Py
not scale-invariant. E(Y |X = x) = yP (Y = y|X = x) E(Y |X = x) =R −∞ yfY |X (y|x)dy
and itself has a distribution. For any value of X(i) , Py ∞
E(Y |A) = y yP (Y = y|A) E(Y |A) = −∞ yf (y|A)dy
Corr(2X, 3Y ) = Corr(X, Y ) X(i+1) ≥ X(j) .
Distribution - Taking n i.i.d. random variables X1 , X2 , X3 , . . . Xn Conditioning on a Random Variable - We can also find the
Continuous Transformations with CDF F (x) and PDF f (x), the CDF and PDF of X(i) are expected value of Y given the random variable X. The
as follows: resulting expectation, E(Y |X) is not a number but a function
Why do we need the Jacobian? We need the Jacobian to rescale n  
our PDF so that it integrates to 1.
X n k n−k of the random variable X. For an easy way to find E(Y |X),
FX(i) (x) = P (X(j) ≤ x) = F (x) (1 − F (x)) find E(Y |X = x) and then plug in X for all x. This changes
k
One Variable Transformations Let’s say that we have a random k=i the conditional expectation of Y from a function of a number x,
variable X with PDF fX (x), but we are also interested in some n − 1
i−1 n−i to a function of the random variable X.
function of X. We call this function Y = g(X). Note that Y is fX(i) (x) = n F (x) (1 − F (X)) f (x)
i−1
a random variable as well. If g is differentiable and one-to-one Properties of Conditioning on Random Variables
Universality of the Uniform - We can also express the distribution
(every value of X gets mapped to a unique value of Y ), then of the order statistics of n i.i.d. random variables
the following is true: 1. E(Y |X) = E(Y ) if X ⊥
⊥Y
X1 , X2 , X3 , . . . Xn in terms of the order statistics of n

dx

dy uniforms. We have that 2. E(h(X)|X) = h(X) (taking out what’s known).
fY (y) = fX (x) fY (y) = fX (x)
F (X(j) ) ∼ U(j) E(h(X)W |X) = h(X)E(W |X)
dy dx
3. E(E(Y |X)) = E(Y ) (Adam’s Law, aka Law of Iterated
To find fY (y) as a function of y, plug in x = g −1 (y). Notable Uses of the Beta Distribution Expectation of Law of Total Expectation)
. . . as the Order Statistics of the Uniform - The smallest of three
d −1
−1 Uniforms is distributed U(1) ∼ Beta(1, 3). The middle of three

fY (y) = fX (g (y)) g (y) Law of Total Expectation (also Adam’s law) - For any set of
dy Uniforms is distributed U(2) ∼ Beta(2, 2), and the largest events that partition the sample space, A1 , A2 , . . . , An or just
The derivative of the inverse transformation is referred to the U(3) ∼ Beta(3, 1). The distribution of the the j th order simply A, Ac , the following holds:
Jacobian, denoted as J. statistic of n i.i.d Uniforms is: c c
E(Y ) = E(Y |A)P (A) + E(Y |A )P (A )
U(j) ∼ Beta(j, n − j + 1)
d −1 E(Y ) = E(Y |A1 )P (A1 ) + · · · + E(Y |An )P (An )
J = g (y) n! j−1 n−j
dy fU(j) (u) = t (1 − t)
(j − 1)!(n − j)!
. . . as the Conjugate Prior of the Binomial - A prior is the
Conditional Variance
Convolutions Eve’s Law (aka Law of Total Variance)
distribution of a parameter before you observe any data (f (x)).
Definition If you want to find the PDF of a sum of two independent
A posterior is the distribution of a parameter after you observe
random variables, you take the convolution of their individual Var(Y ) = E(Var(Y |X)) + Var(E(Y |X))
data y (f (x|y)). Beta is the conjugate prior of the Binomial
distributions.
Z ∞ because if you have a Beta-distributed prior on p (the
parameter of the Binomial), then the posterior distribution on
fX+Y (t) = fx (x)fy (t − x)dx
−∞
p given observed data is also Beta-distributed. This means, MVN, LLN, CLT
that in a two-level model:
Example Let X, Y ∼ i.i.d N (0, 1). Treat t as a constant. Integrate as X|p ∼ Bin(n, p) Law of Large Numbers (LLN)
usual.
Z ∞ p ∼ Beta(a, b) Let us have X1 , X2 , X3 . . . be i.i.d.. We define
1 −x2 /2 1 −(t−x)2 /2 Then after observing the value X = x, we get a posterior X̄n = 1
X +X2 +X3 +···+Xn
The Law of Large Numbers states that as
fX+Y (t) = √ e √ e dx n
−∞ 2π 2π distribution p|(X = x) ∼ Beta(a + x, b + n − x) n −→ ∞, X̄n −→ E(X).
Central Limit Theorem (CLT) Chain Properties Example Heights are normal. Measurement error is normal. By the
A chain is irreducible if you can get from anywhere to anywhere. An central limit theorem, the sampling average from a population
Approximation using CLT is also normal.
irreducible chain must have all of its states recurrent. A chain is
We use ∼ ˙ to denote is approximately distributed. We can use the periodic if any of its states are periodic, and is aperiodic if none of
central limit theorem when we have a random variable, Y that is a Standard Normal - The Standard Normal, denoted Z, is
its states are periodic. In an irreducible chain, all states have the same
sum of n i.i.d. random variables with n large. Let us say that Z ∼ N (0, 1)
2
period.
E(Y ) = µY and Var(Y ) = σY . We have that: A chain is reversible with respect to ~ s if si qij = sj qji for all i, j. A CDF - It’s too difficult to write this one out, so we express it as the
2 reversible chain running on ~ s is indistinguishable whether it is running function Φ(x)
Y ∼
˙ N (µY , σY )
forwards in time or backwards in time. Examples of reversible chains
When we use central limit theorem to estimate Y , we usually have include random walks on undirected networks, or any chain with Exponential Distribution
1 qij = qji , where the Markov chain would be stationary with respect to
Y = X1 + X2 + · · · + Xn or Y = X̄n = n (X1 + X2 + · · · + Xn ).
2 ~
s = (M 1 1
,M ,..., M1
). Let us say that X is distributed Expo(λ). We know the following:
Specifically, if we say that each of the iid Xi have mean µX and σX ,
then we have the following approximations. Reversibility Condition Implies Stationarity - If you have a PMF Story You’re sitting on an open meadow right before the break of
~
s on a Markov chain with transition matrix Q, then si qij = sj qji for dawn, wishing that airplanes in the night sky were shooting
2 all i, j implies that s is stationary.
X1 + X2 + · · · + Xn ∼
˙ N (nµX , nσX ) stars, because you could really use a wish right now. You know
σ2 that shooting stars come on average every 15 minutes, but it’s
X̄n =
1
˙ N (µX , X )
(X1 + X2 + · · · + Xn ) ∼
Stationary Distribution never true that a shooting star is ever ‘’due” to come because
n n Let us say that the vector p ~ = (p1 , p2 , . . . , pM ) is a possible and valid you’ve waited so long. Your waiting time is memorylessness,
Asymptotic Distributions using CLT PMF of where the Markov Chain is at at a certain time. We will call which means that the time until the next shooting star comes
this vector the stationary distribution, ~ s, if it satisfies ~
sQ = ~ s. As a does not depend on how long you’ve waited already.
d
We use − → to denote converges in distribution to as n −→ ∞. These consequence, if Xt has the stationary distribution, then all future
are the same results as the previous section, only letting n −→ ∞ and Xt+1 , Xt+2 , . . . also has the stationary distribution. Example The waiting time until the next shooting star is distributed
not letting our normal distribution have any n terms. For irreducible, aperiodic chains, the stationary distribution exists, is Expo(4). The 4 here is λ, or the rate parameter, or how many
1 unique, and si is the long-run probability of a chain being at state i. shooting stars we expect to see in a unit of time. The expected
d 1
√ (X1 + · · · + Xn − nµX ) −
→ N (0, 1) The expected number of steps to return back to i starting from i is time until the next shooting star is λ , or 14 of an hour. You
σ n 1/si To solve for the stationary distribution, you can solve for can expect to wait 15 minutes until the next shooting star.
X̄n − µX d (Q0 − I)(~s)0 = 0. The stationary distribution is uniform if the columns

→ N (0, 1) of Q sum to 1. Expos are rescaled Expos
σ/√n

Random Walk on Undirected Network Y ∼ Expo(λ) → X = λY ∼ Expo(1)


Markov Chains If you have a certain number of nodes with edges between them, and a
chain can pick any edge randomly and move to another node, then this Memorylessness The Exponential Distribution is the sole
Definition continuous memoryless distribution. This means that it’s
is a random walk on an undirected network. The stationary
A Markov Chain is a walk along a (finite or infinite, but for this class always “as good as new”, which means that the probability of it
distribution of this chain is proportional to the degree sequence. The
usually finite) discrete state space {1, 2, . . . , M}. We let Xt denote failing in the next infinitesimal time period is the same as any
degree sequence is the vector of the degrees of each node, defined as
which element of the state space the walk is on at time t. The Markov infinitesimal time period. This means that for an exponentially
how many edges it has.
Chain is the set of random variables denoting where the walk is at all distributed X and any real numbers t and s,
points in time, {X0 , X1 , X2 , . . . }, as long as if you want to predict
where the chain is at at a future time, you only need to use the present Continuous Distributions P (X > s + t|X > s) = P (X > t)
state, and not any past information. In other words, the given the
present, the future and past are conditionally independent. Formal
Uniform
Given that you’ve waited already at least s minutes, the
Definition: Let us say that U is distributed Unif(a, b). We know the following: probability of having to wait an additional t minutes is the
Properties of the Uniform For a uniform distribution, the same as the probability that you have to wait more than t
P (Xn+1 = j|X0 = i0 , X1 = i1 , . . . , Xn = i) = P (Xn+1 = j|Xn = i)
probability of an draw from any interval on the uniform is minutes to begin with. Here’s another formulation.
proportion to the length of the uniform. The PDF of a Uniform
State Properties is just a constant, so when you integrate over the PDF, you will X − a|X > a ∼ Expo(λ)
A state is either recurrent or transient. get an area proportional to the length of the interval.
• If you start at a Recurrent State, then you will always return Example William throws darts really badly, so his darts are uniform Example - If waiting for the bus is distributed exponentially
back to that state at some point in the future. ♪You can over the whole room because they’re equally likely to appear with λ = 6, no matter how long you’ve waited so far, the
check-out any time you like, but you can never leave. ♪ anywhere. William’s darts have a uniform distribution on the expected additional waiting time until the bus arrives is always
1
• Otherwise you are at a Transient State. There is some surface of the room. The uniform is the only distribution where 6 , or 10 minutes. The distribution of time from now to the
probability that once you leave you will never return. ♪You the probably of hitting in any specific region is proportion to arrival is always the same, no matter how long you’ve waited.
don’t have to go home, but you can’t stay here. ♪ the area/length/volume of that region, and where the density of
occurrence in any one specific spot is constant throughout the Min of Expos If we have independent Xi ∼ Expo(λi ), then
A state is either periodic or aperiodic. min(X1 , . . . , Xk ) ∼ Expo(λ1 + λ2 + · · · + λk ).
whole support.
• If you start at a Periodic State of period k, then the GCD of
Max of Expos If we have i.i.d. Xi ∼ Expo(λ), then
all of the possible number steps it would take to return back is Normal max(X1 , . . . , Xk ) ∼ Expo(kλ) + Expo((k − 1)λ) + · · · + Expo(λ)
> 1. Let us say that X is distributed N (µ, σ 2 ). We know the following:
• Otherwise you are at an Aperiodic State. The GCD of all of Central Limit Theorem The Normal distribution is ubiquitous Gamma Distribution
the possible number of steps it would take to return back is 1. because of the central limit theorem, which states that averages
Let us say that X is distributed Gamma(a, λ). We know the
of independent identically-distributed variables will approach a
Transition Matrix normal distribution regardless of the initial distribution.
following:
Element qij in square transition matrix Q is the probability that the
Transformable Every time we stretch or scale the normal Story You sit waiting for shooting stars, and you know that
chain goes from state i to state j, or more formally:
distribution, we change it to another normal distribution. If we the waiting time for a star is distributed Expo(λ). You
qij = P (Xn+1 = j|Xn = i) add c to a normally distributed random variable, then its mean want to see “a” shooting stars before you go home. X is
increases additively by c. If we multiply a normally distributed the total waiting time for the ath shooting star.
To find the probability that the chain goes from state i to state j in m
random variable by c, then its variance increases
steps, take the (i, j)th element of Qm .
multiplicatively by c2 . Note that for every normally distributed Example You are at a bank, and there are 3 people ahead of
(m)
qij = P (Xn+m = j|Xn = i) random variable X ∼ N (µ, σ 2 ), we can transform it to the you. The serving time for each person is distributed
standard N (0, 1) by the following transformation: Exponentially with mean of 2 time units. The
If X0 is distributed according to row-vector PMF p
~ (e.g. distribution of your waiting time until you begin service
X−µ is Gamma(3, 21 )
pj = P (X0 = ij )), then the PMF of Xn is p~Qn . ∼ N (0, 1)
σ
χ2 Distribution Negative Binomial Let us say that X is distributed NBin(r, p). We Lumping - If you lump together multiple categories in a
Let us say that X is distributed χ2n . We know the following: know the following: multinomial, then it is still multinomial. A multinomial
with two dimensions (success, failure) is a binomial
Story A Chi-Squared(n) is a sum of n independent squared Story X is the number of “failures” that we will achieve distribution.
normals. before we achieve our rth success. Our successes have
probability p. Variances and Covariances - For
Example The sum of squared errors are distributed χ2n (X1 , X2 , . . . , Xk ) ∼ Multk (n, (p1 , p2 , . . . , pk )), we have
Example Thundershock has 60% accuracy and can faint a that marginally Xi ∼ Bin(n, pi ) and hence
Properties and Representations wild Raticate in 3 hits. The number of misses before Var(Xi ) = npi (1 − pi ). Also, for i 6= j,
Pikachu faints Raticate with Thundershock is distributed Cov(Xi , Xj ) = −npi pj , which is a result from class.
2 2

n 1
 NBin(3, .6).
E(χn ) = n, V ar(X) = 2n, χn ∼ Gamma , Marginal PMF and Lumping
2 2 Hypergeometric Let us say that X is distributed HGeom(w, b, n).
Xi ∼ Bin(n, pi )
2 2 2 2
χn = Z1 + Z2 + · · · + Zn , Z ∼
i.i.d.
N (0, 1) We know the following:
Xi + Xj ∼ Bin(n, pi + pj )
Story In a population of b undesired objects and w desired
Discrete Distributions objects, X is the number of “successes” we will have in a X1 ,X2 ,X3 ∼Mult3 (n,(p1 ,p2 ,p3 ))→X1 ,X2 +X3 ∼Mult2 (n,(p1 ,p2 +p3 ))
draw of n objects, without replacement. 
p1

pk−1

DWR = Draw w/ replacement, DWoR = Draw w/o replacement X1 , . . . , Xk−1 |Xk = nk ∼ Multk−1 n − nk , ,...,
Example 1) Let’s say that we have only b Weedles (failure) 1 − pk 1 − pk
and w Pikachus (success) in Viridian Forest. We Multivariate Uniform See the univariate uniform for stories and
DWR DWoR
encounter n Pokemon in the forest, and X is the number examples. For multivariate uniforms, all you need to know is
Fixed # trials (n) Binom/Bern HGeom of Pikachus in our encounters. 2) The number of aces that probability is proportional to volume. More formally,
(Bern if n = 1) that you draw in 5 cards (without replacement). 3) You probability is the volume of the region of interest divided by
Draw ’til k success NBin/Geom NHGeom have w white balls and b black balls, and you draw b the total volume of the support. Every point in the support has
(Geom if k = 1) (see example probs) balls. You will draw X white balls. 4) Elk Problem - You equal density of value Total1 Area .
have N elk, you capture n of them, tag them, and release
them. Then you recollect a new sample of size m. How Multivariate Normal (MVN) A vector X ~ = (X1 , X2 , X3 , . . . , Xk )
Bernoulli The Bernoulli distribution is the simplest case of the
Binomial distribution, where we only have one trial, or n = 1. many tagged elk are now in the new sample? is declared Multivariate Normal if any linear combination is
Let us say that X is distributed Bern(p). We know the normally distributed (e.g. t1 X1 + t2 X2 + · · · + tk Xk is Normal
PMF The probability mass function of a Hypergeometric:
following: for any constants t1 , t2 , . . . , tk ). The parameters of the
w b  Multivariate normal are the mean vector µ ~ = (µ1 , µ2 , . . . , µk )
k n−k
Story. X “succeeds” (is 1) with probability p, and X “fails” P (X = k) = w+b and the covariance matrix where the (i, j)th entry is
(is 0) with probability 1 − p. n Cov(Xi , Xj ). For any MVN distribution: 1) Any sub-vector is
Example. A fair coin flip is distributed Bern( 21 ). Poisson Let us say that X is distributed Pois(λ). We know the also MVN. 2) If any two elements of a multivariate normal
following: distribution are uncorrelated, then they are independent. Note
Binomial Let us say that X is distributed Bin(n, p). We know the that 2) does not apply to most random variables.
following: Story There are rare events (low probability events) that
Story X is the number of ”successes” that we will achieve in n
occur many different ways (high possibilities of Distribution Properties
occurences) at an average rate of λ occurrences per unit
independent trials, where each trial can be either a
success or a failure, each with the same probability p of
space or time. The number of events that occur in that Important CDFs
unit of space or time is X.
success. We can also say that X is a sum of multiple Exponential F (X) = 1 − e−λx , x ∈ (0, ∞))
independent Bern(p) random variables. Let Example A certain busy intersection has an average of 2
accidents per month. Since an accident is a low Uniform(0, 1) F (X) = x, x ∈ (0, 1)
X ∼ Bin(n, p) and Xj ∼ Bern(p), where all of the
Bernoullis are independent. We can express the following: probability event that can happen many different ways,
the number of accidents in a month at that intersection is Poisson Properties (Chicken and Egg Results)
X = X1 + X2 + X3 + · · · + Xn distributed Pois(2). The number of accidents that happen We have X ∼ Pois(λ1 ) and Y ∼ Pois(λ2 ) and X ⊥
⊥ Y.
in two months at that intersection is distributed Pois(4) 1. X + Y ∼ Pois(λ1 + λ2 )
Example If Jeremy Lin makes 10 free throws and each one
independently has a 34 chance of getting in, then the
 
λ1
number of free throws he makes is distributed Bin(10, 43 ), Multivariate Distributions 2. X|(X + Y = k) ∼ Bin k, λ1 +λ2

or, letting X be the number of free throws that he makes, Multinomial Let us say that the vector 3. If we have that Z ∼ Pois(λ), and we randomly and
X is a Binomial Random Variable distributed Bin(10, 34 ). ~ = (X1 , X2 , X3 , . . . , Xk ) ∼ Multk (n, p independently “accept” every item in Z with probability p,
X ~) where
then the number of accepted items Z1 ∼ Pois(λp), and the
Binomial Coefficient n

k is a function of n and k and is read
p
~ = (p1 , p2 , . . . , pk ).
number of rejected items Z2 ∼ Pois(λq), and Z1 ⊥
⊥ Z2 .
n choose k, and means out of n possible indistinguishable
Story - We have n items, and then can fall into any one of the
objects, how many ways can I possibly choose k of them? Convolutions of Random Variables
k buckets independently with the probabilities
The formula for the binomial coefficient is:
p
~ = (p1 , p2 , . . . , pk ). A convolution of n random variables is simply their sum.
 n n!
= Example - Let us assume that every year, 100 students in the 1. X ∼ Pois(λ1 ), Y ∼ Pois(λ2 ),
k k!(n − k)! Harry Potter Universe are randomly and independently X ⊥
⊥ Y −→ X + Y ∼ Pois(λ1 + λ2 )
sorted into one of four houses with equal probability. The 2. X ∼ Bin(n1 , p), Y ∼ Bin(n2 , p),
Geometric Let us say that X is distributed Geom(p). We know the
number of people in each of the houses is distributed X ⊥⊥ Y −→ X + Y ∼ Bin(n1 + n2 , p) Note that Binomial can
following:
Mult4 (100, p
~), where p~ = (.25, .25, .25, .25). Note that thus be thought of as a sum of iid Bernoullis.
Story X is the number of “failures” that we will achieve X1 + X2 + · · · + X4 = 100, and they are dependent.
3. X ∼ Gamma(n1 , λ), Y ∼ Gamma(n2 , λ),
before we achieve our first success. Our successes have Multinomial Coefficient The number of permutations of n X ⊥⊥ Y −→ X + Y ∼ Gamma(n1 + n2 , λ) Note that Gamma
probability p. objects where you have n1 , n2 , n3 . . . , nk of each of the can thus be thought of as a sum of iid Expos.
1
Example If each pokeball we throw has a 10 probability to different variants is the multinomial coefficient.
4. X ∼ NBin(r1 , p), Y ∼ NBin(r2 , p),
catch Mew, the number of failed pokeballs will be  n  n!
1 = X ⊥
⊥ Y −→ X + Y ∼ NBin(r1 + r2 , p)
distributed Geom( 10 ). n1 n2 . . . n k n1 !n2 ! . . . nk !
5. All of the above are approximately normal when λ, n, r are
First Success Equivalent to the geometric distribution, except it Joint PMF - For n = n1 + n2 + · · · + nk large by the Central Limit Theorem.
counts the total number of “draws” until the first success. This
is 1 more than the number of failures. If X ∼ F S(p) then ~ =~
 n 
n n n 6. Z1 ∼ N (µ1 , σ12 ), Z2 ∼ N (µ2 , σ22 ),
P (X n) = p 1 p 2 . . . pk k
E(X) = 1/p. n1 n2 . . . nk 1 2 ⊥ Z2 −→ Z1 + Z2 ∼ N (µ1 + µ2 , σ12 + σ22 )
Z1 ⊥
Special Cases of Random Variables Classic Problems Orderings of i.i.d. random variables
1. Bin(1, p) ∼ Bern(p) I call 2 UberX’s and 3 Lyfts at the same time. If the time it takes for
the rides to reach me is i.i.d., what is the probability that all the Lyfts
2. Beta(1, 1) ∼ Unif(0, 1)
Birthday Matches will arrive first? Answer - since the arrival times of the five cars are
In a group of n people, what is the expected number of distinct i.i.d., all 5! orderings of the arrivals are equally likely. There are 3!2!
3. Gamma(1, λ) ∼ Expo(λ) birthdays (month and day). What is the expected number of birthday orderings that involve the Lyfts arriving first, so the probability that
4. χ2n ∼ Gamma n 1
 matches? Answer - Let X be the number of distinct birthdays, and 3!2!
2, 2 = 1/10 . Alternatively, there are 53

let Ij be the indicator for whether the j th days is represented. the Lyfts arrive first is
5!
5. NBin(1, p) ∼ Geom(p)
E(Ij ) = 1 − P (no one born day j) = 1 − (364/365)
n ways to choose 3 of the 5 slots for the Lyfts to occupy, where each of
the choices are equally likely. 1 of those choices have all 3 of the Lyfts
Reasoning by Representation n
5
By linearity, E(X) = 365 (1 − (364/365) ) . Now let Y be the arriving first, thus the probability is 1/ = 1/10
1. X ∼ Gamma(a, λ), Y ∼ Gamma(b, λ), 3
X
X ⊥
⊥ Y −→ X+Y ∼ Beta(a, b) number of birthday matches and let Ji be the indicator that the ith
pair of people have the same birthday. The probability that any two Expectation of Negative Hypergeometric
2. Bin(n, p) → Pois(λ) as n → ∞, p → 0, np = λ. n What is the expected number of cards that you draw before you pick
people share a birthday is 1/365 so E(Y ) = /365 . your first Ace in a shuffled deck? Answer - Consider a non-Ace.
3. U(j) ∼ Beta(j, n − j + 1) 2
Denote this to be card j. Let Ij be the indicator that card j will be
4. For any X with CDF F (x), F (X) ∼ U drawn before the first Ace. Note that if j is before all 4 of the Aces in
Coupon Collector the deck, then Ij = 1. The probability that this occurs is 1/5, because
There are n total coupons, and each draw, you get a random coupon. out of 5 cards (the 4 Aces and the not Ace), the probability that the
Formulas What is the expected number of coupons needed until you have a not Ace comes first is 1/5. 1/5 here is the probability that any specific
complete set? Answer - Let N be the number of coupons needed; we non-Ace will appear before all of the Aces in the deck. (e.g. the
want E(N ). Let N = N1 + · · · + Nn , N1 is the draws to draw our first probability that the Jack of Spades appears before all of the Aces).
In general, remember that PDFs integrated (and PMFs summed) over
distinct coupon, N2 is the additional draws needed to draw our second Thus let X be the number of cards that is drawn before the first Ace.
support equal 1.
distinct coupon and so on. By the story of First Success, Then X = I1 + I2 + ... + I48 , where each indicator correspond to one
N2 ∼ F S((n − 1)/n) (after collecting first toy type, there’s (n − 1)/n of the 48 not Aces. Thus,
Geometric Series chance you’ll get something new). Similarly, N3 ∼ F S((n − 2)/n), and
n−1
E(X) = E(I1 ) + E(I2 ) + ... + E(I48 ) = 48/5 = 9.6
2 n−1
X k 1 − rn Nj ∼ F S((n − j + 1)/n). By linearity,
a + ar + ar + · · · + ar = ar = a .
k=0
1−r
n
n n n X1 Minimum and Maximum of Random Variables
x E(N ) = E(N1 ) + · · · + E(Nn ) = + + ··· + = n
Exponential Function (e ) n n−1 1 j=1
j What is the CDF of the maximum of n independent
Uniformly-distributed random variables? Answer - Note that

xn x2 x3 x n
 
x
X
e = =1+x+ + + · · · = lim 1+ P (min(X1 , X2 , . . . , Xn ) ≥ a) = P (X1 ≥ a, X2 ≥ a, . . . , Xn ≥ a)
n! 2! 3! n→∞ n Which is approximately n log(n) by Euler’s approximation for
n=1
harmonic sums. Similarily,
Gamma and Beta Distributions P (max(X1 , X2 , . . . , Xn ) ≤ a) = P (X1 ≤ a, X2 ≤ a, . . . , Xn ≤ a)
You can often solve integrals with the following Example Problems We will use that principal to find the CDF of U(n) , where
U(n) = max(U1 , U2 , . . . , Un ) where Ui ∼ Unif(0, 1) (iid).
Z ∞ Z 1 Γ(a)Γ(b)
t−1 −x a−1 b−1 P (max(U1 , U2 , . . . , Un ) ≤ a) = P (U1 ≤ a, U2 ≤ a, . . . , Un ≤ a)
x e dx = Γ(t) x (1 − x) dx = Contributions from Sebastian Chiu
0 0 Γ(a + b)
= P (U1 ≤ a)P (U2 ≤ a) . . . P (Un ≤ a)
Where Γ(n) = (n − 1)! if n is a positive integer First Step Conditioning n
= a
In every time period, Bobo the amoeba can die, live, or split into two
Bayes’ Billiards (special case of Beta) amoebas with probabilities 0.25, 0.25, and 0.5, respectively. All of Pattern Matching withex Taylor Series
Bobo’s offspring have the same probabilities. Find P (D), the 1
Z 1 1
k n−k probability that Bobo’s lineage eventually dies out. Answer - We use For X ∼ Pois(λ), find E . Answer - By LOTUS,
x (1 − x) dx = n X+1
0 (n + 1) law of probability, and define the events B0 , B1 . and B2 where Bi
k
means that Bobo has split into i amoebas. We note that P (D|B0 ) = 1 ∞ ∞
1 e−λ λk e−λ X λk+1 e−λ λ
 
since his lineage has died, P (D|B1 ) = P (D), and P (D|B2 ) = P (D)2 1 X
Euler’s Approximation for Harmonic Sums since both lines of his lineage must die out in order for Bobo’s lineage
E
X+1
=
k+1 k!
=
λ k=0 (k + 1)!
=
λ
(e − 1)
k=0
1 1 1 to die out.
1+ + + ··· + ≈ log n + 0.57721 . . . Adam and Eve’s Laws
2 3 n
P (D) = 0.25P (D|B0 ) + 0.25P (D|B1 ) + 0.5P (D|B2 ) William really likes speedsolving Rubik’s Cubes. But he’s pretty bad
Stirling’s Approximation 2 at it, so sometimes he fails. On any given day, William will attempt
= 0.25 + 0.25P (D) + 0.5P (D)


n
n N ∼ Geom(s) Rubik’s Cubes. Suppose each time, he has a
n! ∼ 2πn independent probability p of solving the cube. Let T be the number of
e Solving the quadratic equation, we get that P (D) = 0.5 or 1. We
Rubik’s Cubes he solves during a day. Find the mean and variance of
dismiss 1 as an extraneous solution since the expected number of
T . Answer - Note that T |N ∼ Bin(N, p). As a result, we have by
Miscellaneous Definitions Bobos increase every generation. Thus our answer is P (D) = 0.5 Adam’s Law that
p(1 − s)
Medians A continuous random variable X has median m if Calculating Probability E(T ) = E(E(T |N )) = E(N p) =
P (X ≤ m) = 50% s
A discrete random variable X has median m if A textbook has n typos, which are randomly scattered amongst its n
Similarly, by Eve’s Law, we have that
P (X ≤ m) ≥ 50% and P (X ≥ m) ≥ 50% pages. You pick a random page, what is the probability that it has no
1 Var(T ) = E(Var(T |N )) + Var(E(T |N )) = E(N p(1 − p)) + Var(N p)
typos? Answer - There is a 1 − n probability that any specific
Log Statisticians generally use log to refer to ln typo isn’t on your page, and thus a 1 − n1 n

probability that there
are no typos on your page. For n large, this is approximately p(1 − p)(1 − s) p2 (1 − s) p(1 − s)(p + s(1 − p))
i.i.d random variables Independent, identically-distributed random = + =
variables. e −1 x
= 1/e by a definition of e . s s2 s2
MGF - Distribution Matching Markov Chains, continued 7. Calculating Covariance If it’s a count of something, break it
(Referring to the Rubik’s Cube question above) Find the MGF of T . William and Sebastian play a modified game of Settlers of Catan, up into a sum of indicator random variables. If you’re trying to
What is the name of this distribution and its parameter(s)? Answer - where every turn they randomly move the robber (which will start on calculate the covariance between two components of a
By Adam’s Law, we have that the center tile) on a game board to one of the adjacent hexagons. multinomial distribution, Xi , Xj , then the covariance is
(refer to a picture of the game board if confused) −npi pj .

tT tT t N
X t n n 8. If X and Y are i.i.d., have you considered using symmetry?
E(e ) = E(E(e |N )) = E((pe + q) ) = s (pe + 1 − p) (1 − s) a) Is this Markov Chain irreducible? Is it aperiodic? Answer - 9. Calculating Probabilities of Orderings of Random
n=0 Variables Have you considered looking at order statistics? -
Yes to both The Markov Chain is irreducible because it can get
s s Remember any ordering of i.i.d. random variables is equally
= = from anywhere to anywhere else. The Markov Chain is also
1 − (1 − s)(pet + 1 − p) s + (1 − s)p − (1 − s)pet aperiodic because the robber can return back to a square in likely.
10. Is this the birthday problem? Is this a multinomial problem?
Intuitively, we would expect that T is distributed Geometrically 2, 3, 4, 5, . . . moves. Those numbers have a GCD of 1, so the chain 11. Determining Independence Use the definition of
because T is just a filtered version of N , which itself is Geometrically is aperiodic. independence. Think of extreme cases to see if you can find a
distributed. The MGF of a Geometric random variable X ∼ Geom(θ) b) What is the stationary distribution of this Markov Chain? Answer counterexample.
is - Since this is a random walk on an undirected graph, the 12. Does something look like Simpson’s Paradox? make sure you’re
tX θ stationary distribution is proportional to the degree sequence. The looking at 3 events.
E(e ) =
1 − (1 − θ)et degree for the corner pieces is 3, the degree for the edge pieces is 4, 13. Find the PDF. If the question gives you two r.v., where you
So, we would want to try to get our MGF into this form to identify and the degree for the center pieces is 6. To normalize this degree know the PDF of one r.v. and the other r.v. is a function of the
what θ is. Taking our original MGF, it would appear that dividing by sequence, we divide by its sum. The sum of the degrees is first one, then the problem wants you to use a transformation
s + (1 − s)p would allow us to do this. Therefore, we have that 6(3) + 6(4) + 7(6) = 72. Thus the stationary probability of being of variables (Jacobian). You can also find the pdf by
on a corner is 3/84 = 1/28, on an edge is 4/84 = 1/21, and in the differentiating the CDF.
s center is 6/84 = 1/14. 14. Do a painful integral. If your integral looks painful, see if
s s+(1−s)p
E(etT ) = = (1−s)p c) What fraction of the time will the robber be in the desert in this
you can write your integral in terms of a PDF (like Gamma or
s + (1 − s)p − (1 − s)pet 1 − s+(1−s)p et Beta), so that the integral equals 1.
game? Answer - From above, 1/14 . 15. Before moving on. Plug in some simple and extreme cases to
make sure that your answer makes sense.
By pattern-matching, it thus follows that T ∼ Geom(θ) where d) Say the robber starts on the desert. What is the expected amount
of moves it will take for the robber to return? Answer - Since Biohazards
s this chain is irreducible and aperiodic, to get the expected time to
θ= return we can just invert the stationary probability. Thus on Section author: Jessy Hwang
s + (1 − s)p
average it will take 14 turns for the robber to return to the desert. 1. Don’t misuse the native definition of probability - When
MGF - Finding Momemts answering “What is the probability that in a group of 3 people,
Find E(X 3 ) for X ∼ Expo(λ) using the MGF of X. Answer - The
Problem Solving Strategies no two have the same birth month?”, it is not correct to treat
λ the people as indistinguishable balls being placed into 12 boxes,
MGF of an Expo(λ) is M (t) = λ−t . To get the third moment, we can Contributions from Jessy Hwang, Yuan Jiang, Yuqi Hou since that assumes the list of birth months { January, January,
take the third derivative of the MGF and evaluate at t = 0: January} is just as likely as the list { January, April, June},
1. Getting Started. Start by defining events and/or defining
6 random variables. (”Let A be the event that I pick the fair when the latter is fix times more likely.
3 2. Don’t confuse unconditional and conditional
E(X ) = coin”; “Let X be the number of successes.”) Clear notion =
λ3 probabilities, or go in circles with Baye’s Rule -
clear thinking! Then decide what it is that you’re supposed to P (B|A)P (A)
But a much nicer way to use the MGF here is via pattern recognition: be finding, in terms of your location (“I want to find P (A|B) = P (B)
. It is not correct to say “P (B) = 1
note that M (t) looks like it came from a geometric series: P (X = 3|A)”). Try simple and extreme cases. To make an because we know that B happened.”; P(B) is the probability
abstract experiment more concrete, try drawing a picture or before we have information about whether B happened. It is
∞  n ∞
1 X t X n! tn making up numbers that could have happened. Pattern not correct to use P (A|B) in place of P (A) on the right-hand
= =
1− λ t
λ λn n! recognition: does the structure of the problem resemble side.
n=0 n=0
something we’ve seen before. 3. Don’t assume independence without justification - In the
n 2. Calculating Probability of an Event. Use combinatorics if matching problem, the probability that card 1 is a match and
The coefficient of tn! here is the nth moment of X, so we have
the naive definition of probability applies. Look for symmetries card 2 is a match is not 1/n2 . - The Binomial and
E(X n ) = λn!
n for all nonnegative integers n. So again we get the same or something to condition on, then apply Bayes’ rule or LoTP. Hypergeometric are often confused; the trials are independent
answer. Is the probability of the complement easier to find? in the Binomial story and not independent in the
3. Finding the distribution of a random variable. Check the Hypergeometric story due to the lack of replacement.
Markov Chains support of the random variable: what values can it take on? 4. Don’t confuse random variables, numbers, and events. -
Suppose Xn is a two-state Markov chain with transition matrix Use this to rule out distributions that don’t fit. - Is there a Let X be a r.v. Then f (X) is a r.b. for any function f . In
0 1 story for one of the named distributions that fits the problem particular, X 2 , |X|, F (X), and IX>3 are r.v.s.
  at hand? - Can you write the random variable as a function of P (X 2 < X|X ≥ 0), E(X), Var(X), and f (E(X)) are numbers.
0 1−α α a r.v. with a known distribution, say Y = g(X)? Then work
Q= X = 2Rand F (X) ≥ −1 are events. It does not make sense to
1 β 1−β directly from the definition of PDF or PMF, expressing ∞
write −∞ F (X)dx because F (X) is a random variable. It does
Find the stationary distribution ~
s = (s0 , s1 ) of Xn by solving ~
sQ = ~
s, P (Y ≤ y) or P (Y = y) in terms of events involving X only. - not make sense to write P (X) because X is not an event.
and show that the chain is reversible under this stationary For PDFs, find the CDF first and then differentiate. - If you’re 5. A random variable is not the same thing as its
distribution. Answer - By solving ~ sQ = ~ s, we have that trying to find the joint distribution of two independent random distribution - To get the PDF of X 2 , you can’t just square the
variables, just multiple their marginal probabilities - Do you PDF of X. The right way is to use one variable transformations
s0 = s0 (1 − α) + s1 β and s1 = s0 (α) + s0 (1 − β) need the distribution? If the question only asks for the - To get the PDF of X + Y , you can’t just add the PDF of X
expected value of X, you might be able to find this without and the PDF of Y . The right way is to compute the
And by solving this system of linear equations it follows that
knowing the entire distirbution of X. See the next item. convolution.
  4. Calculating Expectation. If it has a named distribution, 6. E(g(X)) does not equal g(E(X)) in general. - See the St.
β α
~
s= , check out the table of distributions. If its a function of a r.v. Petersburg paradox for an extreme example. - The right way to
α+β α+β with a named distribution, try LotUS. If its a count of find E(g(X)) is with LotUS.
something, try breaking it up into indicator random variables.
To show that this chain is reversible under this stationary distribution,
we must show si qij = sj qji for all i, j. This is done if we can show
If you can condition on something, consider using Adam’s law.
Also consider the variance formula.
Recommended Resources
s0 q01 = s1 q10 . Indeed, 5. Calculating Variance. Consider independence, named • Introduction to Probability (http://bit.ly/introprobability)
distributions, and LotUS. If it’s a count of something, break it • Stat 110 Online (http://stat110.net)
αβ • Stat 110 Quora Blog (https://stat110.quora.com/)
s0 q01 = = s1 q10 up into a sum of indicator random variables. If you can
α+β condition on something, consider using Eve’s Law. • Stat 110 Course Notes (mxawng.com/stuff/notes/stat110.pdf)
6. Calculating E(X 2 ) - Do you already know E(X) or Var(X)? • Quora Probability FAQ (http://bit.ly/probabilityfaq)
thus our chain is reversible under the stationary distribution. • LaTeX File (github.com/wzchen/probability cheatsheet)
Remember that Var(X) = E(X 2 ) − E(X)2 .
Distributions
Distribution PDF and Support EV Variance MGF

Bernoulli P (X = 1) = p
Bern(p) P (X = 0) = q p pq q + pet

P (X = k) = n
 k
Binomial k
p (1 − p)n−k
Bin(n, p) k ∈ {0, 1, 2, . . . n} np npq (q + pet )n

Geometric P (X = k) = q k p
p
Geom(p) k ∈ {0, 1, 2, . . . } q/p q/p2 1−qet
, qet <1

P (X = n) = r+n−1
 r n
Negative Binom. r−1
p q
p
NBin(r, p) n ∈ {0, 1, 2, . . . } rq/p rq/p2 ( 1−qe r t
t ) , qe < 1
  
w+b
 
P (X = k) = w b /
Hypergeometric k n−k n
nw w+b−n µ µ
HGeom(w, b, n) k ∈ {0, 1, 2, . . . , n} µ= b+w
n (1
w+b−1 n
− n
) −
−λ k
e λ
Poisson P (X = k) = k!
t
Pois(λ) k ∈ {0, 1, 2, . . . } λ λ eλ(e −1)

1
Uniform f (x) = b−a
a+b (b−a)2 etb −eta
Unif(a, b) x ∈ (a, b) 2 12 t(b−a)
2 2
f (x) = √1 e−(x − µ) /(2σ )
Normal σ 2π
σ 2 t2
N (µ, σ 2 ) x ∈ (−∞, ∞) µ σ2 etµ+ 2

Exponential f (x) = λe−λx


λ
Expo(λ) x ∈ (0, ∞) 1/λ 1/λ2
λ−t
,t <λ
1
f (x) = Γ(a)
(λx)a e−λx x1
Gamma  a
λ
Gamma(a, λ) x ∈ (0, ∞) a/λ a/λ2
λ−t
,t < λ
Γ(a+b) a−1
f (x) = Γ(a)Γ(b)
x (1 − x)b−1
Beta
a µ(1−µ)
Beta(a, b) x ∈ (0, 1) µ= a+b (a+b+1)

1
Chi-Squared xn/2−1 e−x/2
2n/2 Γ(n/2)
χ2n x ∈ (0, ∞) n 2n (1 − 2t)−n/2 , t < 1/2
1
f (x) = |A|
Multivar Uniform
A is support x∈A − − −
~ =~ n  n1 n
Multinomial P (X n) = p
n1 ...nk 1
. . . pk k Var(Xi ) = npi (1 − pi ) P n
k
Multk (n, p
~) n = n1 + n2 + · · · + nk n~
p Cov(Xi , Xj ) = −npi pj i=1 pi eti

Inequalities
Cauchy-Schwarz Markov Chebychev Jensen
2
σX
p E|X|
|E(XY )| ≤ E(X 2 )E(Y 2 ) P (X ≥ a) ≤ P (|X − µX | ≥ a) ≤ g convex: E(g(X)) ≥ g(E(X))
a a2
g concave: E(g(X)) ≤ g(E(X))

You might also like