Probability 101, The Intuition Behind Martingales and Solving Problems With Them _ Nor's Blog
Probability 101, The Intuition Behind Martingales and Solving Problems With Them _ Nor's Blog
I think a lot of people don’t get the intuition behind these topics and
why they’re so useful. So I hope the following intuitive introduction
helps people develop a deeper understanding. Note that in the spirit of
clarity, I will only be dealing with finite sets and finite time processes
(unless absolutely necessary, and in such cases, the interpretation
should be obvious). I will do this because people tend to get lost in the
measure-theoretic aspects of the rigor that is needed and skip on the
intuition that they should be developing instead. Also, note that
sometimes the explanations will have more material than is strictly
necessary, but the idea is to have a more informed intuition rather than
presenting a terse-yet-complete exposition that leaves people without
any idea about how to play around with the setup.
People already familiar with the initial concepts can skip to the
interesting sections, but I would still recommend reading the
explanations as a whole in case your understanding is a bit rusty. The
concept of “minimal” sets is used for a lot of the mental modelling in
the post, so maybe you’d still want to read the relevant parts of the
blog where it is introduced.
I would like to thank Everule and rivalq for suggesting me to write this
post, and them and meme and adamant for proofreading and discussing
the content to ensure completeness and clarity.
Table of contents
1. Probability, sigma algebras, and random variables
2. Expected value of a random variable and conditional probability
3. Conditional expectation of a random variable with respect to a sigma
algebra
4. Martingales
5. Stopping times
6. Some math problems
7. Some competitive programming problems
For this function to have some nice properties and for it to make sense
as some measure of “chance”, we add some more constraints to these
functions:
1. Ω is an event (i.e., Ω ∈ F ).
2. If A ∈ F , then Ω ∖ A ∈ F . That is, if something happening is an
event, then it not happening is also an event.
∈ F and B ∈ F , then A ∪ B ∈ F . That is, if X happening is
3. If A
an event and Y happening is an event, then at least one of X and Y
happening is also an event. Note that this combined with the
previous constraint allows us to say that A ∩ B is an event and so on
due to De Morgan’s law.
4. P (Ω) = 1, that is, the probability of the whole sample space is 1.
5. For disjoint sets A and B , P (A ∪ B) = P (A) + P (B). This, along
with the previous constraint, is sufficient to derive identities like
P (∅) = 0, P (A) + P (Ω ∖ A) = 1 and so on.
Let’s now build some intuition about this definition (the following,
especially the part about minimal sets, applies only to finite sets, but
there are a lot of similarities in the conclusions when we try to
generalize to other sets using measure-theoretic concepts).
Note that since we are dealing with finite sets, we will use an
alternative (but equivalent for finite sets) definition where we replace
the interval (−∞, x] by {x}.
When written like this, it might be a bit intimidating, but the underlying
idea is quite simple. We have a sample space Ω, and we have another
set R. Now we want to map elements of Ω to R in some way so that we
can do some computations over elements of Ω. However, since in a
probability space, F tells us how much information we have (or are
allowed to look at), we shouldn’t be able to distinguish between
elements of Ω that are inside the same minimal event of F . Since there
are finitely many elements in Ω, and elements inside the same minimal
event should be mapped to the same real number, there can be only
finitely many intervals that X partitions R into, and the condition in the
definition is necessary and sufficient (due to closedness of F under
intersection and unions).
As a side-note, note that we can also define functions of a random
variable at this point by simply composing the function X with a
function from R to R. Similarly, if we have two random variables on the
same probability space, we can also define functions of them quite
naturally. After all, we just need to specify their values on the minimal
sets and we’re done.
Let’s now change gears and think a bit about how we can isolate an
event. Suppose we have an event E . Let’s say that we want to ignore
everything else, and only look at the elements of Ω (and consequently
the minimal events of F that form a partition of E ). Can we make a
probability space out of it? One easy way would be to just inherit the
structure of Ω and F and set the probabilities of everything that uses
elements other than the minimal elements constituting E to be 0, but
we run into one trouble if we simply decide to inherit the probability
function. The sum of probabilities over all such minimal events is not 1.
One possible idea is to normalize it by P (E), which is possible iff it is
non-zero. It can easily be checked that this probability space is indeed a
valid probability space.
consider the minimal sets in A. In fact, this is used while solving a lot of
recurrences, like in Bayes’ theorem.
like finding the expected time before we toss two heads in a row with a
fair coin (though the sample space is infinite there, these identities still
hold).
E[X’] = E[X], where the first expectation is taken over the coarser
probability space, and the second is taken on the finer probability
space. In a case of unfortunate naming, X’ is called the conditional
expectation of X wrt the σ -algebra F2 .
Example
1. In a way, we are trying to replace the random variable with its “best
approximation” in some sense (for instance, a variance minimization
sense, but that is totally unrelated). Also note that the expectation of
the random variable does not change when we apply this coarsening
to X .
= {∅, Ω}, the random variable
2. Note that for the trivial σ -algebra F2
However, this is the same as saying this: let FY be the σ -algebra where
Just for the sake of completeness, since the above definition captures
what conditional expectation with respect to a σ -algebra does, but
breaks down when you try to generalize it to infinite sample spaces,
here’s the “official definition” that people use (which is quite counter-
intuitive and takes some time to digest):
This essentially says the same thing when you think about the possible
H , but in a more non-constructive way. Turns out that this random
variable is not unique in the strictest sense, but P (Y =
Z) = 0 holds,
where Y and Z are two candidates for this random variable. Also, this
definition makes it clear that X − E[X∣H] is orthogonal to the
indicator function (so we get a more precise definition of projecting
down).
Martingales
Let’s say there is a “process” of some sort, where at every step, we get
more information. Let’s say our sample space for the whole experiment
is Ω. Here, Ω is almost always infinite, but the intuition we built carries
forward here, even though the concepts like minimal sets might not.
Let’s take a very simple example: we toss a coin at each step. The
sample space for the whole experiment is the set of all infinite binary
strings. Let’s think about the kind of information that we have available
at various steps in the process. After step 0, all the information we have
(for instance, the complete information about a random variable that
has information about what has happened until now) corresponds to
the trivial σ -algebra {∅, Ω}. After step i, we have information for the
first i coin tosses, so the information corresponds to the σ -algebra with
“minimal” sets being the sets of binary strings which are the same in the
first i positions.
Now let’s get to what a process really is. At every step of a process, we
have a random variable that shows up. But what is the σ -algebra for
that random variable? To remedy this, we take a σ -algebra F that
captures all possible information. So we define a stochastic process as
follows:
For a given probability space (Ω, F, P ), and an index set I with a total
order ≤, a stochastic process X is a collection of random variables
{X(i)∣i ∈ I}.
(Ω, F, P ). Then
F n := σ(Xk ∣ k ≤ n)
Xi ∣σ(X1 , … , Xi )] = 0.
There are many more examples of martingales, but I won’t go too much
in depth about that, and would leave you to refer to some other
resource for more examples.
Some more intuition about martingales: let’s see what happens to the
conditional expectation in the definition of a martingale when we apply
a convex function to a stochastic process (i.e., to all random variables in
the stochastic process). Since probabilities are all positive, whenever
we’re taking the conditional expectation, we are taking a convex
combination of the values of the convex function of a random variable.
By Jensen’s inequality, we get that rather than equality, ≥ holds in the
defining equation for martingales. Similarly, if we had a concave
function, ≤ would hold in the defining equation. These types of
processes (where ≥ and ≤ hold instead of =) are called sub-
martingales and super-martingales respectively. Note that not all
stochastic processes are one of sub-/super-martingales. Note that ≥ 0
and ≤ 0 means that the random variable on the LHS is almost surely
non-negative and non-positive, respectively.
An example of a sub-martingale
N
exp(−ε2 /(2 ∑k=1 c2k )). Since the negative of a super-martingale is
N
−ε) ≤ exp(−ε2 /(2 ∑k=1 c2k )). Since a martingale is both a sub-
N
2 exp(−ε2 /(2 ∑k=1 c2k )). There is a stronger version of this
inequality.
3. Doob’s martingale convergence theorem: this is a result that says
that a super-martingale that is bounded from below will converge
(almost surely) to a random variable with finite expectation. Note
that if we bar the case when the variable diverges to infinity in some
sense, the other possible way for it to not converge is to oscillate.
This inequality roughly says that bounded martingales don’t oscillate.
Stopping times
Stopping times are quite interesting mathematical objects, and when
combined with martingales, they give a lot of interesting results and are
a very powerful tool.
To that end, let’s consider a random variable τ (that takes values in the
index set, that is the set of non-negative integers here) on a probability
space (Ω, F, P ), and let’s say we have a stochastic process X on the
same probability space with an associated filtration {Fi }. We say that τ
different stopping times, when not both are > t. In our coin flip
example, the latter makes sense if you think about why you can’t have
stopping time for 000010110… = 4 and that for 000011000… = 6.
Another way of formalizing the same thing is to say that the stochastic
process (denoting the stopping of a process) defined by Yt = 0 if τ < t
and 1 otherwise is an adapted process wrt the filtration. That is,
σ(Yt ) ⊆ Ft for all t.
Since the stopping time is an index, the most natural thing at this point
is to think of how to index the stochastic process with some function of
stopping time. If we go back to the definition of a random variable, it
should be clear that intuitively we need to assign a value to it at every
element in Ω. In simple cases, this is trivial, since we can just compute
the value of τ at the minimal set which has this element (let’s say this
value is t), and compute the value of Xt at the minimal set again. From
that we just arbitrarily stopped a process after a time limit. These are
very important types of processes, and have been extensively studied.
The cool part about martingales and stopping times together is that this
setup has been studied a lot, and there are quite a few important
results that I’ll list without proof, which can help us solve a lot of
problems as well as build an understanding around stopping times and
martingales:
and it turns out that this is true if any of the following three
conditions holds:
The stopping time τ is bounded above.
The martingale is bounded and the stopping time τ is finite almost
surely (i.e., P (τ = ∞) = 0).
The expected value of τ is finite, and the difference between Xi+1
Solution sketch
Problem 2 (folklore): Suppose there is an infinite grid, with a non-
negative real number in each. Suppose that for every cell, the number in
it is the mean of the numbers on its 4 neighbouring squares. Show that
all the numbers must be equal.
Solution sketch
Problem 3 (USA IMO 2018 Winter TST 1, P2): Find all functions
f : Z2 → [0, 1] such that for any integers x and y ,
f (x − 1, y) + f (x, y − 1)
f (x, y) =
2
Solution sketch
expected value of S .
Problem 5 (AGMC 2021 P1): In a dance party initially there are 20 girls
and 22 boys in the pool and infinitely many more girls and boys waiting
outside. In each round, a participant from the pool is picked uniformly at
random; if a girl is picked, then she invites a boy from the pool to dance
and then both of them leave the party after the dance; while if a boy is
picked, then he invites a girl and a boy from the waiting line and dance
together. The three of them all stay after the dance. The party is over
when there are only (two) boys left in the pool.
Solution sketch
Solution sketch
Solution sketch
1. 1575F
2. 1479E
3. 1392H
4. 1349D
5. 1025G
6. 850F
7. USACO 2018 — Balance Beam
>