0% found this document useful (0 votes)

3 views22 pages

Probability 101, The Intuition Behind Martingales and Solving Problems With Them _ Nor's Blog

This document provides an intuitive introduction to martingales and their application in solving probability problems, particularly focusing on stopping times and expected values. It explains key concepts such as probability spaces, random variables, and conditional expectations, while emphasizing the importance of understanding the underlying intuition rather than just the formal definitions. The author aims to clarify these concepts for readers who may struggle with the rigor of probability theory.

Uploaded by

last.alt.till.gm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views22 pages

Probability 101, The Intuition Behind Martingales and Solving Problems With Them _ Nor's Blog

Uploaded by

last.alt.till.gm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

nor's blog Home Tags Search RSS Comment

Probability 101, the intuition

behind martingales and
solving problems with them
December 31, 2022 · 34 min · 7042 words · nor

This post was originally written on Codeforces; relevant discussion can be

found here.

Recently someone asked me to explain how to solve a couple of

problems which went like this: “Find the expected time before XYZ
happens”. Note that here the time to completion is a random variable,
and in probability theory, such random variables are called “stopping
times” for obvious reasons. It turned out that these problems were
solvable using something called martingales which are random
processes with nice invariants and a ton of properties.

I think a lot of people don’t get the intuition behind these topics and
why they’re so useful. So I hope the following intuitive introduction
helps people develop a deeper understanding. Note that in the spirit of
clarity, I will only be dealing with finite sets and finite time processes
(unless absolutely necessary, and in such cases, the interpretation
should be obvious). I will do this because people tend to get lost in the
measure-theoretic aspects of the rigor that is needed and skip on the
intuition that they should be developing instead. Also, note that
sometimes the explanations will have more material than is strictly
necessary, but the idea is to have a more informed intuition rather than
presenting a terse-yet-complete exposition that leaves people without
any idea about how to play around with the setup.

People already familiar with the initial concepts can skip to the
interesting sections, but I would still recommend reading the
explanations as a whole in case your understanding is a bit rusty. The
concept of “minimal” sets is used for a lot of the mental modelling in
the post, so maybe you’d still want to read the relevant parts of the
blog where it is introduced.

I would like to thank Everule and rivalq for suggesting me to write this
post, and them and meme and adamant for proofreading and discussing
the content to ensure completeness and clarity.

BONUS: Here’s an interesting paper that uses martingales to solve

stuff.

Table of contents
1. Probability, sigma algebras, and random variables
2. Expected value of a random variable and conditional probability
3. Conditional expectation of a random variable with respect to a sigma
algebra
4. Martingales
5. Stopping times
6. Some math problems
7. Some competitive programming problems

Probability, sigma algebras, and

random variables
Let’s look at the simplified definition of a probability space first. We say
simplified here because there are some technicalities that need to be
sorted out for infinite sets, which we’re not considering here anyway.

A probability space is a triple (Ω, F, P ) consisting of

1. Ω: the (non-empty) ground set, also known as the sample space. It is

helpful to think of it as the set of all possible outcomes of an
experiment.
2. F : the σ -algebra. This is a collection of subsets of Ω. Note that this is
not always the power set of Ω. It is helpful to think of its elements
(the chosen subsets) as “events”. For example, when rolling a die, one
possible event can be “getting an even number”. For this experiment,
we have Ω = {1, 2, 3, 4, 5, 6} and this event is {2, 4, 6}.
3. P : the probability function, a function from F to [0, 1]. This is just a
weight that we assign to every event that we have chosen.

For this function to have some nice properties and for it to make sense
as some measure of “chance”, we add some more constraints to these
functions:

1. Ω is an event (i.e., Ω ∈ F ).
2. If A ∈ F , then Ω ∖ A ∈ F . That is, if something happening is an
event, then it not happening is also an event.
∈ F and B ∈ F , then A ∪ B ∈ F . That is, if X happening is
3. If A
an event and Y happening is an event, then at least one of X and Y
happening is also an event. Note that this combined with the
previous constraint allows us to say that A ∩ B is an event and so on
due to De Morgan’s law.
4. P (Ω) = 1, that is, the probability of the whole sample space is 1.
5. For disjoint sets A and B , P (A ∪ B) = P (A) + P (B). This, along
with the previous constraint, is sufficient to derive identities like
P (∅) = 0, P (A) + P (Ω ∖ A) = 1 and so on.

Let’s now build some intuition about this definition (the following,
especially the part about minimal sets, applies only to finite sets, but
there are a lot of similarities in the conclusions when we try to
generalize to other sets using measure-theoretic concepts).

Note that since F is closed under intersections, if we do the following

procedure multiple times: take two sets and replace them with their
intersection, we will end up with some “minimal” non-empty sets that
we can’t break further (this notation is not standard at all, but we will
use this a lot in what follows). Consider the set of all such possible
minimal sets. Clearly, these sets form a partition of Ω since they are
disjoint (else contradiction to minimality) and since every element is in a
minimal set (using the fact that F is closed under complements and
constructing a minimal set if needed, or by just intersecting all sets
containing that element — there is at least one such set, which is Ω).
Now since F is also closed under unions, it also equals the set of all
possible unions of these minimal subsets. In other words, we have a
partition of the elements of Ω into some sets, and F is the set formed
by all possible unions of sets in that partition. We can now think of σ -
algebras in terms of partitions! Also, note that the probabilities for each
of these minimal events of the σ -algebra add up while taking the
unions.

In some sense, we can now construct an equivalent probability space

(Ω’, F’, P ’) where Ω’ is the set of minimal events of F , F’ is the
power set of Ω’, and P ’ is defined naturally from P . This definition is
not standard but will be useful later on. Note that if F was the power
set of Ω, then this equivalent probability space would have been the
same as the original probability space. In this sense, all other σ -algebras
are “coarser” than the power set, whose minimal sets are all singletons.

At this point, we should realize that F is a measure of how much

“information” we have about Ω (we can’t distinguish between two
elements in the same minimal set). This idea, along with the partition
intuition, is a good way to deal with how you gain information over time
in random processes, but we’ll come to that a bit later.

Let’s now turn to the formal definition of a random variable.

A random variable with respect to a probability space (Ω, F, P ) is a

function X from Ω to R such that X −1 ((−∞, x]) ∈ F for all x ∈ R.
That is, the set of pre-images of real numbers ≤ x should be an
element of F .

Note that since we are dealing with finite sets, we will use an
alternative (but equivalent for finite sets) definition where we replace
the interval (−∞, x] by {x}.

When written like this, it might be a bit intimidating, but the underlying
idea is quite simple. We have a sample space Ω, and we have another
set R. Now we want to map elements of Ω to R in some way so that we
can do some computations over elements of Ω. However, since in a
probability space, F tells us how much information we have (or are
allowed to look at), we shouldn’t be able to distinguish between
elements of Ω that are inside the same minimal event of F . Since there
are finitely many elements in Ω, and elements inside the same minimal
event should be mapped to the same real number, there can be only
finitely many intervals that X partitions R into, and the condition in the
definition is necessary and sufficient (due to closedness of F under
intersection and unions).
As a side-note, note that we can also define functions of a random
variable at this point by simply composing the function X with a
function from R to R. Similarly, if we have two random variables on the
same probability space, we can also define functions of them quite
naturally. After all, we just need to specify their values on the minimal
sets and we’re done.

Expectation of a random variable and

conditional probability
Let’s try to investigate random variables a bit further. Let’s say we have
a probability space (Ω, F, P ) and a random variable X on this
probability space. How do we capture some sort of a mean value of X
over the entire probability space? One possible answer is the following:
for each minimal event, we consider the common value of the random
variable at each element of that event, and add it to a weighted sum
with weight equal to the probability of that minimal event. Note that
minimal events form a partition so the sum of these weights is 1. In
some sense, this is an average of the random variable over all minimal
events, weighted by their importance.

But how do we write this up mathematically without referring to

minimal events? Let’s take advantage of the condition in the definition
of a random variable. Instead of summing over minimal events, we can
sum over distinct values x of the random variable, and weight it by the
probability of that variable being equal to x. This probability is well-
defined due to the condition in the definition of a random variable
(because the event that the random variable equals x is in fact a disjoint
union of minimal events and their probabilities add up). This definition
is also consistent with our earlier intuition.
So we can define the expected value of a random variable X to be
E[X] := ∑x∈range(X) x ⋅ P (X = x).

Note that by our earlier definition related to minimal events, we can

also clearly see that for any random variables X, Y on the same
probability space, we have E[X + Y ] = E[X] + E[Y ].

Let’s now change gears and think a bit about how we can isolate an
event. Suppose we have an event E . Let’s say that we want to ignore
everything else, and only look at the elements of Ω (and consequently
the minimal events of F that form a partition of E ). Can we make a
probability space out of it? One easy way would be to just inherit the
structure of Ω and F and set the probabilities of everything that uses
elements other than the minimal elements constituting E to be 0, but
we run into one trouble if we simply decide to inherit the probability
function. The sum of probabilities over all such minimal events is not 1.
One possible idea is to normalize it by P (E), which is possible iff it is
non-zero. It can easily be checked that this probability space is indeed a
valid probability space.

This is how we define the conditional probability space wrt an event E

with P (E) > 0. The sample space and the σ -algebra remain the same,
but the probability function becomes P (A∣E) := P ’(A) = P (A ∩
E)/P (E). This can be verified to be consistent with our earlier
reasoning by considering minimal elements.

A cleaner way of looking at it is that we’re really only concerned with

some subset of Ω and F that correspond to elements in E , and we
define a probability space with E as the ground set. But just to make
sure that we can interoperate with the other events in F , we take the
event, trim down the parts that are not in E , and then use the
probability (or relative weight) of that set wrt the probability (or
weight) of E instead.

One corollary of this is the following: ∑E

i
P (A∣Ei )P (Ei ) = P (A)

for a partition of Ω into events Ei . This can be seen to be true if we

consider the minimal sets in A. In fact, this is used while solving a lot of
recurrences, like in Bayes’ theorem.

This also has an expected value variant, and if we naturally restrict a

random variable to each conditional probability space, by summing over
minimal sets, we get that if E[X∣Ei ] is the expected value of X wrt

the conditional probability space with respect to Ei , then

∑Ei E[X∣Ei ]P (Ei ) = E[X]. This variant is used in recurrences too,

like finding the expected time before we toss two heads in a row with a
fair coin (though the sample space is infinite there, these identities still
hold).

Conditional expectation with respect

to a sigma algebra
Let’s continue with the partition and information analogy even further.
Recall how we had defined an equivalent probability space. It
corresponds to collapsing equivalence classes (determined by the
partition) into a single element, and making a probability space on that
as the sample space. What if we do that again, after defining another
sparser/coarser probability space on the resulting probability space?

Of course, this leads to a coarser partition of the sample space in some

sense, and if F1 is the original σ -algebra, and F2 is the new σ -algebra

(both of them corresponding to the same Ω, after

“unwrapping/unrolling” the second coarsening), then F2 ⊆ F1 .

Now suppose we had a random variable X on the original probability
space. How do we reduce/adapt the random variable to this coarser
probability space? This is where expected value comes into the picture.
We will choose a definition for conditional expectation, and see that it
satisfies a lot of good properties and is consistent with our conditional
probability identities that we figured out in the previous section.

Note that X is constant on all minimal sets of F1 . Let’s define the

conditional expectation of X with respect to the σ -algebra F2 as

the random variable E[X∣F2 ] := X’ such that on every element of a

minimal set Ei of F2 , X’ assumes the value E[X∣Ei ]. This essentially

means that we are trying to take a “conditional average” of the

information about the values of X on the set Ei . It is easy to verify that

E[X’] = E[X], where the first expectation is taken over the coarser
probability space, and the second is taken on the finer probability
space. In a case of unfortunate naming, X’ is called the conditional
expectation of X wrt the σ -algebra F2 .

To make this a bit clearer, let’s take an example.

Example

We note a couple of points here:

1. In a way, we are trying to replace the random variable with its “best
approximation” in some sense (for instance, a variance minimization
sense, but that is totally unrelated). Also note that the expectation of
the random variable does not change when we apply this coarsening
to X .
= {∅, Ω}, the random variable
2. Note that for the trivial σ -algebra F2

E[X∣F2 ] is identically the constant function that equals E[X].

We can also chain a couple of coarsenings. Suppose F3 ⊆ F2 ⊆ F1 and

X is a random variable wrt the probability space (Ω, F1 , P ). Then we

have E[E[X∣F2 ]∣F3 ] = E[X∣F3 ], as can be verified easily by chaining

conditional expectations. In fact, the fact that expectation remains the

same even after coarsening is a direct consequence of this and the last
point.

A more commonly used term is conditional expectation with respect

to another random variable. It is usually defined as this:

Suppose X and Y are random variables on the same probability space.

For any element ω of the sample space where Y (ω) = y , we define
E[X∣Y ] (ω ) = E[X∣E] where E is the event that Y = y . That is, for
every element where Y assumes a value of y , we take the conditional
expectation wrt the event that Y = y and set the value of the answer
to that conditional expectation.

However, this is the same as saying this: let FY be the σ -algebra where

the minimal sets correspond to distinct values of Y . Then E[X∣Y ] is

identically equal to the random variable E[X∣FY ]. As a side-note, this

constructed σ -algebra is called the σ -algebra generated by the random

variable Y .

In this sense, conditional expectation with respect to a σ -algebra is a

cleaner and more sophisticated concept than conditional expectation
with respect to a random variable, though both concepts are equivalent
(you can always induce a σ -algebra by choosing where to put equal
values for the random variable you’re conditioning on).

Another way of thinking about this concept is to think of it as projecting

down from a set of minimal events to a coarser set of minimal events.
For each chunk of minimal events we’re collapsing, we replace it with a
single minimal event by the “center of mass” of those events, where the
“position” is the value of the random variable, and the “mass” equals
the probability of those minimal events. This works out mathematically
too, and is precise for this setup.

Just for the sake of completeness, since the above definition captures
what conditional expectation with respect to a σ -algebra does, but
breaks down when you try to generalize it to infinite sample spaces,
here’s the “official definition” that people use (which is quite counter-
intuitive and takes some time to digest):

A conditional expectation of X given H, denoted as E[X∣H], is any

random variable on (Ω, H, P ) which satisfies ∫H
E[X∣H]dP =
∫H XdP for each H ∈ H.

This essentially says the same thing when you think about the possible
H , but in a more non-constructive way. Turns out that this random
variable is not unique in the strictest sense, but P (Y =
 Z) = 0 holds,
where Y and Z are two candidates for this random variable. Also, this
definition makes it clear that X − E[X∣H] is orthogonal to the
indicator function (so we get a more precise definition of projecting
down).

Martingales
Let’s say there is a “process” of some sort, where at every step, we get
more information. Let’s say our sample space for the whole experiment
is Ω. Here, Ω is almost always infinite, but the intuition we built carries
forward here, even though the concepts like minimal sets might not.
Let’s take a very simple example: we toss a coin at each step. The
sample space for the whole experiment is the set of all infinite binary
strings. Let’s think about the kind of information that we have available
at various steps in the process. After step 0, all the information we have
(for instance, the complete information about a random variable that
has information about what has happened until now) corresponds to
the trivial σ -algebra {∅, Ω}. After step i, we have information for the
first i coin tosses, so the information corresponds to the σ -algebra with
“minimal” sets being the sets of binary strings which are the same in the
first i positions.

To understand why σ -algebras are necessary to capture information,

perhaps the following analogy helps. Think in terms of minimal sets.
Each time we get some more information, we might be able to
distinguish between elements in minimal sets. To be able to do
operations like finding the expectation of some random variable in this
scenario, we need to be able to distinguish between these elements
too. In some sense, you’re inspecting multiple parallel universes at the
same time, and in each universe, you have a certain prefix of things that
happened, for each of which you want to be able to analyze what
happens later.

Note that we always get more information at every stage of such a

stochastic process. So we define a filtration as follows (without any
reference to a process for the sake of generality): a sequence of σ -
algebras Fi such that Fi
⊆ Fj iff i ≤ j .

Now let’s get to what a process really is. At every step of a process, we
have a random variable that shows up. But what is the σ -algebra for
that random variable? To remedy this, we take a σ -algebra F that
captures all possible information. So we define a stochastic process as
follows:
For a given probability space (Ω, F, P ), and an index set I with a total
order ≤, a stochastic process X is a collection of random variables
{X(i)∣i ∈ I}.

For our purposes, I will mostly be the non-negative integers, though

I = R is used extensively in a lot of contexts. I having a total order
means that there is some concept of time associated with I .

How do we now construct a “natural filtration” that captures the

information that the first i steps of the process (i.e., {X(j)∣j ≤ i})
give us? Remember how we said we can construct a σ -algebra from a
random variable? We can also use the same construction to construct it
from a finite set of random variables, by refining two partitions
together while it is possible (this corresponds to the join of the two
partitions in the refinement lattice, if you’re familiar with posets and
lattices). So the filtration can be set up this way too, by refining the
current σ -algebra with a new random variable each time. More formally,
paraphrasing from Wikipedia,

Let (Xn )n∈N be a stochastic process on the probability space

(Ω, F, P ). Then

F n := σ(Xk ∣ k ≤ n)

is a σ -algebra and F = (F n )n∈N is a filtration. Here σ(Xk ∣ k ≤ n)

denotes the σ -algebra generated by the random variables

X1 , X2 , … , Xn . F is a filtration, since by definition all F n are σ -

algebras and σ(Xk ∣ k ≤ n) ⊆ σ(Xk ∣ k ≤ n + 1).

This is known as the natural filtration of F with respect to X .

Finally, we are able to define what a martingale is. Note that if

σ(Xi ) ⊆ H for some σ -algebra H, then E[Xi ∣H] = Xi (think about

the minimal sets and the definition). A martingale essentially says that
conditional on the information we have until time i, a random variable
Xj for j ≥ i looks like Xi . More formally,

A martingale is a stochastic process X1 , X2 , … such that

E[Xn+1 ∣σ(X1 , X2 , … , Xn )] = Xn , and E[∣Xn ∣] < ∞.

Equivalently, the first condition can be written as E[Xn+1 −

Xn ∣σ(X1 , X2 , … , Xn )] = 0. Note that we talk about “almost sure”

equality of random variables when dealing with more complicated

probability spaces, but for our purposes, we will not concern ourselves
with this detail.

Note that this is necessary (trivial) and sufficient, because Xj − Xi =

(Xj − Xj−1 ) + (Xj−1 − Xj−2 ) + ⋯ + (Xi+1 − Xi ), and by

applying the nested coarsening property, it reduces to E[Xj −

Xi ∣σ(X1 , … , Xi )] = 0.

Why is a martingale so useful? Firstly, it gives us an invariant, and as is

well-known, invariants make life easier when we try to prove results in
math. Secondly, it originated in the theory of gambling, and for fair
gambling games, the expected value of the profit at a later point in
time should be equal to the profit at that point in time, which is
precisely the martingale condition!

Let’s look at some constructions of martingales.

1. Consider a random walk, where the initial position is X0 = 0, and

Xi = Xi−1 ± 1 with equal probability p ≤ 1/2, and Xi = Xi−1

with probability 1 − 2p. Then this is a martingale (from the

definition).
2. Consider a random walk denoting the total assets of a gambler who
keeps reinvesting all his capital into the game. The game gives a
payout of r percent if he wins (with probability p), a loss of r percent
if he loses (with probability p), and no payout in case its a draw (with
probability 1 − 2p). Then his wealth is a martingale.

There are many more examples of martingales, but I won’t go too much
in depth about that, and would leave you to refer to some other
resource for more examples.

Some more intuition about martingales: let’s see what happens to the
conditional expectation in the definition of a martingale when we apply
a convex function to a stochastic process (i.e., to all random variables in
the stochastic process). Since probabilities are all positive, whenever
we’re taking the conditional expectation, we are taking a convex
combination of the values of the convex function of a random variable.
By Jensen’s inequality, we get that rather than equality, ≥ holds in the
defining equation for martingales. Similarly, if we had a concave
function, ≤ would hold in the defining equation. These types of
processes (where ≥ and ≤ hold instead of =) are called sub-
martingales and super-martingales respectively. Note that not all
stochastic processes are one of sub-/super-martingales. Note that ≥ 0
and ≤ 0 means that the random variable on the LHS is almost surely
non-negative and non-positive, respectively.

An example of a sub-martingale

I won’t go into a lot of detail about martingales (since they deserve a

field of study of their own), but will just point out some nice results,
that help to get a feel of how martingales behave:
1. Azuma’s inequality: this provides a bound on how “close” we will be
to the initial value of the random variable given bounds on the
movement of the martingale (or how much in the wrong direction we
can go for a sub-/super-martingale). Let’s say {Xk } is a super-

martingale (which a martingale also is), and let’s suppose ∣Xk

−

Xk−1 ∣ ≤ ck almost surely. Then we have P (XN − X0 ≥ ε) ≤

N
exp(−ε2 /(2 ∑k=1 c2k )). Since the negative of a super-martingale is

a sub-martingale, we get for a sub-martingale X , P (XN − X0 ≤

N
−ε) ≤ exp(−ε2 /(2 ∑k=1 c2k )). Since a martingale is both a sub-

martingale and a super-martingale, we have a probabilistic bound on

how close XN will be to X0 , that is, P (∣XN
− X0 ∣ ≥ ε) ≤

N
2 exp(−ε2 /(2 ∑k=1 c2k )). There is a stronger version of this

inequality, but I think this is quite sufficient to see how well-behaved

martingales are.
2. Doob’s martingale inequality: this is a Markov-style inequality on the
probability that a sub-martingale will exceed a value x. More
formally, for a sub-martingale X , we have P (max1≤i≤n Xi ≥ x) ≤
E[max(Xn , 0)]/x. This can be used to also show Markov’s

inequality.
3. Doob’s martingale convergence theorem: this is a result that says
that a super-martingale that is bounded from below will converge
(almost surely) to a random variable with finite expectation. Note
that if we bar the case when the variable diverges to infinity in some
sense, the other possible way for it to not converge is to oscillate.
This inequality roughly says that bounded martingales don’t oscillate.

Stopping times
Stopping times are quite interesting mathematical objects, and when
combined with martingales, they give a lot of interesting results and are
a very powerful tool.

Suppose we are in the gambler’s shoes, and we want to write an

algorithm to finish the game, depending on certain outcomes in the
process. Stopping times are a way to formalize the analysis of when
such exit conditions are reached. The best we can do right now, given
the current theory we have, is to define it as a random variable (that
may or may not depend on the stochastic process). We also have the
constraint that we can’t see the future, so the decision about when to
stop must be taken with all the information at the current time.

To that end, let’s consider a random variable τ (that takes values in the
index set, that is the set of non-negative integers here) on a probability
space (Ω, F, P ), and let’s say we have a stochastic process X on the
same probability space with an associated filtration {Fi }. We say that τ

is a stopping time if the event that {τ = t} is an event in Ft . That is,

the decision to stop at time t must be taken with information no more

than what we have at time t. For also keeping this consistent with real
number indexing, we modify the condition to {τ ≤ t} being an event in
Ft .

To understand why this ensures no forward-looking, consider the

following. We bunch together elements of Ω that have stopping time
≤ t, and the condition is that we shouldn’t be able to distinguish them
if without information from time after t. Similarly, it must not be the
case that two elements of Ω in the same minimal set of Ft have

different stopping times, when not both are > t. In our coin flip
example, the latter makes sense if you think about why you can’t have
stopping time for 000010110… = 4 and that for 000011000… = 6.

Another way of formalizing the same thing is to say that the stochastic
process (denoting the stopping of a process) defined by Yt = 0 if τ < t
and 1 otherwise is an adapted process wrt the filtration. That is,
σ(Yt ) ⊆ Ft for all t.

Since the stopping time is an index, the most natural thing at this point
is to think of how to index the stochastic process with some function of
stopping time. If we go back to the definition of a random variable, it
should be clear that intuitively we need to assign a value to it at every
element in Ω. In simple cases, this is trivial, since we can just compute
the value of τ at the minimal set which has this element (let’s say this
value is t), and compute the value of Xt at the minimal set again. From

here, it should be obvious that Xτ is indeed a random variable on the

original probability space, and it denotes the value of the stochastic

process when the process is exited. Also note that by the same
reasoning, Xmin(τ ,t) is a random variable whose induced σ -algebra is a

subset of Ft . Xmin(τ ,t) corresponds to a stopped process, which means

that we just arbitrarily stopped a process after a time limit. These are
very important types of processes, and have been extensively studied.

The cool part about martingales and stopping times together is that this
setup has been studied a lot, and there are quite a few important
results that I’ll list without proof, which can help us solve a lot of
problems as well as build an understanding around stopping times and
martingales:

1. Martingale central limit theorem: This is a generalization of the

usual central limit theorem to martingales. The associated process is
the usual martingale, but with differences between consecutive
values being bounded by a fixed bound almost surely. We choose a
stopping time as follows: at each step i + 1, compute the conditional
variance of Xi+1 wrt Fi , and keep adding it to a counter. Stop once

this counter is more than a certain value σ 2 . (In a sense, we are

waiting till we have collected a large enough variance/move). If the
stopping time here is τσ , then the random variable Xτσ /σ converges

to a normally distributed variable with mean 0 and variance 1, as σ

tends to ∞. This also assumes that the sum of variances diverges.
2. Doob’s optional stopping theorem: This is a very important
theorem, and will be quite useful to compute the expected values of
stopping times later on. It talks about the fairness of a martingale
when we stop after a random stopping time, in the sense that we
developed above. The metric for fairness is that E[Xτ ] = E[X0 ],

and it turns out that this is true if any of the following three
conditions holds:
The stopping time τ is bounded above.
The martingale is bounded and the stopping time τ is finite almost
surely (i.e., P (τ = ∞) = 0).
The expected value of τ is finite, and the difference between Xi+1

and Xi is bounded. Note that this can be used to compute the

stopping time if we modify the martingale a bit, as we shall see

later.

Some math problems

We’ll start out with some math problems that I have seen over the years
(apologies since I don’t remember the sources for some of them).

Problem 1: Suppose there is a gambler who has n dollars and gambles

infinitely many times. The payoff of the tth round is Xt , which is an

integer bounded above and below by some fixed constants. We know

that E[Xt ∣Xt−1 , ⋯
, X1 ] ≥ c where c > 0. Suppose the probability

of the gambler going bankrupt is p(n). Do we have limn→∞ p(n) = 0?

Solution sketch
Problem 2 (folklore): Suppose there is an infinite grid, with a non-
negative real number in each. Suppose that for every cell, the number in
it is the mean of the numbers on its 4 neighbouring squares. Show that
all the numbers must be equal.

Solution sketch

Problem 3 (USA IMO 2018 Winter TST 1, P2): Find all functions
f : Z2 → [0, 1] such that for any integers x and y ,
f (x − 1, y) + f (x, y − 1)
f (x, y) =
2

Solution sketch

Problem 4 (2022 Putnam A4): Suppose that X1 , X2 , … are real

numbers between 0 and 1 that are chosen independently and uniformly

k
at random. Let S = ∑i=1 Xi /2i , where k is the least positive integer

such that Xk < Xk+1 , or k = ∞ if there is no such integer. Find the

expected value of S .

Solution sketch (due to ABCDE)

Problem 5 (AGMC 2021 P1): In a dance party initially there are 20 girls
and 22 boys in the pool and infinitely many more girls and boys waiting
outside. In each round, a participant from the pool is picked uniformly at
random; if a girl is picked, then she invites a boy from the pool to dance
and then both of them leave the party after the dance; while if a boy is
picked, then he invites a girl and a boy from the waiting line and dance
together. The three of them all stay after the dance. The party is over
when there are only (two) boys left in the pool.

1. What is the probability that the party never ends?

2. Now the organizer of this party decides to reverse the rule, namely
that if a girl is picked, then she invites a boy and a girl from the
waiting line to dance and the three stay after the dance; while if a boy
is picked, he invites a girl from the pool to dance and both leave after
the dance. Still the party is over when there are only (two) boys left in
the pool. What is the expected number of rounds until the party
ends?

Solution sketch

Problem 6 (folklore): Let a monkey type randomly on a typewriter with

alphabet Σ. For a given string w ∈ Σ∗ , let w1 , … , wn be the

nonempty strings that are simultaneously a prefix and suffix of w . Then

show that the expected number of keystrokes the monkey needs to
n
make to type w is ∑i=1 ∣Σ∣∣wi ∣ .

Solution sketch

A different approach for the above problem is discussed here.

Problem 7 (Alon, Spencer): Let G = (V , E) be a graph with chromatic

number χ(G) = 1000. Let U ⊆ V be a random subset of V chosen
uniformly from among all 2∣V ∣ subsets of V . Let H = G[U ] be the
induced subgraph of G on U . Prove that the probability that H has a
chromatic number of ≤ 400 is less than 0.01.

Solution sketch

Some competitive programming

problems
Almost all the problems that I know about in a competitive
programming perspective are discussed in this Chinese post and this
editorial to 1479E that explains the main idea in a slightly different
manner. I won’t repeat their discussion (the equations should be
understandable if you understand what the main approach with the
math problems above is), but I would just list out the problems that
they link as well as some problems that appeared later (in reverse
chronological order).

1. 1575F
2. 1479E
3. 1392H
4. 1349D
5. 1025G
6. 850F
7. USACO 2018 — Balance Beam

Math Probability Martingales Tutorial

Probability Theory - R.S.varadhan
No ratings yet
Probability Theory - R.S.varadhan
225 pages
140 Google Interview Questions
No ratings yet
140 Google Interview Questions
17 pages
E-Commerce Revenue Management - Python For Data Science - Great Learning
100% (1)
E-Commerce Revenue Management - Python For Data Science - Great Learning
4 pages
Module 11 Unit 2 Simple Linear Regression
No ratings yet
Module 11 Unit 2 Simple Linear Regression
12 pages
gsm-199-prev
No ratings yet
gsm-199-prev
25 pages
01 - Probability Spaces
No ratings yet
01 - Probability Spaces
15 pages
PCMI Notes
No ratings yet
PCMI Notes
70 pages
Week1 Notes
No ratings yet
Week1 Notes
10 pages
Statistics and Econometrics Lecture Notes 2021
No ratings yet
Statistics and Econometrics Lecture Notes 2021
428 pages
STAT 516 Course Notes Part 0: Review of STAT 515: 1 Probability
No ratings yet
STAT 516 Course Notes Part 0: Review of STAT 515: 1 Probability
21 pages
MSO201 Week1 Lecture Notes
No ratings yet
MSO201 Week1 Lecture Notes
7 pages
STA347
No ratings yet
STA347
23 pages
4b_ProbabilityNotes
No ratings yet
4b_ProbabilityNotes
79 pages
Stats ch1
No ratings yet
Stats ch1
22 pages
STAT3004 Course Notes
100% (1)
STAT3004 Course Notes
167 pages
Chapter 2 PDF
No ratings yet
Chapter 2 PDF
29 pages
chapter01.fullsize
No ratings yet
chapter01.fullsize
107 pages
Probability Review
No ratings yet
Probability Review
29 pages
Probability Notes
No ratings yet
Probability Notes
73 pages
Chapter 1 Lecture1
No ratings yet
Chapter 1 Lecture1
39 pages
Notes on Probability Theory and Statistics -- Joel Terschuur
No ratings yet
Notes on Probability Theory and Statistics -- Joel Terschuur
58 pages
Ft Notes
No ratings yet
Ft Notes
46 pages
Random Variale & Random Process
No ratings yet
Random Variale & Random Process
298 pages
MScFE 620 DTSP - Compiled - Video - Transcripts - M1 PDF
No ratings yet
MScFE 620 DTSP - Compiled - Video - Transcripts - M1 PDF
14 pages
Notes 1
No ratings yet
Notes 1
4 pages
Probability & Statistics MTH 2401 STV V54
No ratings yet
Probability & Statistics MTH 2401 STV V54
273 pages
Martingale Theory and Applications: DR Nic Freeman June 4, 2015
No ratings yet
Martingale Theory and Applications: DR Nic Freeman June 4, 2015
40 pages
MScFE 620 DTSP - Video - Transcripts - M1
No ratings yet
MScFE 620 DTSP - Video - Transcripts - M1
14 pages
An Introduction To Probability Theory - Geiss
No ratings yet
An Introduction To Probability Theory - Geiss
71 pages
Lecture 1
No ratings yet
Lecture 1
4 pages
IITM Machine Learning
No ratings yet
IITM Machine Learning
857 pages
Probability Fundamentals Concepts
No ratings yet
Probability Fundamentals Concepts
9 pages
Lecture Notes Statistics I
No ratings yet
Lecture Notes Statistics I
160 pages
St213 1993 Lecture Notes
No ratings yet
St213 1993 Lecture Notes
25 pages
Lecture 2: Basics of Probability Theory: 1 Axiomatic Foundations
No ratings yet
Lecture 2: Basics of Probability Theory: 1 Axiomatic Foundations
7 pages
Prob Imp THR
No ratings yet
Prob Imp THR
7 pages
LN SP2018
No ratings yet
LN SP2018
139 pages
Lecture-1
No ratings yet
Lecture-1
87 pages
Basic Concepts of Ergodic Theory
No ratings yet
Basic Concepts of Ergodic Theory
8 pages
Modeling With Probability
No ratings yet
Modeling With Probability
91 pages
Stochastic Dynamics
No ratings yet
Stochastic Dynamics
78 pages
Till Week 2 Notes
No ratings yet
Till Week 2 Notes
4 pages
Skript 2022
No ratings yet
Skript 2022
112 pages
Lectures Chapter 2A
No ratings yet
Lectures Chapter 2A
9 pages
Probality and Statics
No ratings yet
Probality and Statics
122 pages
PMRprobabilistic Modelling Primer
No ratings yet
PMRprobabilistic Modelling Primer
14 pages
Introduction To Probability
No ratings yet
Introduction To Probability
269 pages
week1
No ratings yet
week1
6 pages
FE-606- Unit 2-Revision of Probability and Statistics
No ratings yet
FE-606- Unit 2-Revision of Probability and Statistics
167 pages
Introduction To Discrete Probability Theory and Bayesian Networks
No ratings yet
Introduction To Discrete Probability Theory and Bayesian Networks
26 pages
APznzabODUiSSaWMUou42Zzm0EYzg0Yh 1FEJF QJAUr4k8rz3m YU3iMSbfj49gHbb070VtVcnEvSEyzQBI1oV0P1nfomatmabhLwlTksvMa8zNID0lFpjygrjBXJpow7OT1jEWPikvLlRPMXG56 KCTPGX6AhP ArSuiN6zEcbb9NFUZrolIsRV3C5
No ratings yet
APznzabODUiSSaWMUou42Zzm0EYzg0Yh 1FEJF QJAUr4k8rz3m YU3iMSbfj49gHbb070VtVcnEvSEyzQBI1oV0P1nfomatmabhLwlTksvMa8zNID0lFpjygrjBXJpow7OT1jEWPikvLlRPMXG56 KCTPGX6AhP ArSuiN6zEcbb9NFUZrolIsRV3C5
33 pages
Stat 333
No ratings yet
Stat 333
128 pages
MAT 271 Probability and Statistics Lecture 2: Sample Space and Probability
No ratings yet
MAT 271 Probability and Statistics Lecture 2: Sample Space and Probability
69 pages
Axioms of Probability
No ratings yet
Axioms of Probability
40 pages
Math 2901 Booklet 15
No ratings yet
Math 2901 Booklet 15
291 pages
Lect 02
No ratings yet
Lect 02
12 pages
Set-Theoretic Paradoxes and their Resolution in Z-F
From Everand
Set-Theoretic Paradoxes and their Resolution in Z-F
Samuel Horelick
4.5/5 (2)
Elements of Real Analysis
From Everand
Elements of Real Analysis
David A. Sprecher
No ratings yet
Introduction to Abstract Analysis
From Everand
Introduction to Abstract Analysis
Marvin E. Goldstein
No ratings yet
On the Logical Form of Correlatively Infinite Sets
From Everand
On the Logical Form of Correlatively Infinite Sets
J. R. Conner
No ratings yet
Mathematical Equality: Fundamentals and Applications
From Everand
Mathematical Equality: Fundamentals and Applications
Fouad Sabry
No ratings yet
Abstract Sets and Finite Ordinals: An Introduction to the Study of Set Theory
From Everand
Abstract Sets and Finite Ordinals: An Introduction to the Study of Set Theory
G. B. Keene
3/5 (1)
Pascal Triangle Analogues Introduction
From Everand
Pascal Triangle Analogues Introduction
Tomislav Tomšić
No ratings yet
SE
No ratings yet
SE
22 pages
tarjan76_stnumbering
No ratings yet
tarjan76_stnumbering
6 pages
Suffix Automaton - Algorithms for Competitive Programming
No ratings yet
Suffix Automaton - Algorithms for Competitive Programming
15 pages
Download
No ratings yet
Download
19 pages
Wireless Charging Base-LAPTOP
No ratings yet
Wireless Charging Base-LAPTOP
63 pages
The Effect of The Implementation of The Regional Financial Management Information System
No ratings yet
The Effect of The Implementation of The Regional Financial Management Information System
9 pages
Installation Guide: SAP GUI 7.50 Java For Mac OS
No ratings yet
Installation Guide: SAP GUI 7.50 Java For Mac OS
12 pages
IPremier and Denial of Service Attack
No ratings yet
IPremier and Denial of Service Attack
3 pages
Mb95M Idtv: Service Manual
No ratings yet
Mb95M Idtv: Service Manual
67 pages
Upes Annual Report 2017 18 PDF
No ratings yet
Upes Annual Report 2017 18 PDF
236 pages
Hospital Management System Abstract
No ratings yet
Hospital Management System Abstract
5 pages
RESTful API Design OCTO Quick Reference Card 2.2
No ratings yet
RESTful API Design OCTO Quick Reference Card 2.2
4 pages
Annexure FSP Boot Camp Nomination Form Template (1)
No ratings yet
Annexure FSP Boot Camp Nomination Form Template (1)
2 pages
Tightening Tools
No ratings yet
Tightening Tools
36 pages
Mouse Pads 13428 EXTERNAL MeeshoTemplate2PricesGSTIN
No ratings yet
Mouse Pads 13428 EXTERNAL MeeshoTemplate2PricesGSTIN
58 pages
Rquired HPD Signs NYC
No ratings yet
Rquired HPD Signs NYC
11 pages
Solution Manual for Data Structures and Problem Solving Using C++ 2/E Mark A. Weiss - Quickly Download And Never Miss Important Content
100% (6)
Solution Manual for Data Structures and Problem Solving Using C++ 2/E Mark A. Weiss - Quickly Download And Never Miss Important Content
30 pages
Practice Test One WAMAP
No ratings yet
Practice Test One WAMAP
5 pages
LEC03 - Number Systems (Bin, Oct, Dec, Hex)
No ratings yet
LEC03 - Number Systems (Bin, Oct, Dec, Hex)
30 pages
Adyen Business Breakdowns Research
No ratings yet
Adyen Business Breakdowns Research
12 pages
Plumbing - Gudang Oke
No ratings yet
Plumbing - Gudang Oke
4 pages
01 GIS - Complete Note
No ratings yet
01 GIS - Complete Note
233 pages
USB Floppy Drive Floppy Disk Drive To USB Floppy Drive To USB
100% (1)
USB Floppy Drive Floppy Disk Drive To USB Floppy Drive To USB
3 pages
Automation in VirtualBox
No ratings yet
Automation in VirtualBox
3 pages
ADS4020
0% (1)
ADS4020
2 pages
GIS PART 1
No ratings yet
GIS PART 1
19 pages
Database Systems the Complete Book C1
No ratings yet
Database Systems the Complete Book C1
12 pages
2.using Intents
No ratings yet
2.using Intents
17 pages
The Report Engine Automation Server
No ratings yet
The Report Engine Automation Server
37 pages
Programming in Python
No ratings yet
Programming in Python
95 pages
[AZ-900] Summary (Learning Path + Practice Assessment)
No ratings yet
[AZ-900] Summary (Learning Path + Practice Assessment)
54 pages

Probability 101, The Intuition Behind Martingales and Solving Problems With Them _ Nor's Blog

Uploaded by

Probability 101, The Intuition Behind Martingales and Solving Problems With Them _ Nor's Blog

Uploaded by

nor's blog Home Tags Search RSS Comment

Probability 101, the intuition

This post was originally written on Codeforces; relevant discussion can be

Recently someone asked me to explain how to solve a couple of

BONUS: Here’s an interesting paper that uses martingales to solve

Probability, sigma algebras, and

A probability space is a triple (Ω, F, P ) consisting of

1. Ω: the (non-empty) ground set, also known as the sample space. It is

Note that since F is closed under intersections, if we do the following

In some sense, we can now construct an equivalent probability space

At this point, we should realize that F is a measure of how much

Let’s now turn to the formal definition of a random variable.

A random variable with respect to a probability space (Ω, F, P ) is a

Expectation of a random variable and

But how do we write this up mathematically without referring to

Note that by our earlier definition related to minimal events, we can

This is how we define the conditional probability space wrt an event E

A cleaner way of looking at it is that we’re really only concerned with

One corollary of this is the following: ∑E

for a partition of Ω into events Ei . This can be seen to be true if we

This also has an expected value variant, and if we naturally restrict a

the conditional probability space with respect to Ei , then ​

∑Ei E[X∣Ei ]P (Ei ) = E[X]. This variant is used in recurrences too,

Conditional expectation with respect

Of course, this leads to a coarser partition of the sample space in some

(both of them corresponding to the same Ω, after

Note that X is constant on all minimal sets of F1 . Let’s define the

conditional expectation of X with respect to the σ -algebra F2 as ​

the random variable E[X∣F2 ] := X’ such that on every element of a

minimal set Ei of F2 , X’ assumes the value E[X∣Ei ]. This essentially

means that we are trying to take a “conditional average” of the

To make this a bit clearer, let’s take an example.

We note a couple of points here:

E[X∣F2 ] is identically the constant function that equals E[X].

X is a random variable wrt the probability space (Ω, F1 , P ). Then we

have E[E[X∣F2 ]∣F3 ] = E[X∣F3 ], as can be verified easily by chaining

conditional expectations. In fact, the fact that expectation remains the

A more commonly used term is conditional expectation with respect

Suppose X and Y are random variables on the same probability space.

the minimal sets correspond to distinct values of Y . Then E[X∣Y ] is

constructed σ -algebra is called the σ -algebra generated by the random

In this sense, conditional expectation with respect to a σ -algebra is a

Another way of thinking about this concept is to think of it as projecting

A conditional expectation of X given H, denoted as E[X∣H], is any

To understand why σ -algebras are necessary to capture information,

Note that we always get more information at every stage of such a

For our purposes, I will mostly be the non-negative integers, though

How do we now construct a “natural filtration” that captures the

Let (Xn )n∈N be a stochastic process on the probability space

is a σ -algebra and F = (F n )n∈N is a filtration. Here σ(Xk ∣ k ≤ n)

denotes the σ -algebra generated by the random variables

algebras and σ(Xk ∣ k ≤ n) ⊆ σ(Xk ∣ k ≤ n + 1).

This is known as the natural filtration of F with respect to X .

Finally, we are able to define what a martingale is. Note that if

A martingale is a stochastic process X1 , X2 , … such that ​ ​

E[Xn+1 ∣σ(X1 , X2 , … , Xn )] = Xn , and E[∣Xn ∣] < ∞.

Equivalently, the first condition can be written as E[Xn+1 ​ −

equality of random variables when dealing with more complicated

Note that this is necessary (trivial) and sufficient, because Xj ​ − Xi =

(Xj − Xj−1 ) + (Xj−1 − Xj−2 ) + ⋯ + (Xi+1 − Xi ), and by

applying the nested coarsening property, it reduces to E[Xj − ​

Why is a martingale so useful? Firstly, it gives us an invariant, and as is

Let’s look at some constructions of martingales.

1. Consider a random walk, where the initial position is X0 = 0, and ​

Xi = Xi−1 ± 1 with equal probability p ≤ 1/2, and Xi = Xi−1

with probability 1 − 2p. Then this is a martingale (from the

I won’t go into a lot of detail about martingales (since they deserve a

martingale (which a martingale also is), and let’s suppose ∣Xk

Xk−1 ∣ ≤ ck almost surely. Then we have P (XN − X0 ≥ ε) ≤

a sub-martingale, we get for a sub-martingale X , P (XN − X0 ≤ ​ ​

martingale and a super-martingale, we have a probabilistic bound on

inequality, but I think this is quite sufficient to see how well-behaved

Suppose we are in the gambler’s shoes, and we want to write an

is a stopping time if the event that {τ = t} is an event in Ft . That is,

the decision to stop at time t must be taken with information no more

To understand why this ensures no forward-looking, consider the

the conditional probability space with respect to Ei , then

conditional expectation of X with respect to the σ -algebra F2 as

A martingale is a stochastic process X1 , X2 , … such that

Equivalently, the first condition can be written as E[Xn+1 −

Note that this is necessary (trivial) and sufficient, because Xj − Xi =

applying the nested coarsening property, it reduces to E[Xj −

1. Consider a random walk, where the initial position is X0 = 0, and

a sub-martingale, we get for a sub-martingale X , P (XN − X0 ≤

of the gambler going bankrupt is p(n). Do we have limn→∞ p(n) = 0?