0% found this document useful (0 votes)
3 views22 pages

Probability 101, The Intuition Behind Martingales and Solving Problems With Them _ Nor's Blog

This document provides an intuitive introduction to martingales and their application in solving probability problems, particularly focusing on stopping times and expected values. It explains key concepts such as probability spaces, random variables, and conditional expectations, while emphasizing the importance of understanding the underlying intuition rather than just the formal definitions. The author aims to clarify these concepts for readers who may struggle with the rigor of probability theory.

Uploaded by

last.alt.till.gm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views22 pages

Probability 101, The Intuition Behind Martingales and Solving Problems With Them _ Nor's Blog

This document provides an intuitive introduction to martingales and their application in solving probability problems, particularly focusing on stopping times and expected values. It explains key concepts such as probability spaces, random variables, and conditional expectations, while emphasizing the importance of understanding the underlying intuition rather than just the formal definitions. The author aims to clarify these concepts for readers who may struggle with the rigor of probability theory.

Uploaded by

last.alt.till.gm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

nor's blog Home Tags Search RSS Comment

Probability 101, the intuition


behind martingales and
solving problems with them
December 31, 2022 · 34 min · 7042 words · nor

This post was originally written on Codeforces; relevant discussion can be


found here.

Recently someone asked me to explain how to solve a couple of


problems which went like this: “Find the expected time before XYZ
happens”. Note that here the time to completion is a random variable,
and in probability theory, such random variables are called “stopping
times” for obvious reasons. It turned out that these problems were
solvable using something called martingales which are random
processes with nice invariants and a ton of properties.

I think a lot of people don’t get the intuition behind these topics and
why they’re so useful. So I hope the following intuitive introduction
helps people develop a deeper understanding. Note that in the spirit of
clarity, I will only be dealing with finite sets and finite time processes
(unless absolutely necessary, and in such cases, the interpretation
should be obvious). I will do this because people tend to get lost in the
measure-theoretic aspects of the rigor that is needed and skip on the
intuition that they should be developing instead. Also, note that
sometimes the explanations will have more material than is strictly
necessary, but the idea is to have a more informed intuition rather than
presenting a terse-yet-complete exposition that leaves people without
any idea about how to play around with the setup.

People already familiar with the initial concepts can skip to the
interesting sections, but I would still recommend reading the
explanations as a whole in case your understanding is a bit rusty. The
concept of “minimal” sets is used for a lot of the mental modelling in
the post, so maybe you’d still want to read the relevant parts of the
blog where it is introduced.

I would like to thank Everule and rivalq for suggesting me to write this
post, and them and meme and adamant for proofreading and discussing
the content to ensure completeness and clarity.

BONUS: Here’s an interesting paper that uses martingales to solve


stuff.

Table of contents
1. Probability, sigma algebras, and random variables
2. Expected value of a random variable and conditional probability
3. Conditional expectation of a random variable with respect to a sigma
algebra
4. Martingales
5. Stopping times
6. Some math problems
7. Some competitive programming problems

Probability, sigma algebras, and


random variables
Let’s look at the simplified definition of a probability space first. We say
simplified here because there are some technicalities that need to be
sorted out for infinite sets, which we’re not considering here anyway.

A probability space is a triple (Ω, F, P ) consisting of

1. Ω: the (non-empty) ground set, also known as the sample space. It is


helpful to think of it as the set of all possible outcomes of an
experiment.
2. F : the σ -algebra. This is a collection of subsets of Ω. Note that this is
not always the power set of Ω. It is helpful to think of its elements
(the chosen subsets) as “events”. For example, when rolling a die, one
possible event can be “getting an even number”. For this experiment,
we have Ω = {1, 2, 3, 4, 5, 6} and this event is {2, 4, 6}.
3. P : the probability function, a function from F to [0, 1]. This is just a
weight that we assign to every event that we have chosen.

For this function to have some nice properties and for it to make sense
as some measure of “chance”, we add some more constraints to these
functions:

1. Ω is an event (i.e., Ω ∈ F ).
2. If A ∈ F , then Ω ∖ A ∈ F . That is, if something happening is an
event, then it not happening is also an event.
∈ F and B ∈ F , then A ∪ B ∈ F . That is, if X happening is
3. If A
an event and Y happening is an event, then at least one of X and Y
happening is also an event. Note that this combined with the
previous constraint allows us to say that A ∩ B is an event and so on
due to De Morgan’s law.
4. P (Ω) = 1, that is, the probability of the whole sample space is 1.
5. For disjoint sets A and B , P (A ∪ B) = P (A) + P (B). This, along
with the previous constraint, is sufficient to derive identities like
P (∅) = 0, P (A) + P (Ω ∖ A) = 1 and so on.

Let’s now build some intuition about this definition (the following,
especially the part about minimal sets, applies only to finite sets, but
there are a lot of similarities in the conclusions when we try to
generalize to other sets using measure-theoretic concepts).

Note that since F is closed under intersections, if we do the following


procedure multiple times: take two sets and replace them with their
intersection, we will end up with some “minimal” non-empty sets that
we can’t break further (this notation is not standard at all, but we will
use this a lot in what follows). Consider the set of all such possible
minimal sets. Clearly, these sets form a partition of Ω since they are
disjoint (else contradiction to minimality) and since every element is in a
minimal set (using the fact that F is closed under complements and
constructing a minimal set if needed, or by just intersecting all sets
containing that element — there is at least one such set, which is Ω).
Now since F is also closed under unions, it also equals the set of all
possible unions of these minimal subsets. In other words, we have a
partition of the elements of Ω into some sets, and F is the set formed
by all possible unions of sets in that partition. We can now think of σ -
algebras in terms of partitions! Also, note that the probabilities for each
of these minimal events of the σ -algebra add up while taking the
unions.

In some sense, we can now construct an equivalent probability space


(Ω’, F’, P ’) where Ω’ is the set of minimal events of F , F’ is the
power set of Ω’, and P ’ is defined naturally from P . This definition is
not standard but will be useful later on. Note that if F was the power
set of Ω, then this equivalent probability space would have been the
same as the original probability space. In this sense, all other σ -algebras
are “coarser” than the power set, whose minimal sets are all singletons.

At this point, we should realize that F is a measure of how much


“information” we have about Ω (we can’t distinguish between two
elements in the same minimal set). This idea, along with the partition
intuition, is a good way to deal with how you gain information over time
in random processes, but we’ll come to that a bit later.

Let’s now turn to the formal definition of a random variable.

A random variable with respect to a probability space (Ω, F, P ) is a


function X from Ω to R such that X −1 ((−∞, x]) ∈ F for all x ∈ R.
That is, the set of pre-images of real numbers ≤ x should be an
element of F .

Note that since we are dealing with finite sets, we will use an
alternative (but equivalent for finite sets) definition where we replace
the interval (−∞, x] by {x}.

When written like this, it might be a bit intimidating, but the underlying
idea is quite simple. We have a sample space Ω, and we have another
set R. Now we want to map elements of Ω to R in some way so that we
can do some computations over elements of Ω. However, since in a
probability space, F tells us how much information we have (or are
allowed to look at), we shouldn’t be able to distinguish between
elements of Ω that are inside the same minimal event of F . Since there
are finitely many elements in Ω, and elements inside the same minimal
event should be mapped to the same real number, there can be only
finitely many intervals that X partitions R into, and the condition in the
definition is necessary and sufficient (due to closedness of F under
intersection and unions).
As a side-note, note that we can also define functions of a random
variable at this point by simply composing the function X with a
function from R to R. Similarly, if we have two random variables on the
same probability space, we can also define functions of them quite
naturally. After all, we just need to specify their values on the minimal
sets and we’re done.

Expectation of a random variable and


conditional probability
Let’s try to investigate random variables a bit further. Let’s say we have
a probability space (Ω, F, P ) and a random variable X on this
probability space. How do we capture some sort of a mean value of X
over the entire probability space? One possible answer is the following:
for each minimal event, we consider the common value of the random
variable at each element of that event, and add it to a weighted sum
with weight equal to the probability of that minimal event. Note that
minimal events form a partition so the sum of these weights is 1. In
some sense, this is an average of the random variable over all minimal
events, weighted by their importance.

But how do we write this up mathematically without referring to


minimal events? Let’s take advantage of the condition in the definition
of a random variable. Instead of summing over minimal events, we can
sum over distinct values x of the random variable, and weight it by the
probability of that variable being equal to x. This probability is well-
defined due to the condition in the definition of a random variable
(because the event that the random variable equals x is in fact a disjoint
union of minimal events and their probabilities add up). This definition
is also consistent with our earlier intuition.
So we can define the expected value of a random variable X to be
E[X] := ∑x∈range(X) x ⋅ P (X = x).

Note that by our earlier definition related to minimal events, we can


also clearly see that for any random variables X, Y on the same
probability space, we have E[X + Y ] = E[X] + E[Y ].

Let’s now change gears and think a bit about how we can isolate an
event. Suppose we have an event E . Let’s say that we want to ignore
everything else, and only look at the elements of Ω (and consequently
the minimal events of F that form a partition of E ). Can we make a
probability space out of it? One easy way would be to just inherit the
structure of Ω and F and set the probabilities of everything that uses
elements other than the minimal elements constituting E to be 0, but
we run into one trouble if we simply decide to inherit the probability
function. The sum of probabilities over all such minimal events is not 1.
One possible idea is to normalize it by P (E), which is possible iff it is
non-zero. It can easily be checked that this probability space is indeed a
valid probability space.

This is how we define the conditional probability space wrt an event E


with P (E) > 0. The sample space and the σ -algebra remain the same,
but the probability function becomes P (A∣E) := P ’(A) = P (A ∩
E)/P (E). This can be verified to be consistent with our earlier
reasoning by considering minimal elements.

A cleaner way of looking at it is that we’re really only concerned with


some subset of Ω and F that correspond to elements in E , and we
define a probability space with E as the ground set. But just to make
sure that we can interoperate with the other events in F , we take the
event, trim down the parts that are not in E , and then use the
probability (or relative weight) of that set wrt the probability (or
weight) of E instead.

One corollary of this is the following: ∑E


i ​
​ P (A∣Ei )P (Ei ) = P (A)
​ ​

for a partition of Ω into events Ei . This can be seen to be true if we


consider the minimal sets in A. In fact, this is used while solving a lot of
recurrences, like in Bayes’ theorem.

This also has an expected value variant, and if we naturally restrict a


random variable to each conditional probability space, by summing over
minimal sets, we get that if E[X∣Ei ] is the expected value of X wrt

the conditional probability space with respect to Ei , then ​

∑Ei E[X∣Ei ]P (Ei ) = E[X]. This variant is used in recurrences too,



​ ​ ​

like finding the expected time before we toss two heads in a row with a
fair coin (though the sample space is infinite there, these identities still
hold).

Conditional expectation with respect


to a sigma algebra
Let’s continue with the partition and information analogy even further.
Recall how we had defined an equivalent probability space. It
corresponds to collapsing equivalence classes (determined by the
partition) into a single element, and making a probability space on that
as the sample space. What if we do that again, after defining another
sparser/coarser probability space on the resulting probability space?

Of course, this leads to a coarser partition of the sample space in some


sense, and if F1 is the original σ -algebra, and F2 is the new σ -algebra
​ ​

(both of them corresponding to the same Ω, after


“unwrapping/unrolling” the second coarsening), then F2 ⊆ F1 .

Now suppose we had a random variable X on the original probability
space. How do we reduce/adapt the random variable to this coarser
probability space? This is where expected value comes into the picture.
We will choose a definition for conditional expectation, and see that it
satisfies a lot of good properties and is consistent with our conditional
probability identities that we figured out in the previous section.

Note that X is constant on all minimal sets of F1 . Let’s define the


conditional expectation of X with respect to the σ -algebra F2 as ​

the random variable E[X∣F2 ] := X’ such that on every element of a


minimal set Ei of F2 , X’ assumes the value E[X∣Ei ]. This essentially


​ ​ ​

means that we are trying to take a “conditional average” of the


information about the values of X on the set Ei . It is easy to verify that

E[X’] = E[X], where the first expectation is taken over the coarser
probability space, and the second is taken on the finer probability
space. In a case of unfortunate naming, X’ is called the conditional
expectation of X wrt the σ -algebra F2 . ​

To make this a bit clearer, let’s take an example.

Example

We note a couple of points here:

1. In a way, we are trying to replace the random variable with its “best
approximation” in some sense (for instance, a variance minimization
sense, but that is totally unrelated). Also note that the expectation of
the random variable does not change when we apply this coarsening
to X .
= {∅, Ω}, the random variable
2. Note that for the trivial σ -algebra F2 ​

E[X∣F2 ] is identically the constant function that equals E[X].



We can also chain a couple of coarsenings. Suppose F3 ⊆ F2 ⊆ F1 and
​ ​ ​

X is a random variable wrt the probability space (Ω, F1 , P ). Then we


have E[E[X∣F2 ]∣F3 ] = E[X∣F3 ], as can be verified easily by chaining


​ ​ ​

conditional expectations. In fact, the fact that expectation remains the


same even after coarsening is a direct consequence of this and the last
point.

A more commonly used term is conditional expectation with respect


to another random variable. It is usually defined as this:

Suppose X and Y are random variables on the same probability space.


For any element ω of the sample space where Y (ω) = y , we define
E[X∣Y ] (ω ) = E[X∣E] where E is the event that Y = y . That is, for
every element where Y assumes a value of y , we take the conditional
expectation wrt the event that Y = y and set the value of the answer
to that conditional expectation.

However, this is the same as saying this: let FY be the σ -algebra where

the minimal sets correspond to distinct values of Y . Then E[X∣Y ] is


identically equal to the random variable E[X∣FY ]. As a side-note, this

constructed σ -algebra is called the σ -algebra generated by the random


variable Y .

In this sense, conditional expectation with respect to a σ -algebra is a


cleaner and more sophisticated concept than conditional expectation
with respect to a random variable, though both concepts are equivalent
(you can always induce a σ -algebra by choosing where to put equal
values for the random variable you’re conditioning on).

Another way of thinking about this concept is to think of it as projecting


down from a set of minimal events to a coarser set of minimal events.
For each chunk of minimal events we’re collapsing, we replace it with a
single minimal event by the “center of mass” of those events, where the
“position” is the value of the random variable, and the “mass” equals
the probability of those minimal events. This works out mathematically
too, and is precise for this setup.

Just for the sake of completeness, since the above definition captures
what conditional expectation with respect to a σ -algebra does, but
breaks down when you try to generalize it to infinite sample spaces,
here’s the “official definition” that people use (which is quite counter-
intuitive and takes some time to digest):

A conditional expectation of X given H, denoted as E[X∣H], is any


random variable on (Ω, H, P ) which satisfies ∫H ​
E[X∣H]dP =
∫H XdP for each H ∈ H.

This essentially says the same thing when you think about the possible
H , but in a more non-constructive way. Turns out that this random
variable is not unique in the strictest sense, but P (Y =
 Z) = 0 holds,
where Y and Z are two candidates for this random variable. Also, this
definition makes it clear that X − E[X∣H] is orthogonal to the
indicator function (so we get a more precise definition of projecting
down).

Martingales
Let’s say there is a “process” of some sort, where at every step, we get
more information. Let’s say our sample space for the whole experiment
is Ω. Here, Ω is almost always infinite, but the intuition we built carries
forward here, even though the concepts like minimal sets might not.
Let’s take a very simple example: we toss a coin at each step. The
sample space for the whole experiment is the set of all infinite binary
strings. Let’s think about the kind of information that we have available
at various steps in the process. After step 0, all the information we have
(for instance, the complete information about a random variable that
has information about what has happened until now) corresponds to
the trivial σ -algebra {∅, Ω}. After step i, we have information for the
first i coin tosses, so the information corresponds to the σ -algebra with
“minimal” sets being the sets of binary strings which are the same in the
first i positions.

To understand why σ -algebras are necessary to capture information,


perhaps the following analogy helps. Think in terms of minimal sets.
Each time we get some more information, we might be able to
distinguish between elements in minimal sets. To be able to do
operations like finding the expectation of some random variable in this
scenario, we need to be able to distinguish between these elements
too. In some sense, you’re inspecting multiple parallel universes at the
same time, and in each universe, you have a certain prefix of things that
happened, for each of which you want to be able to analyze what
happens later.

Note that we always get more information at every stage of such a


stochastic process. So we define a filtration as follows (without any
reference to a process for the sake of generality): a sequence of σ -
algebras Fi such that Fi
​ ​ ⊆ Fj iff i ≤ j .

Now let’s get to what a process really is. At every step of a process, we
have a random variable that shows up. But what is the σ -algebra for
that random variable? To remedy this, we take a σ -algebra F that
captures all possible information. So we define a stochastic process as
follows:
For a given probability space (Ω, F, P ), and an index set I with a total
order ≤, a stochastic process X is a collection of random variables
{X(i)∣i ∈ I}.

For our purposes, I will mostly be the non-negative integers, though


I = R is used extensively in a lot of contexts. I having a total order
means that there is some concept of time associated with I .

How do we now construct a “natural filtration” that captures the


information that the first i steps of the process (i.e., {X(j)∣j ≤ i})
give us? Remember how we said we can construct a σ -algebra from a
random variable? We can also use the same construction to construct it
from a finite set of random variables, by refining two partitions
together while it is possible (this corresponds to the join of the two
partitions in the refinement lattice, if you’re familiar with posets and
lattices). So the filtration can be set up this way too, by refining the
current σ -algebra with a new random variable each time. More formally,
paraphrasing from Wikipedia,

Let (Xn )n∈N be a stochastic process on the probability space


​ ​

(Ω, F, P ). Then

F n := σ(Xk ∣ k ≤ n)

is a σ -algebra and F = (F n )n∈N is a filtration. Here σ(Xk ∣ k ≤ n)


​ ​

denotes the σ -algebra generated by the random variables


X1 , X2 , … , Xn . F is a filtration, since by definition all F n are σ -
​ ​ ​ ​

algebras and σ(Xk ∣ k ≤ n) ⊆ σ(Xk ∣ k ≤ n + 1).


​ ​

This is known as the natural filtration of F with respect to X .

Finally, we are able to define what a martingale is. Note that if


σ(Xi ) ⊆ H for some σ -algebra H, then E[Xi ∣H] = Xi (think about
​ ​ ​
the minimal sets and the definition). A martingale essentially says that
conditional on the information we have until time i, a random variable
Xj for j ≥ i looks like Xi . More formally,
​ ​

A martingale is a stochastic process X1 , X2 , … such that ​ ​

E[Xn+1 ∣σ(X1 , X2 , … , Xn )] = Xn , and E[∣Xn ∣] < ∞.


​ ​ ​ ​ ​ ​

Equivalently, the first condition can be written as E[Xn+1 ​ −


Xn ∣σ(X1 , X2 , … , Xn )] = 0. Note that we talk about “almost sure”
​ ​ ​ ​

equality of random variables when dealing with more complicated


probability spaces, but for our purposes, we will not concern ourselves
with this detail.

Note that this is necessary (trivial) and sufficient, because Xj ​ − Xi =


(Xj − Xj−1 ) + (Xj−1 − Xj−2 ) + ⋯ + (Xi+1 − Xi ), and by


​ ​ ​ ​ ​ ​

applying the nested coarsening property, it reduces to E[Xj − ​

Xi ∣σ(X1 , … , Xi )] = 0.
​ ​ ​

Why is a martingale so useful? Firstly, it gives us an invariant, and as is


well-known, invariants make life easier when we try to prove results in
math. Secondly, it originated in the theory of gambling, and for fair
gambling games, the expected value of the profit at a later point in
time should be equal to the profit at that point in time, which is
precisely the martingale condition!

Let’s look at some constructions of martingales.

1. Consider a random walk, where the initial position is X0 = 0, and ​

Xi = Xi−1 ± 1 with equal probability p ≤ 1/2, and Xi = Xi−1


​ ​ ​ ​

with probability 1 − 2p. Then this is a martingale (from the


definition).
2. Consider a random walk denoting the total assets of a gambler who
keeps reinvesting all his capital into the game. The game gives a
payout of r percent if he wins (with probability p), a loss of r percent
if he loses (with probability p), and no payout in case its a draw (with
probability 1 − 2p). Then his wealth is a martingale.

There are many more examples of martingales, but I won’t go too much
in depth about that, and would leave you to refer to some other
resource for more examples.

Some more intuition about martingales: let’s see what happens to the
conditional expectation in the definition of a martingale when we apply
a convex function to a stochastic process (i.e., to all random variables in
the stochastic process). Since probabilities are all positive, whenever
we’re taking the conditional expectation, we are taking a convex
combination of the values of the convex function of a random variable.
By Jensen’s inequality, we get that rather than equality, ≥ holds in the
defining equation for martingales. Similarly, if we had a concave
function, ≤ would hold in the defining equation. These types of
processes (where ≥ and ≤ hold instead of =) are called sub-
martingales and super-martingales respectively. Note that not all
stochastic processes are one of sub-/super-martingales. Note that ≥ 0
and ≤ 0 means that the random variable on the LHS is almost surely
non-negative and non-positive, respectively.

An example of a sub-martingale

I won’t go into a lot of detail about martingales (since they deserve a


field of study of their own), but will just point out some nice results,
that help to get a feel of how martingales behave:
1. Azuma’s inequality: this provides a bound on how “close” we will be
to the initial value of the random variable given bounds on the
movement of the martingale (or how much in the wrong direction we
can go for a sub-/super-martingale). Let’s say {Xk } is a super- ​

martingale (which a martingale also is), and let’s suppose ∣Xk


− ​

Xk−1 ∣ ≤ ck almost surely. Then we have P (XN − X0 ≥ ε) ≤


​ ​ ​

N
exp(−ε2 /(2 ∑k=1 c2k )). Since the negative of a super-martingale is
​ ​

a sub-martingale, we get for a sub-martingale X , P (XN − X0 ≤ ​ ​

N
−ε) ≤ exp(−ε2 /(2 ∑k=1 c2k )). Since a martingale is both a sub-
​ ​

martingale and a super-martingale, we have a probabilistic bound on


how close XN will be to X0 , that is, P (∣XN
​ ​ ​ − X0 ∣ ≥ ε) ≤

N
2 exp(−ε2 /(2 ∑k=1 c2k )). There is a stronger version of this
​ ​

inequality, but I think this is quite sufficient to see how well-behaved


martingales are.
2. Doob’s martingale inequality: this is a Markov-style inequality on the
probability that a sub-martingale will exceed a value x. More
formally, for a sub-martingale X , we have P (max1≤i≤n Xi ​ ​ ≥ x) ≤
E[max(Xn , 0)]/x. This can be used to also show Markov’s

inequality.
3. Doob’s martingale convergence theorem: this is a result that says
that a super-martingale that is bounded from below will converge
(almost surely) to a random variable with finite expectation. Note
that if we bar the case when the variable diverges to infinity in some
sense, the other possible way for it to not converge is to oscillate.
This inequality roughly says that bounded martingales don’t oscillate.

Stopping times
Stopping times are quite interesting mathematical objects, and when
combined with martingales, they give a lot of interesting results and are
a very powerful tool.

Suppose we are in the gambler’s shoes, and we want to write an


algorithm to finish the game, depending on certain outcomes in the
process. Stopping times are a way to formalize the analysis of when
such exit conditions are reached. The best we can do right now, given
the current theory we have, is to define it as a random variable (that
may or may not depend on the stochastic process). We also have the
constraint that we can’t see the future, so the decision about when to
stop must be taken with all the information at the current time.

To that end, let’s consider a random variable τ (that takes values in the
index set, that is the set of non-negative integers here) on a probability
space (Ω, F, P ), and let’s say we have a stochastic process X on the
same probability space with an associated filtration {Fi }. We say that τ

is a stopping time if the event that {τ = t} is an event in Ft . That is,


the decision to stop at time t must be taken with information no more


than what we have at time t. For also keeping this consistent with real
number indexing, we modify the condition to {τ ≤ t} being an event in
Ft .

To understand why this ensures no forward-looking, consider the


following. We bunch together elements of Ω that have stopping time
≤ t, and the condition is that we shouldn’t be able to distinguish them
if without information from time after t. Similarly, it must not be the
case that two elements of Ω in the same minimal set of Ft have ​

different stopping times, when not both are > t. In our coin flip
example, the latter makes sense if you think about why you can’t have
stopping time for 000010110… = 4 and that for 000011000… = 6.

Another way of formalizing the same thing is to say that the stochastic
process (denoting the stopping of a process) defined by Yt = 0 if τ < t
and 1 otherwise is an adapted process wrt the filtration. That is,
σ(Yt ) ⊆ Ft for all t.
​ ​

Since the stopping time is an index, the most natural thing at this point
is to think of how to index the stochastic process with some function of
stopping time. If we go back to the definition of a random variable, it
should be clear that intuitively we need to assign a value to it at every
element in Ω. In simple cases, this is trivial, since we can just compute
the value of τ at the minimal set which has this element (let’s say this
value is t), and compute the value of Xt at the minimal set again. From

here, it should be obvious that Xτ is indeed a random variable on the


original probability space, and it denotes the value of the stochastic


process when the process is exited. Also note that by the same
reasoning, Xmin(τ ,t) is a random variable whose induced σ -algebra is a

subset of Ft . Xmin(τ ,t) corresponds to a stopped process, which means


​ ​

that we just arbitrarily stopped a process after a time limit. These are
very important types of processes, and have been extensively studied.

The cool part about martingales and stopping times together is that this
setup has been studied a lot, and there are quite a few important
results that I’ll list without proof, which can help us solve a lot of
problems as well as build an understanding around stopping times and
martingales:

1. Martingale central limit theorem: This is a generalization of the


usual central limit theorem to martingales. The associated process is
the usual martingale, but with differences between consecutive
values being bounded by a fixed bound almost surely. We choose a
stopping time as follows: at each step i + 1, compute the conditional
variance of Xi+1 wrt Fi , and keep adding it to a counter. Stop once
​ ​

this counter is more than a certain value σ 2 . (In a sense, we are


waiting till we have collected a large enough variance/move). If the
stopping time here is τσ , then the random variable Xτσ /σ converges


to a normally distributed variable with mean 0 and variance 1, as σ


tends to ∞. This also assumes that the sum of variances diverges.
2. Doob’s optional stopping theorem: This is a very important
theorem, and will be quite useful to compute the expected values of
stopping times later on. It talks about the fairness of a martingale
when we stop after a random stopping time, in the sense that we
developed above. The metric for fairness is that E[Xτ ] ​ = E[X0 ],

and it turns out that this is true if any of the following three
conditions holds:
The stopping time τ is bounded above.
The martingale is bounded and the stopping time τ is finite almost
surely (i.e., P (τ = ∞) = 0).
The expected value of τ is finite, and the difference between Xi+1 ​

and Xi is bounded. Note that this can be used to compute the


stopping time if we modify the martingale a bit, as we shall see


later.

Some math problems


We’ll start out with some math problems that I have seen over the years
(apologies since I don’t remember the sources for some of them).

Problem 1: Suppose there is a gambler who has n dollars and gambles


infinitely many times. The payoff of the tth round is Xt , which is an

integer bounded above and below by some fixed constants. We know


that E[Xt ∣Xt−1 , ⋯
​ , X1 ] ≥ c where c > 0. Suppose the probability
​ ​

of the gambler going bankrupt is p(n). Do we have limn→∞ p(n) = 0? ​

Solution sketch
Problem 2 (folklore): Suppose there is an infinite grid, with a non-
negative real number in each. Suppose that for every cell, the number in
it is the mean of the numbers on its 4 neighbouring squares. Show that
all the numbers must be equal.

Solution sketch

Problem 3 (USA IMO 2018 Winter TST 1, P2): Find all functions
f : Z2 → [0, 1] such that for any integers x and y ,
f (x − 1, y) + f (x, y − 1)
f (x, y) =
2

Solution sketch

Problem 4 (2022 Putnam A4): Suppose that X1 , X2 , … are real


​ ​

numbers between 0 and 1 that are chosen independently and uniformly


k
at random. Let S = ∑i=1 Xi /2i , where k is the least positive integer
​ ​

such that Xk < Xk+1 , or k = ∞ if there is no such integer. Find the


expected value of S .

Solution sketch (due to ABCDE)

Problem 5 (AGMC 2021 P1): In a dance party initially there are 20 girls
and 22 boys in the pool and infinitely many more girls and boys waiting
outside. In each round, a participant from the pool is picked uniformly at
random; if a girl is picked, then she invites a boy from the pool to dance
and then both of them leave the party after the dance; while if a boy is
picked, then he invites a girl and a boy from the waiting line and dance
together. The three of them all stay after the dance. The party is over
when there are only (two) boys left in the pool.

1. What is the probability that the party never ends?


2. Now the organizer of this party decides to reverse the rule, namely
that if a girl is picked, then she invites a boy and a girl from the
waiting line to dance and the three stay after the dance; while if a boy
is picked, he invites a girl from the pool to dance and both leave after
the dance. Still the party is over when there are only (two) boys left in
the pool. What is the expected number of rounds until the party
ends?

Solution sketch

Problem 6 (folklore): Let a monkey type randomly on a typewriter with


alphabet Σ. For a given string w ∈ Σ∗ , let w1 , … , wn be the
​ ​

nonempty strings that are simultaneously a prefix and suffix of w . Then


show that the expected number of keystrokes the monkey needs to
n
make to type w is ∑i=1 ∣Σ∣∣wi ∣ .

Solution sketch

A different approach for the above problem is discussed here.

Problem 7 (Alon, Spencer): Let G = (V , E) be a graph with chromatic


number χ(G) = 1000. Let U ⊆ V be a random subset of V chosen
uniformly from among all 2∣V ∣ subsets of V . Let H = G[U ] be the
induced subgraph of G on U . Prove that the probability that H has a
chromatic number of ≤ 400 is less than 0.01.

Solution sketch

Some competitive programming


problems
Almost all the problems that I know about in a competitive
programming perspective are discussed in this Chinese post and this
editorial to 1479E that explains the main idea in a slightly different
manner. I won’t repeat their discussion (the equations should be
understandable if you understand what the main approach with the
math problems above is), but I would just list out the problems that
they link as well as some problems that appeared later (in reverse
chronological order).

1. 1575F
2. 1479E
3. 1392H
4. 1349D
5. 1025G
6. 850F
7. USACO 2018 — Balance Beam

Math Probability Martingales Tutorial

© 2024 nor's blog

>

You might also like