Ae Chapter2 Math
Ae Chapter2 Math
Just as you can follow a recipe from a cookbook without really learning to cook, it is
certainly possible to study algorithms while avoiding mathematics and proofs. This
may, in fact, be preferable for the practitioner who simply wants help implementing
fundamental algorithms. However, our goal in this text is to learn how to cook,
so to speak. A sound mathematical background is a fundamental prerequisite for
success in the design and analysis of novel algorithms for novel problems.
Algorithmic study is well-suited for those who appreciate mathematics, and can help
develop such an appreciation for those with limited prior mathematical exposure,
since can learn much about mathematics as a side benefit of studying algorithms.
We assume the reader has some background in discrete mathematics, but we stress
that most of the techniques we use should be quite accessible even for those without
extensive training. Algorithmic analysis mostly entails clever application of basic
mathematical concepts, rather than deep results from advanced mathematics.
In this chapter, we review mathematical techniques used throughout the rest of the
book. We begin with basic notation and terminology, proof techniques, and helpful
tricks. Next, we cover three topics in greater detail that we will find particularly
useful: methods for solving recursively-defined expressions called recurrences (to
analyze recursive algorithms), techniques from probability theory (to analyze ran-
domized algorithms), and important concepts from linear algebra (to analyze some
of the more advanced mathematical algorithms in this book). Finally, we highlight
a few remaining bits of prerequisite knowledge that may benefit the reader from
other areas of mathematical study. The reader with a relatively strong mathemat-
ical background is still advised to peruse this chapter, since it includes discussion
and problems covering a number of interesting algorithmic applications.
Our notation is fairly standard and should not dier appreciably from any other
mathematical text.
Sets. We expect the reader to be familiar with standard notation for manipulating
sets. For example, S \ T denotes the intersection of sets S and T , and S [ T denotes
their union. There are several formulaic ways to prove certain types of statements
regarding sets. We can prove that S = T by arguing that S T and also T S.
To show that x 2 S \ T , we could prove separately that x 2 S and x 2 T . To argue
that S T , we might take an arbitrary element x 2 S and show that x 2 T as well.
As a shorthand for summing over a set, we sometimes use a set-valued argument to
a function: X X
f (S) = f (x) and g(S, T ) = g(x, y).
x2S x2S,y2T
We often use interval notation to describe a contiguous set of numbers. For example,
[0, 1] denotes the closed set of all real numbers between 0 and 1, with both endpoints
included in the set, and [5, 10), denotes the set of all real numbers between 5 and
10, including the endpoint 5 but not 10. The Cartesian product of two sets, A B,
is the set of all pairs (a, b) where a 2 A and b 2 B. For example, [0, 1] [0, 1] is the
unit square in the 2D plane with corners (0, 0) and (1, 1).
Logarithms. Logs show up everywhere in the study of algorithms. As with bi-
nary search, many algorithms successively reduce a problem of size n to a smaller
equivalent problem of size at most n/k (for binary search, k = 2). It takes logk n
such phases to reduce a problem of size n down to a trivial problem of size 1. By
log n we mean log2 n by default; we write ln n for logarithms in base e. The base of
a logarithm usually doesnt matter in an asymptotic running time expression, since
logs of dierent bases dier only by constant factors. The reader should be comfort-
able manipulating expressions involving logarithms. For example, log nk = k log n,
2log n = n, and alog b = blog a .
Sequences. We use subscripts to write a sequence of elements as A1 , A2 , . . . , An .
If the elements represent characters drawn from some alphabet, we sometimes call
the sequence a string. The words list and array can also refer to a sequence, but we
typically use these only for sequences that are specifically implemented as linked lists
or arrays. We use the notation A[1], A[2], . . . , A[n] for the elements of a sequence
stored in an array in memory. Our sequences typically start at index 1 instead of
index 0. Finally, a substring or subarray refers to a contiguous block of elements
Ai . . . Aj within a sequence/string/array, while a subsequence refers to a subset
of the elements of the sequence (taken in the same order as they appear in the
sequence), which may not be a single contiguous block.
Sequence S is lexicographically smaller than another sequence S 0 if Sj < Sj0 at the
first index j where they disagree, or if there is no disagreement but S is shorter
than S 0 . Lexicographic ordering is a natural way to order objects that have multiple
components, like sequences, strings, and vectors. For text strings, lexicographic
ordering corresponds to our familiar notion of alphabetic ordering.
Number Bases. We expect the reader to be familiar with the notion of dierent
number bases. In particular, since digital computers naturally represent all numbers
in binary (base 2), the reader should be comfortable with expressing numbers in
binary. For example, we would write the base-10 number 19 in binary as 10011,
since 19 = 16(1) + 8(0) + 4(0) + 2(1) + 1(1). In any number base, we call the
leftmost digit of a number the most significant digit and the rightmost digit the
least significant digit.
Factorials, Permutations, and Combinations. The number of permutations
(dierent orderings) of n distinct elements is n! = 1 2 3 . . . n. The number
Back in Section 1.4, we introduced asymptotic notation and motivated its impor-
tance in describing algorithmic running times. We learned that:
We can round out this list with the following additional definitions:
On occasion, we will encounter two special functions that grow extremely slowly,
much slower even than log n. These functions are generally excellent news in terms
of efficiency, since they give us very small running times even for huge values of n.
The log n Function. Imagine that you have a calculator with an f (n) key, so for
example pressing the key three times would transform n into f (f (f (n))). Assuming
f is decreasing, we define f (n) as the number of key presses needed to decrease n
down to a value that is at most 1. For example, if f (n) = n 2, then f (n) = bn/2c,
since n/2 subtractions of 2 are needed. If f (n) = n/2, then f (n) = dlog ne. This
star operator is important when dealing with recursively-defined algorithms, since
if an algorithm solves a problem of size n by recursively solving subproblems of size
f (n), then the algorithm will recurse for f (n) levels. For example, each step
of binary search reduces a problem of size n to a recursive subproblem of size
f (n) = n/2, so binary search recurses to a depth of f (n) = O(log n).
The star operator applied to f (n) = log n gives the function log n, one of the
slowest-growing functions we occasionally encounter in the analysis of algorithms.
To give you a sense of just how slowly this function grows, we have log n = 5 if
n = 265536 , a number much larger than the number of atoms in the observable uni-
verse. For all practical purposes, we can think of log n as being constant, although
from a theoretical standpoint we cannot simply ignore such terms since they do
grow as functions of n.
p
Problem 8 (Practice Applying the Star Operator). If f (n) = n, give a
simple proof that f (n) = (log log n). Repeated square roots never decrease below one,
so please stop at 2 instead of 1 in the definition of f (typically any constant is fine, as
long as we only care about the asymptotic form of the answer). [Solution]
Problem 9 (The Logsum Function, Binary Codes, and Unbounded Search).
The function logsum(n) = log n + log log n + log log log n + . . . (with log n terms) plays an
important role in several algorithmic results. For example, suppose we want to encode in
a single binary stream a sequence of positive integers, with no bounds on their sizes. We
could of course write each number in binary, but we also need to encode information on
the number of bits we will use, since otherwise we wont know where one number stops and
another starts. The clever Elias Omega code uses logsum(n) + O(log n) bits to encode
n by writing n in binary after recursively encoding dlog ne, the number of bits required for
n (so we encode dlog ne in binary using dlog log ne bits, before which we write its length
dlog log ne using dlog log log ne bits, and so on). Inspired by this approach, consider now
the related problem of unbounded search for a positive integer n, where we are given no
upper bound on n. A simple approach taking at most 2dlog ne comparisons to find n is
to start with x = 1 and successively double x until x n (taking dlog ne steps), then to
binary search the range [1, x] for n (taking another dlog ne steps). For a challenge, can
you show how to improve on the first half of the algorithm in order to perform unbounded
search in only logsum(n) + O(log n) comparisons? As a hint, x is just a power of two, so
recursively search for log x. How is the unbounded search problem essentially equivalent
to the binary encoding problem above? [Solution]
The Inverse Ackermann Function. Designing a function that grows even more
slowly than log n is as simple as just adding more stars. For example, log n is
the number of times we need to iteratively apply the log n function to n before we
reach a value no larger than 1. For n > 4, the value of log n is at least 3 no
matter how many stars we use, and if we add enough stars, we will eventually reach
a point where log n = 3. We define the inverse Ackermann function, (n), as
one plus the minimum number of stars required to make log n 3 (we add one
to ensure that (n) is always at least positive). On occasion, we will use a stronger
two-term variant (m, n) (n), giving one plus the minimum number of stars
required to make log n 3 + m/n. One can show that (n) = o(log n) for
any constant number of stars, so inverse Ackermann running time bounds are even
faster than those involving multiply-iterated logs. Almost every appearance of an
inverse Ackermann running time originates from just one place the analysis of a
prominent data structure for maintaining disjoint sets, discussed in Section 4.6.
The Ackermann function is a famous recursively-defined function designed to grow
extremely quickly. Many variants of the Ackermann function appear in the litera-
ture, and consequently one can also find dierent ways of defining inverse Acker-
mann functions, all of which lead to essentially the same asymptotic behavior. We
dont use the Ackermann function directly in this book, so we wont say any more
about it here in the main text. However, we have included further detail in the
endnotes, for the interested reader.
Graphs are particularly prominent objects in computer science, due to their flexi-
bility in being able to model so many dierent things. Every graph consists of a set
(a) (b)
Root (also parent of k)
i j
Figure 2.1: Examples of: (a) a connected graph with a spanning tree high-
lighted, (b) a graph with two connected components, (c) a directed graph where
there is no directed path from i to j, but there is a directed path from j to i, (d)
a rooted tree of height 2.
of nodes (sometimes called vertices or points) and edges (sometimes called links or
lines), where each edge connects a pair of nodes. Depending on our physical inter-
pretation of nodes and edges, a graph can represent virtually any kind of network
(transportation, communication, social, etc.), as well as a more abstract object like
a mathematical relation.
Graphs can be undirected or directed. The word graph technically implies an
undirected graph, while digraph specifies a directed graph; we sometimes use the
term graph generically to refer to both. An edge directed from node i to node j in
a directed graph (sometimes called an arc) is written as an ordered pair (i, j), while
an undirected edge between i and j in a graph is written as a set {i, j}, since this has
no directional implications. The notation ij is also sometimes used for describing
either a directed or undirected edge from i to j. We usually assume no edge is a self
loop connecting a node to itself. We generally also assume that our graphs are
simple, meaning there is at most one edge ij for any pair of nodes (i, j). Otherwise,
if multiple parallel edges can run between the same pair of nodes, we have a
multigraph. The degree of a node is the number of edges incident to the node. In a
directed graph, we distinguish indegree (number of incoming edges) and outdegree
(number of outgoing edges). We often consider weighted graphs by associating
numeric weights with edges and/or nodes. Depending on the application, these
may have physical meanings as costs, values, capacities, lengths, and so on.
By convention, n represents the number of nodes in a graph, and m represents
the number of edges. A graph with few edges (where m is closer to n) is said
to be sparse, while a graph with many edges (where m is closer to n2 ) is said to
be dense. Some algorithms are better suited to sparse graphs, and others dense
graphs. In practice, most large graphs (e.g., the directed graph depicting the link
structure of the World Wide Web) tend to be quite sparse. The running time of a
graph algorithm will usually depend on both n and m. Since simple graphs satisfy
m n2 , we can technically replace m with O(n2 ) to write these running times solely
in terms of n. However, this is generally not a good idea, since in a sparse graph, an
O(m) running time is more like O(n) than O(n2 ), and we might discourage someone
from using a fast O(m) algorithm if we describe its running time solely as O(n2 ).
Paths and Cycles. A path is a series of nodes connected by edges; we can regard
this either as a sequence of nodes or as a sequence of edges. A path that starts and
ends at the same node is called a cycle. In a directed graph, paths and cycles contain
edges that are consistently directed. A path is simple if it contains no cycles, and a
cycle is simple if it contains no cycles shorter than itself. When we say path and
cycle in this book, we mean a simple path or cycle. We use the term walk for a
path that need not be simple. The length of a path or cycle is the number of edges
it contains (or in a weighted graph, the sum of its edge weights). A Hamiltonian
path or cycle visits every node exactly once, and an Eulerian path or cycle follows
every edge exactly once.
Connectivity. Nodes i and j are adjacent if they are directly connected by an
edge, and connected if they are connected by a path. In a directed graph, there
can possibly be an edge or path from i to j but not in the reverse direction. A
graph is connected if every pair of nodes is connected; for example, Figure 2.1(a)
is connected while Figure 2.1(b) consists of two separate connected components.
In Chapter ?? we extend the notion of connectivity and connected components to
directed graphs, as well as higher orders of connectivity (e.g., we can say that two
nodes are highly connected if they are connected by a large number of edge-disjoint
paths, or alternatively if we need to remove a large number of edges to separate
them into distinct connected components).
Trees. A tree is a graph that is connected and acyclic (having no cycles). Often
we designate a specific node in a tree as a root, allowing us to orient the tree so
that every node has a well-defined parent (except the root) and set of children, as
shown in Figure 2.1(d). A tree with no designated root is called a free tree; the
word tree by itself can mean either a free tree or a rooted tree. We will sometimes
take a free tree and root it, designating one of its nodes as a root, after which we
can then operate on it like a rooted tree. Rooted trees play a key role in most data
structures, as we shall see in Chapters 4 through 9. As opposed to reality, rooted
trees in computer science are usually drawn growing downward from a root at the
top. Accordingly, the depth of a node in a rooted tree is its distance downward from
the root, and the height of a rooted tree is the maximum depth over all nodes. Node
i is an ancestor of node j in a rooted tree if j appears inside the subtree rooted at i
(so i lies on the path from j up to the root). A node of degree 1 (in a rooted tree,
a node with no children) is called a leaf. Other nodes are called internal nodes.
Subgraphs and Spanning Trees. A subset of the edges of a graph is called
a subgraph. A subset of nodes induces a subgraph consisting of all edges with
both endpoints in the subset. A particularly common subgraph in algorithmic
computing is a spanning tree a subset of edges that connects together all nodes
and contains no cycles, as shown in Figure 2.1(a). For example, a spanning tree in
a communication network gives us a minimal subset of edges within which we can
still route information between all pairs of nodes. This simplifies the routing task
as well, since within any tree there is a unique path joining any given pair of nodes.
In the case where we want to broadcast information from one node r to all other
1
previous parent next
sibling first child sibling
2 3 4
1 2 3 4
(a) 1 0 0 1 0 (c)
2 1 0 1 0
3 0 0 0 1
4 1 0 1 0
nodes, we end up with a tree rooted at r. Such trees are often thought of as being
directed outward from the root r. In Chapter ?? we will more formally study
directed notions of trees, known as branchings and arborescences.
Representing a Graph in Memory. Two common ways to represent a graph are
using an adjacency matrix or with adjacency lists. As shown in Figure 2.2(a), the
adjacency matrix is an n n matrix in which the (i, j) entry is 1 if there is an edge
from i to j, and 0 otherwise. For an undirected graph, the matrix is symmetric since
the (i, j) and (j, i) entries are equal. Adjacency matrices are well-suited for dense
graphs; otherwise, they can be a liability both in terms of space and running time. A
graph algorithm utilizing adjacency matrices typically cannot have a better running
time than (n2 ) due to the need to examine the entire input matrix. Furthermore,
although adjacency matrices allow us to query whether an edge (i, j) is present
in only O(1) time, they require (n) time to enumerate the neighbors of a node,
regardless of how many neighbors there are.
Figure 2.2(b) shows a graph represented by adjacency lists, where we maintain for
every node i an array or linked list of the neighbors of i. In a directed graph, we
usually only maintain edges directed out of i, although it is sometimes convenient
to maintain a list of incoming edges as well. Adjacency lists are ideal for sparse
graphs, since they require only (m + n) space (linear in the size of the graph).
Sometimes we store adjacency lists in a fancier data structure such as a hash table
(Chapter 7) in order to allow for fast queries to see if an edge (i, j) is present, as
well as fast insertion and deletion of edges.
Representing a Tree in Memory. Since a free tree on n nodes contains only n 1
edges, it is typically represented just like any other sparse graph, using adjacency
lists. In the event that our tree is rooted, we can augment our adjacency lists
to indicate which neighbor is the parent and which are the children. Another
common way to store a rooted tree in memory is shown in Figure 2.2(c), where
each node maintains a pointer to its parent, first child, and previous and next
sibling. Although the children of a node are stored in a particular order, it depends
on our application as to whether or not this ordering is significant or not. Binary
trees are particularly common among rooted trees, where each node has at most
two children, each designated as a left or right child. A node can have no children,
only a left child, only a right child, or both. We usually store a binary tree as shown
in Figure 2.2(d), where each node points to its parent, left child, and right child.
We will eventually need quite a bit more graph terminology, but we will postpone
introducing it until it is needed, when we study graphs in much greater detail.
Problem 10 (Proof Practice). This problem provides some fun and challenging
warm up exercises to re-familiarize the reader with proofs by induction and by contra-
diction.
(a) Prove by induction that any integer of the form 4k 1 must be a multiple of 3.
[Solution]
P di
(b) Prove by induction Krafts inequality, which states that i2 1 in any binary
tree, where the sum is over every leaf i, and di is depth of leaf i (the root has depth
zero). [Solution]
(c) Given a set of n points in the 2D plane, a line is odd if it contains an odd number
of points from the set, and even otherwise. Argue by contradiction that among all
vertical and horizontal lines, there cannot be a single unique odd line. [Solution]
(d) A separator in a connected graph is a set of nodes whose removal breaks the graph
into two or more dierent connected components. Prove by contradiction that every
p
connected graph on n nodes must have either a separator containing at least n nodes
p
or a simple path containing at least n nodes. [Solution]
(e) In a group of n people, suppose each person has at least n/2 friends. Please prove
by contradiction that it is possible to seat everyone around a circular table so that all
adjacent pairs are friends. [Solution]
In the rest of this section, we highlight several techniques, tricks, and insights that
can be helpful for the mathematical analysis of an algorithm.
a1 + a2
Slope
b 2 b1 + b2
a2
c2 Slope a2 /b2
c a2
a a1 Slope a1 /b1
b1 b2
b
(a) (b)
This is easy to show, since n! is between (n/2)n/2 and nn , the log of both be-
ing
p (n lognn). Stirlings approximation says that n! behaves asymptotically like
2n(n/e) , and also provides useful bounds on the binomial coefficient nk :
n k
n ne k
.
k k k
Many variants of Stirlings approximation exist that can provide stronger bounds
on n!. However, these are rarely necessary for algorithmic analysis.
1 + x ex .
Looking at the graphs of 1 + x and ex , we see that this inequality is only tight
near x = 0, where equality is attained. For a stronger bound, we could take more
terms from the Taylor expansion ex = 1 + x + x2 /2! + x3 /3! + . . . for a higher-order
polynomial bound like 1 + x + x2 /2 ex (for x 0). However, the simpler linear
bound is usually all we need.
A useful generalization of the bound above states that for x 1 and y 1,
1 + xy (1 + x)y exy ,
The right-hand bound follows directly from 1 + x ex , and we will prove the
left-hand bound two dierent ways over the next few pages.
To illustrate a prototypical use of this bound, many algorithms follow an iterative
refinement strategy where they start with a sub-optimal solution and repeatedly
improve it over a series of iterations until it finally becomes optimal. If we start with
a solution of value 1 and make only additive improvements in constant increments,
then our algorithm could take (V ) total iterations, where V is the value of an
optimal solution. However, if we make geometric improvements (increasing the
value by some constant multiplicative factor say at least 1%), then our solution
doubles every 100 iterations, since (1 + 1/100)100 2. Therefore, it only takes
at most 100 log V = O(log V ) iterations to reach an optimal solution. You may
have seen similar uses of this bound when computing the compound interest on an
investment. For example, if you earn 5% interest per year, then your investment
will double at least every 20 years since 2 (1 + 0.05)20 e (the actual amount,
approximately 2.6533, is much closer to e 2.718 than 2, since (1+1/n)n converges
to e as n grows large).
As another example, if each of n iterations of an algorithm fails independently with
probability at most 1/(2n), then the probability the entire algorithm succeeds is at
least (1 1/(2n))n 1/2 (we will see another way to reach this conclusion later
when we study the union bound in probability theory).
The Harmonic Series. The harmonic series,
1 1 1
Hn = 1 + + + ... + ,
2 3 n
appears in many of our analyses. It is useful to know that the sum of the first n
terms of the harmonic series behaves like ln n:
Hn 2 [ln n, 1 + ln n].
The standard [proof] of this fact uses the clever trick of bounding a discrete sum-
mation using a continuous function. You will find many algorithms with running
times containing log terms on account of the fact that Hn = (log n).
Proving Equalities Using a Pair of Inequalities. To show that A = B, we
often prove separately that A B and that B A. Although simple, this approach
is used countless times in mathematical proofs. On a related note, we often show
that two statements A and B are equivalent by proving separately that A implies
B, and that B implies A, and we can show that two subsets A and B are equal by
proving separately that A B and that B A.
Proving Inequalities by Minimization or Maximization. Suppose we want
to show that f (n) g(n), otherwise written f (n) g(n) 0. If f and g happen to
be easy to minimize (say, if they are continuous functions with easily computable
derivatives), we can prove that f (n) g(n) 0 by showing that the minimum
possible value of f (n) g(n) is no smaller than 0. For example, recall our earlier
inequality that 1 + xy (1 + x)y for x 1 and y 1. A reasonably easy way to
prove this (after having taken a multi-variable calculus course) is by showing that
the function f (x, y) = (1 + x)y (1 + xy) attains a minimum value of 0 (at y = 1),
if we minimize over all possible choices for x 1 and y 1.
Minimum, Average, and Maximum. For any set of numbers a1 . . . an , the
average value lies between the minimum and the maximum:
n
1X
min ai ai max ai .
i=1...n n i=1 i=1...n
This obvious yet important fact is used again and again in algorithmic analyses.
Creative Ways to Add. Many of our analyses add things up in clever ways,
such as by reversing the order of a double summation (e.g., adding a table column-
by-column instead of row-by-row). For example, suppose fi denotes the number
of distinct integer factors of the integer i, and that we want to know the sum
f1 + . . . + fn . Letting the indicator function [x|i] represent the value 1 if x divides
i, and 0 otherwise, an eective way to estimate the sum is
n
X n X
X n n X
X n n
X
fi = [x|i] = [x|i] n/x = nHn n ln n.
i=1 i=1 x=1 x=1 i=1 x=1
This is related to a technique called double counting, where we add up the same
quantity two dierent ways to show that two expressions are equal. For example,
if we sum the degrees of all nodes in a graph, we count each endpoint of each edge
once. If we add up all the edges and then multiply by two, we also count each
endpoint of each edge once (since each edge has two endpoints). Hence, the sum of
degrees in any graph is 2m.
Space above f
is a convex set x2 f (x)
p 1
f (x1 ) + 23 f (x2 )
3 x1
p f ( 13 x1 + 32 x2 )
q q x
(a) (b) (c) x3 = 13 x1 + 32 x2
Figure 2.4: In (a) a convex set, any line segment connecting two points p and
q in the set lies completely within the set, and also any supporting hyperplane
tangent to the set contains the entire set on one of its sides. A non-convex
set is shown in (b). As shown in (c), if f is a convex function, then the space
above f is a convex set, so the line segment connecting the points (x1 , f (x1 )) and
(x2 , f (x2 )) lies above f , and any tangent hyperplane (drawn here tangent at the
point (x3 , f (x3 ))) lies below f .
Later in this chapter, we will learn about the expected value E[X] of a random
variable X. Since expected value is nothing more than a weighted sum of values
X can take (weighted by the corresponding probabilities of X taking those values),
Jensens inequality tells us the useful property that f (E[X]) E[f (X)] if f is
convex, with the reverse being true if f is concave.
2.3 Recurrences
Many algorithms break a large problem into smaller subproblems of the same form,
recursively solve these, and then somehow recombine their solutions to obtain a so-
lution for the original problem. It is often convenient to express the running time of
these recursive algorithms using recursively-defined expressions called recurrences.
Here is a simple example. Given a numeric array A[1 . . . n], the maximum value
subarray problem asks us to find a contiguous subarray A[i . . . j] having maximum
sum (we assume the array contains some negative numbers, since otherwise the
problem is trivial). The obvious (n3 ) time brute force solution is to loop over
all n2 = (n2 ) possible subarrays A[i . . . j], and to sum each one in O(n) time.
We can easily improve this to (n2 ) total time using prefix sums: first precompute
an array P [1 . . . n] where P [j] = A[1] + . . . + A[j]. This is easy to do in (n) time
by scanning through A while maintaining a running sum. After computing P , the
sum of A[i . . . j] is now given in constant time by P [j] P [i 1]. As it turns out,
one can do much better still. Here, we describe a recursive algorithm running in
(n log n) time, and in Chapter 11 we will learn an even simpler (n) algorithm
based on the technique of dynamic programming!
For our recursive solution, think of A[1 . . . n] as two half-sized arrays L = A[1 . . . n/2]
and R = A[n/2 + 1 . . . n]. The answer is then given by the best of three solutions:
The best solution entirely within L, which we can find by recursively applying
our algorithm to L,
The best solution entirely within R, which we can find by recursively applying
our algorithm to R, and
The best solution spanning both L and R, which we find by taking the best
suffix of L added to the best prefix of R. These are both easy to compute
in (n) time by computing suffix sums of L (scanning backward through L
keeping a running sum), and also prefix sums of R (scanning forward through
R keeping a running sum).
In total, our algorithm makes two recursive calls to subproblems of size n/2 and
spends (n) additional time outside these calls. If T (n) denotes the running time
of our algorithm on an input of size n, we can therefore write the following recursive
formula for T (n):
T (n) = 2T (n/2) + (n).
As a base case, since a problem of constant size can be solved in constant time we
have T (n) = O(1) for n = O(1).
In this section, we learn how to solve common types of recurrences to produce an
explicit (non-recursive) formula for T (n). For instance, the solution to the example
above is T (n) = (n log n).
so that it takes the form above. First, we can replace the non-recursive part
(17n3 log n 3n2 log4 n + 5) with an asymptotic placeholder ((n3 log n)), since
leading constants and lower-order terms in this part wont aect the asymptotic
solution. In fact, we can even drop the () entirely:
p
T (n) = 2T (n/6 + n + 1) + 3T (bn/8c) + n3 log n.
p
Small additive osets like + n + 1 inside recursive calls also have no impact on
the final asymptotic solution, and can safely be ignored. In fact, this is surprisingly
true even for larger additive osets of magnitude up to O(n/ log2 n). Accordingly,
we can disregard floor and ceiling functions, which are just additive osets of at
most one. After applying these changes, we have the much simpler recurrence
T (n) = 2T (n/2) + n
= 2[2T (n/4) + n/2] + n (expanding T (n/2))
= 4T (n/4) + n + n
= 4[2T (n/8) + n/4] + n + n (expanding T (n/4))
= 8T (n/8) + n + n + n
..
.
= n T ( 1) + n + n + . . . + n
| {z } | {z }
O(1) log n terms
= (n log n).
Tree Expansions. Suppose we arrange the terms resulting from algebraic expan-
sion in their natural hierarchical layout as a tree, then add everything up level by
level, as shown in Figure 2.5. It turns out that the level sums always give a geometric
series, which we can sum with minimal eort, owing to the following insight:
A decreasing geometric series (even an infinite one), where each term decays
by a constant factor, behaves asymptotically like its first term. For example,
n2 + (3/4)n2 + (3/4)2 n2 + . . . = (n2 ).
By symmetry, an increasing geometric series behaves asymptotically just like
its last term.
We can tell if the geometric series obtained from our recursion tree is decreasing,
increasing, or unchanging after expanding the tree out by only a single level (which
is often possible to do mentally, after some practice). Knowing the nature of the
series is all we need to solve the recurrence. For example, consider a recurrence
with the simple form
T (n) = aT (n/b) + n .
T (n) = n2
+ + +
n n n
T 2
T 2
T 2
= n2
n 2 n 2 n 2
2 2 2
n n n n n n n n n
T 4
T 4
T 4
T 4
T 4
T 4
T 4
T 4
T 4 per-level
contribution:
= n2 n2
n 2 n 2 n 2 3
2 2 2 4
n2
n 2 n 2 n 2 n 2 n 2 n 2 n 2 n 2 n 2 3 2
4 4 4 4 4 4 4 4 4 4
n2
...
...
...
...
If our geometric series decreases as we scan down the tree, then the root contribution
n is dominant, so the recurrence solves to T (n) = (n ). If the series remains
unchanging, then each of the logb n levels contributes n , so the answer is T (n) =
(n log n). Finally, if the series increases, then the contribution from the leaves
will be dominant. Each leaf contributes O(1), and the number of leaves in a tree
with depth logb n and branching factor a is alogb n = nlogb a . The solution is therefore
T (n) = (np ), where p = logb a.
Stated more concisely, if T (n) = aT (n/b) + n (with T (n) = O(1) for n = O(1) as
a base case), then
8
< (n ) if > p (decreasing series)
T (n) = (n log n) if = p (unchanging series)
:
(np ) if < p (increasing series)
where p = logb a. We have simplified this formula a bit by noting that we can tell if
with multiple recursive terms. We can solve this exactly as before, by expanding it
into a tree and summing a geometric series. The only dierence is that the tree is
no longer level at the bottom, since n decreases at dierent rates down dierent
branches. Accordingly, the level-by-level contribution starts out as a geometric
series but then behaves slightly dierently toward the bottom of the tree once some
branches start ending. However, this only ends up changing the answer in the
third case above (an increasing series), where the contribution from the leaves is
dominant:
8
< (n ) if > p (decreasing series)
T (n) = (n log n) if = p (unchanging series)
:
(np ) if < p (increasing series)
where 0. Here, we can initially ignore the extra log n term while resolving the
nature of our geometric series. The term then re-appears in the solution unless we
are in the increasing case where the leaves dominate:
8
< (n log n) if > p (decreasing series)
T (n) = (n log +1 n) if = p (unchanging series)
:
(np ) if < p (increasing series)
1 3 5 4
6
Figure 2.6: A Venn diagram showing two events E1 and E2 after rolling a
6-sided die.
2.4 Probability
Randomization often provides a way to design algorithms that are fast, simple, and
elegant. To analyze them, however, we need to equip ourselves with tools from
probability theory. This section summarizes most of the tools we will need.
Pr[E] = 1 Pr[E]
where E, the complement of E, is the set of all outcomes not in E (in other words,
the event that E does not occur). It is often easier to directly compute Pr[E] rather
than Pr[E]; for instance, the probability we see heads at least once in 10 coin flips
is 1 1/210 , where 1/210 is the probability of the complement, that we see all tails.
Since events are nothing more than sets of outcomes, we can use set union and
intersection to express relationships between them. If E1 and E2 are two events,
then E1 [ E2 is the event that either one or both events occur, and E1 \ E2 is the
event that both occur.
The Union Bound. The probability that either E1 or E2 (or both) occurs is
This is obvious from the Venn diagram pictured in Figure 2.6, since Pr[E1 ]+Pr[E2 ]
counts all the outcomes in E1 and E2 , but counts the outcomes in E1 \ E2 twice
so we compensate by subtracting out Pr[E1 \ E2 ]. A more general form this idea is
the inclusion-exclusion principle, where we compute Pr[E1 [ . . . [ Ek ] for a union
of k events by adding their individual probabilities, subtracting the probabilities of
all intersecting pairs of events, adding the probabilities of all intersecting triples of
events, and so on in an alternating fashion. If we only want a rough upper bound,
however, we can say that
where equality holds only if all events are disjoint (Ei \ Ej = ; for all pairs of
events). This is known either as the union bound or as Booles inequality, and it
is an extremely useful tool in many of our analyses. It follows easily from the fact
that every outcome in E1 [ . . . [ Ek has its probability counted exactly once on the
left-hand side, but potentially several times on the right-hand side, depending on
the number of events in which the outcome is contained.
The union bound tells us that the failure probability of a complex system is at most
the sum of the failure probabilities of its individual parts. A machine with 200 parts,
each failing with probability at most 10 6 , has an overall probability of failure (i.e.,
of one or more parts failing) at most 200 10 6 . In the context of a randomized
algorithm, suppose an algorithm spends O(f (n)) time on a single generic input
element with probability at least 1 1/(2n), so it fails to process this element
quickly enough with probability at most 1/(2n). Taking a union bound over all n
elements, the probability of failure on any element is at most 1/2, so consequently
the algorithm runs in O(nf (n)) time with probability at least 1/2.
Conditional Probability. The probability of event A given that event B occurs
is written as
Pr[A \ B]
Pr[A | B] = .
Pr[B]
Otherwise written, Pr[A \ B] = Pr[A | B] Pr[B]1 . Since by symmetry we also have
Pr[A \ B] = Pr[B | A] Pr[A], we can easily derive Bayes rule,
Pr[B | A] Pr[A]
Pr[A | B] = ,
Pr[B]
which is useful for relating Pr[A | B] and Pr[B | A]. We will use Bayes rule quite
often when we study machine learning later in Chapter ??.
Independence. Two events A and B are independent if knowledge about the
occurrence of one event does not change the probability of occurrence of the other.
1 Forthose familiar with calculus, conditional probabilities behave in much the same way as the
chain rule for derivatives. For example, the derivative of f (g(h(x))) is f 0 (g(h(x)))g 0 (h(x))h0 (x).
Similarly, we can write Pr[A \ B \ C] = Pr[A | B, C] Pr[B | C] Pr[C].
Formally, A and B are independent if and only if any of the following equivalent
conditions are true:
Pr[A | B] = Pr[A],
Pr[B | A] = Pr[B], or
Monte Carlo randomized algorithms can give incorrect answers if they are unlucky.
In order to persuade anyone to use them, we need to ensure that the probability
of an incorrect answer is extremely small. We would like our algorithms to succeed
with high probability, which usually means the following in the computing literature:
We say a Monte Carlo algorithm with input size n is correct with high
probability if it fails with probability at most 1/nc for any constant
c > 0 of our choosing. More formally, given any constant c > 0, we can
set the hidden constant in our running time appropriately such that
The interesting (and to some, confusing) aspect of this definition is the part where
c is allowed to be any constant of our choosing. In order to claim a high probability
bound, we must be able to reduce the failure probability to 1/n2 , 1/n10 , 1/n1000 ,
or 1/nc for any other constant c. The impact of our choice of c shows up only in
the hidden constant in the running time of our algorithm, so it does not change the
overall asymptotic running time.
To see that the definition above is a reasonable way to define with high probability,
observe that you might be happy with an algorithm that only fails at most 1% of
the time, but other situations may call for an even more robust algorithm that
fails at most 0.001% of the time. It isnt really satisfactory for us to pick any
fixed constant threshold, since there is no obvious choice that would be universally
acceptable. Instead, we use a threshold based on the input size, n. For example,
we could say that a Monte Carlo algorithm is correct with high probability if it
fails with probability at most 1/2n on inputs of size n. This way, we are virtually
guaranteed that large inputs are handled correctly. It is true that the bound is
weaker for small inputs, but this is reasonable since with small inputs, there are
only a limited number of random outcomes that can occur, and if even one of these
is bad, then our overall probability of failure may be somewhat large. Just as the
most important feature of running time is how it scales with problem size, the most
meaningful definition of with high probability also scales with problem size. It
turns out that an exponentially-small error bound like 1/2n is a bit too difficult
to achieve with most problems, so the standard definition above requires a slightly
weaker polynomially-small error bound. Of course, nobody will complain if your
algorithm manages to satisfy an even stronger bound.
Boosting Success Probability Through Repetition. Suppose we have a Monte
Carlo randomized algorithm that produces a correct answer with probability at least
1/2. As long as the algorithm has the ability to detect when it makes a mistake, we
can boost its success probability from a constant guarantee like 1/2 to a high prob-
ability guarantee by simply running O(log n) independent trials of the algorithm.
If we perform, say c log n trials, then our algorithm only fails if every individual
trial fails, and this happens with probability at most (1/2)c log n = 1/nc . Since we
can select c to be any constant we like, this give us a high probability guarantee of
success. If we want a failure probability of at most 1/n10 , we choose c = 10. If we
want a failure probability of at most 1/n100 , we choose c = 100. In any case, we
still perform only O(log n) iterations of our algorithm, with c disappearing as the
hidden constant.
(a) Suppose that we have n individuals, each with a birthday chosen independently at
random from m possible days. Let D be the event that all n individuals have dierent
birthdays. Please write an exact mathematical expression for Pr[D]. Using bounds
p
from earlier in this chapter, please show that (i) Pr[D] 3/4 if n m/2, and
p
(ii) Pr[D] 1/e if n 2 m. For simplicity, please assume m is a perfect square.
[Solution]
(b) Using a union bound over all n2 pairs of individuals, please show that Pr[D] 1/2
p
if n m. This result will play an important role in the analysis of hash table data
structures in Chapter 7, where we exploit the fact that if n elements are randomly
mapped to a table of size n2 , there is at least a 1/2 probability that no two elements
collide at the same location. [Solution]
(c) In a simple peer-to-peer network made up of m computers, it may be the case that
the only way to find a particular object (e.g., a file) is to query computers one-by-one
until we find a computer hosting the object we seek. In order to speed this process up,
p
suppose we take each object of interest and replicate it on m computers through-
p
out the network, chosen arbitrarily. To search for an object, we probe O( m log m)
randomly-chosen computers and succeed if one of them has the object in question.
Please argue that this procedure guarantees a high probability of success. [Solution]
(d) In a lake with m fish, show how to estimate m to within some constant factor with
p
high probability by catching, marking, and releasing only O( m log m) random fish.
The result of problem 32 may help. [Solution]
High Probability Guarantees for Las Vegas Algorithms. With Las Vegas
randomized algorithms, the output is always correct but the running time can vary
depending on luck. It may be hard to convince someone to use the algorithm unless
we can persuade them that it is unlikely that the running time will be significantly
larger than some target running time. In other words, we would like an algorithm
with a running time guarantee that holds with high probability.
We say a Las Vegas algorithm with input size n runs in O(f (n)) time
with high probability if it fails to run in O(f (n)) time with probability at
most 1/nc , where c > 0 is any constant of our choosing. More formally,
given any any constant c > 0, we can find a constant k such that
We define a high probability running time bound for a Las Vegas algorithm in a
very similar fashion to that of a Monte Carlo algorithm, in terms of a polynomially-
The randomized reduction lemma3 is quite natural since it builds upon a very
familiar principle: if we start with a problem of size n and apply an algorithm that,
in each iteration, reduces the eective problem size to at most some constant fraction
q 2 [0, 1) of its original size, then the algorithm will perform O(log n) iterations.
The prototypical example of this principle is binary search, with q = 1/2. All the
lemma above says is that we achieve the same logarithmic performance (with high
probability) as long as there is a good chance of problem size reduction in each
iteration.
Remarkably, this lemma (which we prove later) is the only tool we need in order to
argue almost of the high probability bounds in this book! For example, with ran-
domized binary search we can easily show that each iteration reduces our problem
2 In fact, it is actually meaningless to define with high probability in this case using a fixed
constant threshold as a failure probability. If an algorithm has an expected running time of O(n2 )
time (we will say in minute what expected means), then we can adjust the hidden constant in
the O() notation using Markovs Inequality (also discussed shortly) to change the probability that
it fails to run in O(n2 ) time to any constant of our choosing: 1%, 0.001%, and so on.
3 Note that this is a name you will only find here in this book, since the lemma does not seem
to at most q = 2/3 of its original size with probability at least p = 1/3: let A denote
the n/3 smallest elements in our array, let B denote the n/3 middle elements, and
let C denote the n/3 largest elements. With probability 1/3, we choose a pivot
element from B and reduce our problem to at most 2/3 of its original size, since
this eliminates either A or C from consideration (depending on the comparison with
the pivot). Therefore, the randomized reduction lemma immediately tells us that
randomized binary search runs in O(log n) time with high probability.
Note that the randomized reduction lemma can be applied even in situations when
we arent explicitly reducing the size of a problem. For example, we could start
with a problem of unit size and in each iteration expand by a constant factor with
some constant probability, stopping when we reach n. We could also identify some
other parameter associated with our algorithms state that satisfies the properties of
the lemma either shrinking or expanding by some constant factor with constant
probability in each iteration.
Union Bounds and High Probability Results. Our definition of with high
probability meshes particularly well with the union bound. In particular, if we
can prove that some property holds with high probability for a generic element in
our input (failure probability at most 1/nc ), then the union bound tells us that
this property also holds with high probability for all n elements in our input (with
failure probability at most n 1/nc = 1/nc 1 , which we can still set to any level
of our choosing by selecting c appropriately). This simplifies many of our proofs by
allowing us to focus on proving a high probability result for only a single element or
smaller subproblem, rather than for the entire input taken as a whole. For example,
in the next chapter we will show that the randomized quicksort algorithm runs in
O(n log n) time with high probability by first observing that it spends O(log n)
work on any single element with high probability (using the randomized reduction
lemma), and then by applying a union bound to conclude that it spends O(log n)
work per element on all of the n input elements also with high probability.
Non-Uniform Subproblem Sizes. If we perform a randomized binary search
over only a length-k subarray of a larger length-n array, the randomized reduction
lemma tells us this will run in O(log k) time, but only with high probability with
respect to k (i.e., with a failure bound of the form 1/k c ). This can make it difficult
to use the union bound to aggregate n high probability results for dierently-sized
subproblems to obtain a global high probability bound; to do this, we would need
a failure bound of the form 1/nc for each subproblem. Fortunately, when we later
prove the randomized reduction lemma, we will show that it actually guarantees
an O(log n) running time with failure probability 1/nc for any n k. We can
therefore aggregate high probability results over non-uniform subproblem sizes with
little trouble a subtle, but important point.
The following problems provide good examples of the randomized reduction lemma
in action (often in conjunction with the union bound).
(a) Consider a random permutation of {1, 2, . . . , n}. Please argue that the length of
its longest sequential subsequence is O(log n) with high probability. A sequential
subsequence is a subsequence whose elements are increasing by 1 as we move from left
(a) Please show that the maximum node degree in our random tree is O(log n) with high
probability. [Solution]
(b) Please argue that the diameter (length of the longest path) of our random tree is
O(log n) with high probability. [Solution]
4 We discuss some related results to this problem later in the book: in problem 210 we develop
an O(n log n) algorithm for finding the longest increasing subsequence, and in problem ?? we
prove the Erd os-Szekeres theorem, which states that any length-n
p sequence must have either an
increasing or decreasing subsequence consisting of at least n elements.
5 Note that with a more detailed Cherno bound analysis, this bound can be improved slightly to
O(log n/ log log n). There are also some interesting related results achievable by probing multiple
bins; for example, if you place each ball in whichever of two randomly chosen bins is least full, then
a much more sophisticated analysis shows that the expected maximum fullness is only O(log log n).
6 One elegant method is using the so-called Pr ufer code a method for encoding any n-node
tree (with nodes labeled 1 . . . n) as an integer sequence of length n 2 (with each element in the
range 1 . . . n). The mapping is one-to-one (thereby providing a nice proof that there are exactly
nn 2 labeled trees on n nodes), and can be performed in either direction tree to sequence, or
sequence to tree in (n) time. Since the number of occurrences of x in the sequence is one
less than the degree of node x in the tree, problem 19 tells us that if we build a random n-node
tree from a randomly-generated Pr ufer code (equivalent to throwing n 2 balls into n bins),
its maximum degree will be O(log n) with high probability. See also problem ?? on generating
a random spanning tree of a graph, and see the endnotes for a brief discussion of properties of
random graphs in general.
Problem 21 (The Coupon Collector Problem). Suppose that you see a special
advertisement on your box of breakfast cereal claiming that inside the box you will find
one of n dierent types of special coupons, each equally likely. Please show that you will
find at least one of each coupon type with high probability if you open O(n log n) boxes
of cereal. As a hint, try to prove the slightly stronger result that by the time you have
accumulated all n coupon types, you will have found only O(log n) copies of any single
particular coupon type with high probability. [Solution]
Problem 22 (Randomly Spreading Information in a Distributed Net-
work). Suppose we have a distributed network of n processors where one processor
wants to broadcast a message to the others. Suppose further that we are operating under
a synchronous model of distributed computation where time proceeds in steps according
to some global clock, and in each time step a processor can exchange a single message with
another processor.
(a) In each time step, suppose each processor that has heard the message contacts a
random processor and sends it the message (note that if we are unlucky, the other
processor might have already heard the message). This is called a gossip protocol.
Using the tools from this chapter, see if you can prove that all n processors will hear
the message after only O(log n) steps with high probability (with probability at least
1 1/nc for any constant c > 0 of our choosing). As a hint, consider separately the
number of steps until some constant fraction of the processors has initially heard the
message, and then the number of steps required to spread to the remaining processors.
[Solution]
(b) Suppose we implement a gossip protocol by pulling rather than pushing the mes-
sage. That is, in each time step, each processor without the message contacts a random
other processor and asks for a copy of the message (which it may or may not have,
and even if it does have the message, it may not be able to communicate if it is
already talking to a dierent processor, since each processor can only participate in
a single communication session per step). Please try to prove that this variant also
distributes a message from a single processor to all n processors in O(log n) steps with
high probability. [Solution]
We now turn to the second major concept in our study of probability theory: random
variables. In contrast, to a standard (non-random) variable that represents a single
value, a random variable is associated with a probability distribution over values.
For example, if X represents the smaller of the face values when we roll two 6-sided
dice, its distribution is
11 9 7 5 3 1
1: 2: 3: 4: 5: 6: .
36 36 36 36 36 36
If we actually perform a random trial by rolling two dice, then we can think of
sampling a value according to this distribution to instantiate the value of X. In
other words, X serves as a placeholder for a value between 1 and 6 that would
only materialize after we perform a random trial (although we never actually replace
X with a concrete value in this fashion). A random variable assigns a numeric value
to every possible outcome associated with some random experiment; in the example
above, the outcome (3, 5) maps to the value X = 3.
Two random variables X and Y are independent if knowledge of the value of X
does not change the probability distribution associated with Y , and vice versa. If
That is, E[X] is the sum over all possible values v that X can take, of v weighted
by the probability of X taking the value v.
Here are some examples:
If X denotes the minimum of the values obtained by two rolls of a 6-sided die,
then E[X] = 1 11 9 7 5 3 1
36 + 2 36 + 3 36 + 4 36 + 5 36 + 6 36 = 91/36.
If X denotes the number of heads we see when flipping 100 fair coins, then
E[X] = 50. If the coins are biased and show heads with probability 3/4 and
tails with probability 1/4, then E[X] = 75.
If we repeatedly flip a biased coin (showing heads with probability 1/10) and
T denotes the number of flips up to and including the first time we see heads,
then E[T ] = 10.
If T denotes the amount of time we spend performing a linear search for a
randomly-chosen element in an n-element array, then E[T ] = (n).
If T denotes the amount of time we spend performing a binary search for a
randomly-chosen element in a sorted n-element array, then E[T ] = (log n).
The running time of a Las Vegas algorithm is a random variable. It is small if the
algorithm is lucky and large if the algorithm is unlucky in its random choices. The
simplest question you can generally ask about a Las Vegas algorithm is what is its
expected running time. If our expected running time is, say, O(n log n), this gives
a sense of what we can expect on average when we run the algorithm, but it still
leaves open the possibility that the running time may have high variability, and
may take much more than O(n log n) time with some non-negligible probability.
In order to convince even the most stubborn critic that our algorithm is reliable
enough to use, we may want to attempt to prove the much stronger result that it
runs in O(n log n) time with high probability.
to any interval. In this book, essentially all of our random variables will be discrete (and mostly
also integer-valued). Continuous random variables are not particularly problematic they just
require more familiarity with calculus and continuous mathematics to handle. For example, the
expected value of a continuous random variable
R taking a value uniformly selected from the interval
[0, 10] would be written as the integral 010 10x
dx = 5.
stronger than an expected running time bound of O(f (n)), neither bound technically im-
plies the other. To illustrate this fact, please describe the probability distribution for the
running time of a hypothetical randomized algorithm that is O(f (n)) in expectation but
not O(f (n)) with high probability. Next, show how we could have a running time that is
O(f (n)) with high probability, but not O(f (n)) in expectation. [Solution]
which, while feasible, is perhaps more complicated than we might like. Instead, let
us decompose X into a sum of much simpler random variables:
X = X1 + X2 + . . . + X100 ,
where Xi takes the value 1 if the ith flip is heads, 0 otherwise. A 0/1-valued random
variables like Xi is called an indicator random variable since its value indicates
whether or not a certain event has occurred. The expected value of an indicator
random variable is easy to compute:
X
E[Xi ] = vPr[Xi = v]
v
= 0 Pr[Xi = 0] + 1 Pr[Xi = 1]
= Pr[Xi = 1]
= 1/2.
8 In Chapter ??, we will see that the distribution of a sum of random variables is given by the
Note that E[Xi ] is just the probability of the event for which Xi serves as an
indicator. This is true for any indicator random variable its expected value is
just the probability of its associated event. We now have
which is the result we intuitively expect. Linearity of expectation is one of the most
valuable tools we have for analyzing randomized algorithms, and we will use it on
many occasions. The reader is encouraged to learn it well.
Expectations of Products. Since expectations of sums behave so nicely, what
happens with products? Unfortunately, E[XY ] = E[X] E[Y ] is not true in general.
If this property does hold, then we say X and Y are uncorrelated. Since independent
random variables are always uncorrelated, we can always decompose the expectation
of a product of independent random variables into a product of expectations. Please
take care not to confuse this with linearity of expectation for sums, which applies
to any sum of random variables regardless of independence.
Expected Trials Until Success. If we roll a 6-sided die, then Pr[E] = 1/6 if E
is the event that we roll a 2. If we keep rolling until the event E occurs, how many
trials do we expect to perform, including the final successful trial? The answer is
what we might intuitively expect: 6. This is an example of a very useful principle:
if we perform a sequence of independent trials, where each trial succeeds with prob-
ability p, then the expected number of trials up to and including the first success is
1/p. Similarly, if each trial succeeds with probability at least p, then we expect at
most 1/p trials. [Very short proof]
(a) Exchanging Hats. Suppose n people in a room are each wearing dierent hats. If
they all exchange hats according to a random permutation, how many people do you
expect to receive their original hat? [Solution]
(b) Balls in Bins. If we randomly throw n balls into m bins, what is the expected
number of balls that end up in each bin? Among all n2 pairs of balls, how many pairs
do we expect to collide by landing in the same bin? What is the expected number
of empty bins? For the last question, an approximate answer is fine. [Solution]
(c) The Coupon Collector Problem. Recall problem 21: inside your box of breakfast
cereal you will find one of n dierent types of special coupons, each equally likely.
Please show that you expect to open (n log n) boxes of cereal before you have col-
lected at least one of each coupon type. As a hint, decompose the sequence of boxes
we check into a series of phases, where during one phase we have collected exactly
k distinct coupons and we are opening boxes in hopes of finding a coupon of any of
the remaining n k types. [Solution]
(d) Revisiting the Randomized Reduction Lemma. Consider a randomized algo-
rithm whose input consists of n elements, where in each iteration of the algorithm
there is at least some constant probability of reducing our problem to at most some
constant fraction of its original size. In this setting, we learned how the randomized
reduction lemma guarantees that we spend only O(log n) iterations with high prob-
ability. Just for fun, use linearity of expectation to show an analogous (and slightly
weaker) result that we spend O(log n) iterations in expectation. [Solution]
(e) Prefix Maxima. Suppose in an array A[1 . . . n] that you want to compute for each
index j the maximum of A[1 . . . j]. This is easy to do in (n) time, of course, by scan-
ning through A and keeping a running maximum. As an exercise, however, consider
the following alternative method: process the elements of A in random order. For each
element A[j], scan backward down to A[1] keeping a running maximum. During this
scan, we stop at the first element A[i] previously processed, since we will have already
computed and stored the maximum of A[1 . . . i], making it unnecessary to scan these
elements again. What is the expected running time of this algorithm? [Solution]
(a) If the processors all know the value of n, then let each one independently attempt
transmission during each time step with probability p = 1/n. Please show that the
expected number of time steps until a successful transmission in this case is approxi-
mately e, and (if you know calculus) please argue 1/n is in fact the optimal value to
choose for p in order to maximize throughput. [Solution]
(b) If our processors dont know the value of n, let each processor attempt to transmit with
probability 1/2. If it decides to transmit and fails due to a collision, then it becomes
dormant and waits until a time step occurs when it hears no other transmissions
before it wakes up and starts attempting to transmit again (again with probability
1/2 per time step). Please argue that we expect O(log n) time steps to elapse between
successful transmissions. [Solution]
by first computing the per element expected running times E[X1 ] . . . E[Xn ]. For
example, if our algorithm spends O(log n) expected time on a generic input element,
then linearity of expectation tells us that it spends O(n log n) total expected time.
Common Types of Distributions. Most random variables we will encounter
have probability distributions belonging to a handful of well-known classes. The
simplest of these is the Bernoulli distribution, which takes the value 1 with prob-
ability p and 0 with probability 1 p. Bernoulli random variables are also called
indicator random variables when they are used to indicate whether or not a partic-
ular event happens. For example, if we flip a biased coin that comes up heads with
probability p and tails with probability 1 p, then we could say that our indicator
variable takes the value 1 to indicate that the coin comes up heads. The expected
value of this indicator variable is exactly p, the probability of its associated event.
A random variable has a geometric distribution if its value indicates the number of
trials until a particular Bernoulli event is successful. For example, if X denotes the
number of flips of our biased coin up to and including the first trial where it comes
up heads, then X has a geometric distribution. It is called a geometric distribu-
tion since the probability of exactly k trials, Pr[X = k] = p(1 p)k , decays in a
geometric fashion with k. As we just mentioned, a geometric distribution derived
from a Bernoulli event occurring with probability p has expected value 1/p.
Suppose now that we flip n biased coins, each heads with probability p. We can
denote whether each individual coin comes up heads by using indicator variables
X1 . . . Xn , and we can write the total number of heads as X = X1 + . . . + Xn . A
random variable like X that is a sum of Bernoulli variables has a binomial distri-
bution, so named because Pr[X = k] (the probability we see k heads in n flips) is
given by the binomial coefficient formula nk pk (1 p)n k . Using linearity of expec-
tation, we find that E[X] = E[X1 ] + . . . + E[Xn ] = np. Binomial distributions tend
to be tightly concentrated around their means, a property we will exploit when we
introduce Cherno bounds momentarily.
The binomial distribution is often approximated by a distribution shaped like the
Taylor series expansion of enpx known as the Poisson distribution, which is well-
studied in probability theory due to its numerous properties and applications.
9 This problem is closely related to the topic of arithmetic coding, which we briefly mention in
Section 10.2.1.
candidate later in the ordering (in which case the current candidate can never be recalled).
The goal is to maximize your chances of hiring the best candidate, and it turns out we can
achieve this by interviewing and dismissing the first 1/e fraction of all candidates, then to
hire any candidate from that point on who is the best seen so far.
A remarkably simple algorithm, known as the odds algorithm, solves not only this problem
but a generalization that has several other applications. In a sequence of n independent
random trials, let pi be the probability that trial i succeeds, let qi = 1 pi , and let
ri = pi /qi be the odds of success (the ratio of probability for versus against). Our goal is
to observe the outcome of each successive trial (success or not), choosing a point at which
to stop that maximizes our probability of stopping on the last successful event. For the
secretary problem, pi = 1/i, since success corresponds to interviewing a candidate who
is the best seen so far. However, this scenario applies to many other problems, such as
betting on the last time a stock price will jump up, or guessing when to stop at an open
parking space in hopes of parking at the closest available space to your place of work.
Letting Ri = ri +ri+1 +. . .+rn and Qi = qi qi+1 . . . qn , let t be the largest index10 at which
Rt 1 (or we set t = 1 if no such index exists). Please show that an optimal strategy is
to stop at the first successful event at or beyond index t, and that this gives a probability
of Rt Qt of stopping on the last successful event. [Solution]
expect 1/p trials until the first success. Equivalently, we keep a running sum of success probabilities
and stop when this reaches 1. Note that the odds algorithm has a pleasantly symmetric form for
guessing the last success, where we sum the odds in reverse order until we reach a sum of 1.
what fraction of the population wants to vote for candidate A in an election. According
to Chebyshevs inequality, approximately how many people must we sample at random
before we have an estimate that is accurate to within an additive error of " with 99%
probability. [Solution]
Cherno Bounds. Some of the most powerful bounds in our arsenal are the
Cherno bounds, which tell us that a binomial random variable tends to be very
tightly concentrated near its expected value. For example, we can write the number
of heads in n coin flips as X = X1 + . . . + Xn where each Xi is an independent
indicator random variable that takes the value 1 if coin flip i is heads, 0 otherwise.
Here, we anticipate that the number of heads will almost always end up very close
to E[X] = n/2.
There are many dierent forms of Cherno bounds, depending on whether we want
a bound on deviation below or above E[X], and whether we want a bound on the
probability of absolute or relative deviation from E[X]. Letting X = X1 + . . . + Xn
be a sum of independent indicator random variables, we have
2"2 /n
1. Pr[X E[X] "] e
2"2 /n
2. Pr[X E[X] + "] e
2
3. Pr[X (1 ")E[X]] e " E[x]/2
2
e " E[x]/4 if " 2e 1
4. Pr[X (1 + ")E[X]]
2 ("+1)E[x] if " 2e 1
As an example, what is the probability we see more than 75 heads in 100 coin flips?
2
According to the second bound above, Pr[X 75] e 25 /50 = e 12.5 < 0.000004,
which is quite unlikely! [For the interested reader, a proof of our Cherno bounds]
Since the full-fledged Cherno bounds above can sometimes be cumbersome to use,
that is why we earlier developed the randomized reduction lemma as a simpler in-
terface to the Cherno bounds. Using Cherno bounds, we can now give a short
[proof] of the randomized reduction lemma.
invocations of the algorithm, we get an answer that is accurate to within a 1" factor with
high probability. How is this result related to that of the preceding problem? [Solution]
Problem 33 (Interpolation Search). This problem serves as a challenging grand
finale for our review of probability theory. Suppose we have a sorted array A[1 . . . n]
whose elements were independently drawn from the uniform distribution over [0, 1] prior
to sorting11 , and that we would like to find the element in A closest in value to a target
value v. An obvious solution is binary search, running in O(log n) time. However, since we
know more here about the structure of A, we can actually achieve an O(log log n) expected
running time using a biased version of binary search known as interpolation search. The
general idea is illustrated with an example: in an array A[1 . . . 1000], we would expect to
find v = 0.95 at roughly index 950, so we choose this as our next guess, as opposed to the
middle index binary search would have chosen.
At each step of the search, we are considering some subarray A[i . . . j], where we have
observed A[i] and A[j] but not yet any of the elements in between. Due to the uniform
distribution of our array elements, we expect the contents of A[i . . . j] to vary from A[i] up
v A[i]
to A[j] in a linear fashion12 . Let = A[j] A[i]
2 [0, 1] be the relative distance at which we
expect to find v between A[i] and A[j]. Interpolating by this amount between i + 1 and
j 1, we choose the next index to visit as
k = Round((i + 1)(1 ) + (j 1) ),
where the function Round(x) rounds x up to dxe with probability proportional to the
fractional part of x, or down to bxc otherwise (this provides a clean, unbiased way to
round x to an integer value). Normally, we would compare v with A[k] and then recurse
left or right accordingly, but just to help simplify our analysis, let us consider looking at
A[k 1], A[k], and A[k +1], stopping if v lies between A[k 1] and A[k +1], or recursing left
on A[i . . . k 1] or right on A[k + 1 . . . j] otherwise. This lets us argue that the probabilities
of recursing left or right are both at most 1/2 [proof] (whereas in the standard approach
where we only look at A[k], one of these probabilities would be larger than 1/2 and harder
to bound, making the rest of our particular style of analysis more complicated).
In keeping with the way we have explained many other randomized algorithms, we show
that interpolations search runs in O(log log n) expected time using the randomized re-
duction lemma. Let m = j i 1 be the number of interior elements in our current
subarray, and let B = log[m2 (1 )] be the log of the product of the number of these
elements we anticipate on the left and right of our target element. We regard B as our
eective problem size, and claim that this quantity reduces by some constant fraction in
each iteration with some constant probability. Since B 2 log n initially, the randomized
reduction lemma tells us that we expect O(log B) = O(log log n) total iterations.
Assume 1/2 (a symmetric argument works if 1/2), and let " = A[j]pmA[i] . Please
use a Cherno bound to show that Pr[A[k 1] > v + "] is at most some constant strictly
less than 1/2. Then, show that B decreases to at most some constant fraction of its current
value unless (i) we recurse to the right, or (ii) A[k 1] > v + ". Since Pr[(i)] 1/2 and
you have just shown that Pr[(ii)] is at most a constant less than 1/2, each iteration will
indeed reduce our eective problem size with at least some constant probability, thereby
fulfilling the conditions of the randomized reduction lemma. [Solution]
11 We can apply this technique to any probability distribution as long as we know the inverse of
its cumulative density function, which eectively allows us to map samples from the complicated
distribution to equivalent samples over the uniform distribution on [0, 1].
12 Conditioning on the fact that we have observed several array elements including A[i] and A[j]
but not yet A[i + 1 . . . j 1], the subarray A[i + 1 . . . j 1] still behaves as if it was obtained
by sorting an array of numbers independently chosen uniformly from the range [A[i], A[j]]. This
subtle point is important, since it allows us to assume every subarray we encounter during our
search behaves probabilistically just like the initial array.
where I is the n n identity matrix (all zeros, with ones down the diagonal). Every
n n matrix has an inverse unless it is singular. In Chapter ??, we will learn how
to define an object called the pseudo-inverse, that plays a similar role to the inverse
of either a singular or non-square matrix.
1 (b) (c)
2 3 4 (()((()())()))()
5 6 7 1101110100100010
(a) 1
2
5 3
6 4
7 +1 +1 1 +1 +1 +1 1 +1 1 1 +1 1 1 1 +1 1
14 There is one slight nuisance we should bear in mind for counting problems, particularly those
in which the answer can be exponentially large. The RAM model of computation usually only
permits us O(log n) bits in a word, allowing us to count only polynomially high up to nc for
some constant c. If we need to count to some exponentially high number like 2n , this requires at
least n bits. We can resolve this issue either by relaxing the model of computation or by penalizing
our running times to account for arithmetic on large numbers.
(a) Suppose the objects we want to index are the n! dierent permutations of an array
containing n distinct elements. Give a (n) algorithm that constructs the ith such
permutation in lexicographic order given any i 2 {0, 1, . . . , n! 1}, and give a (n)
algorithm that performs the inverse mapping as well, taking an ordering and producing
i as output. [Solution]
(b) Consider the same problem for combinations, rather than permutations. That is,
we want to index all k-element subsets of an array containing n distinct elements.
Show how to map between a subset (stored in an array of size k) and an index i 2
{0, 1, . . . , nk 1} in O(k) time. [Solution]
(c) To generate a random k-element subset of an n-element array, consider the following
approach: for each i from n k + 1 up to n, choose a random index j 2 {1, . . . , i}.
If the jth element is not in our subset already, add it, otherwise add the ith element
instead. We can implement this method in O(k) expected time with hashing (Chapter
7). Please show that it does indeed produce a k-element subset chosen uniformly at
random. [Solution]
Problem 37 (Secret Sharing). This problem illustrates a simple yet powerful result
that makes use of polynomial root bounds.
(a) Show that the n coefficients of a degree-(n 1) polynomial are uniquely determined
if we specify the value of the polynomial at n or more dierent points, even if we are
performing arithmetic modulo a prime p. [Solution]
(b) Suppose n people want to share a secret (some integer a in the range 0 . . . p 1, where
p is prime). For extra security, they want to distribute information about the secret
among themselves so that it takes at least k people cooperating together to determine
the secret. That is, any subset of k or more people should be able to determine the
secret, and any subset of k 1 or fewer people should not be able to learn anything
about the secret. How can we use the preceding fact to accomplish this? [Solution]