l19
l19
1 Preliminaries
The last few lectures discussed several interpolations between worst-case and average-case
analysis designed to identify robust algorithms in the face of strong impossibility results for
worst-case guarantees. This lecture gives another analysis framework that blends aspects of
worst- and average-case analysis. In today’s model of self-improving algorithms, an adversary
picks an input distribution, and then nature picks a sequence of i.i.d. samples from this
distribution. This model is relatively close to traditional average-case analysis, but with the
twist that the algorithm has to learn the input distribution (or a sufficient summary of it)
from samples. It is not hard to think of real-world applications where there is enough data
to learn over time an accurate distribution of future inputs (e.g., click-throughs on a major
Internet platform). The model and results are by Ailon, Chazelle, Comandar, and Liu [1].
The Setup. For a given computational problem, we posit a distribution over instances.
The difference between today’s model and traditional average-case analysis, including our
studies of the planted clique and bisection problems, is that the distribution is initially un-
known. The goal is to design an algorithm that, given an online sequence of instances — each
an independent and identically distributed (i.i.d.) sample from the unknown distribution —
quickly converges to an algorithm that is optimal for the underlying distribution.1 Thus the
algorithm is “automatically self-tuning.” The challenge is to accomplish this goal with fewer
samples and less space than a brute-force approach.
∗c 2009–2017, Tim Roughgarden.
†
Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,
CA 94305. Email: [email protected].
1
Remember that when there is a distribution over inputs, there is also a well-defined notion of an optimal
algorithm, namely one with the best-possible expected performance (where the expectation is over the input
distribution).
1
Main Example: Sorting. The obvious first problem to apply the self-improving paradigm
to is sorting in the comparison model, and that’s what we do here.2 Each instance is an
array of n elements, where n is fixed and known. The ith element is drawn from an un-
known real-valued distribution Di . An algorithm can glean information about the xi ’s only
through comparisons—for example, the algorithm cannot access a given bit of the binary
representation of an element (as one would do in radix sort, for example).
A key assumption is that the Di ’s are independent distributions; Section 5.3 discusses
this assumption. The distributions need not be identical. Identical distributions are unin-
teresting in our context, since in this case the relative order of the elements is a uniformly
random permutation. As we’ll see later, every correct sorting algorithm requires Ω(n log n)
expected comparisons in this case, and a matching upper bound is achieved by MergeSort
(for example).
Definition 2.1 (Entropy of a Distribution) Let D = {px }x∈X be a distribution over the
2
Subsequent work extend these techniques and results to several problems in low-dimensional computa-
tional geometry, including computing maxima (see Lecture #2), convex hulls, and Delaunay triangulations.
3
For sorting, random data is the worst case and hence we propose a parameter to upper bound the amount
of randomness in the data. This is an interesting contrast with our lectures on smoothed analysis, where the
analysis framework imposes a lower bound on the amount of randomness in the input.
2
finite set X. The entropy H(D) of D is
X 1
px log2 , (1)
x∈X
px
1
where we interpret 0 log2 0
as 0.
For example, H(D) = log2 |Y | if D is uniform over some subset Y ⊆ X. When Y is the set
Sn of all n! permutations of {1, 2, . . . , n}, this is Θ(n log n). On Homework #10 you will show
that: if D puts positive probability on at most 2h different elements of X, then H(D) ≤ h.
That is, for a given support Y ⊆ X of a distribution, the most random distribution with
support in Y is the uniform distribution over Y .
Happily, we won’t have to work with the formula (1) directly. Instead, we use Shannon’s
characterization of entropy in terms of average coding length.
Theorem 2.2 (Shannon’s Theorem) For every distribution D over the set X, the en-
tropy H(D) characterizes (up to an additive +1 term) the minimum possible expected encod-
ing length of X, where a code is an injective function from X to {0, 1}∗ and the expectation
is with respect to D.
Proving this theorem would take us too far afield, but it is accessible and you should look it
up (see also Homework #10). The upper bound is already achieved by the Huffman codes
that you may have studied in CS161. Intuitively, if entropy is “defined correctly,” then we
should expect the lower bound to hold—entropy is the average number of bits of information
provided by a sample, and it makes sense that encoding this information would require this
many bits.
A simple but important observation is that a correct comparison-based sorting algorithm
induces a binary encoding of the set Sn of permutations of {1, 2 . . . , n}. To see this, recall
that such an algorithm can be represented as a tree (Figure 1), with each comparison cor-
responding to a node with two children — the subsequent execution of the algorithm as a
function of the comparison outcome. At a leaf, where the algorithm terminates, correctness
implies that the input permutation has been uniquely identified. Label each left branch with
a “0”and each right branch with a “1.” By correctness, each permutation of Sn occurs in at
least one leaf; if it appears at multiple leaves, pick one closest to the root. The sequence of 0’s
and 1’s on a root-leaf path encodes the permutation. If the xi ’s are drawn from distributions
D = D1 , D2 , . . . , Dn , with induced distribution Π(D) over permutations (leaves), then the
expected number of comparisons of the sorting algorithm is at least the expected length of
the corresponding encoding, which by Shannon’s Theorem is at least H(π(D)).
Upshot: Our goal should be to design a correct sorting algorithm that, for every distri-
bution D, quickly converges to the optimal per-instance expected running time of
3
< >
leaves = single
permutations
Figure 1: Every correct comparison-based sorting algorithm induces a decision tree, which
induces a binary encoding of the set Sn of permutations on n elements.
The second term is the necessary expected number of comparisons, and the first term is the
time required to write the output. The rest of this lecture provides such an algorithm.
For example, suppose xi takes on one of two possible values. Then the support size of
the induced distribution Π(D) is at most 2n , and (2) then insists that the algorithm should
converge to a per-instance expected running time of O(n).
4
the V -list
λ − 1 elements λ − 1 elements
Figure 2: Construction of the V -list. After merging the elements of λ random instances,
take every λth element.
comparison model.) One could of course use binary search to place each element in O(log n)
time into the right bucket, but we cannot afford to do this if H(Π(D)) = o(n log n).
ED m2i ≤ 20
for every bucket i, where the expectation is over a (new) random sample from the input
distribution D.
The point of the lemma is that, after we’ve assigned all of the input elements to the
correct buckets, we only need O(n) additional time to complete the job (sorting each bucket
using InsertionSort (say) and concatenating the results). We now turn to its proof.4
4
We skipped over this proof in lecture.
5
Proof of Lemma 3.1: The basic reason the lemma is true, and the reason for our choice of λ,
is because the Chernoff bound has the form
2
Pr[X < (1 − δ)µ] ≈ e−µ·δ , (3)
where we are ignoring constants in the exponent on the right-hand side, and where µ is
defined as E[X]. Thus: when the expected value of the sum of Bernoulli random variables is
at least logarithmic, even constant-factor deviations from the expectation are highly unlikely.
To make use of this fact, consider the λn elements S that belong to the first λ inputs,
from which we draw V . Consider fixed indices k, ` ∈ {1, 2, . . . , λn}, and define Ek,` as the
(bad) event (over the choice of S) that
(i) the kth and `th elements take on a pair of values a, b ∈ R for which ED [mab ] ≥ 4λ,
where mab denotes the number of elements of λ random inputs (i.i.d. samples from D)
that lie between a and b; and
(ii) at most λ other elements of S lie between a and b.
That is, Ek,` is the event that the kth and `th samples are pretty far apart and yet there is
an unusually sparse population of other samples between them.
We claim that for all k, `, Pr[Ek,` ] is very small, at most 1/n5 . To see this, fix k, `
and condition on the event that (i) occurs, and the corresponding values of a and b. Even
then, the Chernoff bound (3) implies that the (conditional) probability that (ii) occurs is
at most 1/n5 , provided the constant c in the definition of λ is sufficiently large. (We can
take µ = 4c log n and δ = 12 , for example.) Taking a Union Bound over the at most (λn)2
choices for k, `, we find that Pr[∨k,` Ek,` ] ≤ 1/n2 . This implies that, with probability at least
1 − n12 , for every pair a, b ∈ S of samples with less than λ other samples between them, the
expected number of elements of λ future random instances that lie between a and b is at
most 4λ. This guarantee applies in particular to consecutive bucket boundaries, since only
λ − 1 samples separate them.
To recap, with probability at least 1 − n12 (over S), for every bucket B defined by two
consecutive boundaries a and b, the expected number of elements of λ independent random
inputs that lie in B is at most 4λ. By linearity, the expected number of elements of a single
random input that lands in B is at most 4.
To finish the proof, consider an arbitrary bucket Bi and let Xj be the random variable
indicating whether or not the jth element of a random instance lies in Bi . We have the
following upper bound on the expected squared size m2i of bucket i:
X X
E (X1 + X2 + · · · + Xn )2 = E Xj2 + 2
E[Xj ] · E[Xh ]
j j<h
! !2
X X
≤ E[Xj ] + E[Xj ] ,
j j
where the equality uses linearity of expectation and the independence of the Xj ’s, and the
inequality uses the fact that the Xj ’s are 0-1 random variables. With probability at least
6
< >
leaves = single
permutations
1 − n12 , j E[Xj ] ≤ 4 and hence this upper bound is at most 20, simultaneously for every
P
bucket Bi .
7
The key recurrence is
n o
E Ti1,2,...,j−1 + E Tij+1,j+2,...,n ,
E[Ti ] = 1 + min
j=1,2,...,n
where Tia,...,b denotes the optimal search tree for locating xi with respect to the bucket bound-
aries a, a + 1, . . . , b. In English, this recurrence says that if you knew the right comparison to
ask first, then one would solve the subsequent subproblem optimally. The dynamic program
effectively tries all n possibilities for the first comparison. There are O(n2 ) subproblems (one
per contiguous subset of {1, 2, . . . , n}) and the overall running time is O(n3 ). Details are
left to Homework #10. For a harder exercise (optional), try to improve the running time to
O(n2 ) by exploiting extra structure in the dynamic program (which is an old trick of Knuth).
Given such a solution, the total time needed to construct all n trees — a different one for
each Di , of course — is O(n3 ). We won’t worry about how to account for this work until
the end of the lecture.
Since binary encodings of the buckets {0, 1, 2, . . . , n} and decision trees for searching are
essentially in one-to-one correspondence (with bits corresponding to comparison outcomes),
Shannon’s Theorem implies that the expected number of comparisons used to classify xi via
the optimal search tree Ti is essentially the entropy H(Bi ) of the corresponding distribution
on buckets. We relate this to the entropy that we actually care about (that of the distribution
on permutations) in Section 4.
8
obvious that the third step can be implemented in O(n) time (with probability 1, for any
distribution). The expected running time of the second step of Phase II is O(n). This follows
easily from Lemma 3.1. With probability at least 1 − n12 over the choice of V in Phase I,
the expectation of the squared size of every bucket is at most 20, so Insertion Sort runs in
O(1) time on each of the O(n) buckets. For the remaining probability of at most n12 , we can
use the O(n2 ) worst-case running time bound for Insertion Sort; even then, this exceptional
case contributes only O(1) to the expected running time of the self-improving sorter.
For the first step, we have already argued that the expected running time is proportional
P n
to i=1 H(Bi ), where Bi is the distribution on buckets induced by Di (because we use
optimal binary search trees). The next lemma shows that this quantity is no more than the
target running time bound of O(n + H(Π(D))).
5 Extensions
5.1 Optimizing the Space
We begin with a simple and clever optimization that also segues into the next extension,
which is about generalizing the basic algorithm to unknown distributions.
5
P an upper bound on the entropy of the joint distribution H(B1 ×B2 ×· · ·×Bn )
Actually, this only provides
of the buckets, rather than i H(Bi ). But it is an intuitive and true fact that the entropies of independent
random
Pn variables add (roughly equivalently, to encode all of them, you may as well encode each separately).
So i=1 H(Bi ) = H(B1 × · · · × Bn ), and it is enough to exhibit a single encoding of b1 , . . . , bn .
9
The space required by our self-improving sorter is Θ(n2 ), for the n optimal search trees
(the Ti ’s). Assuming still that the Di ’s are known, suppose that we truncate each Ti after
level log n, for some > 0. These truncated trees require only O(n ) space each, for a total
of O(n1+ ). What happens now when we search for xi ’s bucket in Ti and fall off the bottom
of the truncated tree? We just locate the correct bucket by standard binary search! This
takes O(log n) time, and we would have had to spend at least log n time searching for it
in the original tree Ti anyways. Thus the price we pay for maintaining only the truncated
versions of the optimal search trees is a 1 blow-up in the expected bucket location time,
which is a constant-factor loss for any fixed > 0. Conceptually, the big gains from using
an optimal search tree instead of standard binary search occur at leaves that are at very
shallow levels (and presumably are visited quite frequently).
10
Figure 5: A Delaunay triangulation in the plane with circumcircles shown.
provably requires exponential space (Homework #10). Intuitively, there are too many funda-
mentally distinct distributions that need to be distinguished. An interesting open question is
to find an assumption weaker than (or incomparable to) independence that is strong enough
to allow interesting positive results. The reader is encouraged to go back over the analysis
above and identify all of the (many) places where we used the independence of the Di ’s.
11
“buckets” into one for the entire point set. It can be done, however; see [2].
6 Final Scorecard
To recap, we designed a self-improving sorting algorithm, which is simultaneously competi-
tive (up to a constant factor) with the optimal sorting algorithm for every input distribution
with independent array elements. The algorithm’s requirements are (for a user-specified
constant > 0):
• Θ(n log n) training samples (O(log n) to pick good bucket boundaries, the rest to
estimate frequent buckets);
• Θ(n1+2 ) time to construct the search trees T̂i following the training phase;7
• O(n + H(π(D))) comparisons (expected) per input after the T̂i ’s have been built [with
high probability over the first Θ(log n) training samples].
We note that the requirements are much more reasonable than for an algorithm that attempts
to explicitly learn the distribution π(D) over permutations (which could take space and
number of samples exponential in n).
References
[1] N. Ailon, B. Chazelle, S. Comandur, and D. Liu. Self-improving algorithms. In Proceed-
ings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages
261–270, 2006.
7
This computation can either be absorbed as a one-time cost, or spread out over another O(n2 ) training
samples to keep the per-sample computation at O(n log n).
12
[2] K. L. Clarkson and C. Seshadhri. Self-improving algorithms for Delaunay triangulations.
In Proceedings of the 24th Annual ACM Symposium on Computational Geometry (SCG),
pages 148–155, 2008.
13