0% found this document useful (0 votes)
0 views

l19

Game theory notes

Uploaded by

kavehf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

l19

Game theory notes

Uploaded by

kavehf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

CS264: Beyond Worst-Case Analysis

Lecture #19: Self-Improving Algorithms∗


Tim Roughgarden†
March 14, 2017

1 Preliminaries
The last few lectures discussed several interpolations between worst-case and average-case
analysis designed to identify robust algorithms in the face of strong impossibility results for
worst-case guarantees. This lecture gives another analysis framework that blends aspects of
worst- and average-case analysis. In today’s model of self-improving algorithms, an adversary
picks an input distribution, and then nature picks a sequence of i.i.d. samples from this
distribution. This model is relatively close to traditional average-case analysis, but with the
twist that the algorithm has to learn the input distribution (or a sufficient summary of it)
from samples. It is not hard to think of real-world applications where there is enough data
to learn over time an accurate distribution of future inputs (e.g., click-throughs on a major
Internet platform). The model and results are by Ailon, Chazelle, Comandar, and Liu [1].

The Setup. For a given computational problem, we posit a distribution over instances.
The difference between today’s model and traditional average-case analysis, including our
studies of the planted clique and bisection problems, is that the distribution is initially un-
known. The goal is to design an algorithm that, given an online sequence of instances — each
an independent and identically distributed (i.i.d.) sample from the unknown distribution —
quickly converges to an algorithm that is optimal for the underlying distribution.1 Thus the
algorithm is “automatically self-tuning.” The challenge is to accomplish this goal with fewer
samples and less space than a brute-force approach.
∗c 2009–2017, Tim Roughgarden.

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,
CA 94305. Email: [email protected].
1
Remember that when there is a distribution over inputs, there is also a well-defined notion of an optimal
algorithm, namely one with the best-possible expected performance (where the expectation is over the input
distribution).

1
Main Example: Sorting. The obvious first problem to apply the self-improving paradigm
to is sorting in the comparison model, and that’s what we do here.2 Each instance is an
array of n elements, where n is fixed and known. The ith element is drawn from an un-
known real-valued distribution Di . An algorithm can glean information about the xi ’s only
through comparisons—for example, the algorithm cannot access a given bit of the binary
representation of an element (as one would do in radix sort, for example).
A key assumption is that the Di ’s are independent distributions; Section 5.3 discusses
this assumption. The distributions need not be identical. Identical distributions are unin-
teresting in our context, since in this case the relative order of the elements is a uniformly
random permutation. As we’ll see later, every correct sorting algorithm requires Ω(n log n)
expected comparisons in this case, and a matching upper bound is achieved by MergeSort
(for example).

2 The Entropy Lower Bound


Since a self-improving algorithm is supposed to run eventually as fast as an optimal one
for the underlying distribution, we need to understand some things about optimal sorting
algorithms. In turn, this requires a lower bound on the expected running time of every
sorting algorithm with respect to a fixed distribution.
The distributions D1 , . . . , Dn over x1 , . . . , xn induce a distribution Π(D) over permuta-
tions of {1, 2, . . . , n} via the ranks of the xi ’s. (For simplicity, assume throughout this lecture
that there are no ties.) We’ll see below that if Π(D) is (close to) the uniform distribution
over the set Sn of all permutations, then the worst-case comparison-based sorting bound
of Ω(n log n) also applies here in the average case. On the other hand, sufficiently trivial
distributions Π can obviously be sorted faster. For example, if the support of Π involves
only a constant number of permutations, these can be distinguished in O(1) comparisons
and then the appropriate permutation can be applied to the input in linear time. More
generally, the goal is to beat the Ω(n log n) sorting bound when the distribution Π is “not
too random;” and there is, of course, an implicit hope that “real data” can sometimes be
well approximated by such a a distribution.3 This will not always be the case in practice,
but sometimes it is a reasonable assumption (e.g., if the input is usually already partially
sorted).
The standard way to measure the “amount of randomness” of a distribution mathemat-
ically is by its entropy.

Definition 2.1 (Entropy of a Distribution) Let D = {px }x∈X be a distribution over the
2
Subsequent work extend these techniques and results to several problems in low-dimensional computa-
tional geometry, including computing maxima (see Lecture #2), convex hulls, and Delaunay triangulations.
3
For sorting, random data is the worst case and hence we propose a parameter to upper bound the amount
of randomness in the data. This is an interesting contrast with our lectures on smoothed analysis, where the
analysis framework imposes a lower bound on the amount of randomness in the input.

2
finite set X. The entropy H(D) of D is
X 1
px log2 , (1)
x∈X
px

1
where we interpret 0 log2 0
as 0.
For example, H(D) = log2 |Y | if D is uniform over some subset Y ⊆ X. When Y is the set
Sn of all n! permutations of {1, 2, . . . , n}, this is Θ(n log n). On Homework #10 you will show
that: if D puts positive probability on at most 2h different elements of X, then H(D) ≤ h.
That is, for a given support Y ⊆ X of a distribution, the most random distribution with
support in Y is the uniform distribution over Y .
Happily, we won’t have to work with the formula (1) directly. Instead, we use Shannon’s
characterization of entropy in terms of average coding length.

Theorem 2.2 (Shannon’s Theorem) For every distribution D over the set X, the en-
tropy H(D) characterizes (up to an additive +1 term) the minimum possible expected encod-
ing length of X, where a code is an injective function from X to {0, 1}∗ and the expectation
is with respect to D.
Proving this theorem would take us too far afield, but it is accessible and you should look it
up (see also Homework #10). The upper bound is already achieved by the Huffman codes
that you may have studied in CS161. Intuitively, if entropy is “defined correctly,” then we
should expect the lower bound to hold—entropy is the average number of bits of information
provided by a sample, and it makes sense that encoding this information would require this
many bits.
A simple but important observation is that a correct comparison-based sorting algorithm
induces a binary encoding of the set Sn of permutations of {1, 2 . . . , n}. To see this, recall
that such an algorithm can be represented as a tree (Figure 1), with each comparison cor-
responding to a node with two children — the subsequent execution of the algorithm as a
function of the comparison outcome. At a leaf, where the algorithm terminates, correctness
implies that the input permutation has been uniquely identified. Label each left branch with
a “0”and each right branch with a “1.” By correctness, each permutation of Sn occurs in at
least one leaf; if it appears at multiple leaves, pick one closest to the root. The sequence of 0’s
and 1’s on a root-leaf path encodes the permutation. If the xi ’s are drawn from distributions
D = D1 , D2 , . . . , Dn , with induced distribution Π(D) over permutations (leaves), then the
expected number of comparisons of the sorting algorithm is at least the expected length of
the corresponding encoding, which by Shannon’s Theorem is at least H(π(D)).

Upshot: Our goal should be to design a correct sorting algorithm that, for every distri-
bution D, quickly converges to the optimal per-instance expected running time of

O(n + H(Π(D))). (2)

3
< >

< > < >

leaves = single
permutations

Figure 1: Every correct comparison-based sorting algorithm induces a decision tree, which
induces a binary encoding of the set Sn of permutations on n elements.

The second term is the necessary expected number of comparisons, and the first term is the
time required to write the output. The rest of this lecture provides such an algorithm.
For example, suppose xi takes on one of two possible values. Then the support size of
the induced distribution Π(D) is at most 2n , and (2) then insists that the algorithm should
converge to a per-instance expected running time of O(n).

3 The Basic Algorithm


3.1 High-Level Approach
Our self-improving algorithm is inspired by BucketSort, a standard method of sorting when
you have good statistics about the data. For example, if the input elements are i.i.d. samples
from the uniform distribution on [0, 1], then one can have an array of n buckets, with the
ith bucket meant for numbers between (i − 1)/n and i/n. A linear pass through the input
is enough to put elements in their rightful buckets; assuming each bucket contains O(1)
elements, one can quickly sort each bucket separately and concatenate the results. The total
running time in this case would be O(n). Note that this sorting algorithm is very much
not comparison-based, since the high-order bits of an element are used to place it in the
appropriate bucket in O(1) time.
There are a couple of challenges in making this idea work for a self-improving algorithm.
First, since we know nothing about the underlying distribution, what should the buckets be?
Second, even if we have the right buckets, how can we quickly place the input elements into
the correct buckets? (Recall that unlike traditional BucketSort, here we’re working in the

4
the V -list

λ − 1 elements λ − 1 elements

Figure 2: Construction of the V -list. After merging the elements of λ random instances,
take every λth element.

comparison model.) One could of course use binary search to place each element in O(log n)
time into the right bucket, but we cannot afford to do this if H(Π(D)) = o(n log n).

3.2 Phase I: Constructing the V -List


Our first order of business is identifying good buckets, where “good” means that on future
(random) instances the expected size of each bucket is O(1). (Actually, we’ll need something
a little stronger than this, see Lemma 3.1.)
Set λ = c log n for a sufficiently large constant c. Our self-improving algorithm will, in
its ignorance, sort the first λ instances using (say) MergeSort to guarantee a run time of
O(n log n) per instance. At the same time, however, our algorithm will surreptitiously build
a sorted “master list” L of the λn = Θ(n log n) corresponding elements. [Easy exercise:
this can be done without affecting the O(n log n) per iteration time bound.] Our algorithm
defines the set V ⊂ L — the “V -list” — as every λth element of L (Figure 2), for a total of
n overall. The elements in V are our “bucket boundaries,” and they split the real line into
n + 1 buckets in all.
The next lemma justifies our V -list construction by showing that, with high probability
over the choice of V , expected bucket sizes (squared, even) of future instances have constant
size. This ensures that we won’t waste any time sorting the elements within a single bucket.
For a fixed choice of V and a bucket i (between the ith and (i + 1)th elements of V ),
let mi denote the (random) number of elements of an instance that fall in the ith bucket.

Lemma 3.1 (Elements Are Evenly Distributed) Fix distributions D = D1 , . . . , Dn .


With probability at least 1 − 1/n2 over the choice of V ,

ED m2i ≤ 20
 

for every bucket i, where the expectation is over a (new) random sample from the input
distribution D.
The point of the lemma is that, after we’ve assigned all of the input elements to the
correct buckets, we only need O(n) additional time to complete the job (sorting each bucket
using InsertionSort (say) and concatenating the results). We now turn to its proof.4
4
We skipped over this proof in lecture.

5
Proof of Lemma 3.1: The basic reason the lemma is true, and the reason for our choice of λ,
is because the Chernoff bound has the form
2
Pr[X < (1 − δ)µ] ≈ e−µ·δ , (3)
where we are ignoring constants in the exponent on the right-hand side, and where µ is
defined as E[X]. Thus: when the expected value of the sum of Bernoulli random variables is
at least logarithmic, even constant-factor deviations from the expectation are highly unlikely.
To make use of this fact, consider the λn elements S that belong to the first λ inputs,
from which we draw V . Consider fixed indices k, ` ∈ {1, 2, . . . , λn}, and define Ek,` as the
(bad) event (over the choice of S) that
(i) the kth and `th elements take on a pair of values a, b ∈ R for which ED [mab ] ≥ 4λ,
where mab denotes the number of elements of λ random inputs (i.i.d. samples from D)
that lie between a and b; and
(ii) at most λ other elements of S lie between a and b.
That is, Ek,` is the event that the kth and `th samples are pretty far apart and yet there is
an unusually sparse population of other samples between them.
We claim that for all k, `, Pr[Ek,` ] is very small, at most 1/n5 . To see this, fix k, `
and condition on the event that (i) occurs, and the corresponding values of a and b. Even
then, the Chernoff bound (3) implies that the (conditional) probability that (ii) occurs is
at most 1/n5 , provided the constant c in the definition of λ is sufficiently large. (We can
take µ = 4c log n and δ = 12 , for example.) Taking a Union Bound over the at most (λn)2
choices for k, `, we find that Pr[∨k,` Ek,` ] ≤ 1/n2 . This implies that, with probability at least
1 − n12 , for every pair a, b ∈ S of samples with less than λ other samples between them, the
expected number of elements of λ future random instances that lie between a and b is at
most 4λ. This guarantee applies in particular to consecutive bucket boundaries, since only
λ − 1 samples separate them.
To recap, with probability at least 1 − n12 (over S), for every bucket B defined by two
consecutive boundaries a and b, the expected number of elements of λ independent random
inputs that lie in B is at most 4λ. By linearity, the expected number of elements of a single
random input that lands in B is at most 4.
To finish the proof, consider an arbitrary bucket Bi and let Xj be the random variable
indicating whether or not the jth element of a random instance lies in Bi . We have the
following upper bound on the expected squared size m2i of bucket i:
X   X
E (X1 + X2 + · · · + Xn )2 = E Xj2 + 2
 
E[Xj ] · E[Xh ]
j j<h
! !2
X X
≤ E[Xj ] + E[Xj ] ,
j j

where the equality uses linearity of expectation and the independence of the Xj ’s, and the
inequality uses the fact that the Xj ’s are 0-1 random variables. With probability at least

6
< >

< > < >

leaves = single
permutations

Figure 3: Every correct comparison-based searching algorithm induces a decision tree.

1 − n12 , j E[Xj ] ≤ 4 and hence this upper bound is at most 20, simultaneously for every
P
bucket Bi . 

3.3 Interlude: Optimal Bucket Classification


The next challenge is, having identified good buckets, how do we quickly classify which
element of a random instance belongs to which bucket? Remember that this problem is
non-trivial because we need to compete with an entropy bound, which can be small even for
quite non-trivial distributions over permutations.
To understand the final algorithm, it is useful to first cheat and assume that the distri-
butions Di are known. (This is not the case, of course, and our analysis of Phase I does
not assume this.) In this case, we might as well proceed in the optimal way: that is, for
a given element xi from distribution Di , we ask comparisons between xi and the bucket
boundaries in the way that minimizes the expected number of comparisons. For example,
if xi lies in bucket #17 90% of the time, then our first two comparisons should be between
xi and the two bucket boundaries that define bucket #17 (most of the time, we can then
stop immediately). As with sorting (Figure 1), comparison-based algorithms for searching
can be visualized as decision trees (Figure 3), where each leaf corresponds to the (unique)
bucket to which the given element can belong given the results of the comparisons. Unlike
sorting, however, it can be practical to explicitly construct these trees for optimal searching.
For starters, they have only O(n) size, as opposed to the Ω(n!) size trees that are generally
required for optimal sorting.
Precisely, let Bi denote the distribution on buckets induced by the distribution Di . (This
is with respect to a choice of V , which is now fixed forevermore.) Computing the optimal
search tree Ti for a given distribution Bi is then bread-and-butter dynamic programming.

7
The key recurrence is
n  o
E Ti1,2,...,j−1 + E Tij+1,j+2,...,n ,
 
E[Ti ] = 1 + min
j=1,2,...,n

where Tia,...,b denotes the optimal search tree for locating xi with respect to the bucket bound-
aries a, a + 1, . . . , b. In English, this recurrence says that if you knew the right comparison to
ask first, then one would solve the subsequent subproblem optimally. The dynamic program
effectively tries all n possibilities for the first comparison. There are O(n2 ) subproblems (one
per contiguous subset of {1, 2, . . . , n}) and the overall running time is O(n3 ). Details are
left to Homework #10. For a harder exercise (optional), try to improve the running time to
O(n2 ) by exploiting extra structure in the dynamic program (which is an old trick of Knuth).
Given such a solution, the total time needed to construct all n trees — a different one for
each Di , of course — is O(n3 ). We won’t worry about how to account for this work until
the end of the lecture.
Since binary encodings of the buckets {0, 1, 2, . . . , n} and decision trees for searching are
essentially in one-to-one correspondence (with bits corresponding to comparison outcomes),
Shannon’s Theorem implies that the expected number of comparisons used to classify xi via
the optimal search tree Ti is essentially the entropy H(Bi ) of the corresponding distribution
on buckets. We relate this to the entropy that we actually care about (that of the distribution
on permutations) in Section 4.

3.4 Phase II: The Steady State


For the moment we continue to assume that the distributions Di are known. After the V -list
is constructed in Phase I and the optimal search trees T1 , . . . , Tn over buckets are built in
the Interlude phase, the self-improving algorithm runs as shown in Figure 4 for every future
random instance.

Input: a random instance x1 , . . . , xn , with each xi drawn independently from Di .

1. For each i, use Ti to put xi into the correct bucket.

2. Sort each bucket (e.g., using InsertionSort).

3. Concatenate the sorted buckets and return the result.

Figure 4: The steady state of the basic self-improving sorter.

4 Running Time Analysis of the Basic Algorithm


We have already established that Phase I runs in O(n log n) time per iteration, and that the
Interlude requires O(n3 ) computation (to be refined in Section 5.2). As for Phase II, it is

8
obvious that the third step can be implemented in O(n) time (with probability 1, for any
distribution). The expected running time of the second step of Phase II is O(n). This follows
easily from Lemma 3.1. With probability at least 1 − n12 over the choice of V in Phase I,
the expectation of the squared size of every bucket is at most 20, so Insertion Sort runs in
O(1) time on each of the O(n) buckets. For the remaining probability of at most n12 , we can
use the O(n2 ) worst-case running time bound for Insertion Sort; even then, this exceptional
case contributes only O(1) to the expected running time of the self-improving sorter.
For the first step, we have already argued that the expected running time is proportional
P n
to i=1 H(Bi ), where Bi is the distribution on buckets induced by Di (because we use
optimal binary search trees). The next lemma shows that this quantity is no more than the
target running time bound of O(n + H(Π(D))).

Lemma 4.1 For every choice of V in Phase I,


n
X
H(Bi ) = O(n + H(Π(D))).
i=1

Proof: Fix an arbitrary choice of bucket boundaries V . By Shannon’s Theorem, we only


need to exhibit a binary encoding of of the buckets b1 , . . . , bn to which x1 , . . . , xn belong,
such that the expected coding lengths (over D) is O(n + H(Π(D))).5
As usual, our encoding consists of the comparison results of an algorithm, which we
define for the purposes of analysis only. Given x1 , . . . , xn , the first step is to sort the xi ’s
using an optimal (for D) sorting algorithm. Such an algorithm exists (in principle), and
Shannon’s Theorem implies that it uses ≈ H(Π(D)) comparisons on average. The second
step is to merge the sorted list of xi ’s together with the bucket boundaries V (which are also
sorted). Merging these two lists requires O(n) comparisons, the results of which uniquely
identify the correct buckets for all of the xi ’s — the last two bucket boundaries to which xi
is compared are the left and right endpoints of its bucket. Thus the bucket memberships
of all of xi , . . . , xn can be uniquely reconstructed from the results of the comparisons. This
means that the comparison results can be interpreted as a binary encoding of the buckets to
which the xi ’s belong with expected length O(n + H(Π(D))). 

5 Extensions
5.1 Optimizing the Space
We begin with a simple and clever optimization that also segues into the next extension,
which is about generalizing the basic algorithm to unknown distributions.
5
P an upper bound on the entropy of the joint distribution H(B1 ×B2 ×· · ·×Bn )
Actually, this only provides
of the buckets, rather than i H(Bi ). But it is an intuitive and true fact that the entropies of independent
random
Pn variables add (roughly equivalently, to encode all of them, you may as well encode each separately).
So i=1 H(Bi ) = H(B1 × · · · × Bn ), and it is enough to exhibit a single encoding of b1 , . . . , bn .

9
The space required by our self-improving sorter is Θ(n2 ), for the n optimal search trees
(the Ti ’s). Assuming still that the Di ’s are known, suppose that we truncate each Ti after
level  log n, for some  > 0. These truncated trees require only O(n ) space each, for a total
of O(n1+ ). What happens now when we search for xi ’s bucket in Ti and fall off the bottom
of the truncated tree? We just locate the correct bucket by standard binary search! This
takes O(log n) time, and we would have had to spend at least  log n time searching for it
in the original tree Ti anyways. Thus the price we pay for maintaining only the truncated
versions of the optimal search trees is a 1 blow-up in the expected bucket location time,
which is a constant-factor loss for any fixed  > 0. Conceptually, the big gains from using
an optimal search tree instead of standard binary search occur at leaves that are at very
shallow levels (and presumably are visited quite frequently).

5.2 Unknown Distributions


The elephant in the room is that the Interlude and Phase II of our self-improving sorter
currently assume that the distributions D are known a priori, while the whole point of the
self-improving paradigm is to design algorithms that work well for unknown distributions.
The fix is the obvious one: we build (near-optimal) search trees using empirical distributions,
based on how frequently the ith element lands in the various buckets.
More precisely, the general self-improving sorter is defined as follows. Phase I is defined
as before and uses O(log n) training phases to identify the bucket boundaries V . The new
Interlude uses further training phases to estimate the Bi ’s (the distributions of the xi ’s over
the buckets).
The analysis in Section 5.1 suggests that, for each i, only the ≈ n most frequent bucket
locations for xi are actually needed to match the entropy lower bound (up to a constant
factor). For V and i fixed, call a bucket frequent if the probability that xi lands in it is
at least 1/n , and infrequent otherwise. An extra Θ(n log n) training phases suffice to get
accurate estimates of the probabilities of frequent buckets (for all i) — after this point one
expects Ω(log n) appearances of xi in every frequent bucket for i, and using Chernoff bounds
as in Section 4 implies that all of the empirical frequency counts are close to their expectations
(with high probability). The algorithm then builds a search tree T̂i for i using only the buckets
(if any) in which Ω(log n) samples of xi landed. This involves O(n ) buckets and can be done
in O(n2 ) time and O(n ) space. Homework #10 shows that the T̂i ’s are essentially as good
as the truncated optimal search trees of Section 5.1, with buckets outside the trees being
located via standard binary search, so that the first step Pof Phase II of the self-improving
sorter continues to have an expected running time of O( i H(Bi )) = O(n + H(Π(D))) (for
constant ). Finally, the second and third steps of Phase II of the self-improving sorter
obviously continue to run in time O(n) [expected] and O(n) [worst-case], respectively.

5.3 Beyond Independent Distributions


The assumption that the Di ’s are independent distributions is strong. Some assumption is
needed, however, as a self-improving sorter for arbitrary distributions Π over permutations

10
Figure 5: A Delaunay triangulation in the plane with circumcircles shown.

provably requires exponential space (Homework #10). Intuitively, there are too many funda-
mentally distinct distributions that need to be distinguished. An interesting open question is
to find an assumption weaker than (or incomparable to) independence that is strong enough
to allow interesting positive results. The reader is encouraged to go back over the analysis
above and identify all of the (many) places where we used the independence of the Di ’s.

5.4 Delaunay Triangulations6


Clarkson and Seshadhri [2] give a non-trivial extension of the algorithm and analysis in [1], to
the problem of computing the Delaunay triangulation of a point set. The input is n points in
the plane, where each point xi is an independent draw from a distribution Di . One definition
of a Delaunay triangulation is that, for every face of the triangulation, the circle that goes
through the three corners of the face encloses no other points of the input (see Figure 5
for an example and the textbook [3, Chapter 9] for much more on the problem). The main
result in [2] is again an optimal self-improving algorithm, with steady-state expected running
time O(n + H(∆(D))), where H(∆(D)) is the suitable definition of entropy for the induced
distribution ∆(D) over triangulations. The algorithm is again an analog of BucketSort, but
a number of the details are challenging. For example, while the third step of Phase II of the
self-improving sorter — concatenating the sorted results from different buckets — is trivially
linear-time, it is much less obvious how to combine Delaunay triangulations of constant-size
6
We did not cover this in lecture.

11
“buckets” into one for the entire point set. It can be done, however; see [2].

5.5 Further Problems


It is an open question to design self-improving algorithms for problems beyond sorting and
low-dimensional computational geometry (convex hulls, maxima, Delaunay triangulations).
The paradigm may be limited to problems where there are instance-optimality results (Lec-
ture #2). The reason is that proving the competitiveness of a self-improving algorithm seems
to require understanding lower bounds on algorithms (with respect to an input distribution),
and such lower bounds tend to be known only for problems with near-linear worst-case run-
ning time.

6 Final Scorecard
To recap, we designed a self-improving sorting algorithm, which is simultaneously competi-
tive (up to a constant factor) with the optimal sorting algorithm for every input distribution
with independent array elements. The algorithm’s requirements are (for a user-specified
constant  > 0):

• Θ(n log n) training samples (O(log n) to pick good bucket boundaries, the rest to
estimate frequent buckets);

• Θ(n1+ ) space (n search trees with O(n ) nodes each);

• Θ(n1+2 ) time to construct the search trees T̂i following the training phase;7

• O(n log n) comparisons (worst-case) per training input;

• O(n + H(π(D))) comparisons (expected) per input after the T̂i ’s have been built [with
high probability over the first Θ(log n) training samples].

We note that the requirements are much more reasonable than for an algorithm that attempts
to explicitly learn the distribution π(D) over permutations (which could take space and
number of samples exponential in n).

References
[1] N. Ailon, B. Chazelle, S. Comandur, and D. Liu. Self-improving algorithms. In Proceed-
ings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages
261–270, 2006.
7
This computation can either be absorbed as a one-time cost, or spread out over another O(n2 ) training
samples to keep the per-sample computation at O(n log n).

12
[2] K. L. Clarkson and C. Seshadhri. Self-improving algorithms for Delaunay triangulations.
In Proceedings of the 24th Annual ACM Symposium on Computational Geometry (SCG),
pages 148–155, 2008.

[3] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. Computational Geom-


etry: Algorithms and Applications. Springer, 2000. Second Edition.

13

You might also like