A Reliable Randomized Algorithm For The Closest-Pair Problem
A Reliable Randomized Algorithm For The Closest-Pair Problem
Closest-Pair Problem
Martin Dietzfelbinger
Fachbereich Informatik
Universitat Dortmund
D-44221 Dortmund, Germany
Torben Hagerup
Max-Planck-Institut f ur Informatik
Im Stadtwald
D-66123 Saarbr ucken, Germany
Jyrki Katajainen
Datalogisk Institut
Kbenhavns Universitet
Universitetsparken 1
DK-2100 Kbenhavn , Denmark
Martti Penttonen
Tietojenkasittelytieteen laitos
Joensuun yliopisto
PL 111
FIN-80101 Joensuu, Finland
Partially supported by the ESPRIT Basic Research Actions Program of the EC under
contract No. 7141 (project ALCOM II).
Partially supported by the Academy of Finland under contract No. 1021129 (project E-
cient Data Structures and Algorithms).
summation symbol
cap. omega
H calligraphic aych
IR the set of reals
IN the set of natural numbers
ZZ the set of integers
A cap. ay a lower ay alpha
B cap. bee b lower bee beta
c lower cee
D cap. dee d lower dee delta
varepsilon
G cap. gee gamma
h lower aych
i lower eye
j lower jay
k lower kay
L cap. ell lower ell
m lower em mu
N cap. en n lower en
o lower oh
P cap. pee p lower pee pi
R cap. are r lower are
S cap. ess s lower ess
t lower tee
U cap. you
X cap. eks x lower eks
y lower why
z lower zee
0 zero
1 one
The LaTeX source of this paper is available from the authors as a typesetting aid.
4
1 Introduction
The closest-pair problem is often introduced as the rst nontrivial proximity prob-
lem in computational geometrysee, e. g., [26]. In this problem we are given a
collection of n points in d-dimensional space, where d 1 is a xed integer, and a
metric specifying the distance between points. The task is to nd a pair of points
whose distance is minimal. We assume that each point is represented as a d-tuple
of real numbers, or of integers in a xed range, and that the distance measure is
the standard Euclidean metric.
In his seminal paper on randomized algorithms, Rabin [27] proposed an algo-
rithm for solving the closest-pair problem. The key idea of the algorithm is to
determine the minimal distance
0
within a random sample of points. When the
points are grouped according to a grid with resolution
0
, the points of a closest
pair fall in the same cell or in neighboring cells. This considerably decreases the
number of possible closest-pair candidates from the total of n(n 1)/2. Rabin
proved that with a suitable sample size the total number of distance calculations
performed will be of order n with overwhelming probability.
A question that was not solved satisfactorily by Rabin is how the points are
grouped according to a
0
-grid. Rabin suggested that this could be implemented
by dividing the coordinates of the points by
0
, truncating the quotients to in-
tegers, and hashing the resulting integer d-tuples. Fortune and Hopcroft [15],
in their more detailed examination of Rabins algorithm, assumed the existence
of a special operation findbucket(
0
, p), which returns an index of the cell into
which the point p falls in some xed
0
-grid. The indices are integers in the range
{1, . . . , n}, and distinct cells have distinct indices.
On a real RAM (for the denition see [26]), where the generation of ran-
dom numbers, comparisons, arithmetic operations from {+, , , /,
}, and
findbucket require unit time, Rabins random-sampling algorithm runs in O(n)
expected time [27]. (Under the same assumptions the closest-pair problem can
even be solved in O(nlog log n) time in the worst case, as demonstrated by Fortune
and Hopcroft [15].) We next introduce terminology that allows us to characterize
the performance of Rabins algorithm more closely. Every execution of a random-
ized algorithm succeeds or fails. The meaning of failure depends on the context,
but an execution typically fails if it produces an incorrect result or does not nish
in time. We say that a randomized algorithm is exponentially reliable if, on inputs
of size n, its failure probability is bounded by 2
n
1}, where h
a
is dened by
h
a
(x) = (ax mod 2
k
) div 2
k
, for 0 x < 2
k
.
The class H
k,
contains 2
k1
(distinct) hash functions. Since we assume that on
the RAM model a random number can be generated in constant time, a function
from H
k,
can be chosen at random in constant time, and functions from H
k,
can be evaluated in constant time on a RAM with arithmetic operations from
{+, , , div} (for this 2
k
and 2
2
s
with z
mod 2
k
is a permutation of A. Consequently, the mapping
a2
s
az
2
s
mod 2
k+s
= az mod 2
k+s
is a permutation of the set {a2
s
| a A}. Thus, the number of a A that
satisfy (2.1) is the same as the number of a A that satisfy
a2
s
mod 2
k
{1, . . . , 2
k
1} {2
k
2
k
+ 1, . . . , 2
k
1}. (2.2)
Now, a2
s
mod 2
k
is just the number whose binary representation is given by the
ks least signicant bits of a, followed by s zeroes. This easily yields the following.
If s k , no a A satises (2.2). For smaller s, the number of a A satisfying
(2.2) is at most 2
k
. Hence the probability that a randomly chosen a A satises
(2.1) is at most 2
k
/2
k1
= 1/2
1
.
9
Remark 2.5 The lemma says that the class H
k,
of multiplicative hash functions
is 2-universal in the sense of [24, p. 140] (this notion slightly generalizes that of [8]).
As discussed in [21, p. 509] (the multiplicative hashing scheme), the functions
in this class are particularly simple to evaluate, since the division and the modulo
operation correspond to selecting a segment of the binary representation of the
product ax, which can be done by means of shifts. Other universal classes use
functions that involve division by prime numbers [8, 14], arithmetic in nite elds
[8], matrix multiplication [8], or convolution of binary strings over the two-element
eld [22], i. e., operations that are more expensive than multiplications and shifts
unless special hardware is available.
It is worth noting that the class H
k,
of multiplicative hash functions may be
used to improve the eciency of the static and dynamic perfect-hashing schemes
described in [14] and [12], in place of the functions of the type x (ax mod
p) mod m, for a prime p, which are used in these papers, and which involve in-
teger division. For an experimental evaluation of this approach, see [18]. In an-
other interesting development, Raman [29] has shown that the so-called method
of conditional probabilities can be used to obtain a function in H
k,
with desirable
properties (few collisions) in a deterministic manner (previously known deter-
ministic methods for this purpose use exhaustive search in suitable probability
spaces [14]); this allowed him to derive an ecient deterministic scheme for the
construction of perfect hash functions.
The following is a well-known property of universal classes.
Lemma 2.6 Let n, k and be positive integers with k and let S be a set of
n integers in the range {0, . . . , 2
k
1}. Choose h H
k,
at random. Then
Prob(h is 11 on S) 1
n
2
2
.
Proof. By Lemma 2.4,
Prob
_
h(x) = h(y) for some x, y S
_
_
n
2
_
1
2
1
n
2
2
.
2.3 Duplicate grouping via universal hashing
Having provided the universal class H
k,
, we are now ready to describe our rst
randomized duplicate-grouping algorithm.
Theorem 2.7 Let U 2 be known and a power of 2 and let 1 be an arbitrary
integer. The duplicate-grouping problem for a multiset of n integers in the range
{0, . . . , U 1} can be solved stably by a conservative randomized algorithm that
needs O(n) space and O(n) time on a unit-cost RAM with arithmetic operations
from {+, , , div}; the probability that the time bound is exceeded is bounded by
n
1}. Third,
the resulting pairs (x, h(x)), where x S, are sorted by radix sort (Fact 2.1)
according to their second components. Fourth, it is checked whether all elements
of S that have the same hash value are in fact equal. If this is the case, the third
step has produced the correct result; if not, the whole input is sorted, e. g., with
mergesort.
The computation of 2
1
1
n
by Lemma 2.6. In case the nal check indicates that the outcome of the third step
is incorrect, the call of mergesort produces a correct output in O(nlog n) time,
which does not impair the linear expected running time. The space requirements
of the algorithm are dominated by those of the sorting subroutines, which need
O(n) space. Since both radix sort and mergesort rearrange the elements stably,
duplicate grouping is performed stably. It is immediate that the algorithm is
conservative and that the number of random bits needed is k 1 < log
2
U.
2.4 Duplicate grouping via perfect hashing
We now show that there is another, asymptotically even more reliable, duplicate-
grouping algorithm that also works in linear time and space. The algorithm is
based on the randomized perfect-hashing scheme of Bast and Hagerup [4].
The perfect-hashing problem is the following: Given a multiset S {0, . . . , U
1}, for some universe size U, construct a function h: S {0, . . . , c|S|}, for some
constant c, so that h is 11 on (the distinct elements of) S. In [4] a parallel
algorithm for the perfect-hashing problem is described; we need the following
sequential version.
Fact 2.8 [4] Assume that U is a known prime. Then the perfect-hashing problem
for a multiset of n integers from {0, . . . , U 1} can be solved by a randomized
algorithm that requires O(n) space and runs in O(n) time with probability 1
2
n
(1)
. The hash function produced by the algorithm can be evaluated in constant
time.
In order to use this perfect-hashing scheme, we need to have a method for
computing a prime larger than a given number m. In order to nd such a prime,
we again use a randomized algorithm. The simple idea is to combine a randomized
11
primality test (as described, e. g., in [10, pp. 839 .]) with random sampling.
Such algorithms for generating a number that is probably prime are described
or discussed in several papers, e. g., in [5], [11], and [23]. As we are interested
in the situation where the running time is guaranteed and the failure probability
is extremely small, we use a variant of the algorithms tailored to meet these
requirements. The proof of the following lemma, which includes a description of
the algorithm, can be found in Section A of the appendix.
Lemma 2.9 There is a randomized algorithm that, for any given positive integers
m and n with 2 m 2
n
1/4
and take U
= min{U, 2
n
1/4
}. We
distinguish between two cases. If U is not large, i. e., U = U
, we rst apply
the method of Lemma 2.9 to nd a prime p between U and 2U. Then, the hash
function from Fact 2.8 is applied to map the distinct elements of S {0, . . . , p1}
to {0, . . . , cn}, where c is a constant. Finally, the values obtained are grouped
by one of the deterministic algorithms described in Section 2.1 (Fact 2.1 and
Lemma 2.3 are equally suitable). In case U is large, we rst collapse the universe
by mapping the elements of S {0, . . . , U 1} into the range {0, . . . , U
1} by a
randomly chosen multiplicative hash function, as described in Section 2.2. Then,
using the collapsed keys, we proceed as above for a universe that is not large.
Let us now analyze the resource requirements of the algorithm. It is easy
to check (conservatively) in O(min{n
1/4
, log U}) time whether or not U is large.
Lemma 2.9 shows how to nd the required prime p in the range {U
+1, . . . , 2U
}
in O(n) time with error probability at most 2
n
1/4
. In case U is large, we must
choose a function h at random from H
k,
, where 2
k
= U is known and = n
1/4
.
Clearly, 2
= 1, . . . , d. In phase d
, the entries of S
(in the order produced by the previous phase or in the initial order if d
= 1) are
grouped with respect to component d
the
d-tuples are grouped stably according to components 1, . . . , d
, which establishes
the correctness of the algorithm. The time and probability bounds are obvious.
3 A randomized closest-pair algorithm
In this section we describe a variant of the random-sampling algorithm of Rabin
[27] for solving the closest-pair problem, complete with all details concerning the
hashing procedure. For the sake of clarity, we provide a detailed description for
the two-dimensional case only.
Let us rst dene the notion of grids in the plane, which is central to the
algorithm (and which generalizes easily to higher dimensions). For all > 0,
a grid G with resolution , or briey a -grid G, consists of two innite sets of
equidistant lines, one parallel to the x-axis, the other parallel to the y-axis, where
the distance between two neighboring lines is . In precise terms, G is the set
_
(x, y) IR
2
|x x
0
|, |y y
0
| ZZ
_
,
for some origin (x
0
, y
0
) IR
2
. The grid G partitions IR
2
into disjoint re-
gions called cells of G, two points (x, y) and (x
, y
x
0
)/ and (y y
0
)/ = (y
y
0
)/ (that is, G parti-
tions the plane into half-open squares of side length ).
Let S = {p
1
, . . . , p
n
} be a multiset of points in the Euclidean plane. We assume
that these points are stored in an array S[1..n]. Further, let c be a xed constant
with 0 < c < 1/2, to be specied later. The algorithm for computing a closest
pair in S consists of the following steps.
1. Fix a sample size s with 18n
1/2+c
s = O(n/ log n). Choose a sequence
t
1
, . . . , t
s
of s elements of {1, . . . , n} randomly. Let T = {t
1
, . . . , t
s
} and let
s
_
,
_
y + dy y
min
__
,
a pair of numbers of O(log((x
max
x
min
)/)) and O(log((y
max
y
min
)/)) bits.
To implement this function, we have to preprocess the points to compute the
minimum coordinates x
min
and y
min
.
The correctness of the procedure randomized-closest-pair follows from the
fact that, since
0
is an upper bound on the minimum distance between two points
of the multiset S, a closest pair falls into the same cell in at least one of the shifted
2
0
-grids.
Remark 3.1 When computing the distances we have assumed implicitly that
the square-root operation is available. However, this is not really necessary. In
Step 2 of the algorithm we could calculate the distance
0
of a closest pair p
a
, p
b
of the sample using the Manhattan metric L
1
instead of the Euclidean metric L
2
.
In Step 3b of the algorithm we could compare the squares of the L
2
distances
instead of the actual distances. Since even with this change
0
is an upper bound
on the L
2
-distance of a closest pair, the algorithm will still be correct; on the
other hand, the running-time estimate for Step 3, as given in the next section,
does not change. (See the analysis of Step 3b following Corollary 4.4.) The tricks
just mentioned suce for showing that the closest-pair algorithm can be made
to work for any xed L
p
metric without computing pth roots, if p is a positive
integer or .
15
procedure randomized-closest-pair(modies S: array[1..n] of points)
returns(a pair of points)
% Step 1. Take a random sample of size at most s from the multiset S.
t[1..s] := a random sequence of s indices in [1..n]
% Eliminate repetitions in t[1..s]; store the chosen points in R.
for j := 1 to s do
T[t[j]] := true
s
:= 0
for j := 1 to s do
if T[t[j]] then
s
:= s
+ 1
R[s
] := S[t[j]]
T[t[j]] := false
% Step 2. Deterministically compute a closest pair within the random sample.
(p
a
, p
b
) := deterministic-closest-pair(R[1..s
])
0
:= dist(p
a
, p
b
) % dist is the distance function.
if
0
> 0 then
% Step 3. Consider the four overlapping grids.
for dx, dy {0,
0
} do
% Step 3a. Group the points.
duplicate-grouping(S[1..n], group
dx,dy,20
)
% Step 3b. In each group nd a closest pair.
j := 0
while j < n do
i := j + 1
j := i
while j < n and group
dx,dy,20
(S[i]) = group
dx,dy,20
(S[j + 1]) do
j := j + 1
if i = j then
(p
c
, p
d
) := deterministic-closest-pair(S[i..j])
if dist(p
c
, p
d
) < dist(p
a
, p
b
) then
(p
a
, p
b
) := (p
c
, p
d
)
return (p
a
, p
b
)
Figure 1: A formal description of the closest-pair algorithm.
16
Remark 3.2 The randomized closest-pair algorithm generalizes naturally to any
d-dimensional space. Note that while two shifts (by 0 and
0
) of 2
0
-grids are
needed in the one-dimensional case, in the two-dimensional case 4 and in the
d-dimensional case 2
d
shifted grids must be taken into account.
Remark 3.3 For implementing the procedure deterministic-closest-pair any of
a number of algorithms can be used. Small input sets are best handled by the
brute-force algorithm, which calculates the distances between all n(n1)/2 pairs
of points; in particular, all calls to deterministic-closest-pair in Step 3b are exe-
cuted in this way. For larger input sets, in particular, for the call to deterministic-
closest-pair in Step 2, we use an asymptotically faster algorithm. For dierent
numbers d of dimensions various algorithms are available. In the one-dimension-
al case the closest-pair problem can be solved by sorting the points and nding
the minimum distance between two consecutive points. In the two-dimensional
case one can use the simple plane-sweep algorithm of Hinrichs et al. [17]. In the
multi-dimensional case, the divide-and-conquer algorithm of Bentley and Shamos
[7] and the incremental algorithm of Schwarz et al. [30] are applicable. Assuming
d to be constant, all the algorithms mentioned above run in O(nlog n) time and
O(n) space. One should be aware, however, that the complexity depends heavily
on d.
4 Analysis of the closest-pair algorithm
In this section, we prove that the algorithm given in Section 3 has linear time
complexity with high probability. Again, we treat only the two-dimensional case
in detail. Time bounds for most parts of the algorithm were established in previous
sections or are immediately clear: Step 1 of the algorithm (taking the sample of
size s
s) obviously uses O(s) time. Since we assumed that s = O(n/ log n), no
more than O(n) time is consumed in Step 2 for nding a closest pair within the
sample (see Remark 3.3). The complexity of the grouping performed in Step 3a
was analyzed in Section 2. In order to implement the function group
dx,dy,
, which
returns the group indices, we need some preprocessing that takes O(n) time.
It remains only to analyze the cost of Step 3b, where closest pairs are found
within each group. It will be shown that a sample of size s 18n
1/2+c
, for any xed
c with 0 < c < 1/2, guarantees O(n)-time performance with a failure probability
of at most 2
n
c
. This holds even if a closest pair within each group is computed by
the brute-force algorithm (see Remark 3.3). On the other hand, if the sampling
procedure is modied in such a way that only a few 4-wise independent sequences
are used to generate the sampling indices t
1
, . . . , t
s
, linear running time will still
be guaranteed with probability 1O(n
=1
1
2
|S
| (|S
| 1),
which is the number of (unordered) pairs of elements of S that lie in the same
set S
) 4N(S, G) +
3
2
n.
Proof. We consider 4 cells of G whose union is one cell of G
4
i=1
k
i
(k
i
1). The contribution
of the one (larger) cell to N(S, G
) is
1
2
k(k 1), where k =
4
i=1
k
i
. We want to
give an upper bound on
1
2
k(k 1) in terms of b.
The function x x(x 1) is convex in [0, ). Hence
1
4
k
_
1
4
k 1
_
1
4
4
i=1
k
i
(k
i
1) =
1
2
b.
This implies
1
2
k(k 1) =
1
2
k(k 4) +
3
2
k 8
1
4
k
_
1
4
k 1
_
+
3
2
k 4 b +
3
2
k.
Summing the last inequality over all cells of G
) 4N(S, G) +
3
2
n.
Remark 4.2 In the case of d-dimensional space, this calculation can be carried
out in exactly the same way; this results in the estimate N(S, G
) 2
d
N(S, G) +
1
2
(2
d
1)n.
Corollary 4.3 Let S be a multiset of n points that satises N(S) < n. Then
there is a grid G
with n N(S, G
) < 5.5n.
Proof. We start with a grid G so ne that no cell of the grid contains two distinct
points in S. Then, obviously, N(S, G) = N(S) < n. By repeatedly doubling the
grid size as in Lemma 4.1 until N(S, G
)
16N(S, G) + 6n.
Proof. Let G
i
, for i = 1, 2, 3, 4, be the four dierent grids with resolution 2 that
overlap G. Each cell of G
)
4
i=1
N(S, G
i
). Applying Lemma 4.1
and summing up yields N(S, G
as in Corollary 4.3 is
ensured. Let
.
We apply Corollary B.2 from the appendix to the partition of S (with du-
plicates) induced by G
, we must have
0
2
. (This
19
is the case even if in Step 2 the Manhattan metric L
1
is used.) Thus the four
grids G
1
, G
2
, G
3
, G
4
used in Step 3 have resolution 2
0
4
. We form a new
conceptual grid G
with resolution 4
of the grid G
is at
least 2
0
. Hence we may apply Corollary 4.4 to obtain that the four grids G
1
, G
2
,
G
3
, G
4
used in Step 3 of the algorithm satisfy N(S, G
i
) = O(n), for i = 1, 2, 3, 4.
But obviously the running time of Step 3b is O(
4
i=1
(N(S, G
i
)+n)); by the above,
this bound is linear in n. This nishes the analysis of the cost of Step 3b.
It is easy to see that Corollaries 4.3 and 4.4 as well as the analysis of Step 3b
generalize from the plane to any xed dimension d. Combining the discussion
above with Theorem 2.13, we obtain the following.
Theorem 4.5 The closest-pair problem for a multiset of n points in d-dimension-
al space, where d 1 is a xed integer, can be solved by a randomized algorithm
that needs O(n) space and
(1) O(n) time on a real RAM with operations from {+, , , div, log
2
, exp
2
};
or
(2) O(n+log log(
max
/
min
)) time on a real RAM with operations from {+, , ,
div},
where
max
and
min
denote the maximum and the minimum distance between any
two distinct points, respectively. The probability that the time bound is exceeded is
2
n
(1)
.
Proof. The running time of the randomized closest-pair algorithm is dominated
by that of Step 3a. The group indices used in Step 3a are d-tuples of integers
in the range {0, . . . ,
max
/
min
}. By Theorem 2.14, parts (1) and (2) of the
theorem follow directly from the corresponding parts of Theorem 2.13. Since all
the subroutines used nish within their respective time bounds with probability
12
n
(1)
, the same is true for the whole algorithm. The amount of space required
is obviously linear.
In the situation of Theorem 4.5, if the coordinates of the input points happen
to be integers drawn from a range {0, . . . , U 1}, we can replace the real RAM by
a conservative unit-cost RAM with integer operations; the time bound of part (2)
then becomes O(n+log log U). The number of random bits used by either version
of the algorithm is quite large, namely essentially as large as possible with the
given running time. Even if the number of random bits used is severely restricted,
we can still retain an algorithm that is polynomially reliable.
Theorem 4.6 Let , d 1 be arbitrary xed integers. The closest-pair problem
for a multiset of n points in d-dimensional space can be solved by a randomized
algorithm with the time and space requirements stated in Theorem 4.5 that uses
only O(log n + log(
max
/
min
)) random bits (or O(log n + log U) random bits for
integer input coordinates in the range {0, . . . , U 1}), and that exceeds the time
bound with probability O(n
).
20
Proof. We let s = 16 n
3/4
and generate the sequence t
1
, . . . , t
s
in the algo-
rithm as the concatenation of 4 independently chosen sequences of 4-independent
random values that are approximately uniformly distributed in {1, . . . , n}. This
random experiment and its properties are described in detail in Corollary B.4 and
Lemma B.5 in Section B of the appendix. The time needed is o(n), and the num-
ber of random bits needed is O(log n). The duplicate grouping is performed with
the simple method described in Section 2.3. This requires only O(log(
max
/
min
))
or O(log U) random bits. The analysis is exactly the same as in the proof of
Theorem 4.5, except that Corollary B.4 is used instead of Corollary B.2.
5 Conclusions
We have provided an asymptotically ecient algorithm for computing a closest
pair of n points in d-dimensional space. The main idea of the algorithm is to
use random sampling in order to reduce the original problem to a collection of
duplicate-grouping problems. The performance of the algorithm depends on the
operations assumed to be primitive in the underlying machine model. We proved
that, with high probability, the running time is O(n) on a real RAM capable of
executing the arithmetic operations from {+, , , div, log
2
, exp
2
} in constant
time. Without the operations log
2
and exp
2
, the running time increases by an
additive term of O(log log(
max
/
min
)), where
max
and
min
denote the maximum
and the minimum distance between two distinct points, respectively. When the
coordinates of the points are integers in the range {0, . . . , U 1}, the running
times are O(n) and O(n +log log U), respectively. For integer data the algorithm
is conservative, i.e., all the numbers manipulated contain O(log n + log U) bits.
We proved that the bounds on the running times hold also when the collection
of input points contains duplicates. As an immediate corollary of this result we
get that the following decision problems, which are often used in lower-bound
arguments for geometric problems (see [26]), can be solved as eciently as the
one-dimensional closest-pair problem on the real RAM (Theorems 4.5 and 4.6):
(1) Element-distinctness problem: Given n real numbers, decide if any two of
them are equal.
(2) -closeness problem: Given n real numbers and a threshold value > 0,
decide if any two of the numbers are at distance less than from each other.
Finally, we would like to mention practical experiments with our simple dup-
licate-grouping algorithm. The experiments were conducted by Tomi Pasanen
(University of Turku, Finland). He found that the duplicate-grouping algorithm
described in Theorem 2.7, which is based on radix sort (with = 3), behaves es-
sentially as well as heapsort. For small inputs (n < 50 000) heapsort was slightly
faster, whereas for large inputs heapsort was slightly slower. Randomized quick-
sort turned out to be much faster than any of these algorithms for all n 1 000 000.
One drawback of the radix-sort algorithm is that it requires extra memory space
21
for linking the duplicates, whereas heapsort (as well as in-place quicksort) does not
require any extra space. One should also note that in some applications the word
length of the actual machine can be restricted to, say, 32 bits. This means that
when n > 2
11
and = 3, the hash function h H
k,
(see the proof of Theorem
2.7) is not needed for collapsing the universe; radix sort can be applied directly.
Therefore the integers must be long before the full power of our methods comes
into play.
Acknowledgements
We would like to thank Ivan Damgard for his comments concerning Lemma A.1
and Tomi Pasanen for his assistance in evaluating the practical eciency of the
duplicate-grouping algorithm. The question of whether the class of multiplicative
hash functions is universal was posed to the rst author by Ferri Abolhassan and
Jorg Keller. We also thank Kurt Mehlhorn for useful comments on this universal
class and on the issue of 4-independent sampling.
References
[1] A. Aggarwal, H. Edelsbrunner, P. Raghavan, and P. Tiwari, Op-
timal time bounds for some proximity problems in the plane, Inform. Process.
Lett. 42 (1992), 5560.
[2] A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and Anal-
ysis of Computer Algorithms, Addison-Wesley, Reading, 1974.
[3] A. Andersson, T. Hagerup, S. Nilsson, and R. Raman, Sorting in
linear time?, in Proc. 27th Annual ACM Symposium on the Theory of
Computing, pp. 427436, Association for Computing Machinery, New York,
1995.
[4] H. Bast and T. Hagerup, Fast and reliable parallel hashing, in Proc.
3rd Annual ACM Symposium on Parallel Algorithms and Architectures, pp.
5061, Association for Computing Machinery, New York, 1991.
[5] P. Beauchemin, G. Brassard, C. Crepeau, C. Goutier, and
C. Pomerance, The generation of random numbers that are probably
prime, J. Cryptology 1 (1988), 5364.
[6] M. Ben-Or, Lower bounds for algebraic computation trees, in Proc. 15th
Annual ACM Symposium on Theory of Computing, pp. 8086, Association
for Computing Machinery, New York, 1983.
[7] J. L. Bentley and M. I. Shamos, Divide-and-conquer in multidimensional
space, in Proc. 8th Annual ACM Symposium on Theory of Computing, pp.
220230, Association for Computing Machinery, New York, 1976.
22
[8] J. L. Carter and M. N. Wegman, Universal classes of hash functions,
J. Comput. System Sci. 18 (1979), 143154.
[9] B. Chor and O. Goldreich, On the power of two-point based sampling,
J. Complexity 5 (1989), 96106.
[10] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
Algorithms, The MIT Press, Cambridge, 1990.
[11] I. Damgard, P. Landrock, and C. Pomerance, Average case error
estimates for the strong probable prime test, Math. Comp. 61 (1993), 177
194.
[12] M. Dietzfelbinger, A. Karlin, K. Mehlhorn, F. Meyer auf der
Heide, H. Rohnert, and R. E. Tarjan, Dynamic perfect hashing: Upper
and lower bounds, SIAM J. Comput. 23 (1994), 738761.
[13] M. Dietzfelbinger and F. Meyer auf der Heide, Dynamic hashing
in real time, in Informatik Festschrift zum 60. Geburtstag von G unter
Hotz (J. Buchmann, H. Ganzinger, and W. J. Paul, Eds.), Teubner-Texte
zur Informatik, Band 1, pp. 95119, B. G. Teubner, Stuttgart, 1992.
[14] M. L. Fredman, J. Koml os and E. Szemeredi, Storing a sparse table
with O(1) worst case access time, J. Assoc. Comput. Mach. 31 (1984), 538
544.
[15] S. Fortune and J. Hopcroft, A note on Rabins nearest-neighbor algo-
rithm, Inform. Process. Lett. 8 (1979), 2023.
[16] M. Golin, R. Raman, C. Schwarz, and M. Smid, Simple randomized
algorithms for closest pair problems, Nordic J. Comput. 2 (1995), 327.
[17] K. Hinrichs, J. Nievergelt, and P. Schorn, Plane-sweep solves the
closest pair problem elegantly, Inform. Process. Lett. 26 (1988), 255261.
[18] J. Katajainen and M. Lykke, Experiments with universal hashing,
Technical Report 96/8, Dept. of Computer Science, Univ. of Copenhagen,
Copenhagen, 1996.
[19] S. Khuller and Y. Matias, A simple randomized sieve algorithm for the
closest-pair problem, Inform. and Comput. 118 (1995), 3437.
[20] D. Kirkpatrick and S. Reisch, Upper bounds for sorting integers on
random access machines, Theoret. Comput. Sci. 28 (1984), 263276.
[21] D. E. Knuth, The Art of Computer Programming, Vol. 3: Sorting and
Searching, Addison-Wesley, Reading, 1973.
23
[22] Y. Mansour, N. Nisan, and P. Tiwari, The computational complexity
of universal hashing, in Proc. 22nd Annual ACM Symposium on Theory of
Computing, pp. 235243, Association for Computing Machinery, New York,
1990.
[23] Y. Matias and U. Vishkin, On parallel hashing and integer sorting,
Technical Report UMIACSTR9013.1, Inst. for Advanced Computer Stud-
ies, Univ. of Maryland, College Park, 1990. (Journal version: J. Algorithms
12 (1991), 573606.)
[24] K. Mehlhorn, Data Structures and Algorithms, Vol. 1: Sorting and
Searching, Springer-Verlag, Berlin, 1984.
[25] G. L. Miller, Riemanns hypothesis and tests for primality, J. Comput.
System Sci. 13 (1976), 300317.
[26] F. P. Preparata and M. I. Shamos, Computational Geometry: An In-
troduction, Springer-Verlag, New York, 1985.
[27] M. O. Rabin, Probabilistic algorithms, in Algorithms and Complexity:
New Directions and Recent Results (J. F. Traub, Ed.), pp. 2139, Academic
Press, New York, 1976.
[28] M. O. Rabin, Probabilistic algorithm for testing primality, J. Number The-
ory 12 (1980), 128138.
[29] R. Raman, Priority queues: small, monotone and trans-dichotomous, in
Proc. 4th Annual European Symposium on Algorithms, Lecture Notes in
Comput. Sci. 1136, pp. 121137, Springer, Berlin, 1996.
[30] C. Schwarz, M. Smid, and J. Snoeyink, An optimal algorithm for the
on-line closest-pair problem, in Proc. 8th Annual Symposium on Computa-
tional Geometry, pp. 330336, Association for Computing Machinery, New
York, 1992.
[31] W. Sierpi nski, Elementary Theory of Numbers, Second English Edition
(A. Schinzel, Ed.), North-Holland, Amsterdam, 1988.
[32] A. C.-C. Yao, Lower bounds for algebraic computation trees with integer
inputs, SIAM J. Comput. 20 (1991), 655668.
A Generating primes
In this section we provide a proof of Lemma 2.9. The main idea is expressed in
the proof of the following lemma.
24
Lemma A.1 There is a randomized algorithm that, for any given integer m 2,
returns an integer p with m < p 2m such that the following holds: the running
time is O((log m)
4
), and the probability that p is not prime is at most 1/m.
Proof. The heart of the construction is the randomized primality test due to Miller
[25] and Rabin [28] (for a description and an analysis see, e. g., [10, pp. 839 .]). If
an arbitrary number x of b bits is given to the test as an input, then the following
holds:
(a) If x is prime, then Prob(the result of the test is prime) = 1;
(b) if x is composite, then Prob(the result of the test is prime) 1/4;
(c) performing the test once requires O(b) time, and all numbers manipulated
in the test are O(b) bits long.
By repeating the test t times, the reliability of the result can be increased such
that for composite x we have
Prob(the result of the test is prime) 1/4
t
.
In order to generate a probable prime that is greater than m we use a random
sampling algorithm. We select s (to be specied later) integers from the interval
{m+1, . . . , 2m} at random. Then these numbers are tested one by one until the
result of the test is prime. If no such result is obtained the number m + 1 is
returned.
The algorithm fails to return a prime number (1) if there is no prime among
the numbers in the sample, or (2) if one of the composite numbers in the sample
is accepted by the primality test. We estimate the probabilities of these events.
It is known that the function (x) = |{p| p x and p is prime}|, dened for
any real number x, satises
(2n) (n) >
n
3 ln(2n)
,
for all integers n > 1. (For a complete proof of this fact, also known as the
inequality of Finsler, see [31, Sections 3.10 and 3.14].) That is, the number of
primes in the set {m + 1, . . . , 2m} is at least m/(3 ln(2m)). We choose
s = s(m) = 3(ln(2m))
2
and
t = t(m) = max{log
2
s(m) , log
2
(2m)}.
(Note that t(m) = O(log m).) Then the probability that the random sample
contains no prime at all is bounded by
_
1
1
3 ln(2m)
_
s
_
_
_
1
1
3 ln(2m)
_
3 ln(2m)
_
_
ln(2m)
< e
ln(2m)
=
1
2m
.
25
The probability that one of the at most s composite numbers in the sample will
be accepted is smaller than
s(m) (1/4)
t
s(m) 2
log
2
s(m)
2
log
2
(2m)
=
1
2m
.
Summing up, the failure probability of the algorithm is at most 2 (1/(2m)) =
1/m, as claimed. If m is a b-bit number, the time required is O(s t b), that is,
O((log m)
4
).
Remark A.2 The problem of generating primes is discussed in greater detail by
Damgard et al. [11]. Their analysis shows that the proof of Lemma A.1 is overly
pessimistic. Therefore, without sacricing the reliability, the sample size s and/or
the repetition count t can be decreased; in this way considerable savings in the
running time are possible.
Lemma 2.9 There is a randomized algorithm that, for any given positive integers
m and n with 2 m 2
n
1/4
and
t = t(m, n) = 1 + max{log
2
s(m, n), n
1/4
}.
As above, the failure probability is bounded by the sum of the following two terms:
_
1
1
3 ln(2m)
_
s(m,n)
< e
2n
1/4
< 2
1n
1/4
and
s(m, n) (1/4)
t(m,n)
2
(1+n
1/4
)
2
1n
1/4
.
This proves the bound 2
n
1/4
on the failure probability. The running time is
O(s t log m) = O((log m) n
1/4
(log log m + log n + n
1/4
) log m) = O(n).
26
B Random sampling in partitions
In this section we deal with some technical details of the analysis of the closest-
pair algorithm. For a nite set S and a partition D = (S
1
, . . . , S
m
) of S into
nonempty subsets, let
P(D) = { S | || = 2 {1, . . . , m} : S
}.
Note that the quantity N(D) dened in Section 4 equals |P(D)|. For the analysis
of the closest-pair algorithm, we need the following technical fact: If N(D) is
linear in n and more than 8
n,
Prob
_
i, j {1, . . . , s} {1, . . . , m} : t
i
= t
j
t
i
, t
j
S
_
> 1
4
n
s
. (B.1)
Proof. We rst note that we may assume, without loss of generality, that
n N(D) 1.1n. (B.2)
To see this, assume that N(D) > 1.1n and consider a process of repeatedly rening
D by splitting o an element x in a largest set in D, i.e., by making x into a
singleton set. As long as D contains a set of size
2n+1 =
_
200/n 0.1n+1,
which for n 800 is at most 0.1n. Hence if we stop the process with the rst
partition D
with N(D
) n. Since D
is a
renement of D, we have for all i and j that
t
i
and t
j
are contained in the same set S
of D
t
i
and t
j
are contained in the same set S
of D;
thus, it suces to prove (B.1) for D
.
27
We dene random variables X
i,j
, for P(D) and 1 i < j s, as follows:
X
i,j
:=
_
1 if {t
i
, t
j
} = ,
0 otherwise.
Further, we let
X =
P(D)
1i<js
X
i,j
.
Clearly, by the denition of P(D),
X = |{(i, j) | 1 i < j s t
i
= t
j
t
i
, t
j
S
for some }| 0.
Thus, to establish (B.1), we only have to show that
Prob(X = 0) <
4
n
s
.
For this, we estimate the expectation E(X) and the variance Var(X) of the
random variable X, with the intention of applying Chebyshevs inequality:
Prob
_
|X E(X)| t
_
Var(X)
t
2
, for all t > 0. (B.3)
(For another, though simpler, application of Chebyshevs inequality in a similar
context see [9]).
First note that for each = {x, y} P(D) and 1 i < j s the following
holds:
E(X
i,j
) = Prob(t
i
= x t
j
= y) +Prob(t
i
= y t
j
= x) =
2
n
2
. (B.4)
Thus,
E(X) =
P(D)
1i<js
E(X
i,j
) (B.5)
= |P(D)|
_
s
2
_
2
n
2
= N(D)
s
2
n
2
_
1
1
s
_
.
By assumption, s 8
n 8
n.
Using the assumption N(D) n, we get from (B.5) that
E(X)
2
1.01
. (B.6)
Next we derive an upper bound on the variance of X. With the (standard)
notation
Cov(X
i,j
, X
,j
) = E(X
i,j
X
,j
) E(X
i,j
) E(X
,j
)
28
we may write
Var(X) = E(X
2
) (E(X))
2
=
P(D)
1i<js
1i
<j
s
Cov(X
i,j
, X
,j
). (B.7)
We split the summands Cov(X
i,j
, X
,j
) occurring in this sum into several classes
and estimate the contribution to Var(X) of the summands in each of these classes.
For all except the rst class, we use the simple bound
Cov(X
i,j
, X
,j
) E(X
i,j
X
,j
) = Prob(X
i,j
= X
,j
= 1).
For i {1, . . . , s}, if t
i
= x S, we will say that i is mapped to x. Below we
bound the probability that {i, j} is mapped onto , while at the same time {i
, j
}
is mapped onto
. Let J = {i, j, i
, j
}.
Class 1. |J| = 4. In this case the random variables X
i,j
and X
,j
are inde-
pendent, so that Cov(X
i,j
, X
,j
) = 0.
Class 2. |J| = 2 and =
. Now E(X
i,j
X
,j
) = E(X
i,j
), so the total
contribution to Var(X) of summands of Class 2 is at most
P(D)
1i<js
E(X
i,j
) = E(X).
Class 3. |J| < |
, so
X
i,j
X
,j
0 and Cov(X
i,j
, X
,j
) 0.
Since |J| {2, 3, 4} and |
, j
the central
range element. The argument proceeds by counting the number of summands of
certain kinds as well as estimating the size of each summand.
Class 4. |J| = 3 and |
, j
. By denition, =
can be chosen in
N(D) ways. Furthermore, X
i,j
= X
,j
= 1 only if the central domain element
is mapped to one element of , while the two remaining elements of J are both
mapped to the other element of , the probability of which is (2/n)(1/n)(1/n) =
2/n
3
. Altogether, the contribution to Var(X) of summands of Class 4a is at most
s
3
N(D) 2/n
3
2.2s
3
/n
2
.
Class 4b. |J| = 3 and |
| = 3. The set
can be chosen in
m
=1
_
|S|
3
_
ways, after which there are three choices for the central range element and two
ways of completing (and, implicitly,
. X
i,j
= X
,j
= 1 only if the central domain element is mapped to the
central range element, while the remaining element of {i, j} is mapped to the
remaining element of and the remaining element of {i
, j
} is mapped to the
29
remaining element of
=1
|S
|(|S
| 1)(|S
| 2)
_
_
_
s
n
_
3
_
_
m
=1
(|S
| 1)
3
_
_
_
s
n
_
3
. (B.8)
We use the inequality
m
=1
a
3
m
=1
a
2
_
3/2
(a special case of Jensens inequal-
ity, valid for all a
1
, . . . , a
m
0) and the assumption (B.2) to bound the right hand
side in (B.8) by
_
_
m
=1
|S
|(|S
| 1)
_
_
3/2
_
s
n
_
3
(2 1.1n)
3/2
_
s
n
_
3
= 2.2
3/2
_
s
n
_
3
< 3.3
3
.
Bounding the contributions of the summands of the various classes to the sum in
equation (B.7), we get (using that n
1/2
25)
Var(X) E(X) + 2.2s
3
/n
2
+ 3.3
3
= E(X) + (2.2n
1/2
+ 3.3)
3
< E(X) + 3.5
3
. (B.9)
By (B.3) we have
Prob(X = 0) Prob
_
|X E(X)| E(X)
_
Var(X)
(E(X))
2
;
by (B.9) and (B.6) this yields
Prob(X = 0)
1
E(X)
+
3.5
3
(E(X))
2
1.01
2
+
3.5 1.01
2
.
Since 1.01/ + 3.5 1.01
2
< 4, we get
Prob(X = 0) <
4
=
4
n
s
,
as claimed.
In case the size of the chosen subset is much larger than
n, the estimate in
the lemma can be considerably sharpened.
Corollary B.2 Let n, m and s be positive integers, let S be a set of size n 800,
let D = (S
1
, . . . , S
m
) be a partition of S into nonempty subsets with N(D) n,
and assume that s random elements t
1
, . . . , t
s
are drawn independently from the
uniform distribution over S. Then if s 9
n,
Prob
_
i, j {1, . . . , s} {1, . . . , m} : t
i
= t
j
t
i
, t
j
S
_
> 1 2
s/(18
n)
.
30
Proof. Split the sequence t
1
, . . . , t
s
into disjoint subsequences of length s
= 8
n
9
n/s
1
2
. Since the subexperiments are
independent and their number is at least s/(9
n) s/(18
n)
.
Clearly, this is also a lower bound on the probability that the whole sequence
t
1
, . . . , t
s
hits two elements from the same S
.
B.2 Sampling with few random bits
In this section we show that the eect described in Lemma B.1 can be achieved
also with a random experiment that uses very few random bits.
Corollary B.3 Let n, m, s, S, and D be as in Lemma B.1. Then the conclusion
of Lemma B.1 also holds if the s elements t
1
, . . . , t
s
are chosen according to a
distribution over S that only satises the following two conditions:
(a) the sequence is 4-independent, i. e., for all sets {i, j, k, } {1, . . . , s} of
size 4 the values t
i
, t
j
, t
k
, t
i,j
) 2
_
1
n
_
2
2(1 2)
n
2
.
Equation (B.5) changes into
E(X) N(D)
s
2
n
2
(1 2)
_
1
1
s
_
.
As s 8
800 and = 0.0025, we get (1 2)(1 1/s) 1/1.01, such that (B.6)
remains valid. The contributions to Var(X) of the summands of the various
classes dened in the proof of Lemma B.1 are bounded as follows.
Class 1: The contribution is 0. For justifying this, 4-wise independence is su-
cient.
Class 2: E(X).
Class 3: 0.
31
Class 4a: s
3
N(D) (2/n
3
) (1 + )
3
2.3s
3
/n
2
.
Class 4b: (2.2n)
3/2
(s/n
3
) (1 + )
3
3.3
3
.
Finally, estimate (B.9) is replaced by
Var(x) E(X) + (2.3n
1/2
+ 3.3)
3
< E(X) + 3.5
3
,
where we used that n
1/2
25. The rest of the argument is verbally the same as
in the proof of Lemma B.1.
In the random sampling experiment, we can even achieve polynomial reliability
with a moderate number of random bits.
Corollary B.4 In the situation of Lemma B.1, let s 4n
3/4
, and let 1
be an arbitrary integer. If the experiment described in Corollary B.3 is repeated
independently 4 times to generate 4 sequences (t
,1
, . . . , t
,s
), with 1 4,
of elements of S, then
Prob
_
k, {1, . . . , 4}i, j {1, . . . , s} {1, . . . , m} :
t
k,i
= t
,j
t
k,i
, t
,j
S
_
> 1 n
.
Proof. By Corollary B.3, for each xed the probability that the sequence t
,1
, . . . ,
t
,s
hits two dierent elements in the same subset S
is at least 1 4
n/s
1 n
1/4
. By independence, the probability that this happens for one of the
4 sequences is at least 1 (n
1/4
)
4
; clearly, this is also a lower bound on the
probability that the whole sequence t
,i
, with 1 4 and 1 i s, hits two
dierent elements in the same set S
.
Lemma B.5 Let S = {1, . . . , n} for some n 800 and take s = 4n
3/4
. Then
the random experiment described in Corollary B.3 can be carried out in o(n) time
using a sample space of size O(n
6
) (or, informally, using 6 log
2
n + O(1) random
bits).
Proof. Let us assume for the time being that a prime number p with s < p 2s is
given. (We will see at the end of the proof how such a p can be found within the
time bound claimed.) According to [9], a 4-independent sequence t
1
, . . . , t
p
, where
each t
j
is uniformly distributed in {0, . . . , p 1}, can be generated as follows:
Choose 4 coecients
0
,
1
,
2
,
3
randomly from {0, . . . , p 1} and let
t
j
=
_
3
r=0
r
j
r
_
mod p, for 1 j p.
By repeating this experiment once (independently), we obtain another such se-
quence t
1
, . . . , t
p
. We let
t
j
= 1 + (t
j
+ pt
j
) mod n, for 1 j s.
32
Clearly, the overall size of the sample space is (p
4
)
2
= p
8
= O(n
6
), and the time
needed for generating the sample is O(s). We must show that the distribution of
t
1
, . . . , t
s
satises conditions (a) and (b) of Corollary B.3. Since the two sequences
(t
p
, . . . , t
p
) and (t
p
, . . . , t
p
) originate from independent experiments and each of
them is 4-independent, the sequence
t
1
+ pt
1
, . . . , t
s
+ pt
s
is 4-independent; hence the same is true for t
1
, . . . , t
s
, and (a) is proved. Further,
t
j
+ pt
j
is uniformly distributed in {0, . . . , p
2
1}, for 1 j s. From this, it is
easily seen that, for x S,
Prob(t
j
= x)
__
p
2
n
_
1
p
2
,
_
p
2
n
_
1
p
2
_
.
Now observe that p
2
/n /p
2
< 1/n < p
2
/n /p
2
, and that
_
p
2
n
_
1
p
2
_
p
2
n
_
1
p
2
1
p
2
<
1
s
2
1
16n
3/2
=
1
16
n
1
n
<
n
,
where we used that n 800, whence 1/(16
2s
p prime
2s/p
_
= O
_
s
_
1 +
2s
p prime
1
p
_
_
= O(s log log s),
where the last estimate results from the fact that
px
p prime
1
p
= O(log log x).
(For instance, this can easily be derived from the inequality (2n) (n) <
7n/(5 lnn), valid for all integers n > 1, which is proved in [31, Section 3.14].)
33