Efficiency of A Good But Not Linear Set Union Algorithm. Tarjan
Efficiency of A Good But Not Linear Set Union Algorithm. Tarjan
R O B E R T E N D R E TAR J A N
ABSTRACT. TWO types of instructmns for mampulating a family of disjoint sets which partitmn a
umverse of n elements are considered FIND(x) computes the name of the (unique) set containing
element x UNION(A, B, C) combines sets A and B into a new set named C. A known algorithm for
implementing sequences of these mstructmns is examined It is shown that, if t(m, n) as the maximum
time reqmred by a sequence of m > n FINDs and n -- 1 intermixed UNIONs, then kima(m, n) _~
t(m, n) < k:ma(m, n) for some positive constants ki and k2, where a(m, n) is related to a functional
inverse of Ackermann's functmn and as very slow-growing.
KEY WORDSAND PHRASES. algorithm, complexity, eqmvalence, partition, set umon, tree
Introduction
S u p p o s e we w a n t to use t w o t y p e s of i n s t r u c t i o n s for m a n i p u l a t i n g d i s j o i n t sets. F I N D ( x )
c o m p u t e s t h e n a m e of t h e u n i q u e set c o n t a i n i n g e l e m e n t x. U N I O N ( A , B, C) c o m b i n e s
sets A a n d B i n t o a n e w set n a m e d C. I n i t i a l l y we are g i v e n n elements, e a c h i n a single-
t o n set. W e t h e n wish to c a r r y o u t a s e q u e n c e of rn >_ n F I N D s a n d n, - 1 i n t e r m i x e d
UNIONs.
A n a l g o r i t h m for solving t h i s p r o b l e m is useful in m a n y contexts, i n c l u d i n g h a n d l i n g
E Q U I V A L E N C E a n d C O M M O N s t a t e m e n t s in FORTRAN [3, 6], finding m i n i m u m s p a n -
n i n g trees [9], c o m p u t i n g d o m i n a t o r s in d i r e c t e d g r a p h s [14], c h e c k i n g flow g r a p h s for
r e d u c i b i l i t y [13], c a l c u l a t i n g d e p t h s in trees [2], c o m p u t i n g least c o m m o n a n c e s t o r s i n
trees [2], a n d solving a n effiine m i n i m u m p r o b l e m [7].
Several a l g o r i t h m s h a v e b e e n d e v e l o p e d [3, 5-7, 10, 12], n o t a b l y a v e r y c o m p l i c a t e d o n e
b y H o p c r o f t a n d U l l m a n [7]. I t is a n e x t e n s i o n of a n idea b y S t e a r n s a n d R o s e n k r a n t z
[12] a n d h a s a n 0 ( m log* n ) worst-case r u n n i n g t i m e , w h e r e
) times
set, and the root of the tree represents the entire set as well as some element. Each tree
vertex is represented in a computer by a cell containing two items: the element corre-
sponding to the vertex, and either the name of the set (if the vertex is the root of the
tree) or a pointer to the father of the vertex in the tree. Initially, each singleton set is
represented by a tree with one vertex. The basic notion of representing the sets by trees
was presented by Galler and Fischer [6].
To carry out FIND(x), we locate the cell containing x; then we follow pointers to the
root of the corresponding tree to get the name of the set. I n addition, we may collapse the
tree as follows.
Collapszng Rule. After a FIND, make all vertices reached during the FIND operation
sons of the root of the tree.
Figure 1 illustrates a FIND operation with collapsing. Collapsing at most multiplies
the time a FIND takes by a constant factor and may save time in later finds. K n u t h [4]
attributes the collapsing rule to Tritter; independently, McIlroy and Morris [8] used it
in an algorithm for finding spanning trees.
To carry out UNION(A, B, C), we locate the roots named A and B, make one a son
of the other, and name the new root C, after deleting the old names (Figure 2). We may
arbitrarily pick A or B as the new root, or we may apply a union rule, such as the fol-
lowing:
Weighted Union Rule. If set A contains more elements than set B, make B a son of A.
Otherwise make A a son of B.
I n order to implement this rule, we must attach a third item to each cell, namely the
number of its descendants. Morris apparently first described the weighted union rule [8].
We can easily implement these instructions on a random-access computer. Suppose we
carry out m > n FINDs and n - 1 intermixed UNIONs. Each UNION requires a fixed
finite time. Each FIND requires some fixed amount of time plus time proportional to the
length of the path from the vertex representing the element to the root of the correspond-
ing tree. Let t(m, n) be the maximum time required by any such sequence of instructions.
(Often in practice m is 0 (n). Previous researchers have restricted their attention to bound-
ing max,~+~=N t(m, n) by some function of N Any upper or lower bound on t(kN, N) for
some constant k gives an upper or lower bound on maxm+n_u t(m, n) and vice versa, only
the constants in the bound change.)
If neither the weighting nor the collapsing rule is used, it is easy to show t h a t
for suitable positive constants kl and k2. If only the weighting rule is used, it is similarly
easy to show that
for some positive constants kl and k2. Fischer [5] gave (1) and (2) for m = n. If only the
collapsing rule is used, we shall show that
for some positive constant k. Paterson [11] proved this bound for m = n and Fischer [5]
proved that it is tight to within a constant factor when m = n.
If we use both the weighting rule and the collapsing rule, the algorithm becomes much
harder to analyze. Fischer [5] showed that t(m, n) < km log log n in this case, and Hop-
rI
Tf
Fi~ 1. A FIND on element a, with collapsing. FIG 2 Umon of two trees. Root rl of T1
Triangles denote subtrees Collapsing converts has a descendants; root r~ of T2 has b de-
tree T into tree T' scendants. Root of new tree has a -F b de-
scendants.
croft and Ullman [7] improved this bound to t(m, n) _< km log* n. Here we show that
klma(m, n) < t(m, n) ~ k~mw(m, n)
for some positive constants kl and k2, where a(m, n) is related to a functional inverse of
Ackermann's function and is very slow-growing. Thus, t(m, n) is o(m log* n) but not
o(m).
An Upper Bound
I t is useful to think about the set union algorithm in the following way [5]: suppose we
perform all n - 1 UNIONs first. Then we have a single tree with n vertices. Each of the
original FINDs now is a "partial" find in the new tree: to carry out FIND(x) we follow
the path in the tree from x to the furthest ancestor of x corresponding to a UNION
which appears before FIND(x) in the original sequence of operations, and we collapse
this path. Thus any sequence of m FINDs and n -- 1 intermixed UNIONs corresponds
to a sequence of m partial finds performed on the tree created by carrying out the n -- 1
UNIONs (without any FINDs).
Furthermore, let T be any tree created by a sequence of n - 1 UNIONs. For any vertex
v, let r(v), the rank of v, be the height of v in T. Then any sequence of m partial finds
performed on T, such that the ranks of the last vertices on the finds are nondecreasing,
corresponds to a sequence of m FINDs intermixed with the n - 1 UNIONs used to create
T. Thus we can get an upper bound on t(m, n) by bounding the length of m partial finds
performed on a tree created by a sequence of n - 1 UNIONs. We can get a lower bound
on t(m, n) by bounding the maximum total length of m partial finds, whose last vertices
have nondecreasing ranks, performed on a tree created by n - 1 UNIONs.
To get an upper bound, we use a refinement and generalization of ideas in [7]. Since the
techniques involved are somewhat complicated, we introduce them by proving the upper
bound of (3), generalizing Paterson's result.
Let t'(m, n) be the maximum time used by the set union algorithm for m >_ n FINDs
and n - 1 intermixed UNIONs, assuming that the algorithm uses the collapsing rule
b u t not the weighted union rule. We can bound t'(m, n) b y bounding the maximum total
length of m partial finds performed on any tree with n vertices.
Let T be any tree with n vertices. If v E T, let r(v), the rank of v, be the height of v
in T. Then 0 < r(v) < n - 1, and v --~ w in T implies r(v) < r(w). Furthermore, if
f(v) = w before a partial find and f(v) = w' ~ w after the find, then r(w) < r(w').
Thus ranks strictly increase along any path in any tree formed from T by carrying out
218 ROBERT E. TARJAN
partial finds. (Note. r(v) is defined and fixed with respect to the original tree T, and
does not change even though the tree changes when partial finds are carried out.)
To bound the total length of m partial finds performed on T, we shall partition the
edges (v, w) on the find paths into various sets, bound the number of edges in each set,
and add the bounds.
Let, F be the set of edges (v, w) on the m partial find paths. For 1 _< i < z, where z
and b are arbitrary integer parameters to be fixed later, let
M, = {(v,w) E F t 3 3 suchthat b'j < r(v) < r(w) < b'(j + 1)
and 3 k such that r(v) < b'-lk < r(w)}.
(Note. i - 1 is the most significant position where the b-ary representations of r(v)
and r(w) differ.)
Let M~+i = F - U:=i M,. Clearly the sets M, partition F. For 1 < ~ < z + 1, let
L, = {(v, w) E M, I Of the edges on the find path containing (v, w), (v, w) is the last
one in M,}.
LEMMA1. I L , ] < m .
PROOF. Obvious.
LEMMA2. I M , - - L , I < b n f o r l < ~ < z .
PROOF. Let v E T. Suppose (v, w) E M, -- L,. Then there is an edge (v', w') e M~
following (v, w) on the same find path. I t follows from the definition of M, that for some
ko, r(w) _< r(v') < b'-lko < r(w'). If w" = f(v) after th]s find is performed, it follows
that if r(w) >_ b'-lk, r(w") > b'-l(k + 1).
Suppose that M, - L, contains x(v) edges of the form (v, w). Let w " = f(v) just be-
fore the last find corresponding to such an edge is performed. Then by the reasoning above,
r ( w ' ) ,~ b'-l(b[r(v)/b~l q- x(v) -- 1). But by the definition of M,, b'([r(v)/b'l q- 1) >
r ( w ' ) . Therefore x(v) - 1 < b, and x(v) _< b. Summing over all vertices, I M, - L, ] <
bn. Q.E.D.
LEMMA 3. ] Mz+l -- Lz+l l _~ n2/b z Jr n.
PROOF. Let v E T. Suppose (v, w) E Mz+~ - Lz+l. Then thereis an edge (v', w') E
M~+i following (v, w) on the same find path. Let 2 = lr(w')/bzl. Then r(w) < r(v') <
b'j < r(w') < b~(2 q- 1); otherwise (J, w') E M. for some 1 < 2 < z. If w" = f(v) after
this find is performed, [r(w")/b'l > [r(w')/b ~1 > [r(w)/b~J + 1. Thus tf(v)/b~l increases
by at least one each time an edge (v, w) E M~+i - L~+i occurs on a find path, and since
0 < lf(v)/b~J < [(n - 1)/b~J, v can occur in only [(n - 1)/b~l + 1 edges (v, w) E M~+i
- L,.+i. Q.E.D.
THEOREM 4. I F t --< 3re.max(l, [log(n2/m)/logl2m/nl l) + 2m q- n.
PROOF.
z+l z+l
IF] = ~ l L , I q- ~ I M , - L , I _< (z-~ 1 ) m q - b z n T ~ / b ~ q - n
for all z > 1, b > 1.
Pick b = 12m/nl and z = max(l, llog(n2/m)/log bl). Then I F I -< 3m -max( 1,I l o g ( ~ / m ) /
log 12m/nl]) q- 2m q- n. Q.E.D.
COROLLARY 5.
Let r(v) be defined as before, fixed with respect to T. For v C T, let d(v) be the number
of descendants of v, again fixed with respect to T.
LEMMA 6. I f V ~ W in T, then d(w) > 2d(v).
PROOF. In the process of forming T, some union makes v a son of w. At this time
d(w) > 2d(v) because the union obeys the weighted union rule. Subsequent unions can-
not change the number of descendants of v and can only increase the number of descen-
dants of w. Q.E.D.
COROLLARY7. d(v) >__2 r(~) and 0 < r(v) < log2 n for all v E T.
PROOF. By induction on r(v), using Lemma 6.
Let the function A(i, x) on integers be defined by
A(O,x) = 2 x , A(~,0) = 0 for i >1,
(5)
A(~,I) =2 for 1_~1, A(i,x) =A(i-l,A(i,x-1)) for i > l , x~2.
A (i, x) is a slight variant of Ackermann's function [1]; it is not primitive recursive. Some
important facts about A(i, x) appear below.
A(0, 2) = 2.2 = 4. A ( i + 1, 2) = A(,, A ( i + 1, 1)) = A(i, 2). Therefore, by induc-
tion,
A(i, 2) = 4 for alli.
A(1, 1) = 2. A(1, x + 1) = A(0, A(1, x)) = 2.A(1, x). Therefore,
A(1, x) = 2~ for x_> 1. (6)
A(2, 1) = 2. A(2, x -k 1) = A(1, A(2, x)) = 2 ~(2'~). Therefore,
The following rather weak inequalities will be used in the lower bound proof. By (6), (9),
and (10),
Let m partial finds be performed on T. Let F be the set of edges (v, w) on these find
paths. Partition F as follows: if (v, w) E F and for some i and j, v E S,~ and w E S , ,
let (v,w) E Nk, where k = m i n { i I 3 3 v , w E S.j}. If for a l l i a n d 3 , eitherv ~ S , or
w ~ S , , l e t (v, w) E N , + ~ . F o r 0 < ~ < z + 1, let
L, -- {(v, w) E N, [ Of the edges on the find path containing (v, w), (v, w) is the last
one in N,}.
LEMMA 9. ]L, ] <_ m.
PROOF. Obvious.
LE~IMA 10. I No -- Lo] _< n.
PROOF. Let v E T. Suppose (v, w) E No - L0. Then there is an edge (v', w') E No
following (v, w) on the same find path. I t follows that for some j, 27 _< r(v) < r(w) <
23 + 2 < r(w'), and r(w') - r(v) > 2. I f f ( v ) = w" after this find is performed, r(w")
- r(v) > 2; and no finds after this one can contain an edge (v, w"') E No. Thus each
vertex v is in at most one edge (v, w) E No - Lo. Q.E.D.
LEMMA 11. For 1 < i < z, ] N, - L, I -< ~n.
PROOF. L e t v E T a n d s u p p o s e v E S , j ; i . e . A ( i , 3) -< r(v) < A(i, 3 + 1). Suppose
(v, w) E N, - L,. Since S,0 = S00 and Sa = S01 for all i, it must be the case that3 >_ 2.
There is an edge (v', w') E N, following (v, w) on the same find path. From the definition
of N,, there is some k0 such that r(w) _< r(v') < A(~ -- 1, ko) < r(w'). If w" = f(v)
after this find is performed, it follows that if A(~ - 1, k) _< r(w), A ( i - 1, k + 1)
_< r(w").
Suppose that N, - L, contains x(v) edges of the form (v, w). Let w'" = f(v) just before
the last find corresponding to such an edge is performed. Then by the reasoning above,
A ( ~ - 1, x( v ) - 1) _< r( w"' ) , and by the definition of N,, r( w"' ) < A ( z, 3 + 1). Since
3 -> .~,
') A ( i - - 1, x(v) - 1) < A ( z , j + 1) = A ( ~ - 1, A ( i , g ) ) , a n d s i n c e A i s i n c r e a s -
ing in its second argument, x(v) - 1 < A ( , , 3 ) ; or, x(v) < A(,, 3).
Ey~iciency o] a Good But Not Linear Set Union Algorithm 221
_< n -b ~zn -b na(z, n) ~ (z ~ 2)m for a n y z > 1, by Lemmas 9, 10, 11, and 12
_< (~n + m)a(m, n) -p 10m + n, by choosing z = a(m, n), since
a(a(m, n), n) ~_ 4[m/nl ~ Stain. Q.E.D.
COROLLARY 14. t(m, n) < k2ma(m, n) for some constant k2. (17)
The function a(m, n) is maximized when m = n; if k is a fixed constant and m >_
n.a(k, n), (17) impfies t(m, n) i s 0 ( m ) . Iflog2 n < A(3, 4), a(m, n) _< 3.
A Lower Bound
Let t(m, n) be the maximum time required by the set union algorithm for m _> n FINDs
and n - 1 UNIONs, assuming that both collapsing and weighted union are used. To get
a nonlinear lower bound on t(m, n), we shall show that for any k, there is a finite n(k)
such that a tree with n(k) vertices exists in which n(k) partial finds, each of length k,
may be performed. The argument is quite complicated, but is devised to be as general
as possible.
Let T be a tree contained in an acyclic graph G, called a shortcut graph of T. If vo --~ vl
--~ - • • --~ vk is a path m T, we perform a generalized find (g-find) in T by adding to G each
edge (v., v~), z < .7, which is not already present in G. The cost of the g-find is the length l
of the shortest path from v0 to vk in G (before the new edges are added). We shall only
allow g-finds such that vk is the root of T or the father of vk is at distance l --P 1 from v0
in G before the new edges are added. We shall provide a nonlinear lower bound on the
maximum total cost of a sequence of m > n g-finds performed on a tree T with n vertices
constructed using any union rule, assuming that the shortcut graph is initially T itself.
This gives a nonlinear lower bound on t(m, n). I t also gives a nonlinear lower bound on
the running time of almost any conceivable set union algorithm.
2~ ROBERT E. TARJAN
Let T be any tree. Let T(0) = T. For any i _> 1, let T ( i ) be formed from two copies
of T ( i - 1) by making the root of one of them the son of the root of the other. If T is
the tcee having a single vertex, T(Q is called an S, tree. S, has 2' vertices and 2 '-1 leaves.
Removal of all the leaves from S, produces S,-1. S, may be formed using any union rule,
since the trees combined at each step are identical.
Let G be a shortcut graph of T. Let G(0) = G. For any i > 1, let G(i) be formed from
two copies of G(i - 1) by adding an edge from the root of T ( i - 1) embedded in one to
the root of T ( i - 1) embedded in the other. Then G(i) is a shortcut graph of T ( i ) .
THEOREM 15. Let T be any tree with two or more vertices and s _~ 1 leaves, and let G be
any shortcut graph of T. I f i >_ A(4k, 4s), we can perform a g-find of cost k on at least half
the leaves in T( i) , start~ng with shortcut graph G( i).
P~.OOF. We prove the theorem by double induction on k and s. Suppose k < 1 and
s is acbitrary. For any i > A(4k, 4s) _> 0, each leaf in T ( i ) is at a distance of one or more
from the root of T ( i ) in any shortcut graph. Thus the theorem is true for k < 1.
Suppose k = 2 and s is arbitrary. Half the leaves in T(1) are at a distance of at least
two in G(1) from the root of T(1) and remain that way regardless of what g-finds are
done on the other leaves. Thus g-finds of cost two can be done on all these leaves. I t fol-
lows that if i _> A(4k, 4s) = A(8, 4s) > 1, g-finds of cost two can be done on half the
leaves of T ( i ) . Thus the theorem is true for k = 2.
Suppose the theorem holds for all k ~ < k and arbitrary s. We prove the theorem holds
for k with s = 1. We can assume k ~ 3. The tree T(1) has two leaves. One is in the copy
of T whose root is the root of T(1). Call this the r-leaf of T(1) and call the leaf in the
other copy of T the u-leaf of T(1). The u-leaf has a father different from the root of T( 1 ).
If T' is the tree consisting of the path from the father of the u-leaf to the root of T(1),
then by the induction hypothesis a g-find of length k - 1 may be performed in T ' ( A (4k -
4, 4)) on half the leaves, starting with shortcut graph G(1 + A (4k - 4, 4)). Thus in
T(1 -I- A(4k -- 4, 4)) a g-find of length k -- 1 can be performed on the fathers of one-
half of the u-leaves, starting with shortcut graph G(1 --t- A (4k -- 4, 4) ). This means that
in T( 1 ~ A (4k -- 4, 4) ) a g-find of length k can be performed on one-half of the u-leaves,
starting with shortcut graph G(I --{- A(4k -- 4, 477. Let G' be the resulting shortcut
graph.
Consider the u-leaves of the T(1) trees embedded in T(1 + A ( 4 k - 4, 4)) on which
g-finds have not been performed. There are 2 A(4k-4' 4)-1 such leaves. Each of these has a
distinct father and no pair of these fathers is related in T ( 1 % A (4k - 4, 4)7. I t follows
by the induction hypothesis that in T(1 + A(4k - 4, 4) + A ( 4 k -- 4, 2 ~(4~-4' 4)+1)) a
g-find of length k - 1 can be performed on one-half of these fathers, starting with shortcut
graph G ' ( A ( 4 k - 4, 2 A(4~-4'4)+1)). Let n~ = 1 -b A ( 4 k - 4, 4) + A ( 4 k - 4, 2 ~(4k-4' 4)+~).
Then in T(n~7 a g-find of length k can be performed on an additional one-fourth of the
u-leaves of the embedded trees T(1), starting with shortcut graph G ' ( A ( 4 k - 4,
2 A(4k-4' 4)+17). Let the resulting shortcut graph be G".
Now consider the r-leaves of the trees T(1) embedded in T(nl). No g-finds have been
performed on these 2 ~-~ > 2 leaves. The fathers of all these leaves are distinct and half
of them are unrelated to each other. By the induction hypothesis, in T(n~ ~ A (4k -- 4,
2"~)) we may perform a g-find of length k - 1 on one-half of these unrelated fathers,
starting with shortcut graph G"(A(4k -- 4, 2"~)). I t follows that in T(n~ + A ( 4 k -- 4,
2~) 7, g-finds of length ]c can be performed on an additional one-eighth of the leaves of
the embedded T(1) trees, starting with shortcut graph G" ( A ( 4k -- 4, 2 ~ ) ).
Combining these results, we see that in T(n~ + A (4k -- 4, 2 ~') ), starting with shortcut
graph G(n~ + A(4k -- 4, 2 ~ ) ) , we can perform g-finds of length k on one-half of the
leaves. Furthermore,
nl =: 1 -t- A(4k -- 4, 4) ~ A(4k -- 4, 2 A(4k-4' 4)+1)
1 -k A ( 4 k -- 4, 47 -k A ( 4 k - 3, A ( 4 k - 4, 4) -k 2) by (117
<~ 1 "t- A(4]c -- 4, 4) -t- A(4k -- 2, 4) by (5), (12) (18)
Effwiency of a Good B u t Not Linear Set Union Algorithm 223
and
n~ -4- A ( 4 k -- 4, 2"') < nl -4- A ( 4 k - 3, n~ -4- 1) by (11)
< nl -4- A ( 4 k -- 3, A ( 4 k - 4, 4)
-4- A ( 4 k -- 2, 4) -4- 2) by (9), (10), (13), (18)
< 4A(4k -- 2, 6)
< A ( 4 k , 4) by (14).
Thus the theorem holds for k with s = l, since if we can perform the desired g-finds in
T ( n l "4- A ( 4 k - 4, 2 " ' ) ) , we can certainly perform them in T ( i ) if , > A ( 4 k , 4) >
nl 4- A ( 4 k - 4, 2"').
Suppose the theorem holds for all k', s' such that k' < k, or k' = k and s' < s. We
prove the theorem for k and s, using an argument like that above but slightly more
complicated.
Consider T ( A (4k, 4s - 4)). Ignoring one leaf in each copy of T, by the induction
hypothesis we can perform g-finds of length k on one-hMf of the remaining leaves, pro-
ducing a shortcut graph G' from G ( A (4k, 4s - 4) ). Now consider the ignored leaves, one
per copy of T. Since A(4k, 4s - 4) > 1, T ( A ( 4 k , 4s - 4)) consists of 2 A(4k'4~-4)-1copies
of T(1). In each copy of T(1) there is one ignored leaf in the copy of T whose root is that
of T( 1 ), call this the r-leaf. Call the ignored leaf in the other copy of T m T(1) the u-leaf.
In T ( A ( 4 k , 4s - 4) + A ( 4 k - 4, 2 ~(4k' 48-4)+1)) we can, by the induction hypothesis,
perform g-finds of length k - 1 on one-half of the fathers of the u-leaves (since the
fathers of the u-leaves are unrelated). Alternatively we can perform g-finds of length k on
one-half of the u-leaves, producing shortcut graph G" from shortcut graph G'(A(4k -
4, 2 A¢4~' 48-4)+1)).
Let n2 = A ( 4 k , 4s -- 4) -4- A ( 4 k -- 4, 2 A¢4~'4,-4)+1). In T(n2 + A ( 4 k -- 4, 2 "~) we can
perform g-finds of length k -- 1 on the fathers of an additional one-fourth of the u-leaves,
or instead g-finds of length k on an additional one-fourth of the u-leaves, producing short-
cut graph G'" from shortcut graph G " ( A (4k - 4, 2 "2) ).
Let n3 = n2 -I- A(4k - 4, 2"2). Now consider the r-leaves of T(1) contained in T(n3).
Half of them have distinct, unrelated fathers. Hence in T(n3 + A ( 4 k -- 4, 2"')) we can
perform g-finds of length k -- 1 on the fathers of one-fourth of the r-leaves, or instead
g-finds of length k on one-fourth of the r-leaves, starting with shortcut graph
V " ' ( A ( 4 k - 4, 2"')).
Combining, we see that in T(1), i f , > n3 + A(4k -- 4, 2"3), we can perform g-finds of
length k on one-half of the leaves, starting with shortcut graph G(i).
But we have
n2 = A(4k, 4s -- 4) -t- A ( 4 k -- 4, 2 A(4k' 4s--4)-{-1)
< A ( 4 k , 4 s - - 4) -{- A ( 4 k - - 3, A(4k, 4 s - - 4) + 2) by (11)
< A ( 4 k , 4s - 4) + A ( 4 k , 4s - 3) by (15). (19)
n3 = n2 + A ( 4 k - 4 , 2 ~)
_< A ( 4 k , 4s -- 4) + A ( 4 k , 4s - 3) + A ( 4 k - 3, A ( 4 k , 4s -- 4)
+ A ( 4 k , 4s - 3) + 1) by (11), (19)
< A ( 4 k , 4s -- 4) --{-A ( 4 k , 4s -- 3)
+ A ( 4 k -- 3, 3A(4k, 4s -- 3)) by (10)
< A(4k, 4s - 4) --}-A ( 4 k , 4s -- 3) -]- A ( 4 k , 4s -- 2) by (16)
< 3A(4k, 4s -- 2) by (10). (20)
na "4- A ( 4 k -- 4, 2 "3)
<3A(4k, 4 s - - 2) + A ( 4 k - - 3,3A(4k, 4 s - - 2) + 1) by (11),(20)
< 3 A ( 4 k , 4s -- 2) --b A ( 4 k , 4s -- 1) by (16)
< 4A(4k, 4s -- 1) by (10)
< 2 a(4k' 4'-1) < A ( 4 k - 1, A(4k, 4s -- 1)) = A ( 4 k , 4s) by (5), (6), (9)
224 ROBERT E. TARJAN
Thus the theorem holds for k and s, and the theorem holds in general by double induc-
tion.
Let t~'(m, n) be the maximum cost of a sequence of m >_ n g-finds performed on any
tree T of n vertices formed using some union rule, starting with T itself as the shortcut
graph.
THEOREM 16. For some constant kl, klm ~(m, n) _~ t~ (m, n).
PROOF. Let T be a tree consisting of a root and s sons of the root. By Theorem 15, a
g-find of cost k can be performed on half the leaves of T(A(4k, 4s)). I t follows that a
total of s .2 A(4k'4,)-2 g-finds of cost/c - 1 can be performed on vertices of SA(4~,4,).
Let m and n satisfy a(m, n) >_ 2. Let k = [¼a(m, n)l - 1. Then A(4k, 4[m/nl)
log n. From n vertices, using any union rule, we can construct one or more copies of
S~(4k, ~r~/,7). We can use up at least half the available vertices forming such trees. Within
each such tree, we can perform [m/nl. 2 A(4~'4rm/~)-2 g-finds, each of cost k - 1. Thus the
total cost of all such finds in all the trees is at least ( m / 8 ) ( i ~ a ( m , n)J - 2). This gives ~
the theorem.
COaOLLARY 17. For some positive constant kl, klm ca(m, n) < t(m, n).
PROOF. The sequence of g-finds given by Theorem 16 can be interpreted as a sequence
of partial finds, each of length at least k. These partial finds can be ordered so that the
ranks of their final vertices are nondecreasing. This gives a sequence of n - 1 U N I O N s
and m interspersed F I N D s of total cost k~ ma(m, n) for a suitable positive constant ]el.
Thus the bound (17) is tight to within a constant factor.
REFERENCES
1. ACEERMANN,W Zum Hflbertshen Aufbau der reelen Zahlen Math. Ann. 99 (1928), 118-133.
2. AHO, A. V , HOPCROFT, J. E , AND ULLMAN, J D On computing least common ancestors in
trees. Proc 5th Annual ACM Symp. on Theory of Computing, Austin, Texas, 1973, pp. 253-265
3. ARDEN,B. W , GALLER,B. A , ANDGRAHAM,R M. An algorithm for eqmvalence declarations.
Comm. ACM 4i, 7 (July 1961), 310-314.
4 CHV/~TAL,V., KLARNER,D. A , AND KNUTH, D E. Selected combinatorial research problems
Teeh Rep STAN-CS-72-292, Comput. Sci Dep, Stanford U., Stanford, Cahf, 1972.
5. FXSCHER,M J, Efficiency of eqmvalence algorithms. In Complexity of Computer Computations,
R. E, Miller and J W Thatcher, Eds., Plenum Press, New York, 1972, pp. 153-168.
6. GALLER,B. A, ANDFISCHER,M.J. An improved equivalence algorithm Comm. ACM 7, 5 (May
1964), 301-303.
7 HOPCROFT, J , AND ULLMAN, J.D. Set-merging algomthms. SIAM J. Comput ~ (Dec. 1973),
294-303.
8 HOPCROFT,J Private communication.
9. KERSCHENBAUM,A , ANn VAN SLYKE, R Computing minimum spanning trees efficiently.
Proc 25th Annum Conf of the ACM, 1972, p.p 518-527.
E~ciency of a Good But Not Linear Set Union Algorithm 225
Journal of the Assoclatlonfor Computing Machinery, Vol 22, No. 2, April 1975