Bitonic Sort
Bitonic Sort
Batcher’s Bitonic sort is a parallel sorting algorithm whose main operation is a technique for merging
two bitonic sequences. A bitonic sequence is the concatenation of an ascending and a descending
sequence of numbers. For example, 2; 4; 6; 8; 9; 24; 6; 3; 2; 0 is a bitonic sequence. Various
adaptations of Batcher’s original algorithm have been proposed, To sort a sequence of n numbers, the
Batcher’s algorithmproceeds as follows. The first step is to convert the n numbers into a bitonic sequence
with n=2 numbers in an increasing subsequence and n=2 numbers in a decreasing subsequence. This is
done recursively, as illustrated in Figure 1. After the bitonic sequence with n numbers is obtained, it is
merged into an ordered sequence (either increasing or decreasing, depending upon which is needed). The
merging technique consists of log n stages, and each stage consists of three operations: shuffle, compare,
and unshuffle. Since this algorithm is well-known, we do not explain these operations in detail here;
however, an example is shown in Figure 2.
Batcher’s original bitonic sort. Each of the p processors works on () numbers in every stage. At each
stage, the compare operation takesO((L) local computation and the shuffle and unshuffle operations have
communication costs with other processors of .Therefore the time complexity is a factor of the
number of stages and the term O(+ (n
p)C). Given n numbers, we have log n phases. Phase 1 consists of 1 stage, phase 2 consists of 2
stages, and so on. Thus, the total number of stages is O(log2 n), and the overall time complexity of the
Batcher’s bitonic sort iss1 = O(((
n
p
Bitonic sorting with radix sort. To convert n input elements into a bitonic sequence with n=2 elements
in the increasing and decreasing sequences we use parallel radix sort. We then use Batcher’s Bitonic
Merge to convert n element bitonic sequence into an increasing sequence of n numbers. This is done in
log n stages and takes time O(((n. Thus the total time
complexity for this modification of bitonic sort is
n
p
)L+ (
n
p
Sample Sort
There have been many sorting algorithms proposed that use a random sample of the input elements to
partition the input into distinct subproblems that can be sorted independently (see, e.g., [8, 9, 16, 19]).
The basic version we implemented consists of the following three phases.
First, a random sample of p� 1 splitter elements S = fs1; s2; : : :; sp�1g is selected from the n
input elements. Once sorted, these splitter elements define a set of p buckets, e.g., bucket i covers the
range [pi; pi+1). Second, each input element is placed in (sent to) the appropriate bucket, and finally,
the elements in each bucket are sorted. Typically, one bucket is created for each processor, and the sorting
of the elements in each bucket is performed in parallel. The random selection of the splitter elements is
performed in parallel; if the size of the random sample is large enough, then it may be beneficial to
perform the sort of the splitter elements in parallel as well.
Note that the running time of this algorithm does not depend on the distribution of the keys since the
splitters are selected from the input elements. The running time of this algorithm is dependent on
themaximum bucket size, i.e., themaximum number of elements contained in any bucket. Ideally, one
would like all buckets to contain an equal number of elements. Oversampling is a technique that has been
used to try to reduce the deviation in bucket size: in Step 1, one selects a set of ps input elements, sorts
them, and uses the elements with ranks s; 2sas the splitter elements. The integer parameter s is called
the oversampling ratio. Clearly, the choice of is crucial to the expected running time of the algorithm. In
particular, s should be chosen large enough to achieve a small deviation in bucket sizes and small enough
so that the time required to sort the sample in Step 1 does not become too large; the ratio of the largest
bucket size to the average bucket size, n=p, is called the bucket expansion. Thus, the appropriate choice
for s depends on both the value of n and the number of processors in the system.
The sample sort algorithm is summarized below
SAMPLESORT (A)
Step 1 Selecting the splitters. Each of the processors randomly selects a set of elements from
p s
its elements of . These keys are sorted by parallel radix sort, and then the
dn=pe A ps p �1
splitter elements are selected. Duplicates can be distinguished by tagging the elements before
sorting.
Step 2 Distributing the keys to the buckets. Each processor determines the bucket to which each of
its elements of belongs by performing a binary search of the array of splitters (which
dn=pe A
may be stored separately on each processor). The keys are then distributed to their respective
buckets.
Step 3 Sorting the keys in each bucket. Each processor sorts the keys locally within its bucket using
a sequential radix sort.
Assuming that the input elements are initially distributed to the processors, Step 1 uses time in O(s)
local computation to select the elements, time to sort them (by Equation 5), and local computation
ps Trs(ps) O(p)
time and communication to select and collect the splitter elements. In Step 2, the local computation
costs are to determine the buckets for all elements and in communication to place the elements in the
O((n=p) logp) O(n=p)
buckets. Step 3 uses expected time (by Equation 4) in local computation and time in communication.
Tsrs(n=p) O(n=p)
PARALLEL COUNTING SORT (A; i;
Step 1 In parallel, each processor counts the number of its (approximately) assigned elements dn=pe
Step 2 Processors cooperatively compute prefix sums of the processor-wise prefix sums.
(a) Each processor is assigned (approximately) values of and computes the d2r=pe [0; 2r)
prefix sums of all the count arrays for those values. For example, the processor assigned
value will set v Ri[v] =Pi
, for each .
j=0 Rj [v] 0 _i_p�1
(b) Each processor now adds an offset to each element in its count array. For example,
processor adds to , for each . i Rp�1[v] Ri[v + 1] v 2 [0; 2r)
Step 3 In parallel, the processors copy their assigned elements to the output array . This is done B
just like sequential counting sort, except processor uses its count array .
Tss(n) = O(Trs(ps) + Tsrs(n=p) + (p+n=p)C + (p + (n=p) log p)L):
i Ri