0% found this document useful (0 votes)
26 views

10 Sorting

The document discusses several parallel sorting algorithms: 1. Bitonic sort uses recursively sorting bitonic sequences to sort arrays in O(log n) time using n log n work. 2. Batcher's odd-even merge sort uses merging pairs of elements to sort in O(log n log log n) time using O(n log n) work. 3. Optimal merge sort recursively merges sorted subsequences using c-cover merging to achieve optimal parallel merge sort performance. 4. Other discussed parallel sorting algorithms include quicksort, bucket sort, radix sort, and blocked radix sort which sorts keys in blocks to improve cache locality.

Uploaded by

girishcherry12
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

10 Sorting

The document discusses several parallel sorting algorithms: 1. Bitonic sort uses recursively sorting bitonic sequences to sort arrays in O(log n) time using n log n work. 2. Batcher's odd-even merge sort uses merging pairs of elements to sort in O(log n log log n) time using O(n log n) work. 3. Optimal merge sort recursively merges sorted subsequences using c-cover merging to achieve optimal parallel merge sort performance. 4. Other discussed parallel sorting algorithms include quicksort, bucket sort, radix sort, and blocked radix sort which sorts keys in blocks to improve cache locality.

Uploaded by

girishcherry12
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

CSL 860: Modern Parallel

Computation
PARALLEL SORTING
Bitonic Merge and Sort
• Bitonic sequence: {a0, a1, …, an-1}:
– A sequence with a monotonically increasing part
and a monotonically decreasing part
• For some i, {a0<= …<=ai} and {ai+1>= …>= an-1}
– Or, a cyclic-shift of indices makes it bitonic
Two Bitonic Sequences
Subsequence 2

Subsequence 1

• Say {a0<= …<=an/2 >= an/2+1>= …>= an-1}


– Subsequence 1: {min(a0, an/2+1), min(a1, an/2+2), …}
– Subsequence 2: {max(a0, an/2+1), max(a1, an/2+2), …}
• Recursively sort each bitonic subsequence
– Subsequence 1 <= Subsequence 2
Bitonic Sort
• Sort each pairs
– alternately in increasing and decreasing orders
• Every sequence of length four is now bitonic
• Sort recursively
– Again alternate increasing and decreasing orders
– Forming bitonic sequences of length eight now
– And so on ..
Bitonic Network
7 3 3 3 3 2 1

3 7 6 6 4 1 2

8 8 8 7 2 3 3

6 6 7 8 1 4 4 log n stages
log2n time
4 1 5 5 5 5 5
n log2n work
1 4 4 4 6 6 6

5 5 1 2 7 7 7

2 2 2 1 8 8 8

Stage 1 Stage 2 Stage 3


Batcher’s Odd-Even Merge
• Merge the even indices
– Get c0, c1, c2, c3, …
• Merge the odd indices
– Get d0, d1, d2, d3, …
• Perform an odd-even merge
– c0, min(c1, d0), max(c1, d0), min(c2, d1), max(c2, d1),…dn-1
Example of a Fast Sort
• Rank sort
• Given Array A,
– For each i, find rank A[i]
– A[rank[i]] = A[i]
• How fast can you find Rank(A:A)?
– If you had n2 processors (Quiz)
– If you had n3 processors
Optimal Merge Sort
• Sort recursively similarly to a sequential
• Merge at each level of recursion using the
optimal parallel merge
– O(log log n) time
– O(n) work
• Log n merge-tree levels =>
– O(log n loglog n) time
– O(n log n) work
Sort n/p elements, then Merge
Merge 2n/p-elements
pairs P0,P1, P2,P3
p-processors merging

P0,P1 Merge n/p-elements P2,P3


pairs

P0 P1 Pp-1

Sort first n/p Sort second n/p


elements elements

HOW EFFICIENTLY CAN YOU MERGE?


c-Cover Merging
• Consider sequence A and B
• X: a c-cover of A and B
– Two consecutive elements of X have at most c
elements of A (and at most c elements of B) between
them
• Given Rank(X:A) and Rank(X:B)
– Find Rank(A:B) and Rank(B:A)
– In O(1) time, with O(n) work
• If X is a c-cover of B, and we know Rank(A:X) and
Rank(X:B)
– Compute Rank(A:B) in O(1)
Optimal O(log n)-time Merge Sort
• Works for any proper binary tree
– Not necessarily balanced
• The sorted list L[i] at a given node i is divided
into stages
– sth stage generates list Ls[i]
• Initially:
– L0[i] = null for internal nodes
– L0[i] = value at leaf node i
Fast Optimal Merge Sort: Definitions
• Algorithm proceeds up the tree, one stage at a
time
• At stage s, node n is active if
– height(node) <= s <= 3*height(n)
– height(n) = height(Tree) – path-length from root to
n
• At each stage a node is active,
– It merges a sample of lists of its children
Algorithm
• parallel for n = active nodes:
– Ls+1[n] = Merge(Samples(left), Samples(right]))
• SUBi(l0, l1, l3 …) = (li, l2i, l3i …)
– Samples(n) =
• SUB4(Ls[n]), if s <= 3*height(n)
• SUB2(Ls[n]), if s = 3*height(n) + 1
• SUB1(Ls[n]) = Ls[n], if s >= 3*height(n) + 2
Analysis
• Number of elements at a node doubles
– Ls+1[n] <= 2 |Ls[v]| + 4
• Samples (n) is a 4-cover for Samples+1(n)
– i.e., No more than 4 items of Ls+1[n] between two
consecutive items of Ls[n]
• For each stage s > height(n),
– Ls[n] is a 4-cover for Samples(left) and Samples(right)
• For s >= 2, Merge. I.e., compute:
– Rank(Samples(left), Samples(right))
– Rank(Samples(right), Samples(left))
• Use Rank(Ls:Samples(left)) (similarly right) computed by:
– Rank(Ls:Samples-1(left)) and Rank(Samples-1(left):Samples(left))
Parallel Quick-Sort
• Group of p processors sort a sub-sequence
• Initially all processors sort the entire sequence
• The sequence is divided into n/p blocks
• Together processors choose a pivot
• Processors rearrange elements into two ‘halves’
– Prefix sum
• The groups is subdivided into two and assigned to each
‘half’
– The size of each subgroup may be proportional to the size
of the ‘half’
– If subgroup size = 1, stop subdividing and sort serially
Bucket Sort
• Decide buckets
• Parallel for element i:
– Put it in bucket b
• Keep incoherent cache of each bucket per processor
• Merge buckets
• Sort each bucket separately
• All buckets need not be equal
– Load imbalance
• Sample sort:
– Choose a sample of size s
– Sort the samples
– Choose B-1 evenly spaced element from the sorted list
– These elements (splitters) demarcate B buckets
Parallel Splitter Selection
• Divide n elements into B (equi-sized) blocks
• (Quick)Sort each block
• For each sorted block:
– Choose B-1 evenly spaced splitters
• Use the B*(B-1) elements as samples
• Sort the samples
– Choose B-1 Splitters
• No bucket contains more than 2*n/B elements
Radix Sort
• Bucket-sort by each key
– Sort stably
• For each key, find its rank
– Count number of keys ‘<’ in the sequence
– Count number of keys ‘=’ before it in the sequence
• Use parallel prefix sum
– Notbit = !bit
– psum= escan(bit)
– nZeros = psum[n]+ notbit[n]
– nBefore = idx – psum + nZeros
– Rank = bit? nBefore : psum
Blocked Radix-sort
• Consider b bits at a time as “key”
– Prefix-sum still required per-bit?
• Satish/Harris/Garland:
– Divide sequence into blocks, Divide key into nibbles
– Load each block into shared memory
– Sort by 4 iterations of single-bit stable sorts (How?)
– For each block, write its 16-entry digit histogram and
the sorted block to global memory.
– Perform a prefix sum over the 16B histogram tables
– Copy elements of each block to their correct output
position

You might also like