Efficient Implementation of Sorting On Multi-Core SIMD CPU Architecture
Efficient Implementation of Sorting On Multi-Core SIMD CPU Architecture
CPU Architecture
Jatin Chhugani
Anthony D. Nguyen
Victor W. Lee
William Macy
Mostafa Hagog
Yen-Kuang Chen
Contact:
ABSTRACT
Sorting a list of input numbers is one of the most fundamental problems in the field of computer science in general and
high-throughput database applications in particular. Although literature abounds with various flavors of sorting
algorithms, different architectures call for customized implementations to achieve faster sorting times.
This paper presents an efficient implementation and detailed analysis of MergeSort on current CPU architectures.
Our SIMD implementation with 128-bit SSE is 3.3X faster
than the scalar version. In addition, our algorithm performs
an efficient multiway merge, and is not constrained by the
memory bandwidth. Our multi-threaded, SIMD implementation sorts 64 million floating point numbers in less than 0.5
seconds on a commodity 4-core Intel processor. This measured performance compares favorably with all previously
published results.
Additionally, the paper demonstrates performance scalability of the proposed sorting algorithm with respect to
certain salient architectural features of modern chip multiprocessor (CMP) architectures, including SIMD width and
core-count. Based on our analytical models of various architectural configurations, we see excellent scalability of our
implementation with SIMD width scaling up to 16X wider
than current SSE width of 128-bits, and CMP core-count
scaling well beyond 32 cores. Cycle-accurate simulation of
Intels upcoming x86 many-core Larrabee architecture confirms scalability of our proposed algorithm.
1.
Akram Baransi
Sanjeev Kumar
Pradeep Dubey
INTRODUCTION
1313
2.
RELATED WORK
Over the past few decades, large number of sorting algorithms have been proposed [13, 15]. In this section, we focus
mainly on the algorithms that exploit the SIMD and multicore capability of modern processors, including GPUs, Cell,
and others.
Quicksort has been one of the fastest algorithms used in
practice. However, its efficient implementation for exploiting SIMD is not known. In contrast, Bitonic sort [1] is implemented using a sorting network that sets all comparisons in
advance without unpredictable branches and permits multiple comparisons in the same cycle. These two characteristics
make it well suited for SIMD processors. Radix Sort [13] can
also be SIMDfied effectively, but its performance depends on
the support for handling simultaneous updates to a memory
location within the same SIMD register.
As far as utilizing multiple cores is concerned, there exist
algorithms [16] that can scale well with increasing number of
cores. Parikh et al. [17] propose a load-balanced scheme for
parallelizing quicksort using the hyperthreading technology
available on modern processors. An intelligent scheme for
splitting the list is proposed that works well in practice.
Nakatani et al. [16] present bitonic sort based on a k-way
decomposition. We use their algorithm for performing load
balanced merges when the number of threads is greater than
the number of arrays to be sorted.
The high compute density and bandwidth of GPUs have
also been targeted to achieve faster sorting times. Several implementations of bitonic sort on GPUs have been
described [8, 18]. Bitonic sort was selected for early sorting implementations since it mapped well to their fragment
processors with data parallel and set comparison characteristics. GPUTeraSort [8], which is based on bitonic sort, uses
data parallelism by representing data as 2-D arrays or textures, and hides memory latency by overlapping pointer and
fragment processor memory accesses. Furthermore, GPUABiSort [9] was proposed, that is based on adaptive bitonic
sort [2] and rearranges the data using bitonic trees to reduce the number of comparisons. Recently added GPU
capabilities like scattered writes, flexible comparisons and
atomic operations on memory have enabled methods combining radixsort and mergesort to achieve faster performances
on modern GPUs [19, 21, 22].
Cell processor is a heterogeneous processor with 8 SIMDonly special purpose processors (SPEs). CellSort [7] is based
on a distributed bitonic merge with a bitonic sort kernel
implemented with SIMD instructions. It exploits the high
bandwidth for cross-SPE communication by executing as
many iterations of sorting as possible while the data is in
1314
3.
ARCHITECTURE SPECIFICATION
3.1
ILP
3.2
DLP
SIMD, since 128-bit wide SIMD can operate on four singleprecision floating-point simultaneously.
Second in the future, we will have 256-bit wide SIMD instructions [12] and even wider [20]. Widening the SIMD
width to process more elements in parallel will improve the
compute density of the processor for code with a large amount
of data-level parallelism. Nevertheless, for wider SIMD,
shuffles will become more complicated and will probably
have longer latencies. This is because the chip area to implement a shuffle network is proportional to the square of the input sizes. On the other hand, operations, like min/max, that
process the elements of the vector independently will not
suffer from wider SIMD and should have the same throughput/latency.
Besides the latency and throughput issues mentioned in
Section 3.1, instruction definition can be restrictive and performance limiting in some cases. For example, in current
SSE architecture, the functionality of the two-register shuffle instructions are limited - the first elements from the first
source can go to the lower 64-bits of the register while the
elements from the second source (destination operand) can
go to the high 64-bits of the destination. As input we have
two vectors:
A1 A2 A3 A4 and B1 B2 B3 B4
We want to compare A1 to A2, A3 to A4, B1 to B2 and
B3 to B4. To achieve, that we must shuffle the two vectors
into a form where elements from the two vectors can be
compared directly. Two accepted form of the vectors are:
A1 B1 A3 B3
A2 B2 A4 B4
or
A1 B2 A3 B4
A2 B1 A4 B3
The current SSE shuffle that takes two sources does not
allow the lower 64-bits of output from both sources. Thus,
instead of two shuffle instructions, we have to use a three
instruction sequence with the Blend instruction. The instruction sequence for shuffling the elements becomes.
C = Blend (A, B, 0xA) // gives: A1 B2 A3 B4
D = Blend (B, A, 0xA) // gives: B1 A2 B3 A4
D = Shuffle (D, D, 0xB1) // gives: A2 B1 A4 B3
This results in a sub-optimal performance. As we will
see in Section 4.2.1, this is actually a pattern that a bitonic
merge network uses in the lowest level. We expect the shuffle
instruction in the future to provide such capability and thus
improve the sort performance. For the rest of the paper,
we use this three-instruction sequence for the real machine
performance, while assuming only two instructions for the
analytical model.
3.3
TLP
3.4
MLP
4.
ALGORITHMIC DETAILS
4.1
Algorithm Overview
1315
Figure 1: Odd-Even merge network for merging sequences of length 4 elements each.
On the other hand, bitonic merge compares all the elements at every network step, and overwrites each SIMD
lane. Note that the pattern of comparison is much simpler,
and the resultant data movement can easily be captured
using the existing register shuffle instructions. Therefore,
we mapped the bitonic merging network using the current
SSE instruction set (Section 4.2). In this subsection, we describe in detail our SIMD algorithm. We start by explaining
the three important kernels, namely the bitonic merge kernel, the in-register sorting kernel, and the merging kernel.
This is followed by the algorithm description and various
design choices. For ease of explanation, we focus on single
thread for this subsection. This is later extended to multiple
threads in the next subsection.
4.2.1
Phase 2. The second phase consists of (log N -log M) iterations. In each iteration, we merge pairs of lists to obtain
sorted sequences of twice the length than the previous iteration. All P processors work simultaneously to merge the
pairs of lists, using an algorithm similar to 1(b). However,
for large N , the memory bandwidth may dictate the runtime, and prevent the application from scaling. For such
cases, we use multiway merging, where the N /M lists are
merged simultaneously in a hierarchical fashion, by reading
(and writing) the data only once to the main memory.
4.2
Exploiting SIMD
Figure 2: Bitonic merge network for merging sequences of length 4 elements each.
Figure 2 shows the pattern for merging two sequences of
four elements. Arrays A and B are are held in SIMD registers. Initially A and B are sorted in the same (ascending)
1316
4.2.2
In-Register Sorting
4.2.3
1317
4.2.4
Figure 5: Bitonic merge network for merging sequences of length 16 elements each
The main reason the poor performance is that the min/max
and the shuffle instructions have an inherent dependency
that exposes the min/max latency. There are 2 ways to
overcome it:
1. Simultaneously merging multiple lists. Since merge
operations for different lists are independent of each other,
we can hide the latency exposed above by interleaving
the min/max and shuffle instructions for the different
lists. In fact, for the current latency numbers, it is sufficient (and necessary) to execute four merge operations
simultaneously (details in Section 5.2) to hide the latency. The resultant number of cycles is 32 cycles, which
implies 2 cycles per element, a 3.25X speedup over the
previous execution time. The encouraging news is that
in case of the log N iterations of mergesort, all except
the last two iterations have at least 4 independent pairs
of lists that can be merged.
2. Use a wider network. For example, consider a 16 by
16 bitonic merge network (Figure 5). The interesting
observation here is that the last 4 levels (L2 , L3 , L4 and
L5 ) basically execute 2 independent copies of the 8 by
8 network. The levels L3 , L4 and L5 are independent
copies of 4 by 4 network. Hence, executing two 8 by 8
1318
4.3
4.4
Multiway Merging
Head
Tail
Count
Head
Tail
Count
Head
Tail
Count
1319
4.5
So far, we have focussed on sorting an input list of numbers (keys). However, database applications typically sort
tuples of (key, data), where data represents the pointer to
the structure containing the key. Our algorithm can be easily extended using one of the following ways to incorporate
this case.
1. Treating the (key, data) tuple as a single entity (e.g., 64bit value). Our sorting algorithm can seamlessly handle
it by comparing only the first 32-bits of the 64-bits for
computing the min/max, and shuffling the data appropriately. Effectively, the SIMD width reduces by a factor
of 2X (2-wide instead of 4-wide), and the performance
slows down by 1.5X-2X for SSE, as compared to sorting
just the keys.
2. Storing the key and the data values in different SIMD
registers. The result of the min/max is used to set a
mask that is used for blending the data register (i.e.,
moving the data into the same position as their corresponding keys). The shuffle instructions permute the
keys and data in the same fashion. Hence, the final
result consists of the sorted keys with the data at the
appropriate locations. The performance slows down by
2X.
Henceforth, we report our analysis and results for sorting
an input list of key values.
5.
5.2
ANALYSIS
This section describes a simple yet representative analytical model that characterizes the performance of our algorithm. We first analyze the single threaded scalar implementation, and enhance it with the SIMD implementation,
followed by the parallel SIMD implementation.
5.1
1320
5.3
6.
RESULTS
In this section, we present the data of our MergeSort algorithm. The input dataset was a random distribution of
single precision floating point numbers (32-bits each). The
runtime is measured on a system with a single Intel Q9550
quad-core processor, as described in Table 1.
Multi-threaded Implementation
Given the execution time (in cycles) per element for each
iteration of the single-threaded SIMDfied mergesort, and
other system parameters, this models aims at computing
the total time taken by a multi-threaded implementation.
As explained in Section 4.3, the algorithm has two phases.
In the first phase, we divide the input list into N /M blocks.
Let us now consider the computation for one block.
To exploit the P threads, the block is further divided into
P lists, wherein each thread sorts its M = M/P elements.
In this part, there is no communication between threads,
but there is a barrier at the end. However, since the work
division is constant and the entire application is running in
a lock-step fashion; we do not expect too much load imbalance, and hence the barrier cost is almost negligible. The
execution time for this phase can be modeled as:
System Parameters
Core clock speed
3.22GHz
Number of cores
4
L1 Cache
32KB/core
L2 Cache
12MB
Front-side Bus Speed
1333MHz
Front-side Bus Bandwidth 10.6GB/s
Memory Size
4GB
6.1
Single-Thread Performance
1321
Cell
8600
8800
Quadro
[7]
GTS[22]
GTX[19]
5600[19]
1C
4C
2C[7]
1CScalar
1CSSE
2C
4C
BW (GB/sec)
25.6
32.0
86.4
76.8
16.8
16.8
10.6
10.6
10.6
10.6
10.6
Peak GFLOPS
204.8
86.4
345.6
345.6
20.0
80.0
51.2
25.8
25.8
51.6
103.0
512K
0.05
0.0067
0.0067
0.0314
0.0088
0.0046
0.0027
1M
0.009
0.07
0.0130
0.0130
0.031
0.014
0.098
0.0660
0.0195
0.0105
0.0060
2M
0.023
0.13
0.0257
0.0258
0.070
0.028
0.205
0.1385
0.0393
0.0218
0.0127
4M
0.056
0.25
0.0514
0.0513
0.147
0.056
0.429
0.2902
0.0831
0.0459
0.0269
8M
0.137
0.50
0.1030
0.1027
0.311
0.126
0.895
0.6061
0.1763
0.0970
0.0560
16M
0.317
0.2056
0.2066
0.653
0.251
1.863
1.2636
0.3739
0.2042
0.1170
32M
0.746
0.4108
0.4131
1.363
0.505
3.863
2.6324
0.7894
0.4292
0.2429
Number
Of
Elements
PowerPC[11]
Intel
Our Implementation
64M
1.770
0.8476
0.8600
2.872
1.118
7.946
5.4702
1.6738
0.9091
0.4989
128M
4.099
1.8311
6.316
2.280
16.165
11.3532
3.6151
1.9725
1.0742
256M
23.5382
7.7739
4.3703
2.4521
Table 2: Performance comparison across various platforms. Running times are in seconds. (1C represents
1-core, 2C represents 2-cores and 4C represents 4-cores.)
per element per iteration. We observed two things. First,
the performance for the scalar version is constant across all
dataset sizes. The cycles per element for each iteration is
10.1 cycles. This means the scalar MergeSort is essentially
compute bound (for the range of problems we considered).
In addition, this also implies that our implementation is able
to hide the memory latency pretty well.
512K
1M
4M
16M
64M
256M
10.1
2.9
10.1
2.9
10.1
2.9
10.1
3.0
10.1
3.1
10.1
3.3
4
Parallel Speedup
No. of
Elements
Scalar
SIMD
Scalar
SIMD
3
1M Elements
256M Elements
2
1
0
1
Number of Cores
Second, the performance of the (4-wide) SIMD implementation is 3.0X - 3.6X faster than the scalar version. The cycles per element (per iteration) time grows from 2.9 cycles to
3.3 cycles (from 512K to 256M elements). Without our multiway merge implementation, the number varied between 2.9
cycles (for 512K elements) to 3.7 cycles (for 256M elements,
with the last 8 iterations being bandwidth bound). Therefore, our efficient multiway merge implementation ensures
that the performance for sizes as large as 256M elements is
not bandwidth bound, and increases by only 5-10% as compared to sizes that fit in the L2 cache. The increase in runtime with the increase in datasize is due to the fact that our
multiway merge operation uses a 16 by 16 network, which
consumes 30% more cycles than the optimal combination of
merging networks (Section 5.2). For an input size of 256M
elements, the last 8 iterations are performed in the multiway
merge, that leads to the reported increase in runtime.
We observe several things from the figure. First, our implementation scales well with the number of cores around
3.5X on 4 cores for scalar code, and 3.2X for SSE code. In
particular, our multiway merge implementation is very effective. Without our multiway merge implementation, the
SSE version for 256M elements scales only 1.75X because of
being limited by the memory bandwidth. Second, our implementation is able to efficiently utilize the computational
resources. The scaling of the larger size (256M), which does
not fit in the cache, is almost as good as the scaling of the
smaller size (1M), which fits in the cache. The scaling for
the SSE code is slightly lower than the scalar code due to
the larger synchronization overhead.
6.2
6.3
Multi-Thread Performance
1322
duce the resultant running time. For different sizes, our running times are around 30% larger than our predicted values.
This can be attributed to the following two reasons. First,
an extra shuffle instruction after the second level of comparisons (for a 4 by 4 network) accounts for around 10% of the
extra cycles. Second, extra move instructions are generated,
which may be executed on either the min/max or the shuffle functional units, accounting for around 15% overhead on
average. By accounting for these two reasons, our analytical
model reasonably predicts the actual performance.
Both of these experiments confirm the soundness of our
analysis for the sorting algorithm, and provide confidence
that the analytical model developed can be used for predicting performance on various architectural configurations.
6.4
Table 2 lists the best reported execution times4 (in seconds) for varying dataset sizes on various architectures( IBM
Cell [7], Nvidia 8600 GTS [22], Nvidia 8800 GTX and Quadro
FX 5600 [19], Intel 2-core Xeon with Quicksort [7], and
IBM PowerPC 970MP [11]). Our performance numbers are
faster than those reported on other architectures. Since
sorting performance depends critically on both the compute power and the bandwidth available, we also present
the peak GFLOPS (Billions of floating point operations per
second) and the peak bandwidth (GB/sec) for a fair comparison. The columns at the end show running times of our
implementation for 1 core and 4 cores. There are two key
observations:
Our single-threaded implementation is 7X10X faster
than previously reported numbers for 1-core IA-32 platforms [7]. This can be attributed to our efficient implementation. In particular, our cache-friendly implementation and multiway merging aims at reducing the bandwidth bounded stages in the sorting algorithm, thereby
achieving reasonable scaling numbers even for large
dataset sizes.
Our 4-core implementation is competitive with the performance on any of the other modern architectures, even
though Cell/GPU architectures have at least 2X more
compute power and bandwidth. Our performance is
1.6X4X faster than Cell architecture and 1.7X2X faster
than the latest Nvidia GPUs (8800 GTX and
Quadro FX 5600).
6.5
9
8
7
6
5
4
3
2
1
0
(3,1,1)
(1,1,1)
(1,1,0)
16
SIMD Width
32
64
1323
8C
0.636
2.742
16C
0.324
1.399
9.
32C
0.174
0.751
7.
CONCLUSIONS
We have presented an efficient implementation and detailed analysis of MergeSort on current CPU architectures.
This implementation exploits salient architectural features
of modern processors to deliver significant performance benefit. These features include cache blocking to minimize access latency, vectorizing for SIMD to increase compute density, partitioning work and load balancing among multiple
cores, and multiway merging to eliminate the bandwidth
bound stages for large input sizes. In addition, our implementation uses either wider sorting network or multiple independent networks to increase parallelism and hide backto-back instruction dependences. These optimizations enable us to sort 256M numbers in less than 2.5 seconds on
a 4-core processor. We project near-linear scalability of our
implementation with up to 64-wide SIMD, and well beyond
32 cores.
8.
REFERENCES
ACKNOWLEDGEMENTS
1324