Influence of Branch Mispredictions On Sorting Algorithms
Influence of Branch Mispredictions On Sorting Algorithms
Bachelor Thesis
Supervised by
Univ.-Prof. Dr. Martin Dietzfelbinger
Acknowledgement
List of Figures
1.1 The relation between execution time, number of branch misses and
number of executed instructions [KS06]. . . . . . . . . . . . . . . . . 3
4.1 First step of lean symmetric dual pivot partitioning (Listing 4.7) . . 23
4.2 Pipelining loop body of naive-MinMax assuming no misprediction
happens. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Pipelining loop body of lean naive-MinMax. . . . . . . . . . . . . . . 26
4.4 Unrollsix partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Unrollsix partitioning - State 7 = [000 111]2 . . . . . . . . . . . . . 30
4.6 Unrollsix partitioning - State 11 = [001 011]2 . . . . . . . . . . . . 31
4.7 Unrollsix partitioning - State 47 = [101 111]2 . . . . . . . . . . . . 31
4.8 Lookup table guided power (introduced by Dietzfelbinger, Listing 4.17) 32
4.9 Lookup table MinMax (Listing 4.18). . . . . . . . . . . . . . . . . . 33
4.10 Variants of selection (Left: conditional select, Right: pseudo condi-
tional select). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.11 Variants of conditional swap (Left: conditional swap, Right: pseudo
conditional swap). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iv List of Figures
List of Tables
Listings
Contents
Chapter 1
Introduction
In the real world humanity needs very wide set of applications that handle data
like Database, Webservers, Schedulers, Networking and many engineering issues.
Sorting makes everything easier such as in search (binary search O(log2 n))
and selection/order statistics (the k-th smallest element can be found in constant
time by simply looking at the k-th position) problems.
• Many engineering issues come to the fore when implementing sorting algo-
rithms, which depend on many factors such as prior knowledge about the
keys and satellite data, the memory hierarchy (caches and virtual memory)
of the host computer, and the software environment [CLRS09].
doesn’t permute only the keys but also satellite data as well. However if the record
includes a large amount of satellite data, those movements (by permutation) is
costly, and in order to minimize data movement we let the permutation on the
pointers of the records records than the records themselves [CLRS09].
Sorting algorithms are classified according many aspects such as compassion
(compassion sort, counting sort), memory usage (in-place sort, not in-place sort)
and .. etc [CLRS09].
Compassion is one of most important aspect which classifies sorting algorithms
to compassion-based (Quicksort, Insertionsort, .. etc) and counting-based (Radix-
sort, Bucketsort, .. etc). Compassion-based algorithms determine the sorted order
of an input array by comparing elements and counting-based sorts n numbers using
array indexing as a tool for determining relative order [CLRS09].
1.1 History
In 1962, Quicksort algorithm has been invented by Tony Hoare and considered
to be one of the most efficient sorting algorithms. It gained widespread adop-
tion, appearing (like in C standard library and in Java). Quicksort follows a
divide-and-conquer strategy by choosing one pivot element from the input, put
it in its right position in the array and classifies the rest elements around it [EW16].
1.2 Motivation
Kaligosi and Sanders [KS06] in 2006 have noticed that in comparison based sorting
algorithms like Quicksort or Mergesort, neither the executed instructions nor the
cache faults dominate execution time. Comparisons are much more important, but
only indirectly since they cause the direction of branch instructions depending on
them to be mispredicted. So the improvements of Quicksort such as median-of-
three pivot selection bring no significant benefits in practice (at least for sorting
small objects) because they increase the number of branch mispredictions. There-
fore, it is not only the number of executed instructions that plays a major role in
the running time (Figure 1.1).
Findings of Kaligosi and Sanders [KS06] showed that branch mispredictions
may have a significant effect on the speed of programs. As in Quicksort, a skewed
pivot led to a better branch prediction and - possibly - a decrease in computation
time.
1.2. Motivation 3
Figure 1.1: The relation between execution time, number of branch misses and
number of executed instructions [KS06].
These results above inspired Elmasry and Katajainen. They took the idea of
decoupling element comparisons from branches from Mortensen [Mor01] which has
been also used by Sanders and Winkel in their Samplesort [SW04] and tried to
avoid unpredictable branches at all (So after decoupling element comparisons from
branches the lower bound of the number of branch mispredictions that have been
proved by Brodal and Moruz in [BM05] is no more valid).
In 2012, Katajainen showed in his talk (updated in 2014) [Kat12] that Quick-
sort is really faster than Mergesort and the secret behind this result is avoiding
unpredictable branches as possible as it can.
Very newly in 2017 per a private communication, Seidel2 shared his new idea
about unpredictable branches avoidance in the single-pivot partitioning algorithms
with us. The idea relies strongly on blind swaps (Section 4.5) without using extra
space and a few number of assembly instructions.
1
Many optimization orients like memory (explicit stack instead of recursing and quicksort the
smaller number of elements first in order to reduce the worst case stack space to O(log n) from
O(n)) and reducing both the chances of very unbalanced partitioning and the instruction count
by using median-of-3 pivoting.
2
Prof. Dr. Raimund Seidel, Head of Theoretical Computer Science in University of Saarland.
4 Chapter 1. Introduction
Chapter 2
Background
2.1 Pipelining
Pipelining is an implementation technique in which multiple instructions are
overlapped in execution, which exploits parallelism among the instructions in a
sequential instruction stream [PH13].
In modern processors in order to speedup the execution and reduce run time,
the instructions have to be done pipelined. So the instruction is split into several
stages/phases (number of stages depends on the architecture significantly). In
the best case pipelining speedups the execution time many times as many as the
number of stages (i.e. MIPS instructions classically take five steps). We assume
that there are five stages to make it easy:
Each stage/phase takes one CPU cycle, so each instruction will take cycles as
many as it has phases. The speedup has done by executing different phases of
different instructions in parallel, so no instruction-units can be idle. The processor
can execute one entire instruction in one CPU cycle at most (Figure 2.1 at tick
t4 ).
However instructions pipelining has to overstep some difficulties. In various
situations the next instruction cannot be executed in its clock slot. There are
three different types of those problems which called hazards (structural, data and
control/branch hazards) [PH13].
Data Hazard. Data that is needed to execute the instruction is not yet available.
6 Chapter 2. Background
Control/Branch Hazards. The instruction that was fetched is not the one that
is needed because of branch mispredictions (This topic will be explained deeply in
section 2.3).
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 ..
Clock tick
Instruction 1 F D O XW
Instruction 2 F D O XW
Instruction 3 F D O XW
Instruction 4 F D O XW
Instruction 5 F D O XW
Instruction 6 F D O X ..
Instruction 7 F D O ..
.. F D ..
.. F ..
Branching. Branch instructions cost 1-2 CPU cycles if the branch has been
predicted correctly. However in some cases processors mispredict the outcome of
the branch, which leads to very costly penalty (10-20 CPU cycles) [Fog17a].
data from RAM is even more expensive than cache (hundreds of CPU cycles).
Memory store is performed in about 1 CPU cycle [Fog17a].
Listing 2.1: A simple conditional branch Listing 2.2: Listing 2.1 transformed to
written in C++. assembly code using g++ compiler.
1 1 ; %rbp: base pointer
2 int main(int argc, char *argv[ ]) { 2 ; %rsp: stack pointer
3 int a, b, c; 3
4 4 pushq %rbp
5 // func(&a, &b) 5 movq %rsp, %rbp
6 6 movl -4(%rbp), %eax
7 if (a < b) { 7 cmpl -8(%rbp), %eax
8 c = a; 8 ; got to L2, if a >= b
9 } else { 9 jge L2
10 c = b; 10 movl -4(%rbp), %eax
11 } 11 movl %eax, -12(%rbp)
12 12 jmp L3
13 return 0; 13 L2:
14 } 14 movl -8(%rbp), %eax
15 movl %eax, -12(%rbp)
As we see in the assembly code (Listing 2.2). At line 9, the processor tries to
guess the next instruction either line 10 or 13.
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 ..
Clock tick
pushq %rbp F D O XW
movq %rsp, %rbp F D O XW
movl -4(%rbp), %eax F D O XW
cmpl -8(%rbp), %eax F D O XW
jge L2 F D O XW
movl -8(%rbp), %eax F D O XW
movl %eax, -12(%rbp) F D O XW
.. F D O XW
.. F D O XW
movl -4(%rbp), %eax F D O XW
movl %eax, -12(%rbp) F D O XW
jmp L3 F D O XW
.. F D O XW
Figure 2.3: Branch misprediction (Listing 2.2). Assuming always-taken static pre-
dictor (figure in the style of [Zar96]).
code execution. There are also some adaptive local prediction schemes called dy-
namic that take the decision about a particular branch according to its previous
outcome(s). [ANP16]
The main difference between 1-bit (Figure 2.5) and 2-bit (Figures 2.6 and 2.7)
prediction schemes is the depth of the behavior history of the branch. First one has
only one bit, which can represent only two decision states (T/taken, N/not-taken)
and has information only about the previous outcome. However 2-bit predictors
have four decision states (ST/strongly taken, WT/weakly taken, SN/strongly not-
taken, WN/weakly not-taken) and rely on more information to take the decision
which make the prediction more adaptive [ANP16].
Table 2.2 shows details about 5 prediction schemes (two of them static). As-
suming p is the probability of that the branch must be in real taken (not the
prediction).
taken taken
T N
Figure 2.4: Always taken static predictor (left) and always not-taken static predic-
tor (right).
ST WT WN SN
taken taken taken
Table 2.2 lists the prediction schemes considered here, with expected number
of branch misses. Proofs of formulas can be found in appendix 7.1.
10 Chapter 2. Background
taken
taken ST WN
not taken
taken
taken not taken
WT SN not taken
not taken
1 always taken
always not-taken
1-bit
P r(Branch miss)
2-bit-saturate-counter
2-bit-flip-on-consecutive
0.5
0
0 0.5 1
p
They implemented a new Guided approach (Listing 7.1) to reduce the harmful
misprediction effect in the Classical and Unrolled approaches. The result was very
good after they replaced the branch, which causes the sizable number of branch
misses with one less harmful.
The problem will be discussed deeply in sections 4.3 and 4.4.
1. Lean: O(1) branch mispredictions incurred when the branch predictor used
by the underlying hardware is static.
4
A pure-C program is a sequence of possibly labelled statements that are executed sequentially
unless the order is altered by a branch statement [EK12]. See also the pure-C cost model [Mor01]
Chapter 3.
2.5. Quicksort 13
2.5 Quicksort
Quicksort is one of the most widely used sorting algorithms, which has been in-
troduced by Hoare in 1962 [ADK16] and it is considered to be one of the most
efficient sorting algorithms [EW16].
Quicksort is a comparison-based sorting algorithm, which follows the divide-
and-conquer paradigm (as Mergesort) and it is the most frequently used sorting
algorithm since it is very fast in practice and needs almost no additional memory
(except recursive calls in the stack) without any assumptions on the distribution
of the input [KS06].
The main part of Quicksort is the partitioning procedure which is explained in
following section 2.5.1. This is why we will consider only partitioning algorithms
while we try to improve the runtime of Quicksort in this work.
• Recursive step: Apply Quicksort to the left and right sides recursively.
l r
v ..
l r
.. < v v v ≤ ..
p
Hoare partitioning has two pointers. One of them scans the input form the
left side and stops scanning when it found an element larger than the pivot. The
second one scans the input from the right side and stops if a smaller element has
been found. After the scanning phase, the algorithm exchange the elements, at
which the pointers stopped if the pointers did not cross yet then the algorithreturns
to the scanning phase again, otherwise the algorithm is done.
YBB Quicksort became the new standard Quicksort algorithm in Oracle’s Java
7 runtime library. It makes about 1.9n ln n + O(n) comparisons in average in
2.5. Quicksort 15
Snapshot (1) before Quicksort Snapshot (4) Partitioning Snapshot (7) Partitioning
100 100
90
80 80
70
60 60
A[k]
50
40 40
30
20 20
10
0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Snapshot (2) Partitioning Snapshot (5) Partitioning Snapshot (8) Partitioning
100 100
90
80 80
70
60 60
A[k]
50
40 40
30
20 20
10
0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Snapshot (3) Partitioning Snapshot (6) Partitioning Snapshot (9) after Quicksort
90 90 90
70 70 70
A[k]
50 50 50
30 30 30
10 10 10
10 30 50 70 90 10 30 50 70 90 10 30 50 70 90
k k k
Figure 2.11: Quicksort using Hoare partitioning method and median-of-three pivot
example.
l r
v1 .. v2
l r
.. < v1 v1 v1 ≤ .. ≤ v2 v2 v2 < ..
p1 p2
32
contrast to the 2n ln n + O(n) of standard Quicksort and the 15 n ln n + O(n) of
Sedgewick’s dual pivot algorithm and although the number of swaps in YBB Quick-
sort about 0.6n ln n + O(n) much larger than 0.33n ln n + O(n) swap operations in
classical Quicksort, it still faster [Wil16].
Chapter 3
Branch categorization
Elmasry and Katajainen [EK12] have mentioned that some branches called easy-
to-predict branches are friendly and need not be eliminated. However some others
called unpredictable branches should be eliminated as long as no extra complica-
tions are introduced.
So they defined two classes of conditional branches easy-to-predict which make
O(1) mispredictions and such branch could be seen in the loops usually that mis-
predict in the last iteration. Such branches do not need to be focused. The second
class is unpredictable (or even hard-to-predict as it defined in [Mor01]), which has
a variable number of mispredictions that leads to a worse behaviour.
“Hard-to-predict are the primary source for branch misses. Branch misses are
very expensive, costing up to 15 clock cycles on Pentium Pro architecture.” [Mor01]
The golden part of pre-previous sentence of Elmasry and Katajainen is that
unpredictable branches should be eliminated as long as no extra complications are
introduced.
Predictor E M
always taken n/2 n − Hn
always not-taken n/2 Hn
1-bit n/2 2(Hn − ζ(2))
2-bit-saturate-counter n/2 Θ(Hn )
2-bit-flip-on-consecutive n/2 Θ(Hn )
Summary. Some types of branches are harmful and induce a large number of
mispredictions because such type of branches is hard-to-predict. As Elmasry and
Katajainen [EK12] mentioned that the unpredictable branches should be elimi-
nated as long as no extra complications are introduced.
The same observation we found when we tried to make naive-Maximum (Listing
4.9) algorithm lean, which did not lead to a better behaviour. The one explanation
is that the benefit of avoiding branch mispredictions could not cover the loss in
other performance aspects (i.e. memory I/O).
3.2. Real example 19
20
10
10 50 100
n input size
Branch
Predictable Unpredictable
O(1) miss
Negligible Critical
Chapter 4
In the course of this work various strategies were applied in order to eliminate
branch misses in Quicksort algorithms. Moreover we will show how branch mis-
predictions affect other algorithms like simultaneous maximum and minimum and
exponentiation by squaring.
Before starting showing the techniques of branch misses elimination, we would
like to introduce three simple procedures which are used in partitioning algorithms
in this work. Swap and Rotation procedures (Listings 4.1, 4.2 and 4.3).
Listing 4.1: Swap two elements. Listing 4.2: Three elements rotation.
1 void SWAP(int &a, int &b) { 1 void ROTATE(int &a, int &b, int &c) {
2 int t = a; 2 int t = a;
3 a = b; 3 a = b;
4 b = t; 4 b = c;
5 } 5 c = t;
6 }
The CMOVcc6 instruction was introduced in the P6 processor family (Intel Pen-
tium II):
The same methods were suggested in [Cor16] in order to optimize the pre-
diction of a conditional branch: Arrange the code to make basic blocks con-
tiguous, unroll loops and use CMOVcc and SETcc instructions. Consider the
following line of C that has a condition dependent upon one of the constants:
x = (a < b) ? const1 : const2;
We consider three possible ways to compile the previous line of C. Listing 4.4
comes from [Cor16] and shows a bad assembly code. Listing 4.5 uses a SETcc
instruction to avoid the branch and listing 4.6 uses a CMOVcc instruction.
6
See [ALSU06], page 718.
4.1. Conditional instruction 23
swap
v2 ≮ e v2 < e
p1 l k g p2
v1 .. < v1 v1 ≤ .. ≤ v2 x .. v2 < .. v2
Figure 4.1: First step of lean symmetric dual pivot partitioning (Listing 4.7)
24 Chapter 4. Eliminating branch misses
Listing 4.8: The for-loop of listing 4.7 transformed to assembly code using g++.
1 L5: ; Step 1
2 movl (%edx), %eax ; x = *k
3 movl %esi, %ecx ; alpha = g
4 cmpl %eax, %edi ; x <= p2
5 cmovge %ebx, %ecx ; alpha = l, if x <= p2
6
7 ; SWAP(*k, *alpha)
8 movl (%ecx), %ebp
9 movl %ebp, (%edx)
10 movl %eax, (%ecx)
11
12 ; Step 2
13 ; [*] integer pointer += 1 ~ +4 Bytes
14 setl %cl
15 movzbl %cl, %ecx
16 sall $2, %ecx ; [*] ecx = 4*(p2 < x)
17 subl %ecx, %esi ; g -= ecx
18
19 xorl %ecx, %ecx
20 cmpl %eax, 4(%esp)
21 setg %cl ; ecx = (x < p1)
22 cmpl %eax, %edi
23 setge %al ; eax = (x <= p2)
24 addl $1, (%esp)
25 leal (%ebx,%ecx,4), %ebx ; [*] l += 4*ecx
26 movzbl %al, %eax
27 movl 8(%esp), %ecx
28 leal (%edx,%eax,4), %edx ; [*] k += 4*eax
29 movl (%esp), %eax
30 cmpl %ecx, %eax
31 jne L5
4.1. Conditional instruction 25
Listing 4.10: The main loop of listing 4.9 transformed to assembly code using g++
compiler.
1 L7:
2 movl (%eax), %edx
3 cmpl (%ebx), %edx
4 jge L3
5 ; min = A[i]
6 movl %edx, (%ebx)
7 movl (%eax), %edx
8 L3:
9 cmpl %edx, (%ecx)
10 jge L4
11 ; max = A[i]
12 movl %edx, (%ecx)
13 L4:
14 addl $4, %eax
15 cmpl %eax, %esi
16 jne L7
Listing 4.12: The main loop of listing 4.11 transformed to assembly code using g++
compiler.
1 L14:
2 movl (%edx), %eax
3 cmpl %eax, (%ebx)
4 movl %eax, %esi
5 cmovle (%ebx), %esi
6 movl %esi, (%ebx)
7 cmpl %eax, (%ecx)
8 cmovge (%ecx), %eax
9 addl $4, %edx
10 cmpl %edx, %edi
11 movl %eax, (%ecx)
12 jne L14
If we assume a simple pipeline processor with 5 stages then the naive approach
does not make mispredictions since its branches are negligible. The lean approach
spends 3 ticks (Figure 4.3) more than naive (Figure 4.2) in each iteration.
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 ..
Clock tick
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 ..
Clock tick
t0 t1 t2 t1 + t0 t2 − t0
0 A[i] A[i + 1] A[i] A[i + 1]
A[i + 1] − A[i] A[i] A[i + 1] A[i + 1] A[i]
Listing 4.14: The main loop in listing 4.13 transformed to assembly code using
g++ compiler.
1 L43:
2 movl 4(%esi,%ecx,4), %eax ; t2 = A[i + 1]
3 movl (%esi,%ecx,4), %edx ; t2 = A[i]
4
5 ; Step 1
6 movl $0, %edx ; t0 = 0
7 movl %eax, %ebx
8 subl %edx, %ebx ; (t2 - t1)
9 cmpl %eax, %edx ; tmp0 = (t2 - t1)
10 cmovle %edi, %ebx ; if (t1 <= t2) t0 = 0
11 addl %ebx, %edx ; t1 += t0
12 subl %ebx, %eax ; t2 -= t0
13
14
15 ; Step 2
16 cmpl 0(%ebp), %edx
17 jge L38 ; if (min <= t1) goto L38
18 movl %edx, 0(%ebp) ; min = t1
19 L38:
20 movl 32(%esp), %edi
21 cmpl (%edi), %eax
22 jle L39 ; if (t2 <= max) goto L39
23 movl %eax, (%edi) ; max = t2
24 L39:
25 addl $2, %ecx
26 cmpl %ecx, 24(%esp)
27 ja L43
4.2. Loop Unrolling 29
Listing 4.15: Loop before unrolling. Listing 4.16: Loop after unrolling.
1 for (int i = 0; i < N; i++) { 1 for (int i = 0; i < N - 1; i += 2) {
2 if (i % 2) { 2 A[i] = X;
3 A[i] = X; 3 A[i + 1] = Y;
4 } else { 4 }
5 A[i] = Y;
6 }
7 }
l k g r
v .. ≤ v .. v ≤ ..
State
Machine
l k g r
v .. ≤ v .. v ≤ ..
Definitely, the code size is increased as we increase unrolling level, like in our
approach unrolling six elements means 26 = 64 states should be handled (in general
unrolling m elements means 2m states).
The state machine (Figure 4.4) which decides what to do with the current
six elements is implemented using switch-case control flow, since it is a form of
multiway branching in order to take advantage of branch table/jump table instead
of a tree of conditional branches (in other words we avoid unpredictable branches)8 .
Figure 4.5 represents state 7 (flag pattern [000 111]2 ). The problem is done by
three swaps (A[k] ↔ A[g − 5]), (A[k + 1] ↔ A[g − 4]) and (A[k + 2] ↔ A[g − 5]).
Then the pointers are modified as following (k, g) ← (k, g − 6). Figures 4.6 and
4.7 explain states 11 and 47.
l k g r
5 .. ≤ 5 5 6 7 a0 a1 a2 .. a3 a4 a5 6 7 8 5 ≤ ..
l k g r
5 .. ≤ 5 a 3 a4 a5 a0 a1 a2 .. 5 6 7 6 7 8 5 ≤ ..
8
Use flag --param case-values-threshold=8 (or any number < 64) to make sure that
switch-case is transformed to branch table/jump table and not a simple tree of conditional
branches which leads branch mispredictions
4.3. Lookup table 31
l k g r
5 .. ≤ 5 5 6 3 a0 a1 a2 .. a3 a4 a5 4 7 8 5 ≤ ..
l k g r
5 .. ≤ 5 4 3 a5 a0 a1 a2 .. a3 a4 6 5 7 8 5 ≤ ..
l k g r
5 .. ≤ 5 2 6 3 a0 a1 a2 .. a3 a4 a5 8 7 9 5 ≤ ..
l k g r
5 .. ≤ 5 2 3 5 a0 a1 a2
a .. a3 a4 6 8 7 9 5 ≤ ..
0 1 2 3
a 1 t x xt
r, if n1 n0 = 00
|{z}
rx, if n1 n0 = 01
r ← ran1 n0 =
rt, if n1 n0 = 10
rxt, if n1 n0 = 11
Figure 4.8: Lookup table guided power (introduced by Dietzfelbinger, Listing 4.17)
0 1 2
b0 ← (ak < T0 )
T min − max
b1 ← (T2 < ak )
T1−b0 ← ak
T1+b1 ← ak
.. ak ak+1 ak+2 ..
The following two examples shows how to apply this technique in Quicksort
and Guided Power.
34 Chapter 4. Eliminating branch misses
if (α) x ← s2 + α(s1 − s2 )
x ← s1 x ← s2
Figure 4.10: Variants of selection (Left: conditional select, Right: pseudo condi-
tional select).
if (c) t←a
t←a a ← a + c(b − a)
a←b b ← b + c(t − b)
b←t
Figure 4.11: Variants of conditional swap (Left: conditional swap, Right: pseudo
conditional swap).
4.4. Pseudo conditional branch 35
Listing 4.21: The main loop of lean Lo- Listing 4.22: Listing 4.21 transformed to
muto partitioning (Listing 7.5). assembly code using g++ compiler.
1 int *lomuto_sl(..) { 1 L3:
2 2 movl (%ebx), %ecx ; x = *q
3 // .. 3
4 4 xorl %edx, %edx
5 while (q <= r) { 5 cmpl %ecx, %edi
6 x = *q; 6 setg %dl ; smaller = (x < v)
7 smaller = (x < v); 7
8 p += smaller; 8 ; [*] integer pointer += 1 ~ +4 Bytes
9 delta = smaller * (q - p); 9 movl %edx, %esi
10 10 movl %ebx, %edx
11 // .. 11 leal (%eax,%esi,4), %eax ; [*] p += 4*smaller
12 } 12
13 13 subl %eax, %edx ; q - p
14 // .. 14 movl (%eax), %ebp ; *p
15 } 15 sarl $2, %edx ; [*] 4*(q - p)
16 imull %esi, %edx ; [*] delta = smaller*(q - p)
17
18 ; ..
19
20 jnb L3
36 Chapter 4. Eliminating branch misses
Listing 4.23: The main loop of lean Lo- Listing 4.24: Listing 4.23 transformed to
muto partitioning using CMOVcc (Listing assembly code using g++ compiler.
7.6). 1 L3:
1 int *lomuto_sl_cmov(..) { 2 movl (%ecx), %ebx
2 3
3 // ..
4 xorl %edx, %edx ; x = *q
4 5 movl $0, %ebp ; delta = 0
5 while (q <= r) {
6 cmpl %ebx, %esi
6 x = *q;
7 setg %dl ; smaller = (x < v)
7 smaller = (x < v);
8
8 p += smaller;
9 ; [*] integer pointer += 1 ~ +4 Bytes
9 delta = (smaller ? q - p : 0);
10 leal (%eax,%edx,4), %eax ; [*] p += 4*smaller
10 11 movl %ecx, %edx ; edx = q
11 // ..
12 movl (%eax), %edi ; edi = *p
12 }
13
13 14 subl %eax, %edx ; (q - p)
14 // ..
15 cmpl %ebx, %esi ; x < v
15 } 16 cmovg %edx, %ebp ; delta = (q - p) if (x < v)
17
18 ; ..
19
20 jnb L3
n = .. n5 n4 n3 n2 n1 n0
if (n1 n0 ) r ← n0 r(x − 1) + r
if (n0 ) r ← n1 r(t − 1) + r
r ← rx
if (n1 )
r ← rt
Figure 4.12: Variants of Guided Power (Left: Guided Power (Listing 7.1) [ANP16],
Right: pseudo conditional branch Guided Power (Listing 7.2)).
4.5. Safety blind swaps 37
Quicksort. The idea of Seidel is very simple (Figure 4.13). The algorithm (List-
ing 7.7) has two pointers k and g, where all elements in A[l . . . k] are smaller than
the pivot and all elements in A[k + 1 . . . g − 1] are larger/equal than the pivot.
At the same time g points on the current element which should be classified. We
swap the element A[k + 1] and A[g] without any conditions (we call this a blind
swap), then we move pointer g one position forward. After this blind swap the
new element that should be classified is exactly at position k + 1. So if the element
A[k + 1] is smaller than pivot, pointer k should move one position forward. Other-
wise it should not. Definitely the previous step is done branchless by casting the
outcome of (A[k + 1] < p) to a numerical type, then it is added to the pointer k
as following k ← k + (A[k + 1] < p). The algorithm stops once the first pointer g
cross the most right position of the input array.
We applied the concept for dual and triple pivot partitioning algorithms (List-
ings 7.11 and 7.14) as a variants of Always-Large-First partitioning algorithms.
Through this technique, we reduced the time of Hoare algorithm more than
the half.
l k g r
Step 1 v .. < v α v ≤ .. β ..
swap
g++
l k g r
v .. < v β v ≤ .. α ..
l k g r
Step 2 v .. < v β v ≤ .. α ..
β<v β≥v
l k g r l k g r
v .. < v β v ≤ .. α .. v .. < v β v ≤ .. α ..
k g
5 8 0 7 3 6 3 8 1
k g
5 8 0 7 3 6 3 8 1
k g
5 0 8 7 3 6 3 8 1
k g
5 0 7 8 3 6 3 8 1
k g
5 0 3 8 7 6 3 8 1
k g
5 0 3 6 7 8 3 8 1
k g
5 0 3 3 7 8 6 8 1
k g
5 0 3 3 8 8 6 7 1
k g
5 0 3 3 1 8 6 7 8
k g
1 0 3 3 5 8 6 7 8
In our block partitioning algorithm (Listing 7.9)9 , we store the indexes of ele-
ments of the left side which are not smaller than pivot in a buffer Bk and the same
we do for the right side of the input in a buffer Bg (Figure 4.15).
This leads to three cases. First case: Bk has the same number of indexes as
Bg which is m, then we should only swap all corresponding elements and move
the pointers (k, g) ← (k + m, g − m). Second case: if |Bk | > |Bg |, we swap first
|Bg | corresponding elements, then we put the remaining elements that should be
swapped in its right positions and move the pointers k and g to its right positions.
Third case: analogous to second case.
The main difference between our implementation and Block partitioning in
[EW16] is:
• In rearrangement phase they swap until one of the buffers contains no more
pointers to elements which should be swapped and move the corresponding
pointer only for the empty buffers. Then they restart scanning phase again.
In our implementation we make sure that both buffers are empty and all
elements have been moved to its right positions before restarting scanning
phase (so the distance between scanning pointers k and g is reduced 2B in
every iteration, where B is the size of the block).
• In scanning phase they do not refill the non-empty buffer (there are at most
one non-empty buffer). In our implementation we do not have non-empty
buffer, since we empty both in rearrangement phase. So we refill both again.
9
The idea was implemented in [EW16] a little bit different from our implementation.
40 Chapter 4. Eliminating branch misses
l k g r
v .. ≤ v .. .. .. v ≤ ..
|{z} |{z}
Bk Bg
|{z}
|{z}
{i | v ≤ ai : ∀ai ∈ Bk } swap
{i | ai ≤ v : ∀ai ∈ Bg }
Figure 4.15: Storing indexes preparing to rearrangement phase in single pivot block
partition (Listing 7.9).
41
Chapter 5
Experiments
·10−9
2
1.5
Time/n [s]
0.5
3 4 5 6 7 8 9
Input size [log10 n]
Actually we have done the experiments (Figure 5.1) to clarify two things.
First: Negligible conditional branches must not be eliminated at all. For exam-
ple naive-MinMax has two negligible which induce O(log n) branch mispredictions
(very small number). In other words the prediction strategy helps the processor
very well. As an attempt to optimize naive-MinMax we tried to use the conditional
instruction CMOVcc (Section 4.1) which did not lead to a better behavior (Table
5.1). We also applied lookup-table (Section 4.3) which was the most harmful Min-
Max approach we have and takes about 3 times as naive-MinMax does. This was
expected because of using memory store/load instructions.
Second: How helpful is elimination of critical conditional branches. 3/2-
MinMax suffers a lot from branch misses and by applying some simple techniques
we saved about 25% of the execution time (Section 4.1).
44 Chapter 5. Experiments
·10−8
7
Time/n [s]
3 4 5 6 7 8
Items [log10 n]
·10−9
4
Time/n [s]
1
3 4 5 6 7 8
Input size [log10 n]
We started from Hoare algorithm (Listing 7.4) which suffers a lot from branch
misses. Unrolling strategy (Listing 7.8) was not that good variant but it still
better than Hoare since it saves about 9% of the time. Two main ideas are behind
this improvement in Unrolling strategy. The first idea is loop unrolling and the
second one is using switch-case control flow instead if-else, which enables using
branch table in order to avoid branch misses.
Katajainen implemented a lean version of Lomuto (Listing 7.5) in [Kat14] which
has good behavior on his machine. However surprisingly it was not much better
than Hoare on our machine and the reason behind this unexpected result was
the difference between processors architectures. On our machine multiplication
instructions may cost approximately as branch mispredictions cost in Hoare and
it is doubtless more expensive than the conditional instructions like CMOVcc. This
observation motivated us to replace the multiplication instructions with the con-
ditional instructions (Listing 7.6) which speeds up the algorithm about 1.3 times.
The real competition is between Block algorithm (Listing 7.9) and lean-Blind
algorithm (Listing 7.7). Both save about 55% of the time which is needed in Hoare
(they can partition 2 arrays while Hoare still processing the first one). lean-Blind
approach looks very simple and has a few number of instructions in the its main
loop without using any extra space unlike Block approach which needs about 16
KB in our implementation (The algorithm uses 2 buffers. Each one can store 1025
64-bit addresses, so both together need 16 KB and 16 Byte).
46 Chapter 5. Experiments
·10−9
5
4
Time/n [s]
3 4 5 6 7 8
Input size [log10 n]
·10−9
6
5
Time/n [s]
3 4 5 6 7 8
Input size [log10 n]
Apparently applying safety blind swap idea (Section 4.5) on dual and triple
pivot partitioning algorithms improves both of them. It speeds up YBB (Listing
7.10) about 1.9 times and Symmetric triple-pivot (Listing 7.12) approximately 1.5
times.
5.6. Quicksort 47
Using the conditional instructions (Listings 4.7 and 7.13) led also to a better
behavior for both. Lean dual symmetric (Listing 4.7) saved about 20% of the
execution time of YBB (Listing 7.10) and lean triple symmetric (Listing 7.13)
saved about 25% of execution time of symmetric triple-pivot (Listing 7.12).
Both ideas blind swaps or conditional statements make the main loop of the
partitioning procedure lean. Seeking of the reason behind the difference superiority
of blind swap strategy, we translated the code of both approaches to assembly level.
Indeed both induce only O(1) branch misses but blind approach has fewer number
of instructions.
5.6 Quicksort
·10−9
4
Time/n log n [s]
3.5
2.5
3 4 5 6 7 8
Input size [log10 n]
The idea of blind swaps showed very efficient behaviour, where the Blind Single-
Pivot Quicksort (Listing 7.7) was the fastest algorithm since it is faster than
Hoare twice. Even with Dual- and Triple-Pivot (variants of Always-Large-First
approaches) it outperforms. The two main reasons are the few number of instruc-
tion and the constant number of branch misses O(1).
Lean Symmetric Triple- and Dual-Pivot (Listings 7.13 and 4.7) Quicksorts were
attempts to replace the conditional branches in Symmetric Triple-Pivot (Listing
7.12 [AD16]) and YBB (Listing 7.10) with the conditional instructions. Although
we expected that the speedup should be approximately very near from Blind ap-
proaches (Always-Large-First) but the idea was not in the same efficient level.
One of the reasons that blind approaches execute fewer instructions although both
induce only O(1) branch misses.
48 Chapter 5. Experiments
Block Quicksort came in the second position after Blind Single-Pivot Quick-
sort directly. Indeed it does excellent work but it needs extra space for its accessory
buffers, which should adapt the cache size (in our implementation the algorithm
needs 16 KB and 16 Byte, where the size of L1 cache is 32 KB).
Unrolling strategy together with using switch-case in order to take the advan-
tage of branch tables was not bad idea and it may perform better if more elements
are unrolled with taking care about the code size (in our implementation we have
26 = 64 cases in the switch-case statement). However implementation of a switch-
case with 2n cases looks not a practical solution.
The lean version of Lomuto which was introduced by Katajainen [Kat14] (List-
ing 7.5) was surprisingly not a good variant. The only interpretation that we
have, that the algorithm uses 64-bit multiplication instructions in the main loop
of the partitioning phase. Once we transformed the multiplication instruction to
a conditional move instruction the performance was increased.
·10−9
5
Time/n log n [s]
3 4 5 6 7 8
Input size [log10 n]
Figure 5.7: Quicksorts race of all partitioning algorithms which have been consid-
ered in this thesis.
5.6. Quicksort 49
Figure 5.8 ranks the Quicksorts which we considered relying on its speedup
with reference to Hoare CQS. Let THoare the average runtime of Hoare CQS and
TQ the average runtime of Quicksor Q, for input of size 108 . Then the speedup of
algorithm Q is defined as following:
THoare
SpeedupQ =
TQ
Speedup
1 Lean Blind CQS 1.959
2 Block CQS 1.859
3 Lean Blind Dual-Pivot QS 1.812
4 Lean Blind Triple-Pivot QS 1.606
5 Lean Symmetric Triple-Pivot QS 1.405
6 cmov-Lean Lomuto CQS 1.291
7 Lean Symmetric Dual-Pivot QS 1.285
8 Unrollsix CQS 1.104
9 Symmetric Triple-Pivot QS 1.054
10 Lean Lomuto CQS 1.045
11 YBB Dual-Pivot QS 1.032
12 Hoare CQS 1.0
Chapter 6
Conclusion
The main contribution of this thesis is to show the import role which the modern
processors architectures play with regard to execution time of algorithms. Condi-
tional branch instructions induce one of the most harmful hazards which we should
take care about. Reducing the probability of branch misprediction or even elimi-
nating branches at all (lean procedures) could be a good solution. The results we
obtained on our machine showed an accurate improvement in Quicksort and some
other algorithms. The developer is responsible for optimization strategy (which
branch should be optimized/eliminated and which not, with taking architecture of
the processor into account).
All strategies and techniques we used in this work is language-dependent (im-
plemented using C/C++) and this leads to an interesting question I look forward to
answer is, "Which strategies and techniques are applicable in Java and are there
new factors which should be considered in such optimization process then ?".
52
53
Chapter 7
Appendix
7.1 Proofs
Proof (1-bit prediction scheme). We prove the form of the expected number of the
branch mispredictions which is in table 2.2 (Section 2.3.1). To get the probability
of a branch miss, we use the stationary distribution of the corresponding markov
chain as following: Assuming p is the probability that branch taken, and (π1 , π2 )
represent states (T, N).
The transition table is
T N
!
T p 1−p
Π=
N p 1−p
π 1 + π2 = 1
(1 − p)π1 = pπ2
pπ2 = (1 − p)π1
1−p
together with 2k=1 πk = 1, we get π(p) = (p, 1 − p).
P
We conclude π2 = p
π1
The branch miss occurs when the prediction automation is in state T and the
branch should not be taken or when the prediction automation is in state N and
the branch has to be taken, so the probability of a branch miss is π(p) • (1 − p, p) =
2p(1 − p), and the expected number of branch misses is 2 nk=1 p(1 − p) ∀n > 0
P
WT p
0 1−p 0
Π=
WN 0 p 0 1 − p
SN 0 0 p 1−p
54 Chapter 7. Appendix
π1 + π 2 + π3 + π4 = 1
(1 − p)π1 = pπ2
pπ2 + (1 − p)π2 = (1 − p)π1 + pπ3
pπ3 + (1 − p)π3 = (1 − p)π2 + pπ4
pπ4 = (1 − p)π3
1−p
We conclude π2 = p
π3 = ( 1−p
π1 , p
)2 π1 and π4 = ( 1−p
p
)3 π1 together with
P4 1 3 2 2 3
k=1 πk = 1, we get π(p) = 1−2p(1−p) (p , p (1 − p), p(1 − p) , (1 − p) ).
The branch miss occurs when the prediction automation is in state ST or WT
and the branch should not be taken or when the prediction automation is in state
SN or WN and the branch has to be taken, so the probability of a branch miss is
p(1−p)
π(p) • (1 − p, 1 − p, p, p) = 1−p(1−p) , and the expected number of branch misses is
Pn p(1−p)
k=1 1−2p(1−p) .
WT p 0 0 1 − p
Π=
WN p 0 0 1 − p
SN 0 0 p 1−p
π1 + π 2 + π 3 + π4 = 1
(1 − p)π1 = pπ2 + pπ3
pπ2 + (1 − p)π2 = (1 − p)π1
pπ3 + (1 − p)π3 = pπ4
pπ4 = (1 − p)π2 + (1 − p)π3
2 2
We conclude π2 = (1 − p)π1 , π3 = (1−p) p
π1 and π4 = (1−p)
p2
π1 togethor with
P4 1 2 2 2 2
k=1 πk = 1, we get π(p) = 1−p(1−p) (p , p (1 − p), p(1 − p) , (1 − p) ).
The branch miss occurs when the prediction automation is in state ST or WT
and the branch should not be taken or when the prediction automation is in state
SN or WN and the branch has to be taken, so the probability of a branch miss is
2 2 +p(1−p)
π(p) • (1 − p, 1 − p, p, p) = 2p (1−p)
1−p(1−p)
, and the expected number of branch
Pn 2p2 (1−p)2 +p(1−p)
misses is k=1 1−p(1−p)
.
2
7.2. GNU Compiler Collection 55
• -O0 : Reduce compilation time and make debugging produce the expected
results. This is the default.
• -O1 : Optimizing compilation takes somewhat more time, and a lot more
memory for a large function. With -O, the compiler tries to reduce code size
and execution time, without performing any optimizations that take a great
deal of compilation time.
56 Chapter 7. Appendix
• -O2 : Optimize even more. GCC performs nearly all supported optimizations
that do not involve a space-speed tradeoff. As compared to -O, this option
increases both compilation time and the performance of the generated code.
• -O3 : Optimize yet more. ‘-O3’ turns on all optimizations specified by -O2
and turn on all other optimization flags.
7.3. Routines 57
7.3 Routines
Listing 7.1: Guided Power (Auger, Nicaud and Pivoteau [ANP16]) (Section 4.3
and 4.4).
1 double guided_power(double x,int n){
2 double r = 1, t;
3
4 while (n > 0) {
5 t = mult(x,x);
6
7 if (n & 3) {
8 if (n & 1) r = mult(r,x);
9 if (n & 2) r = mult(r,t);
10 }
11
12 x = mult(t,t);
13 n >>= 2;
14 }
15
16 return r;
17 }
Listing 7.3: 3/2 MinMax (Auger, Nicaud and Pivoteau [ANP16], Section 4.1).
1 void half3_minmax(int *A, unsigned long int N, int &min, int &max) {
2 max = - 1; min = MAX + 1; // MAX = 10000000
3
4 for (unsigned long int i = 0; i < N; i += 2) {
5 if (A[i] < A[i + 1]) {
6 if (A[i] < min) min = A[i];
7 if (A[i + 1] > max) max = A[i + 1];
8 } else {
9 if (A[i + 1] < min) min = A[i + 1];
10 if (A[i] > max) max = A[i];
11 }
12 }
13 }
58 Chapter 7. Appendix
Listing 7.7: Lean Blind Single-Pivot Partitioning (Seidel idea, Section 4.5).
1 int *blind_sl(int *left, int *right) {
2
3 int *k = left;
4 int *g = left + 1;
5 int p = *left;
6
7 for (; g <= right; g++) {
8 SWAP(*g, *(k + 1));
9 k += (*(k + 1) < p);
10 }
11
12 SWAP(*left, *k);
13 return k;
14 }
60 Chapter 7. Appendix
69 k += 3; g -= 3; break;
70
71 case 18: SWAP(*k, *g); SWAP(*(k + 2), *(g - 2));
72 k += 3; g -= 3; break;
73
74 case 19: SWAP(*k, *(g - 2)); SWAP(*(k + 2), *(g - 3));
75 k += 2; g -= 4; break;
76
77 case 20: SWAP(*k, *g); SWAP(*(k + 2), *(g - 1));
78 k += 3; g -= 3; break;
79
80 case 21: SWAP(*k, *(g - 1)); SWAP(*(k + 2), *(g - 3));
81 k += 2; g -= 4; break;
82
83 case 22: SWAP(*k, *g); SWAP(*(k + 2), *(g - 3));
84 k += 2; g -= 4; break;
85
86 case 23: SWAP(*(k + 2), *(g - 3)); ROTATE(*k, *(k + 1), *(g - 4));
87 k += 1; g -= 5; break;
88
89 case 24: SWAP(*k, *g); SWAP(*(k + 3), *(g - 2)); SWAP(*(k + 4), *(g - 1));
90 k += 5; g -= 1; break;
91
92 case 25: SWAP(*k, *(g - 1)); SWAP(*(k + 3), *(g - 2));
93 k += 4; g -= 2; break;
94
95 case 26: SWAP(*k, *g); SWAP(*(k + 3), *(g - 2));
96 k += 4; g -= 2; break;
97
98 case 27: SWAP(*k, *(g - 2));
99 k += 3; g -= 3; break;
100
101 case 28: SWAP(*k, *g); ROTATE(*(g - 1), *(g - 2), *(k + 3));
102 k += 4; g -= 2; break;
103
104 case 29: SWAP(*k, *(g - 1));
105 k += 3; g -= 3; break;
106
107 case 30: SWAP(*k, *g);
108 k += 3; g -= 3; break;
109
110 case 31: ROTATE(*k, *(k + 2), *(g - 3));
111 k += 2; g -= 4; break;
112
113 case 32: SWAP(*(k + 1), *g); SWAP(*(k + 2), *(g - 1)); SWAP(*(k + 3), *(g - 2));
114 k += 4; g -= 2; break;
115
116 case 33: SWAP(*(k + 1), *(g - 1)); SWAP(*(k + 2), *(g - 2));
117 k += 3; g -= 3; break;
118
119 case 34: SWAP(*(k + 1), *g); SWAP(*(k + 2), *(g - 2));
120 k += 3; g -= 3; break;
121
122 case 35: SWAP(*(k + 1), *(g - 2)); SWAP(*(k + 2), *(g - 3));
123 k += 2; g -= 4; break;
124
125 case 36: SWAP(*(k + 1), *g); SWAP(*(k + 2), *(g - 1));
126 k += 3; g -= 3; break;
127
128 case 37: SWAP(*(k + 1), *(g - 1)); SWAP(*(k + 2), *(g - 3));
129 k += 2; g -= 4; break;
130
131 case 38: SWAP(*(k + 1), *g); SWAP(*(k + 2), *(g - 3));
132 k += 2; g -= 4; break;
133
134 case 39: SWAP(*(k + 1), *(g - 3)); SWAP(*(k + 2), *(g - 4));
135 k += 1; g -= 5; break;
136
137 case 40: SWAP(*(k + 1), *g); SWAP(*(k + 3), *(g - 1)); SWAP(*(k + 4), *(g - 2));
138 k += 5; g -= 1; break;
62 Chapter 7. Appendix
139
140 case 41: SWAP(*(k + 1), *(g - 1)); SWAP(*(k + 3), *(g - 2));
141 k += 4; g -= 2; break;
142
143 case 42: SWAP(*(k + 1), *g); SWAP(*(k + 3), *(g - 2));
144 k += 4; g -= 2; break;
145
146 case 43: SWAP(*(k + 1), *(g - 2));
147 k += 3; g -= 3; break;
148
149 case 44: SWAP(*(k + 1), *g); ROTATE(*(g - 1), *(g - 2), *(k + 3));
150 k += 4; g -= 2; break;
151
152 case 45: SWAP(*(k + 1), *(g - 1));
153 k += 3; g -= 3; break;
154
155 case 46: SWAP(*(k + 1), *g);
156 k += 3; g -= 3; break;
157
158 case 47: ROTATE(*(k + 1), *(k + 2), *(g - 3));
159 k += 2; g -= 4; break;
160
161 case 48: SWAP(*(k + 2), *g); SWAP(*(k + 3), *(g - 1)); SWAP(*(k + 4), *(g - 2));
162 k += 5; g -= 1; break;
163
164 case 49: SWAP(*(k + 2), *(g - 1)); SWAP(*(k + 3), *(g - 2));
165 k += 4; g -= 2; break;
166
167 case 50: SWAP(*(k + 2), *g); SWAP(*(k + 3), *(g - 2));
168 k += 4; g -= 2; break;
169
170 case 51: SWAP(*(k + 2), *(g - 2));
171 k += 3; g -= 3; break;
172
173 case 52: SWAP(*(k + 2), *g); ROTATE(*(g - 1), *(g - 2), *(k + 3));
174 k += 4; g -= 2; break;
175
176 case 53: SWAP(*(k + 2), *(g - 1));
177 k += 3; g -= 3; break;
178
179 case 54: SWAP(*(k + 2), *g);
180 k += 3; g -= 3; break;
181
182 case 55: SWAP(*(k + 2), *(g - 3));
183 k += 2; g -= 4; break;
184
185 case 56: SWAP(*(k + 3), *g); SWAP(*(k + 4), *(g - 1)); SWAP(*(k + 5), *(g - 2));
186 k += 6; break;
187
188 case 57: SWAP(*(k + 3), *(g - 1)); SWAP(*(k + 4), *(g - 2));
189 k += 5; g -= 1; break;
190
191 case 58: SWAP(*(k + 3), *(g - 2)); ROTATE(*g, *(g - 1), *(k + 4));
192 k += 5; g -= 1; break;
193
194 case 59: SWAP(*(k + 3), *(g - 2));
195 k += 4; g -= 2; break;
196
197 case 60: SWAP(*(k + 4), *(g - 1)); ROTATE(*g, *(g - 2), *(k + 3));
198 k += 5; g -= 1; break;
199
200 case 61: ROTATE(*(g - 1), *(g - 2), *(k + 3));
201 k += 4; g -= 2; break;
202
203 case 62: ROTATE(*g, *(g - 2), *(k + 3));
204 k += 4; g -= 2; break;
205
206 case 63: k += 3; g -= 3;
207
208 }
7.3. Routines 63
209 }
210
211 SWAP(*left, *--k);
212 k = blind_sl(k, g);
213
214 return k;
215 }
Chapter 8
Bibliography
[ALSU06] Aho, Alfred V. ; Lam, Monica S. ; Sethi, Ravi ; Ullman, Jeffrey D.:
Compilers: Principles, Techniques, and Tools (2Nd Edition). Boston,
MA, USA : Addison-Wesley Longman Publishing Co., Inc., 2006. –
ISBN 0321486811
[Mor01] Mortensen, Sofus: Refining the pure-C cost model (Master thesis).
2001
71
[Sta03] Stallman, Richard M.: Using the GNU Compiler Collection. GNU
Press, 2003
[SW04] Sanders, Peter ; Winkel, Sebastian: Super Scalar Sample Sort. In:
Algorithms - ESA 2004, 12th Annual European Symposium, Bergen,
Norway, September 14-17, 2004, Proceedings, 2004, S. 784–796
[Wil13] Wild, Sebastian: Master Thesis, Java 7’s Dual Pivot Quicksort (Mas-
ter thesis). 2013
Hiermit erkläre ich an Eides statt, dass ich die vorliegende Arbeit selbstständig und
ohne Benutzung anderer als der angegebenen Hilfsmittel angefertigt habe. Alle
Stellen, die wörtlich oder sinngemäß aus veröffentlichten oder nicht veröffentlichten
Schriften entnommen wurden, sind als solche kenntlich gemacht. Die Arbeit hat
in gleicher oder ähnlicher Form noch keiner anderen Prüfungsbehörde vorgelegen.