Parallel Algorithms Ws 20
Parallel Algorithms Ws 20
Parallel Algorithms
Peter Sanders
Institute of Theoretical Informatics
Sanders: Parallel Algorithms December 10, 2020 2
save energy: Two processors with half the clock frequency need less
power than one processor at full speed. (power≈ voltage · clock
frequency)
Overview
Models, simple examples
Matrix multiplikation
Broadcasting
Sorting
General data exchange
Load balanciong I, II, III
List ranking (conversion list → array)
Hashing, priority queues
Simple graph algorithms
Sanders: Parallel Algorithms December 10, 2020 5
Literature
Script (in German)
+
More Literature
[Sanders, Worsch],
Parallele Programmierung mit MPI – ein Praktikum, Logos, 1997.
Sanders: Parallel Algorithms December 10, 2020 7
Related Courses
Parallel programming: Tichy, Karl, Streit
Rechnerarchitektur: Karl
GPUs: Dachsbacher
Algorithmenanalyse:
Count cycles: T (I), for given problem instance I .
Worst case depending on problem size: T (n) = max|I|=n T (I)
∑|I|=n T (I)
Average case: Tavg (n) =
|{I : |I| = n}|
Example: Quicksort has average case complexity O(n log n)
shared memory
Sanders: Parallel Algorithms December 10, 2020 14
Access Conflicts?
Example: Global Or
Input in x[1..p]
Parallel on PE i = 1..p
Global And
O(1) time
2
Θ n PEs (!)
Sanders: Parallel Algorithms December 10, 2020 17
i A B 1 2 3 4 5 <- j M
1 3 * 0 1 0 1 1
2 5 1 * 1 0 1 1
3 2 0 0 * 0 1 1
4 8 1 1 1 * 1 1
5 1 0 0 0 0 * 1
A 3 5 2 8 1
-------------------------------
i A B 1 2 3 4 5 <- j M
1 3 * 0 1 0 1 0
2 5 1 * 1 0 1 0
3 2 0 0 * 0 1 0
4 8 1 1 1 * 1 1->maxValue=8
5 1 0 0 0 0 * 0
Sanders: Parallel Algorithms December 10, 2020 18
Distributed Memory
PE 1 PE 2 PE p
...
local local local
memory memory memory
network
Sanders: Parallel Algorithms December 10, 2020 21
processors
0 1 ... p−1
cache
network
memory modules
Sanders: Parallel Algorithms December 10, 2020 22
Problems
Function fetchAndAdd(a, ∆)
expected:= ∗ a
repeat
desired:= expected + ∆
until CAS(a, expected, desired)
return desired
Sanders: Parallel Algorithms December 10, 2020 26
M M ... M
B B B
PEM
Sanders: Parallel Algorithms December 10, 2020 27
threads
processor
L1
core
L2
L3 cache
compute node
main memory
SSD
more compute nodes
Internet
network
disks
tape
[Book]
Sanders: Parallel Algorithms December 10, 2020 29
Proposition: we get quite far with flat Models, in particular for shared
memory.
Explicit „Store-and-Forward“
We know the interconnection graph
(V = {1, . . . , p} , E ⊆ V ×V ). Variants:
– V = {1, . . . , p} ∪ R with additional
„dumb“ router nodes (perhaps with buffer memory).
– Buses → Hyperedges
In each time step each edge can transport up to k′ data packets of
constant length (usually k′ = 1)
Discussion
+ simple formulation
− low level ⇒ „messy algorithms“
− Hardware router allow fast communication whenever an unloaded
communication path is found.
Sanders: Parallel Algorithms December 10, 2020 32
[Book]
Sanders: Parallel Algorithms December 10, 2020 33
BSP∗
BSP+
MapReduce
A⊆I
1: map
[
B= µ (a) ⊆ K ×V
a∈A
2: shuffle
...
C = {(k, X) : k ∈ K∧
3: reduce X = {x : (k, x) ∈ B} ∧ X 6= 0}
/
[
D= ρ (c)
c∈C
Sanders: Parallel Algorithms December 10, 2020 41
MapReduce Discussion
+ Abstracts away difficult issues
A⊆I
* parallelization 1: map
[
* load balancing B= µ (a) ⊆ K ×V
a∈A
* fault tolerance 2: shuffle
...
* memory hierarchies C = {(k, X) : k ∈ K∧
3: reduce X = {x : (k, x) ∈ B} ∧ X 6= 0}
/
* ... [
D= ρ (c)
− Large overheads c∈C
− Limited functionality
Sanders: Parallel Algorithms December 10, 2020 43
1− ε
...
µ and ρ use space O n C = {(k, X) : k ∈ K∧
(“substantially sublinear”) X = {x : (k, x) ∈ B∧
3: reduce
[ X 6= 0}/
overall space for B O n2−ε D= ρ (c)
c∈C
(“substantially subquadratic”)
solvable in O(polylog(n))
MapReduce steps
2 Supersteps:
2. Receive elements of B.
Build elements of C .
Run reducer.
Send all but first result of a reducer to a random PE.
Sanders: Parallel Algorithms December 10, 2020 47
superstep 1
27
map shuffle
21
superstep 2
reduce
15
2
Sanders: Parallel Algorithms December 10, 2020 48
Time
w m
2L + O +g
p p
if
w = Ω(ŵp log p) and m = Ω(m̂p log p)
Sanders: Parallel Algorithms December 10, 2020 49
Circuits
k = 0: trivial
k k + 1:
depth k depth k (IH)
M zM
}| { zM}| {
xi = xi ⊕ xi+2k
i<2k+1 i<2k i<2k
| {z }
Tiefe k+1
2 k+1
1
0 k
Sanders: Parallel Algorithms December 10, 2020 53
PRAM Code
PE index i ∈ {0, . . . , n − 1}
active := 1
x
for 0 ≤ k < ⌈log n⌉ do 0 1 2 3 4 5 6 7 8 9 a b c d e f
if active then
if bit k of i then
active := 0
else if i + 2k < n then
xi := xi ⊕ xi+2k
// result is in x0
Careful: much more complicated on a real asynchronous
shared-memory machine.
Analysis
n PEs
Time O(log n)
x
Speedup O(n/ log n) 0 1 2 3 4 5 6 7 8 9 a b c d e f
Efficiency O(1/ log n)
Sanders: Parallel Algorithms December 10, 2020 55
p PEs
Each PE adds
n/p elements sequentially.
Then parallel sum
n/p
for p subsums
Time Tseq (n/p) + Θ(log p)
p
Efficiency
Tseq (n) 1 p log p
= = 1−Θ
p(Tseq (n/p) + Θ(log p)) 1 + Θ(p log(p)) /n n
if n ≫ p log p
Sanders: Parallel Algorithms December 10, 2020 56
Analysis
Matrix Multiplikation
Given: Matrices A ∈ Rn×n , B ∈ Rn×n
with A = ((ai j )) und B = ((bi j ))
R: semiring
C = ((ci j )) = A · B well-known:
n
ci j = ∑ aik · bk j
k=1
work: Θ n3 arithmetical operations
(better algorithms if R allows subtraction)
Sanders: Parallel Algorithms December 10, 2020 60
n3 PEs
for i:= 1 to n dopar
for j:= 1 to n dopar
n
ci j := ∑ aik · bk j // n PE parallel sum
k=1
Distributed Implementation I
p ≤ n2 PEs
for i:= 1 to n dopar
for j:= 1 to n dopar n/ p
n
ci j := ∑ aik · bk j n
k=1
− limited scalability
2
n
− high communication volume. Time Ω β √
p
Sanders: Parallel Algorithms December 10, 2020 62
Assume p = N 3 , n is a multiple of N
View A, B, C as N × N matrices,
each element is a n/N × n/N matrix
n/N
j A
k
i C
Sanders: Parallel Algorithms December 10, 2020 64
0 1 2 ... p−1
Broadcast Reduction
turn around direction of communication
add corresponding parts n
of arriving and own
messages
Modelling Assumptions
fully connected
fullduplex – parallel send and receive
Variants: halfduplex, i.e., send or receive, BSP, embedding into
concrete networks
Sanders: Parallel Algorithms December 10, 2020 69
Procedure naiveBroadcast(m[1..n])
PE 0: for i := 1 to p − 1 do send m to PE i
PE i > 0: receive m
Time: (p − 1)(nβ + α)
nightmare for implementing a scalable algorithme
n
...
p−1
0 1 2 ... p−1
Sanders: Parallel Algorithms December 10, 2020 70
Procedure binomialTreeBroadcast(m[1..n])
PE index i ∈ {0, . . . , p − 1}
// Message m located on PE 0
if i > 0 then receive m
for k := min{⌈log n⌉ , trailingZeroes(i)} − 1 downto 0 do
send m to PE i + 2k // noop if receiver ≥ p
0 1 2 3 4 5 6 7 8 9 a b c d e f
Sanders: Parallel Algorithms December 10, 2020 71
Analysis
Time: ⌈log p⌉ (nβ + α )
Optimal for n = 1
Embeddable into linear array
n· f (p) n+ log p?
0 1 2 3 4 5 6 7 8 9 a b c d e f
Sanders: Parallel Algorithms December 10, 2020 72
Linear Pipeline
Procedure linearPipelineBroadcast(m[1..n], k) 1
PE index i ∈ {0, . . . , p − 1} 2
3
// Message m located on PE 0 4
// assume k divides n 5
Analysis 1
2
3
Time nk β + α per step 4
(6= Iteration) 5
6
5
bino16
pipe16
4
T/(nTbyte+ceil(log p)Tstart)
0
0.01 0.1 1 10 100 1000 10000
nTbyte/Tstart
Sanders: Parallel Algorithms December 10, 2020 75
10
bino1024
pipe1024
8
T/(nTbyte+ceil(log p)Tstart)
0
0.01 0.1 1 10 100 1000 10000
nTbyte/Tstart
Sanders: Parallel Algorithms December 10, 2020 76
Discussion
Linear pipelining is optimal for fixed p und n → ∞
But for large p extremely large messages needed
αp α log p?
Sanders: Parallel Algorithms December 10, 2020 77
Procedure binaryTreePipelinedBroadcast(m[1..n], k)
// Message m located on root, assume k divides n
define piece j as m[( j − 1) nk + 1.. j nk ]
for j := 1 to k do
if parent exists then receive piece j
if left child ℓ exists then send piece j to ℓ
if right child r exists then send piece j to r
1 2 3 4 5 6 7 8 9 10 11 12 13
recv left recv right left recv right left recv right left recv right left recv right left recv right left recv right left right
Sanders: Parallel Algorithms December 10, 2020 78
Example
1 2 3 4 5 6 7
recv left recv right left recv right left recv right left recv right
8 9 10 11 12 13
left recv right left recv right left recv right left right
Sanders: Parallel Algorithms December 10, 2020 79
Analysis
time nk β + α per step (6= iteration)
2 j steps until first packet reaches level j
how many levels? d:= ⌊log p⌋
Analysis
Time nk β + α per step (6= iteration)
d:= ⌊log p⌋ levels
n
Overall: T (n, p, k):= (2d + 3(k − 1)) β + α
k
r
n(2d − 3)β
optimal k:
3α
p
∗
substituted: T (n, p) = 2d α + 3nβ + O nd αβ
Sanders: Parallel Algorithms December 10, 2020 81
Fibonacci-Trees
1 2 4 7 12
0
1
2
3
4
Analysis
Time nk β + α per step (6= iteration)
j steps until first packet reaches level j
How many PEs p j with level 0.. j?
p0 = 1, p1 = 2, p j = p j−2 + p j−1 + 1 ask Maple,
rsolve(p(0)=1,p(1)=2,p(i)=p(i-2)+p(i-1)+1,p(i));
√
3 5+5 j
pj ≈ √ Φ ≈ 1.89Φ j
5( 5 −
√
1)
with Φ = 1+2 5 (golden ratio)
d ≈ logΦ p levels
p
∗
overall: T (n, p) = d α + 3nβ + O nd αβ
Sanders: Parallel Algorithms December 10, 2020 83
Procedure fullDuplexBinaryTreePipelinedBroadcast(m[1..n], k)
// Message m located on root, assume k divides n
define piece j as m[( j − 1) nk + 1.. j nk ]
for j := 1 to k + 1 do
receive piece j from parent // noop for root or j = k + 1
and, concurrently, send piece j − 1 to right child
// noop if no such child or j = 1
send piece j to left child
// noop if no such child or j = k + 1
even step odd step
Sanders: Parallel Algorithms December 10, 2020 84
Analysis
Time nk β + α per step
j steps until first packet level j reaches
d ≈ logΦ p levels
Then 2 steps for each further packet
p
∗
Overall: T (n, p) = d α + 2nβ + O nd αβ
Sanders: Parallel Algorithms December 10, 2020 85
5
bino16
pipe16
btree16
4
T/(nTbyte+ceil(log p)Tstart)
0
0.01 0.1 1 10 100 1000 10000
nTbyte/Tstart
Sanders: Parallel Algorithms December 10, 2020 86
10
bino1024
pipe1024
btree1024
8
T/(nTbyte+ceil(log p)Tstart)
0
0.01 0.1 1 10 100 1000 10000 100000 1e+06
nTbyte/Tstart
Sanders: Parallel Algorithms December 10, 2020 87
Discussion
General p:
use next larger tree. Then drop a subtree
Sanders: Parallel Algorithms December 10, 2020 88
H-Trees
Sanders: Parallel Algorithms December 10, 2020 89
H-Trees
Sanders: Parallel Algorithms December 10, 2020 90
Root Process
for j := 1 to k step 2 do
send piece j + 0 along edge labelled 0
send piece j + 1 along edge labelled 1
0 1
0 1 1 0 0
0 1 0 1 1 0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 1 0 1 0 0 1
1 0 0 1
0 1 1
Sanders: Parallel Algorithms December 10, 2020 93
Other Processes,
Wait for first piece to arrive
if it comes from the upper tree over an edge labelled b then
∆:= 2· distance of the node from the bottom in the upper tree
for j := 1 to k + ∆ step 2 do
along b-edges: receive piece j and send piece j − 2
along 1 − b-edges: receive piece j + 1 − ∆ and send piece j
0 1
0 1 1 0 0
0 1 1 0 1 0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 1 0 1 0 0 1
1 0 0 1
0 1 1
Sanders: Parallel Algorithms December 10, 2020 94
0 0
0 1 0 1
0 1 1 0 0 1 1 0
0 1 1 0 1 1 0 0 0 1 1 0 1 1 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12
1 1 0 1 0 0 1 1 1 0 1 0 0
1 0 0 1 1 0 0 1
0 1 0 1
1
Sanders: Parallel Algorithms December 10, 2020 95
0 0
0 1 1 0
0 1 1 0 1 0 1
0 1 1 0 1 0 0 1 1 0 1 0
0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 10 11
1 1 0 1 0 0 0 0 1 1 0
1 0 0 1 1 0 0 1
0 1 0 1
1
Sanders: Parallel Algorithms December 10, 2020 96
0 0
1 0 0
1
1 0 1 0 1
0 1 1 0 1 0 0 1 1 0 1 0
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10
0 0 1 1 0 1 1 0 1 0
1 0 0 1 1 0 0
0 1 0 1
1 1
Sanders: Parallel Algorithms December 10, 2020 97
0 1
0 1
1 1 0
0 1 0
0 1 0 1
1 0 1 0 0 1 1
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9
1 1 0 1 0 0 0 1 0 1
1 0 0 0 1 0
0 1 1
1 0
Sanders: Parallel Algorithms December 10, 2020 98
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 0 1 2 3 4 5 6 7
Sanders: Parallel Algorithms December 10, 2020 99
Coloring Edges
s 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
r 0 1 2 3 4 5 6 7 8 9 10 11 12 13
Sanders: Parallel Algorithms December 10, 2020 102
Analysis
Time nk β + α pro step
2 j steps until all PEs in level j are reached
d = ⌈log(p + 1)⌉ levels
Then 2 steps for 2 further packets
n
T (n, p, k) ≈ β + α (2d + k − 1)), with d ≈ log p
k
r
n(2d − 1)β
optimal k:
α
p
T ∗ (n, p): ≈ nβ + α · 2 log p + 2n log pαβ
Sanders: Parallel Algorithms December 10, 2020 105
5
bino16
pipe16
2tree16
4
T/(nTbyte+ceil(log p)Tstart)
0
0.01 0.1 1 10 100 1000 10000
nTbyte/Tstart
Sanders: Parallel Algorithms December 10, 2020 106
10
bino1024
pipe1024
2tree1024
8
T/(nTbyte+ceil(log p)Tstart)
0
0.01 0.1 1 10 100 1000 10000 100000 1e+06
nTbyte/Tstart
Sanders: Parallel Algorithms December 10, 2020 107
23-Reduktion
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
otherwise:
<root root >root
n n
Sanders: Parallel Algorithms December 10, 2020 109
Hypercube Hd
p = 2d PEs
nodes V = {0, 1}d , i.e., write node number in binary
edges in dimension i: Ei = (u, v) : u ⊕ v = 2i
E = E0 ∪ · · · ∪ Ed−1
d 0 1 2 3 4
Sanders: Parallel Algorithms December 10, 2020 111
ESBT-Broadcasting
In step i communication along dimension i mod d
Decompose Hd into d Edge-disjoint Spanning Binomial Trees
0d cyclically distributes packet to roots of the ESBTs
ESBT-roots perform binomial tree broadcasting
(except missing smallest subtree 0d )
100 101
000
000 001 001 010 100
011 101 110 011 101 110
010 011
111 010 100 111 100 001 111 001 010
110 111 110 101 011
step 0 mod 3 step 1 mod 3 step 2 mod 3
Sanders: Parallel Algorithms December 10, 2020 112
Discussion
n
small large spcecial alg.
depending on
binomial p
tree network?
linear n
pipeline
binary tree p=2^d
EBST Y N
23−Broadcast
Sanders: Parallel Algorithms December 10, 2020 114
Reality Check
Libraries (z.B. MPI) often do not have a pipelined implementation
of collective operations your own broadcast may be significantly
faster than a library routine
Beyond Broadcast
Pipelining is important technique for handling large data sets
hyper-cube algorithms are often elegant and efficient. (and often
simpler than ESBT)
Sorting
Fast inefficient ranking
Quicksort
Sample Sort
Multiway Mergesort
Selection
More on sorting
Sanders: Parallel Algorithms December 10, 2020 118
n Elements, n2 processors:
Input: A[1..n] // distinct elements
Output: M[1..m] // M[i] =rank of A[i]
i A B 1 2 3 4 5 <- j M
1 3 1 0 1 0 1 1
2 5 1 1 1 0 1 1
3 2 0 0 1 0 1 1
4 8 1 1 1 1 1 1
5 1 0 0 0 0 1 1
A 3 5 2 8 1
-------------------------------
i A B 1 2 3 4 5 <- j M
1 3 1 0 1 0 1 3
2 5 1 1 1 0 1 4
3 2 0 0 1 0 1 2
4 8 1 1 1 1 1 5
5 1 0 0 0 0 1 1
Sanders: Parallel Algorithms December 10, 2020 120
⇓π
Time
O α log p + β √np + np log np . (1)
Sanders: Parallel Algorithms December 10, 2020 122
Example
hgdl
eamb
k i cj
Sanders: Parallel Algorithms December 10, 2020 123
row all−gather−merge
hgdl dghl dghl dghl dghl
eamb abem abem abem abem
k i cj cijk cijk cijk cijk
Sanders: Parallel Algorithms December 10, 2020 124
hgdl
eamb
k i cj
dghl e
h dghl a dghl c
g d dghl b
0123 k 1223 i 2222 m 1113 j
l
abem e
h abem a abem c
g d abem b
0013 k 1113 i 0023 m 0113 j
l
cijk e
h cijk a cijk c
g cijk b
0223 k 1333 i 1222 d 1122 j
m l
Sanders: Parallel Algorithms December 10, 2020 126
hgdl
eamb
k i cj
dghl e dghl a dghl c dghl b d g h l
0123 h
k 1223 g 2222 d
i m 1113 j 4 6 7 11
l
abem e
h abem a abem c
g d abem b a b e m
0013 k 1113 i 0023 m 0113 j 1 2 5 12
l
cijk e
h cijk a cijk c
g cijk b c i j k
0223 k 1333 i 1222 d 1122 j 3 8 9 10
m l
Sanders: Parallel Algorithms December 10, 2020 127
Numerical Example:
105
Uniform
104
103
102
10
1
2−5 20 25 210 215 220
n/p, p = 218
Sanders: Parallel Algorithms December 10, 2020 130
Quicksort
Sequential
Procedure qSort(d[], n)
if n = 1 then return
select a pivot v
reorder the elements in d such that
d0 · · · dk−1 ≤ v < dk · · · dn−1
qSort([d0 , . . . , dk−1 ], k)
qSort([dk+1 , . . . , dn−1 ], n − k − 1)
Sanders: Parallel Algorithms December 10, 2020 131
Tpar = Ω (n)
For simplicity: n = p.
2. Broadcast
3. Local comparison
5. Redistribute data
6. Split PEs
7. Parallel recursion
Sanders: Parallel Algorithms December 10, 2020 133
Example
pivot v = 44
PE Nummer 0 1 2 3 4 5 6 7
Nr. d. Elemente Pivot 0 1 2 3 4
Nr. d. Elemente > Pivot 0 1 2
Wert vorher 44 77 11 55 00 33 66 22
Wert nachher 44 11 00 33 22 77 55 66
PE Nummer 0+0 0+1 0+2 0+3 0+4 5+0 5+1 5+2
Sanders: Parallel Algorithms December 10, 2020 135
/* determine a pivot */
int getPivot(int item, MPI_Comm comm, int nP)
{ int pivot = item;
int pivotPE = globalRandInt(nP);/* from random PE */
/* overwrite pivot by that one from pivotPE */
MPI_Bcast(&pivot, 1, MPI_INT, pivotPE, comm);
return pivot;
}
Analysis
Per recursion level:
– 2× Broadcast
– 1× Prefix sum (→later)
Time O(α log p)
For distributed memory its bad that many elements get moved
Ω (log p) times.
··· Time O np (log n + β log p) + α log2 p
Sanders: Parallel Algorithms December 10, 2020 139
PE 1 PE 2 PE 3 i=1 j=3
8 2 0 5 4 1 7 3 9 6 p′ = 3
v partition
2 0 8 5 1 4 7 3 9 6 na =4 nb =6
a b a b a b k′ = 4+6
4
·3= 56
k=1
2 0 1 3 8 5 4 7 9 6 i=2 j=3
i= j=1 v p′ = 2
quickSort 5 4 8 7 9 6 partition
0 1 2 3 a b a b na =2 nb =4
k′ = 2+4
2
·2= 32
5 4 8 7 9 6 k=1
i= j=2 i= j=3
quickSort quickSort
4 5 6 7 8 9
Sanders: Parallel Algorithms December 10, 2020 141
Load Balance
!
1
k ∑ki=1 ln 1+
1 ( )
p 23
i
∏1+ 2
i =e
i=1 p 3
1
∑ki=1 i 1 k 3 i
≤e ( ) =e
p 23 p ( ) geom. sum
∑i=0 2
k+1
1 (2)
3 −1
1 3 k
≤ e ( ) = e3 ≈ 20.1 .
p 3 −1 3
=e 2 p 2
Sanders: Parallel Algorithms December 10, 2020 142
Better Balance?
Janus-quicksort? Axtmann, Wiebigke, Sanders, IPDPS 2018
for small p′ choose pivot carefully
for small p′ (Θ(log p)) switch to sample sort?
Alternative: always halve PEs, randomizaztion, careful choice of pivot
Axtmann, Sanders, ALENEX 2017
Sanders: Parallel Algorithms December 10, 2020 144
105
Uniform
104
103
102
10
1
2−5 20 25 210 215 220
n/p, p = 218
Sanders: Parallel Algorithms December 10, 2020 145
Multi-Pivot Methods
Analysis
distribution
z }| { local sorting data exchange
n z }| { z }| {
Tpar = O log p + Tseq (n/p) + Tall−to−all (p, n/p)
p
Tseq (n) n
≈ + 2 β + pα
p p
Idealizing assumption is realistic for permutation.
Sanders: Parallel Algorithms December 10, 2020 147
Sample Sort
PE 1 PE 2 PE 3
of ikc ms ta phr ej ndqbgul input
abc f hj lr sample
s1= c s2= j splitters
s j l
c m r u local
i k e p g q buckets
f o a h t b d n
move
data
s j l
c m r u
i e g k p q
a b f h d o t n sort
ab cdefghi j k l m n o p q locally
rstu [Book]
Sanders: Parallel Algorithms December 10, 2020 149
log n
Lemma 2. a = O ε2
suffices such that with probability
≥ 1 − 1n no PE gets more than (1 + ε )n/p elements.
Sanders: Parallel Algorithms December 10, 2020 150
Lemma:
n
a = O log
ε2
suffices such that with probability ≥ 1 − 1
n no PE gets
more than (1 + ε )n/p elements.
Proof idea: We analyze and algorithm that choses global samples with
replacement.
Let he1 , . . . , en i denote the input in sorted order.
fail: Some PE gets more than (1 + ε )n/p elements
→ ∃ j :≤ a samples from he j , . . . , e j+(1+ε )n/p i (event E j )
→ P [fail] ≤ nP E j , j fixed.
1 if s ∈ he , . . . , e
i j j+(1+ε )n/p i
Let Xi := , X:= ∑i Xi
0 else
P E j = P [X < a] = P [X < 1/(1 + ε )E[X]] ≈ P [X < (1 − ε )E[X]]
E[Xi ] = P [Xi = 1] = 1+p ε
Sanders: Parallel Algorithms December 10, 2020 151
Chernoff-Bound
small if n ≫ p2 log p
z }| {
sort sample
z }| { collect/distribute splitters
p log n z }| {
TsampleSort (p, n) = Tfastsort (p, O 2
) + Tallgather (p)
ε
n n n
+O log p + Tseq ((1 + ε ) ) + Tall−to−all (p, (1 + ε ) )
p p p
| {z } | {z } | {z }
partition local sorting data exchange
Sanders: Parallel Algorithms December 10, 2020 153
Sorting Samples
Using gather/gossiping
Using gather–merge
Fast ranking
Parallel quicksort
Recursively using sample sort
Sanders: Parallel Algorithms December 10, 2020 154
p2 log pTcompr
Using gather/gossiping
ε2
p2 β
Using gather–merge
ε 2 Tcompr
p2 β
Fast ranking
log pTcompr
p2 β
Parallel quicksort
log pTcompr
Recursively using sample sort
Sanders: Parallel Algorithms December 10, 2020 155
template<class Element> 1
{ random_device rd; 4
mt19937 rndEngine(rd()); 5
locS.push_back(data[dataGen(rndEngine)]); 10
Sanders: Parallel Algorithms December 10, 2020 156
Find Splitters
for (size_t i=0; i < p−1; ++i) s[i] = s[(a+1) ∗ (i+1)]; //select splitters 6
s.resize(p−1); 7
Sanders: Parallel Algorithms December 10, 2020 157
Partition Locally
buckets[bound − s.begin()].push_back(el); 5
} 6
data.clear(); 7
Sanders: Parallel Algorithms December 10, 2020 158
sDispls.push_back(0); 3
sCounts.push_back(bucket.size()); 6
sDispls.push_back(bucket.size() + sDispls.back()); 7
} 8
MPI_Alltoall(sCounts.data(),1,MPI_INT,rCounts.data(),1,MPI_INT,comm); 9
rDispls[0] = 0; 11
sort(rData.begin(), rData.end()); 5
rData.swap(data); 6
} 7
Sanders: Parallel Algorithms December 10, 2020 160
60
50
40
30
20
10
0
10 12 14 16 18 20 22
2 2 2 2 2 2 2
input size (elements per thread)
Sanders: Parallel Algorithms December 10, 2020 161
Multisequence Selection
Splitter Selection
in
Processor i selects the element with global rank k = .
p
Simple algorithm: quickSelect exploiting sortedness of the sequences.
k ?
v yes
v
v
v
no
Sanders: Parallel Algorithms December 10, 2020 164
Idea:
Ordinary select but p× binary search instead of partitioning
Function msSelect(S : Array of Sequence of Element; k : N) : Array of N
for i := 1 to |S| do (ℓi , ri ):= (0, |Si |)
invariant ∀i : ℓi ..ri contains the splitting position of Si
invariant ∀i, j : ∀a ≤ ℓi , b > r j : Si [a] ≤ S j [b]
while ∃i : ℓi < ri do
v:= pickPivot(S, ℓ, r)
for i := 1 to |S| do mi := binarySearch(v, Si [ℓi ..ri ])
if ∑i mi ≥ k then r:= m else ℓ:= m
return ℓ
Sanders: Parallel Algorithms December 10, 2020 165
efficient if n ≫ p2 log p
deterministic (almost)
perfect load balance
somewhat worse constant factors than sample sort
Sanders: Parallel Algorithms December 10, 2020 166
Expected
time
O log n p(log np + β ) + log pα
Sanders: Parallel Algorithms December 10, 2020 167
Consider case n = p.
√
sample of size p
√
k = Θ p/ log p splitters
Buckets have size ≤ cp/k elements whp
Allocate buckets of size 2cp/k
Write elements to random free positions within their bucket
Compactify using prefix sums
Recursion
Sanders: Parallel Algorithms December 10, 2020 169
Example
sample & sort
8qwe5r t2zu9iop3 7yxc6v b4 ma s0df ghjkl1 3a fkr v
move to buckets
a r
4 05 38 1 7 2 69 boamefdqh ilcpnj kg w ysv r tz x u
compact
4053817269boamefdqhilcpnjkgwysvrtzxu
sort sort sort
0123456789abcdefghijklmnopqrstuvwxyz
Sanders: Parallel Algorithms December 10, 2020 170
More on Sorting I
Cole’s mergesort: [JáJáSection 4.3.2]
Time O np + log p deterministic, EREW PRAM (CREW in
[JáJá]). Idea: Pipelined parallel merge sort. Use (deterministic)
sampling to predict where data comes from.
More on Sorting II
Integer Sorting: (Close to) linear work. Very fast algorithma on CRCW
PRAM.
105
Uniform
104
103
102
10
1
2−5 20 25 210 215 220
n/p, p = 218
Sanders: Parallel Algorithms December 10, 2020 173
6
Uniform
5
Collective Communication
Broadcast
Reduction
Prefix sum
Not here hier: gather / scatter
Gossiping (= all-gather = gather + broadcast)
All-to-all personalized communication
– equal message lengths
– arbitrary message lengths, = h-relation
Sanders: Parallel Algorithms December 10, 2020 175
Prefix sums
exclusive
p−1 p−2 p−3 ... 0
inclusive
Sanders: Parallel Algorithms December 10, 2020 176
Plain Pipeline
As in broadcast
1
2
3
4
5
6
7
8
9
Sanders: Parallel Algorithms December 10, 2020 177
x:= σ := m i
sum
100
e−h
101
e−h
x e−e 000 001 e−f
for k := 0 to d − 1 do a−d
a−a
a−d
a−b
i[k..d−1]1k
invariant σ = ⊗ j=i[k..d−1]0k m@j 010
a−d
011
a−d
110 111
e−h a−c a−d e−h
101
y:= σ @(i ⊕ 2k )
100
// sendRecv a−h
a−e 000 001
a−h
a−f
a−h a−h
σ := σ ⊗ y a−a a−b
Analysis
Telephone model:
Tprefix = (α + nβ ) log p
3 11
Inorder numbering of the nodes
2 7 10
Upward phase: as with reduction but 6 9 12
1 4
i
PE i stores ∑′ x@ j
j=i 3 4 1..i’−1
i′ −1 i’..i’’ PE
i
Downward phase: PE i receives ∑ x@ j i’..i 1..i
j=1
(root: = 0 !)
i
i’..i−1 i+1..i’’
and forwards this to the left. 1 5 2 6
i
right subtree gets ∑ x@ j
j=1 i’ i−1 i+1 i’’
Each PE only active once per phase. → pipelining OK
Sanders: Parallel Algorithms December 10, 2020 180
Pseudocode
Function InOrderTree::prefixSum(m)
// upward phase:
x:= 0; receive(leftChild, x)
z:= 0; receive(rightChild, z) x+m+z
send(parent, x + m + z) m iPE
x z
// downward phase:
ℓ:= 0; receive(parent, ℓ)
ℓ
send(leftChild, ℓ) m iPE
send(rightChild, ℓ + x + m) ℓ ℓ+x+m
return ℓ + x + m
Sanders: Parallel Algorithms December 10, 2020 181
23-Prefix Sums
1..i’−1
PE odd packets
i’..i’’ i 0 1
i’..i 1..i 0 1 1 0
0 1 1 0 1 0 0
i
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
i’..i−1 i+1..i’’
1 1 0 1 0 0 1
1 0 0 1
0 1
i’ i−1 i+1 i’’ even packets
Sanders: Parallel Algorithms December 10, 2020 182
Analysis
Generalization:
Applies to any algorithm based on inorder numbered trees
ESBT does not work?
Sanders: Parallel Algorithms December 10, 2020 184
Gossiping
Each PE has a message m of length n.
At the end, each PE should know all messages.
Hypercube Algorithm
Example
g h gh gh efgh efgh abcdefgh abcdefgh
c d cd cd abcd abcd abcdefgh abcdefgh
e f ef ef efgh efgh abcdefgh abcdefgh
a b ab ab abcd abcd abcdefgh abcdefgh
Sanders: Parallel Algorithms December 10, 2020 186
Analysis
All-Reduce
Hypercube Algorithm
PE i
for j := d − 1 downto 0 do
Get from PE i ⊕ 2 j all its messages
destined for my j-D subcube
Move to PE i ⊕ 2 j all my messages
destined for its j-D subcube
Sanders: Parallel Algorithms December 10, 2020 188
Fully Connected:
1-Factor-Algorithm
i=0 1
[König 1936] 0 2
4 3
p odd, i is PE-index:
i=1 1
for r := 0 to p − 1 do 0 2
4 3
k:= 2r mod p
i=2 1
j:= (k − i) mod p′
0 2
send( j, mi j )|| recv( j, m ji ) 4 3
i=3 1
pairwise communication (telephone model):
0 2
The partner of the partners of j in round i is 4 3
i − (i − j) ≡ j mod p i=4 1
Time: p(nβ + α ) optimal for n → ∞ 0 2
4 3
Sanders: Parallel Algorithms December 10, 2020 190
1-Factor Algorithm
i=0 1 5
p even: 0 2
4 3
// PE index j ∈ {0, . . . , p − 1}
i=1 1 5
for i := 0 to p − 2 do 0 2
p
idle:= i mod (p − 1) 4 3
2
i=2 1 5
if j = p − 1 then exchange data with PE idle
0 2
else
4 3
if j = idle then i=3 1 5
exchange data with PE p − 1 0 2
else 4 3
The “Ostrich”-Algorithm
Ostrich-Analysis:
BSP-Model: Time L + gh
h-Relation
[König 1916]
Consider the bipartite multigraph 4
G = ( s1 , . . . , s p ∪ r1 , . . . , r p , E) with Sender Empf.
| (si , r j ) ∈ E | = # packets from PE i to PE j. 1
Theorem: ∃ edge coloring φ : E → {1..h}, i.e., 2
no two equal-colored edges are incident to any node.
3
for j := 1 to h do
4
send messages with color j
optimal when postulating packet-wise delivery
Sanders: Parallel Algorithms December 10, 2020 195
Problems: 2
Computing edge colorings online 1 3
is complicated and expensive
optimal?? ?
Sanders: Parallel Algorithms December 10, 2020 197
a0 d0
a1 a2
Sanders: Parallel Algorithms December 10, 2020 199
Two Triangles
round a2 a1 a0 b0 b1 b2
1
a0 2
3
4
a1 a2 5
6
7
b0 8
9
10
b1 b2 11
12
Sanders: Parallel Algorithms December 10, 2020 200
l m
h
Reduction h-Relation 2 2-Relations
Ignore direction of communications for now
Connect nodes with odd degree
all nodes have even degree
Eulertour-technique: decompose the graph into edge disjoint
cycles
Direct cycles in clockwise direction
indegree and outdegree ≤ ⌈h/2⌉
Build bipartite graph (as before)
Color bipartite graph
Color classes in bipartite graph edge disjoint simple cycles in
the input graph (2-relations)
Reinstate orginal communication direction
Sanders: Parallel Algorithms December 10, 2020 201
1 ... ...
2 ... ...
3 ... ...
4 ... ...
5 ... ...
6 ... ...
Odd p
Open Problems
Get rid of splitting into 5 subpackages?
Conjecture:
3
One h-Relation with ≤ hp packets can be delivered in ≈ h
8
steps.
Simplifying assumptions:
a
Sanders: Parallel Algorithms December 10, 2020
d
207
Sanders: Parallel Algorithms December 10, 2020 208
Abstract Description
Summary: All-to-All
Ostrich: Delegate to online, asynchronous routing.
Good when that is implemented well.
Comparison of approaches?
Sanders: Parallel Algorithms December 10, 2020 213
Applications
Priority-driven scheduling
Best first branch-and-bound:
Find best solution in a large, implicitly defined tree. (later more)
Naive Implementation
PE 0 manages a sequential PQ
All others send requests
Branch-and-Bound
H : tree (V, E) with bounded node degree
c(v): node costs — grow on when descending a path
v∗ : leaf with minimal costs
Ṽ : {v ∈ V : v ≤ v∗ }
m: |Ṽ | Simplification: Ω (p log p)
h: Depth of H̃ (subgraph of H induced by Ṽ ).
Tx : Time for generating the successors of a node
Tcoll : Upper bound for broadcast, min-reduction, prefix-sum, routing
one element from/to random partner.
O(α log p) on many networks.
Sanders: Parallel Algorithms December 10, 2020 218
Sequential Branch-and-Bound
Parallel Branch-and-Bound
Analysis
Theorem: Tpar = ( mp + h)(Tx + O TqueueOp )
Case 1:
(at most m/p Iterations):
All processed nodes are in Ṽ
Case 2:
(at most h Iterations):
Some nodes outside Ṽ ~
V
are processed
→ maximal path length
from a node in Q
H
to the optimal solutions
v*
is being reduced.
Sanders: Parallel Algorithms December 10, 2020 221
Our Approach
PE: 1 2 3 4
B&B Processes
New Nodes
Random
Placement
Local Queues
Top−Nodes
Filter p best
Assign to PEs
Sanders: Parallel Algorithms December 10, 2020 223
Filter p best
Assign to PEs
Sanders: Parallel Algorithms December 10, 2020 224
Parallel Implementation I
Insert
Sending: Tcoll
log p n
Local insertions: O · log .
log log p p
(Better with “advanced” local queues Careful: amortized bounds
are not sufficient.)
Sanders: Parallel Algorithms December 10, 2020 226
Parallel Implementation I
deleteMin*
Procedure deleteMin*(Q1 , p)
Q0 := the O(log p) smallest elements of Q1
M:= select(Q0 , p) // later
enumerate M = e1 , . . . , e p
assign ei to PE i // use prefix sums
if maxi ei > min j Q1 @ j then expensive special case treatment
empty Q0 back into Q1
Sanders: Parallel Algorithms December 10, 2020 227
Analysis
Remove locally: O log p log np
Parallel Implementation II
PE: 1 2 3 4
Q1
Q0
Filter n best
Assign to PEs
Sanders: Parallel Algorithms December 10, 2020 229
Parallel Implementation II
Insert
Send: Tcoll
Insert locally: merge Q0 and new elements
O(log p) whp.
Cleanup: Empty Q0 every log p iterations.
cost O log p log np per log p iterations
average costs O log np
Sanders: Parallel Algorithms December 10, 2020 230
Parallel Implementation II
deleteMin*
Procedure deleteMin*(Q0 , Q1 , p)
while |{e ∈ Q̆0 : e < min Q̆1 }| < p do
Q0 := Q0 ∪ {deleteMin(Q1 )}
M:= select(Q0 , p) // later
enumerate M = e1 , . . . , e p
assign ei to PE i // use prefix sums
Sanders: Parallel Algorithms December 10, 2020 231
Analysis
Remove locally: O(1) expected iterations O Tcoll + log np
Result
insert*: expected O Tcoll + log np
deleteMin*: expected O Tcoll + log np
Sanders: Parallel Algorithms December 10, 2020 233
choose a sample s
u:= element with rank nk |s| + ∆ in s.
ℓ:= element with rank nk |s| − ∆ in s
Partition Q into
Q< := {q ∈ Q : q < ℓ},
Q> := {q ∈ Q : q > u},
Q′ := Q \ Q< \ Q>
If |Q< | < k and |Q< | + |Q′ | ≥ k, output Q< and find the
k − |Q< | smallest elements of Q′
All other cases are unlikely if |s|, ∆ are sufficiently large.
Sanders: Parallel Algorithms December 10, 2020 234
Parallel Implementation
√
|s| = p sample can be sorted in time O(Tcoll ).
∆ = Θ p1/4+ε for a small constant ε .
This makes difficult cases unlikely.
Procedure deleteMin*(Q0 , Q1 , p)
while |{e ∈ Q̆0 : e < min Q̆1 }| < p do
Q0 := Q0 ∪ {deleteMin(Q1 )} // select immediately
M:= select(Q0 , p) // later
enumerate M = e1 , . . . , e p
assign ei to PE i // use prefix sums
Or just use sufficiently many locally smallest elements and check later
Sanders: Parallel Algorithms December 10, 2020 237
Asynchronous Variant
4
T [ms]
0
24 8 16 24 32 40 48 56 64
n
Sanders: Parallel Algorithms December 10, 2020 240
p = 256
insert 256 elements and a deleteMin*:
centralized: > 28.16ms
parallel: 3.73ms
break-even at 34 PEs
Sanders: Parallel Algorithms December 10, 2020 241
80
MultiQ c=2
MultiQ HT c=2
MultiQ c=4
Throughput (MOps/s)
Spraylist
60
Linden
Lotan
40
20
0
0 7 14 21 28 35 42 49 56
Threads
Sanders: Parallel Algorithms December 10, 2020 245
100%
MultiQ c=2
90% MultiQ c=4
Cumulative frequency
80% Spraylist
70% Theoretical c=2
Theoretical c=4
60%
50%
40%
30%
20%
10%
List Ranking
Motivation:
List Ranking
n
L: List i 1 2 3 4 5 6 7 8 9
L
n: Elements S
S(i): Successor of element i
(unordered) P
S(i) = i: at end of list
R 1 1 1 1 1 1 1 1 0
P(i): Predecessor of element i
4 3 5 8 7 2 6 1 0
Exercise: compute in constant time for n PE PRAM
Motivation II
Lists are very simple graphs
Pointer Chasing
Analysis
R(i) = 2k or
R(i) = final result
Proof: True for k = 0.
k k + 1:
Case R(i) < 2k : already final value (IH)
Case R(i) = 2k , R(Q(i)) < 2k : now final value (Invariant, IH)
Case R(i) = R(Q(i)) = 2k : Now 2k+1
P
R 1 1 1 1 1 1 1 1 0
f(i) 1 2 3 4 5
S’
P’
R’ 2 2 2 20
recurse
R’ 4 8 2 6 0
R 4 3 5 8 7 2 6 1 0
Sanders: Parallel Algorithms December 10, 2020 254
f (i) = ∑ [ j 6∈ I]
j≤i
Sanders: Parallel Algorithms December 10, 2020 256
Analysis
n 4
T (n) = O + log p + T n in expectation
p 5
n
O log levels of recursion
p
n n
Summe: O + log log p geometric sum
p p
n
Linear work, time O(log n log log n) with PEs
log n log log n
n
*4/5
log n/p *4/5 ...
Sanders: Parallel Algorithms December 10, 2020 257
− DFS
− BFS
− shortest pahts
(nonnegative SSSP O(n) par. time. interesting for m = Ω (np) )
(what about APSP?)
− topological sorting
+ connected components (but not strongly connected)
+ Minimal spanning trees
+ Graph partitioning
Sanders: Parallel Algorithms December 10, 2020 260
Find a tree (V, T ) with minimum weight ∑e∈T c(e) that connects all
nodes.
Sanders: Parallel Algorithms December 10, 2020 261
T := 0/
S:= {s} for arbitrary start node s
repeat n − 1 times 1 5 2
find (u, v) fulfilling the cut property for S 9 7
S:= S ∪ {v} 2
4 3
T := T ∪ {(u, v)}
Sanders: Parallel Algorithms December 10, 2020 263
Analysis
Analysis
Edge Contraction
forall (w, v) ∈ E do
E := E \ (w, v) ∪ {(w, u)} // but remember orignal terminals
1 5 2 1 7 (was {2,3})
9 7 9
2 2
4 3 4 3
Sanders: Parallel Algorithms December 10, 2020 268
Boruvka’s Algorithm
1 5 2
9 7
2 3
3 4 5
Sanders: Parallel Algorithms December 10, 2020 269
Analysis (Sequential)
forall v ∈ V dopar
p
allocate |Γ(v)| 2m processors to node v // prefix sum
find w such that c(v, w) is minimized among Γ(v) // reduction
output original edge corresponding to (v, w)
pred(v):= w
1 5 2
9 7
2 3
Time O m
p + log p
3 4 5
Sanders: Parallel Algorithms December 10, 2020 271
out-degree 1
|C| edges
pseudotree,
i.e. a tree plus one edge
forall v ∈ V dopar
w:= pred(v)
if v < w ∧ pred(w) = v then pred(v):= v
Time O np
1 5 2 1 5 2
9 7 9 7
2 3 2 3
3 4 5 3 4 5
Sanders: Parallel Algorithms December 10, 2020 273
Time O np log n
1 5 2 1 5 2
9 7 9 7
2 3 2 3
3 4 5 3 4 5
Sanders: Parallel Algorithms December 10, 2020 274
Contraction
k:= #components
V ′ = 1..k
find a bijective mapping f : star-roots→ 1..k // prefix sum
E ′ := {( f (pred(u)), f (pred(v)), c, eold ) :
(u, v, c, eold ) ∈ E ∧ pred(u) 6= pred(v)}
Time O m p + log p
1 5 2 1
9 7 9 7
2 3
3 4 5 2
Sanders: Parallel Algorithms December 10, 2020 275
Recursion
Analysis
Alternate
5. T ⇐ MST (L ∪ F) [Recursively].
T (n, m) ≤ T (n/8, m/2) + T (n/8, n/4) + c(n + m)
T (n, m) ≤ 2c(n + m) fulfills this recurrence.
Sanders: Parallel Algorithms December 10, 2020 279
Kruskal
100 qKruskal
Kruskal8
filterKruskal+
filterKruskal
filterKruskal8
qJP
pJP
1 2 4 8 16
number of edges m / number of nodes n
Sanders: Parallel Algorithms December 10, 2020 281
Load Balancing
[Sanders Worsch 97]
Given
word to be done
PEs
Load balancing = assigning work → PEs
Goal: minimize parallel execution time
Sanders: Parallel Algorithms December 10, 2020 283
Measuring Cost
p
Maximal Load: max ∑
i=1 j∈jobs @ PE i
T ( j, i,...)
In this Lecture
Independent jobs
– Sizes exactly kown — fully parallel implementation
– Sizes unknown or inaccurate — random assignment,
master-worker-scheme, random polling
Sanders: Parallel Algorithms December 10, 2020 288
...
Sanders: Parallel Algorithms December 10, 2020 291
Time C + O np + log p if jobs are initially distributred randomly.
Sanders: Parallel Algorithms December 10, 2020 292
4 4 5
3 3 3
2 2 2
0 2 5 9 13 15 20 23 25
0 7 14 21
Sanders: Parallel Algorithms December 10, 2020 293
Atomic Jobs
pos
assign job j to PE C
Parallel
11
· opt
9
[Anderson, Mayr, Warmuth 89]
Sanders: Parallel Algorithms December 10, 2020 294
0 2 5 9 13 15 20 23 25
0 7 14 21
optimal:
4 3 4 3 5 2 2 2 3
Sanders: Parallel Algorithms December 10, 2020 295
zc (m) : N → C
zc (0) := 0, zc (m + 1) := zc (m)2 + c
M := {c ∈ C : zc (m) ist beschränkt} .
Sanders: Parallel Algorithms December 10, 2020 296
Approximate Computation
Imaginärteil(c)
a0
Where is the load balancing problem?
16 23
8 9 15
0 1 2 7
z0
a0
Realteil(c)
Sanders: Parallel Algorithms December 10, 2020 297
Code
Random Assignment
Given: n jobs of size ℓ1 ,. . . , ℓn
Let L := ∑ ℓi
i≤n
Discussion
eine Iteration
The Master-Worker-Scheme
Initially all jobs are on a master-PE
Job sizes can be master
Discussion
+ Simple
+ Natural input-output scheme (but perhaps a separate disk slave)
+ Suggests itself when the job generator is not parallelized
+ Easy to debug
− communication bottleneck ⇒ tradeoff communication cost versus
imbalance
− How to split?
− Multi-level schemes are complicated and of limited help
Sanders: Parallel Algorithms December 10, 2020 306
Work Stealing
(Almos) arbitrarily subdivisible load
Initially all the work on PE 0
Almost nothing is known on job sizes
Preemption is allowed. (Successive splitting)
Sanders: Parallel Algorithms December 10, 2020 308
4 1 2 3 1 2 3
5 9 6 7 4 5 6 7
8 10 11 8 9 10 11
12 13 14 15 12 13 14 15
Korf 85: Iterative deepening depth first search with ≈ 109 tree nodes.
Sanders: Parallel Algorithms December 10, 2020 309
#G.# #G.#
#GG# #.G#
#FF# #FF#
#G..# #G..#
#...# #...#
#G..# #...#
#...# #...#
#G??# #.??#
Sanders: Parallel Algorithms December 10, 2020 311
An Abstract Model:
Tree Shaped Computations
subproblem
work
sequentially
atomic
split
empty
l
send Proc. 1022
Sanders: Parallel Algorithms December 10, 2020 313
Splitting Stacks
a) b)
Sanders: Parallel Algorithms December 10, 2020 316
An Application List
Discrete Mathematics (Toys?):
– Golomb Rulers
– Cellular Automata, Trellis Automata
– 15-Puzzle, n-Queens, Pentominoes . . .
NP-complete Problems (nondeterminism branching)
– 0/1 Knapsack Problem (fast!)
– Quadratic Assignment Problem
– SAT
Functional, Logical Programming Languages
Constraint Satisfaction, Planning, . . .
Numerical: Adaptive Integration, Nonlinear Optimization by Interval
Arithmetics, Eigenvalues of Tridiagonal Matrices
Sanders: Parallel Algorithms December 10, 2020 318
reject
request rejected waiting request
idle
send request
Sanders: Parallel Algorithms December 10, 2020 320
Random Polling
..
.
Anfrage
Aufspaltung
.
..
Sanders: Parallel Algorithms December 10, 2020 321
Õ (·) Calculus
X ∈ Õ ( f (n)) – iff ∀β > 0 :
Termination Detection
not here
Sanders: Parallel Algorithms December 10, 2020 323
Analysis
Satz 6. For all ε > 0 there is a choice of ∆t and m such that
Tseq
Tpar (1 + ε ) +
p
Õ Tatomic + h(Trout (l) + Tcoll + Tsplit ) .
active PEs
p
time
sequential work splits
Sanders: Parallel Algorithms December 10, 2020 325
Bounding Idleness
Lemma 7.
Let m < p with m ∈ Ω (p).
busy idle
11
00 successful
request
111
000000
111
1 2 3 4 5 6
PE#
Then Õ (h) iterations 0
000
111
000
111000
111
000
111000
111 000
111
with at least 1
000
111000
111000
111
000
111000
111000
111
000
111
000111
111000000
111
000
111000
111000
111
m empty subproblems
000
111000
111
000
111000
111
suffice to ensure 000
111000
111
000
111000
111
000
111000
111
000
111000
111
∀P : gen(P) ≥ h .
000111
111000 111
000
p−1 000 111
111 000
Sanders: Parallel Algorithms December 10, 2020 326
Busy phases
Tseq
Lemma 8. There are at most (p−m)∆t iterations with ≤ m idle
PEs at their end.
Sanders: Parallel Algorithms December 10, 2020 327
A Simplified Algorithm
P, P′ : Subproblem
P := if iPE = 0 then Proot else P0/
while not finished
P := work(P, ∆t)
select a global value 0 ≤ s < n uniformly at random
if T (P@iPE − s mod p) = 0 then
(P, P@iPE − s mod p) := split(P)
Analysis
Satz 10.
Tseq
ETpar ≤(1 + ε ) +
p
1
O Tatomic + h + Trout + Tsplit
ε
for an appropriate choice of ∆t.
Sanders: Parallel Algorithms December 10, 2020 330
empty subproblem
h .. atomic subproblem
.
complete binary tree
Tseq
log
Tatomic
...
Additional
Tseq
h − log
Tatomic
term.
Sanders: Parallel Algorithms December 10, 2020 332
log p − 1
h
.. .. ... .. ..
. . . .
2Tseq
log
pTatomic
... ... ... ...
Golomb-Rulers mn
m1
0 1 4 10 12 17
Total length m
1 3 6 2 5
4 8
find n marks {m1 , . . . mn } ⊆ N0 9 7
10
13
11
m1 = 0, mn = m 12 16
17
|{m j − mi : 1 ≤ i < j ≤ n}| = n(n − 1)/2
Applications: Radar astronomy, codes, . . .
Sanders: Parallel Algorithms December 10, 2020 335
Many Processors
Parsytec GCel-3/1024 with C OSY (PB)
Verification search
1024
640
speedup
512
384
256
128
LAN
2 4 6 8 10 12
PEs
Superlinear Speedup
Parsytec GCel-3/1024 under C OSY (PB)
1024 processors
2000 items
Splitting on all levels
256 random instances at the border between simple and difficult
overall 1410× faster than seq. computation!
Sanders: Parallel Algorithms December 10, 2020 340
65536
16384
4096
1024
speedup
256
64
16
0.25
1 10 100 1000 10000 100000
sequential execution time [s]
Sanders: Parallel Algorithms December 10, 2020 341
Fast Initialization
16
14 ohne Initialisierung
mit Initialisierung
12
Beschleunigung
10
0
0.0001 0.001 0.01 0.1 1 10 100 1000
sequentielle Zeit [s]
Sanders: Parallel Algorithms December 10, 2020 342
Static vs Dynamic LB
16 16
14 dynamisch 14 dynamisch
16 Teilprobleme 16 Teilprobleme
12 16384 Teilprobleme 12 256 Teilprobleme
Beschleunigung
Beschleunigung
10 10
8 8
6 6
4 4
2 2
0 0
0.0001 0.001 0.01 0.1 1 10 100 1000 0.0001 0.001 0.01 0.1 1 10 100 1000
sequentielle Zeit [s] sequentielle Zeit [s]
Sanders: Parallel Algorithms December 10, 2020 343
640
512
384
256
128
0
1 10 100 1000 10000
sequentielle Zeit [s]
Sanders: Parallel Algorithms December 10, 2020 344
MapReduce in 10 Minutes
[Google, DeanGhemawat OSDI 2004] see Wikipedia
// M ⊆ K ×V
// MapF : K ×V → K ′ ×V ′
′
// ReduceF : K × 2 → V ′′
′ V
Refinements
Fault Tolerance
Load Balancing using hashing (default) und Master-Worker
Associative commutative reduce functions
Sanders: Parallel Algorithms December 10, 2020 349
Examples
Grep
URL access frequencies
build inverted index
Build reverse graph adjacency array
Sanders: Parallel Algorithms December 10, 2020 350
Graph Partitioning
Contraction
while |V | > c · k do
find a matching M ⊆ E
contract M // similar to MST algorithm (more simple)
Finding a Matching
ω ({u, v})
expansion({u, v}):=
c(u) + c(v)
∗ ω ({u, v})
expansion ({u, v}):=
c(u)c(v)
ω ({u, v})2
expansion∗2 ({u, v}):=
c(u)c(v)
ω ({u, v})
innerOuter({u, v}):=
Out(v) + Out(u) − 2ω (u, v)
Sanders: Parallel Algorithms December 10, 2020 352
todo
Sanders: Parallel Algorithms December 10, 2020 353