0% found this document useful (0 votes)

68 views

Parallel Algorithms Ws 20

This document provides an overview of parallel algorithms and discusses some basic concepts. It introduces the PRAM model of parallel computation and provides examples of parallel algorithms for tasks like computing the maximum value and performing a global OR. It discusses challenges in analyzing parallel algorithms and differences between the PRAM model and real parallel architectures. The goals are to parallelize basic sequential techniques, understand basic communication patterns, and achieve provable performance guarantees while maintaining applicability.

Uploaded by

Ismael Abdulkarim Ali

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views

Parallel Algorithms Ws 20

Uploaded by

Ismael Abdulkarim Ali

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 353

Sanders: Parallel Algorithms December 10, 2020 1

Parallel Algorithms
Peter Sanders
Institute of Theoretical Informatics
Sanders: Parallel Algorithms December 10, 2020 2

Why Parallel Processing

increase speed: p computers jointly working on a problem solve it up to
p times faster. But, too many cooks spoil the broth. careful
coordination of processors

save energy: Two processors with half the clock frequency need less
power than one processor at full speed. (power≈ voltage · clock
frequency)

expand available memory by using aggregate memory of many

processors

less communication: when data is distributed it can also be

(pre)processed in a distributed way.
Sanders: Parallel Algorithms December 10, 2020 3

Subject of the Lecture

Basic methods for parallel problem solving

Parallelization of basic sequential techniques: sorting, data

structures, graph algorithms, . . .

Basic communication patterns

load balancing
Emphasis on provable performance guarantees
But applicability always in view
Sanders: Parallel Algorithms December 10, 2020 4

Overview
Models, simple examples
Matrix multiplikation
Broadcasting
Sorting
General data exchange
Load balanciong I, II, III
List ranking (conversion list → array)
Hashing, priority queues
Simple graph algorithms
Sanders: Parallel Algorithms December 10, 2020 5

Literature
Script (in German)
+

Figures from the book marked as [Book].

Sanders: Parallel Algorithms December 10, 2020 6

More Literature

[Kumar, Grama, Gupta und Karypis],

Introduction to Parallel Computing. Design and Analysis of Algorithms,
Benjamin/Cummings, 1994. mostly practical, programming oriented

[Leighton], Introduction to Parallel Algorithms and Architectures,

Morgan Kaufmann, 1992.
Theoretical algorithms on concrete interconnection networks

[JáJá], An Introduction to Parallel Algorithms, Addison Wesley, 1992.

PRAM

[Sanders, Worsch],
Parallele Programmierung mit MPI – ein Praktikum, Logos, 1997.
Sanders: Parallel Algorithms December 10, 2020 7

Parallel Computing at ITI Sanders

Massively parallel sorting, Michael Axtmann

Massively parallel graph algorithms, Sebastian Lamm

Fault tolerance, Lukas Hübner

Shared-memory data structures, Tobias Maier

(Hyper)graph partitioning,
Tobias Heuer & Daniel Seemaier

Comm. eff. alg., Lorenz Hübschle-Schneider

SAT-solving and planning, Markus Iser and Dominik Schreiber

Geometric algorithms, Daniel Funke

Sanders: Parallel Algorithms December 10, 2020 8

Role in the CS curriculum

Elective or “Mastervorzug” Bachelor!
Vertiefungsfach
– Algorithmentechnik
– Parallelverarbeitung
Studienprofil daten-intensives Rechnen
Sanders: Parallel Algorithms December 10, 2020 9

Related Courses
Parallel programming: Tichy, Karl, Streit

Modelle der Parallelverarbeitung: viel theoretischer,

Komplexitätstheorie,. . . Worsch

Algorithmen in Zellularautomaten: spezieller, radikaler, theoretischer

Worsch

Rechnerarchitektur: Karl

GPUs: Dachsbacher

+ other algorithms lectures

Sanders: Parallel Algorithms December 10, 2020 10

RAM/von Neumann Model

Analysis: count machine instructions — O(1) registers
load, store, arithmetics, branch,. . . ALU
1 word = O(log n) bits
simple
freely programmable
very successful large memory
Sanders: Parallel Algorithms December 10, 2020 11

Algorithmenanalyse:
Count cycles: T (I), for given problem instance I .
Worst case depending on problem size: T (n) = max|I|=n T (I)
∑|I|=n T (I)
Average case: Tavg (n) =
|{I : |I| = n}|
Example: Quicksort has average case complexity O(n log n)

Randomized Algorithms: T (n) (worst case) is a random variable.

We are interested, e.g. in its expectation (later more).
Don’t mix up with average case.
Example: Quicksort with random pivot selection has expected
worst case cost E[T (n)] = O(n log n)
Sanders: Parallel Algorithms December 10, 2020 12

Algorithm Analysis: More Conventions

O(·) gets rid of cumbersome constants
Secondary goal: memory
Execution time can depend on several parameters:
Example: An efficient variant of Dijkstra’s Algorithm for shortest
paths needs time O(m + n log n) where n is the number of nodes
and m is the number of edges. (It always has to be clear what
parameters mean.)
Sanders: Parallel Algorithms December 10, 2020 13

A Simple Parallele Model: PRAM

Idea: change RAM as little as possible.

p PEs (Processing Elements); numbered 1..p (or 0..p − 1).

Every PE knows p.

One machine instruction per clock cycle and PE synchronous

Shared global memory
PE 1 PE 2 PE p
...

shared memory
Sanders: Parallel Algorithms December 10, 2020 14

Access Conflicts?

EREW: Exclusive Read Exclusive Write. Concurrent access is

forbidden

CREW: Concurrent Read Exclusive Write. Concurrent read OK.

Example: One PE writes, others read = „Broadcast“

CRCW: Concurrent Read Concurrent Write. avoid chaos:

common: All writers have to agree Example: OR in constant time

(AND?)

arbitrary: someone succeeds (6= random!)

priority: writer with smallest ID succeeds

combine: All values are combined, e.g., sum

Sanders: Parallel Algorithms December 10, 2020 15

Example: Global Or

Input in x[1..p]

Initilize memory location Result= 0

Parallel on PE i = 1..p

if x[i] then Result := 1

Global And

Initilize memory location Result= 1

if not x[i] then Result := 0

Sanders: Parallel Algorithms December 10, 2020 16

Example: Maximum on Common CRCW PRAM

[JáJá Algorithm 2.8]

Input: A[1..n] // distinct elements

Output: M[1..n] // M[i] = 1 iff A[i] = max j A[ j]

forall (i, j) ∈ {1..n}2 dopar B[i, j]:= A[i] ≥ A[ j]

forall i ∈ {1..n} dopar
n
^
M[i]:= B[i, j] // parallel subroutine
j=1

O(1) time
2

Θ n PEs (!)
Sanders: Parallel Algorithms December 10, 2020 17

i A B 1 2 3 4 5 <- j M
1 3 * 0 1 0 1 1
2 5 1 * 1 0 1 1
3 2 0 0 * 0 1 1
4 8 1 1 1 * 1 1
5 1 0 0 0 0 * 1
A 3 5 2 8 1
-------------------------------
i A B 1 2 3 4 5 <- j M
1 3 * 0 1 0 1 0
2 5 1 * 1 0 1 0
3 2 0 0 * 0 1 0
4 8 1 1 1 * 1 1->maxValue=8
5 1 0 0 0 0 * 0
Sanders: Parallel Algorithms December 10, 2020 18

Describing Parallel Algorithms

Pascal-like pseudocode
Explicitly parallel loops [JáJá S. 72]
Single Program Multiple Data principle. The PE-index is used to
break the symmetry. 6= SIMD !
Sanders: Parallel Algorithms December 10, 2020 19

Analysis of Parallel Algorithms

In principle – just another parameter: p.

Find execution time T (I, p).
Problem: interpretation.

Work: W = pT (p) is a cost measure. (e.g. Max: W = Θ n2 )
Span: T∞ = inf p T (p) measures Parallelizability.
(absolute) Speedup: S = Tseq /T (p).
Use the best known sequential algorithm.
= T (1)/T (p) is usually different!
Relative Speedup Srel
2

(e.g. Maximum: S = Θ(n), Srel = Θ n )

E = S/p. Goal: E ≈ 1 or, at least E = Θ(1). „Superlinear

Efficiency:
speedup“: E > 1. (possible?). Example maximum: E = Θ(1/n).
Sanders: Parallel Algorithms December 10, 2020 20

PRAM vs. Real Parallel Computers

Distributed Memory

PE 1 PE 2 PE p

...
local local local
memory memory memory

network
Sanders: Parallel Algorithms December 10, 2020 21

(Symmetric) Shared Memory

processors
0 1 ... p−1
cache

network

memory modules
Sanders: Parallel Algorithms December 10, 2020 22

Problems

Asynchronous design, analysis, implementation, and

debugging are much more difficult than for PRAM

Contention for memory module / cache line

Example:
The Θ(1) PRAM Algorithm for global OR becomes Θ(p).

local/cache-memory is (much) faster than global memory

The network gets more complex with growing p
while latencies grow
Contention in the network
Maximum local memory consumption more important than total
memory consumption
Sanders: Parallel Algorithms December 10, 2020 23

Realistic Shared-Memory Models

asynchronous
aCRQW: asynchronous concurrent read queued write. When x
PEs contend for the same memory cell, this costs time O(x).

consistent write operations using atomic operations

memory hierarchies
Why is concurrent read OK?
Sanders: Parallel Algorithms December 10, 2020 24

Atomic Instuctions: Compare-And-Swap

General and widely available:

Function CAS(a, expected, desired) : {0, 1}

BeginTransaction
if ∗a = expected then ∗a:= desired; return 1// success
else expected:= ∗ a; return 0// failure
EndTransaction
Sanders: Parallel Algorithms December 10, 2020 25

Further Operations for Consistent Memory

Access:
Fetch-and-add
Hardware transactions

Function fetchAndAdd(a, ∆)
expected:= ∗ a
repeat
desired:= expected + ∆
until CAS(a, expected, desired)
return desired
Sanders: Parallel Algorithms December 10, 2020 26

Parallel External Memory

M M ... M
B B B
PEM
Sanders: Parallel Algorithms December 10, 2020 27

Models with Interconnection Networks

processors RAMs
0 1 ... p−1
...
... memory
network

PEs are RAMs

asynchronous processing
Interaction by message exchange
Important: cost model for data exchange
Sanders: Parallel Algorithms December 10, 2020 28

Real Maschines Today

SIMD superscalar

threads

processor
L1
core
L2

L3 cache
compute node
main memory
SSD
more compute nodes

Internet
network
disks
tape
[Book]
Sanders: Parallel Algorithms December 10, 2020 29

Handling Complex Hierarchies

Proposition: we get quite far with flat Models, in particular for shared
memory.

Design distributed, implement hierarchy adapted

Shared-memory subroutines on nodes
Sanders: Parallel Algorithms December 10, 2020 30

Explicit „Store-and-Forward“
We know the interconnection graph
(V = {1, . . . , p} , E ⊆ V ×V ). Variants:
– V = {1, . . . , p} ∪ R with additional
„dumb“ router nodes (perhaps with buffer memory).
– Buses → Hyperedges
In each time step each edge can transport up to k′ data packets of
constant length (usually k′ = 1)

On a k-port-machine, each node can simultaneously send or

receive k Packets gleichzeitig senden oder empfangen. k = 1 is
called single-ported.
Sanders: Parallel Algorithms December 10, 2020 31

Discussion

+ simple formulation
− low level ⇒ „messy algorithms“
− Hardware router allow fast communication whenever an unloaded
communication path is found.
Sanders: Parallel Algorithms December 10, 2020 32

Typical Interconnection Networks

mesh torus 3D−mesh hypercube

root
fat tree

[Book]
Sanders: Parallel Algorithms December 10, 2020 33

Fully Connected Point-to-Point

E = V ×V , single ported
Tcomm (m) = α + mβ . (m = message length in machine words)
+ Realistic handling of message lengths
+ Many interconnection networks approximate fully connected
networks ⇒ sensible abstraction

+ No overloaded edges → OK for hardware router

+ „artificially“ increasing α , β
→ OK for „weak“ networks
+ Asynchronous model
− A bit of hand waving for real networks
Sanders: Parallel Algorithms December 10, 2020 34

Fully Connected: Variants

What can PE I do in time Tcomm (m) = α + mβ ?

message length m.

half duplex: 1×send or 1×receive (also called simplex)

Telephone: 1×send to PE j and 1×receive from PE j
(full)duplex: 1×send and 1×receive.
Arbitrary comunication partners

Effect on running time:

T duplex ≤ T Telefon ≤ T duplex/2 ≤ 3T duplex

Sanders: Parallel Algorithms December 10, 2020 35

BSP Bulk Synchronous Parallel

[McColl LNCS Band 1000, S. 46]

Machine described by three parameters: p, L and g.

L: Startup overhead for one collective message exchange – involving
all PEs
computation speed
g: gap≈ communication bandwidth
Superstep: Work locally. Then collective global synchronized data
exchange with arbitrary messages.

w: max. local work (clock cycles)

h: max. number of machine words sent or received by a PE (h-relation)
Time: w + L + gh
Sanders: Parallel Algorithms December 10, 2020 36

BSP versus Point-to-Point

By naive direct data delivery:
Let H = max #messages of a PEs.
Then T ≥ α (H + log p) + hβ .
Worst case H = h. Thus L ≥ α log p and g ≥ α ?

By all-to-all and direct data delivery:

Then T ≥ α p + hβ .
Thus L ≥ α p and g ≈ β ?

By all-to-all and indirect data delivery:

Then T = Ω (log p(α + hβ )).
Thus L = Ω (α log p) and g = Ω (β log p)?
Sanders: Parallel Algorithms December 10, 2020 37

BSP∗

Truly efficient parallel algorithms: c-optimal multisearch for an

extension of the BSP model,
Armin Bäumker and Friedhelm Meyer auf der Heide, ESA 1995.

Redefinition of h to # blocks of size B, e.g. B = Θ(α /β ).

Let Mi be the set of messages sent or received by PE i.

Let h = maxi ∑m∈Mi ⌈|m|/B⌉.
Let g the gap between sending packets of size B.
Then, once more
w + L + gh
is the time for a super step.
Sanders: Parallel Algorithms December 10, 2020 38

BSP∗ versus Point-to-Point

With naive direct data delivery:
L ≈ α log p
g ≈ Bβ
Sanders: Parallel Algorithms December 10, 2020 39

BSP+

We augment BSP by collective operations

broadcast
(all-)reduce
prefix-sum
with message lengths h.

Stay tuned for algorithms which justify this.

BSP∗ -algorithms are up to a factor Θ(log p) slower than

BSP+ -algorithms.
Sanders: Parallel Algorithms December 10, 2020 40

MapReduce

A⊆I
1: map
[
B= µ (a) ⊆ K ×V
a∈A
2: shuffle
...
C = {(k, X) : k ∈ K∧
3: reduce X = {x : (k, x) ∈ B} ∧ X 6= 0}
/
[
D= ρ (c)
c∈C
Sanders: Parallel Algorithms December 10, 2020 41

MapReduce Example WordCount

happy birthday to you
happy birthday dear Timmy A⊆I
happy birthday to you
1: map
(happy,1) (birthday,1) (to,1) (you,1) (happy,1) (birthday,1) [
(dear,1) (Timmy,1) (happy,1) (birthday,1) (to,1) (you,1)
B= µ (a) ⊆ K ×V
a∈A
2: shuffle
(birthday, {1,1,1}) (to, {1,1}) (you, {1,1}) (dear, {1}) ...
C = {(k, X) : k ∈ K∧
(Timmy, {1}) (happy, {1,1,1})
3: reduce X = {x : (k, x) ∈ B} ∧ X 6= 0}
/
[
(birthday, 3) (to, 2) (you, 2) D= ρ (c)
(dear, 1) (Timmy, 1) (happy, 3) c∈C
Sanders: Parallel Algorithms December 10, 2020 42

MapReduce Discussion
+ Abstracts away difficult issues
A⊆I
* parallelization 1: map
[
* load balancing B= µ (a) ⊆ K ×V
a∈A
* fault tolerance 2: shuffle
...
* memory hierarchies C = {(k, X) : k ∈ K∧
3: reduce X = {x : (k, x) ∈ B} ∧ X 6= 0}
/
* ... [
D= ρ (c)
− Large overheads c∈C

− Limited functionality
Sanders: Parallel Algorithms December 10, 2020 43

MapReduce – MRC Model

[Karloff Suri Vassilvitskii 2010]
A problem is in MRC iff for input of size n:
A⊆I
solvable in O(polylog(n)) 1: map
MapReduce steps [
B= µ (a) ⊆ K ×V
µ and ρ evaluate in time O(poly(n)) 2: shuffle
a∈A

1− ε
...
µ and ρ use space O n C = {(k, X) : k ∈ K∧
(“substantially sublinear”) X = {x : (k, x) ∈ B∧
3: reduce
[ X 6= 0}/
overall space for B O n2−ε D= ρ (c)
c∈C
(“substantially subquadratic”)

Roughly: count steps,

very loose constraints on everything else
Sanders: Parallel Algorithms December 10, 2020 44

MapReduce – MRC Model Criticism

A problem is in MRC iff for input of size n:

solvable in O(polylog(n))
MapReduce steps

µ and ρ evaluate in time O(poly(n)) n2 ? n42 ? “big” data?

1− ε

µ and ρ use space O n
(“substantially sublinear”)

overall space for B O n2−ε
(“substantially subquadratic”) “big” data?

Roughly: count steps, speedup? efficiency?

very loose constraints on everything else
Sanders: Parallel Algorithms December 10, 2020 45

MapReduce – MRC+ Model

[Sanders IEEE BigData 2020] A⊆I

w Total work for µ , ρ 1: map

[
B= µ (a) ⊆ K ×V
ŵ Maximal work for µ , ρ a∈A
2: shuffle
m Total data volume for A ∪ B ∪C ∪ D ...
C = {(k, X) : k ∈ K∧
X = {x : (k, x) ∈ B∧
m̂ Maximal object size in A ∪ B ∪C ∪ D, 3: reduce
[ X 6= 0}/
D= ρ (c)
Theorem:
Lower and upper bound for parallel execution: c∈C
w
Θ + ŵ + log p work,
p
m
Θ + m̂ + log p bottleneck communication volume.
p
Sanders: Parallel Algorithms December 10, 2020 46

Implementing MapReduce on BSP

2 Supersteps:

1. Map local data.

Send (k, v) ∈ B to PE h(k) (hashing)

2. Receive elements of B.
Build elements of C .
Run reducer.
Send all but first result of a reducer to a random PE.
Sanders: Parallel Algorithms December 10, 2020 47

MapReduce on BSP – Example

work data dependence
PE: 1 2 3 4

superstep 1
27

map shuffle
21

superstep 2
reduce
15

2
Sanders: Parallel Algorithms December 10, 2020 48

MapReduce on BSP – Analysis

Time
w m
2L + O +g
p p
if
w = Ω(ŵp log p) and m = Ω(m̂p log p)
Sanders: Parallel Algorithms December 10, 2020 49

Graph- und Circuit Representation of Algorithms

Many computations can be represented as a
a a+b a+b+c a+b+c+d
directed acyclic graph

Input nodes have indegree 0

and a fixed output a b c d

Ouput nodes have outdegree 0

and indegree 1

The indegree is bounded by

a small constant.

Inner nodes compute a function, that can be computed in constant

time.
Sanders: Parallel Algorithms December 10, 2020 50

Circuits

Variant: When a constant number of bits rather than machine

words are processed, we get circuits.

The depth d(S) of the computation-DAG is the number of inner

nodes on the longest path from an input to an output ∼ time

One circuit for each input size (specified algorithmically) ⇒ circuit

family
Sanders: Parallel Algorithms December 10, 2020 51

Example: Associative Operations (=Reduction)

Satz 1. Let ⊕ denote an associative operator that can be

computed in constant time. Then,
M
xi := (· · · ((x0 ⊕ x1 ) ⊕ x2 ) ⊕ · · · ⊕ xn−1 )
i<n

can be computed in time O(log n) on a PRAM and in time

O(α log n) on a linear array with hardware router
Example: +, ·, max, min, . . . (example for non-commutative?)
Sanders: Parallel Algorithms December 10, 2020 52

Proof Outline for n = 2k (wlog?)

L
Induction hypothesis: ∃ circuit of depth k for i<2k xi

k = 0: trivial
k k + 1:
depth k depth k (IH)
M zM
}| { zM}| {
xi = xi ⊕ xi+2k
i<2k+1 i<2k i<2k
| {z }
Tiefe k+1

2 k+1
1
0 k
Sanders: Parallel Algorithms December 10, 2020 53

PRAM Code
PE index i ∈ {0, . . . , n − 1}
active := 1
x
for 0 ≤ k < ⌈log n⌉ do 0 1 2 3 4 5 6 7 8 9 a b c d e f
if active then
if bit k of i then
active := 0
else if i + 2k < n then
xi := xi ⊕ xi+2k
// result is in x0
Careful: much more complicated on a real asynchronous
shared-memory machine.

Speedup? Efficiency? log x here always log2 x

Sanders: Parallel Algorithms December 10, 2020 54

Analysis

n PEs
Time O(log n)
x
Speedup O(n/ log n) 0 1 2 3 4 5 6 7 8 9 a b c d e f
Efficiency O(1/ log n)
Sanders: Parallel Algorithms December 10, 2020 55

Less is More (Brent’s Principle)

p PEs
Each PE adds
n/p elements sequentially.
Then parallel sum
n/p
for p subsums
Time Tseq (n/p) + Θ(log p)
p
Efficiency

Tseq (n) 1 p log p
= = 1−Θ
p(Tseq (n/p) + Θ(log p)) 1 + Θ(p log(p)) /n n
if n ≫ p log p
Sanders: Parallel Algorithms December 10, 2020 56

Distributed Memory Machine

PE index i ∈ {0, . . . , n − 1}
// Input xi located on PE i
active := 1 x
0 1 2 3 4 5 6 7 8 9 a b c d e f
s := xi
for 0 ≤ k < ⌈log n⌉ do
if active then
if bit k of i then
sync-send s to PE i − 2k
active := 0
else if i + 2k < n then
receive s′ from PE i + 2k
s := s ⊕ s′
// result is in s on PE 0
Sanders: Parallel Algorithms December 10, 2020 57

Analysis

fully connected: Θ((α + β ) log p)

linear array: Θ(p): step k needs time 2k .
linear array with router: Θ((α + β ) log p), since edge congestion is
one in every step
2
BSP Θ((l + g) log p) = Ω log p

Arbitrary n > p: additional time Tseq (n/p)

Sanders: Parallel Algorithms December 10, 2020 58

Discussion Reductions Operation

Binary tree yields logarithmic running time
Useful for most models
Brent’s principle: inefficient algorithms get efficient by using less
PEs

Later: reduction of complex objects, e.g., vectors, matrices

Sanders: Parallel Algorithms December 10, 2020 59

Matrix Multiplikation
Given: Matrices A ∈ Rn×n , B ∈ Rn×n
with A = ((ai j )) und B = ((bi j ))
R: semiring
C = ((ci j )) = A · B well-known:
n
ci j = ∑ aik · bk j
k=1

work: Θ n3 arithmetical operations
(better algorithms if R allows subtraction)
Sanders: Parallel Algorithms December 10, 2020 60

A first PRAM Algorithm

n3 PEs
for i:= 1 to n dopar
for j:= 1 to n dopar
n
ci j := ∑ aik · bk j // n PE parallel sum
k=1

One PE für each product cik j := aik bk j

Time O(log n)
Efficiency O(1/ log n)
Sanders: Parallel Algorithms December 10, 2020 61

Distributed Implementation I

p ≤ n2 PEs
for i:= 1 to n dopar
for j:= 1 to n dopar n/ p
n
ci j := ∑ aik · bk j n
k=1

Assign n2 /p of the ci j to each PE

− limited scalability
2
n
− high communication volume. Time Ω β √
p
Sanders: Parallel Algorithms December 10, 2020 62

Distributed Implementation II-1

[Dekel Nassimi Sahni 81, KGGK Section 5.4.4]

Assume p = N 3 , n is a multiple of N
View A, B, C as N × N matrices,
each element is a n/N × n/N matrix
n/N

for i:= 1 to N dopar 1 ... N

for j:= 1 to N dopar
N n
ci j := ∑ aik bk j
k=1

One PE for each product cik j := aik bk j

Sanders: Parallel Algorithms December 10, 2020 63

Distributed Implementation II-2

store aik in PE (i, k, 1)
store bk j in PE (1, k, j)
PE (i, k, 1) broadcasts aik to PEs (i, k, j) for j ∈ {1..N}
PE (1, k, j) broadcasts bk j to PEs (i, k, j) for i ∈ {1..N}
compute cik j := aik bk j on PE (i, k, j) // local!
N
PEs (i, k, j) for k ∈ {1..N} compute ci j := ∑ cik j to PE (i, 1, j)
k=1

j A
k
i C
Sanders: Parallel Algorithms December 10, 2020 64

Analysis, Fully Connected, etc.

store aik in PE (i, k, 1) // free (or cheap)
store bk j in PE (1, k, j) // free (or cheap)
PE (i, k, 1) broadcasts aik to PEs (i, k, j) for j ∈ {1..N}
PE (1, k, j) broadcasts bk j to PEs (i, k, j) for i ∈ {1..N}
3

compute cik j := aik bk j on PE (i, k, j) // Tseq (n/N) = O (n/N)
N
PEs (i, k, j) for k ∈ {1..N} compute ci j := ∑ cik j to PE (i, 1, j)
k=1
Communication:
Obj. size
PEs
z }| { z}|{
2Tbroadcast ((n/N)2 , N )+Treduce ((n/N)2 , N) ≈ 3Tbroadcast ((n/N)2 , N)
3 2
N=p1/3 n n
···O + β 2/3 + α log p
p p
Sanders: Parallel Algorithms December 10, 2020 65

Discussion Matrix Multiplikation

PRAM Alg. is a good starting point
√
DNS algorithm saves communication but needs factor Θ 3 p
more space than other algorithms

good for small matrices (for big ones communication is irrelevant)

Pattern for dense linear algebra:

Locale Ops on submtrices + Broadcast + Reduce
e.g. matrix-vector-product, sovlve lin. eq. syst.,. . .
Sanders: Parallel Algorithms December 10, 2020 66

Broadcast and Reduction

Broadcast: One for all
One PE (e.g. 0) sends message of length n to all PEs
n

0 1 2 ... p−1

Reduction: One for all

One PE (e.g. 0) receives sum of p messages of length n
(vector addition6= local addition)
Sanders: Parallel Algorithms December 10, 2020 67

Broadcast Reduction
turn around direction of communication
add corresponding parts n
of arriving and own
messages

All the following 0 1 2 ... p−1

broadcast algorithms yield
reduction algorithms
for commutative and associative operations.

Most of them (except Johnsson/Ho and some network embeddings)

also work for noncommutative operations.
Sanders: Parallel Algorithms December 10, 2020 68

Modelling Assumptions
fully connected
fullduplex – parallel send and receive
Variants: halfduplex, i.e., send or receive, BSP, embedding into
concrete networks
Sanders: Parallel Algorithms December 10, 2020 69

Naive Broadcast [KGGK Section 3.2.1]

Procedure naiveBroadcast(m[1..n])
PE 0: for i := 1 to p − 1 do send m to PE i
PE i > 0: receive m

Time: (p − 1)(nβ + α)
nightmare for implementing a scalable algorithme
n
...
p−1

0 1 2 ... p−1
Sanders: Parallel Algorithms December 10, 2020 70

Binomial Tree Broadcast

Procedure binomialTreeBroadcast(m[1..n])
PE index i ∈ {0, . . . , p − 1}
// Message m located on PE 0
if i > 0 then receive m
for k := min{⌈log n⌉ , trailingZeroes(i)} − 1 downto 0 do
send m to PE i + 2k // noop if receiver ≥ p

0 1 2 3 4 5 6 7 8 9 a b c d e f
Sanders: Parallel Algorithms December 10, 2020 71

Analysis
Time: ⌈log p⌉ (nβ + α )
Optimal for n = 1
Embeddable into linear array
n· f (p) n+ log p?

0 1 2 3 4 5 6 7 8 9 a b c d e f
Sanders: Parallel Algorithms December 10, 2020 72

Linear Pipeline
Procedure linearPipelineBroadcast(m[1..n], k) 1

PE index i ∈ {0, . . . , p − 1} 2
3
// Message m located on PE 0 4
// assume k divides n 5

define piece j as m[( j − 1) nk + 1.. j nk ]

6
7
for j := 1 to k + 1 do
receive piece j from PE i − 1 // noop if i = 0 or j = k + 1
and, concurrently,
send piece j − 1 to PE i + 1 // noop if i = p − 1 or j = 1
Sanders: Parallel Algorithms December 10, 2020 73

Analysis 1
2
3
Time nk β + α per step 4
(6= Iteration) 5
6

p − 1 steps until first packet arrives 7

Then 1 step per further packet

n
T (n, p, k): β + α (p + k − 2))
k
r
n(p − 2)β
optimales k:
α
p
∗
T (n, p): ≈ nβ + pα + 2 npαβ
Sanders: Parallel Algorithms December 10, 2020 74

5
bino16
pipe16
4
T/(nTbyte+ceil(log p)Tstart)

0
0.01 0.1 1 10 100 1000 10000
nTbyte/Tstart
Sanders: Parallel Algorithms December 10, 2020 75

10
bino1024
pipe1024
8
T/(nTbyte+ceil(log p)Tstart)

0
0.01 0.1 1 10 100 1000 10000
nTbyte/Tstart
Sanders: Parallel Algorithms December 10, 2020 76

Discussion
Linear pipelining is optimal for fixed p und n → ∞
But for large p extremely large messages needed
αp α log p?
Sanders: Parallel Algorithms December 10, 2020 77

Procedure binaryTreePipelinedBroadcast(m[1..n], k)
// Message m located on root, assume k divides n
define piece j as m[( j − 1) nk + 1.. j nk ]
for j := 1 to k do
if parent exists then receive piece j
if left child ℓ exists then send piece j to ℓ
if right child r exists then send piece j to r
1 2 3 4 5 6 7 8 9 10 11 12 13

recv left recv right left recv right left recv right left recv right left recv right left recv right left recv right left right
Sanders: Parallel Algorithms December 10, 2020 78

Example
1 2 3 4 5 6 7

recv left recv right left recv right left recv right left recv right

8 9 10 11 12 13

left recv right left recv right left recv right left right
Sanders: Parallel Algorithms December 10, 2020 79

Analysis
time nk β + α per step (6= iteration)
2 j steps until first packet reaches level j
how many levels? d:= ⌊log p⌋

then 3 steps for each further packet

n
Overall: T (n, p, k):= (2d + 3(k − 1)) β +α
k
r
n(2d − 3)β
optimal k:
3α
Sanders: Parallel Algorithms December 10, 2020 80

Analysis
Time nk β + α per step (6= iteration)
d:= ⌊log p⌋ levels
n
Overall: T (n, p, k):= (2d + 3(k − 1)) β + α
k
r
n(2d − 3)β
optimal k:
3α
p
∗
substituted: T (n, p) = 2d α + 3nβ + O nd αβ
Sanders: Parallel Algorithms December 10, 2020 81

Fibonacci-Trees
1 2 4 7 12
0
1
2
3
4

active connection passive connection

Sanders: Parallel Algorithms December 10, 2020 82

Analysis
Time nk β + α per step (6= iteration)
j steps until first packet reaches level j
How many PEs p j with level 0.. j?
p0 = 1, p1 = 2, p j = p j−2 + p j−1 + 1 ask Maple,
rsolve(p(0)=1,p(1)=2,p(i)=p(i-2)+p(i-1)+1,p(i));
√
3 5+5 j
pj ≈ √ Φ ≈ 1.89Φ j
5( 5 −
√
1)
with Φ = 1+2 5 (golden ratio)

d ≈ logΦ p levels
p
∗
overall: T (n, p) = d α + 3nβ + O nd αβ
Sanders: Parallel Algorithms December 10, 2020 83

Procedure fullDuplexBinaryTreePipelinedBroadcast(m[1..n], k)
// Message m located on root, assume k divides n
define piece j as m[( j − 1) nk + 1.. j nk ]
for j := 1 to k + 1 do
receive piece j from parent // noop for root or j = k + 1
and, concurrently, send piece j − 1 to right child
// noop if no such child or j = 1
send piece j to left child
// noop if no such child or j = k + 1
even step odd step
Sanders: Parallel Algorithms December 10, 2020 84

Analysis
Time nk β + α per step
j steps until first packet level j reaches
d ≈ logΦ p levels
Then 2 steps for each further packet
p
∗
Overall: T (n, p) = d α + 2nβ + O nd αβ
Sanders: Parallel Algorithms December 10, 2020 85

5
bino16
pipe16
btree16
4
T/(nTbyte+ceil(log p)Tstart)

0
0.01 0.1 1 10 100 1000 10000
nTbyte/Tstart
Sanders: Parallel Algorithms December 10, 2020 86

10
bino1024
pipe1024
btree1024
8
T/(nTbyte+ceil(log p)Tstart)

0
0.01 0.1 1 10 100 1000 10000 100000 1e+06
nTbyte/Tstart
Sanders: Parallel Algorithms December 10, 2020 87

Discussion

Fibonacci trees are a good compromise for all n, p.

General p:
use next larger tree. Then drop a subtree
Sanders: Parallel Algorithms December 10, 2020 88

H-Trees
Sanders: Parallel Algorithms December 10, 2020 89

H-Trees
Sanders: Parallel Algorithms December 10, 2020 90

Disadvantages of Tree-based Broadcasts

Leaves only receive their data. Otherwise they are unproductive
Inner nodes send more than they receive
full-duplex communication not fully exploited
Sanders: Parallel Algorithms December 10, 2020 91

23-Broadcast: Two T(h)rees for the Price of One

Binary-Tree-Broadcasts using two trees A und B at once
Inner nodes of A are
leaves of B
0 1
and vice versa 0 1 1 0 0
0 1 1 0 1 0 0
Per double step: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
One packet received as a leaf + 1 1 0 1 0 0 1
one packet received and forwarde as1 inner node
0 0 1
0 1 1
i.e., 2 packets sent and received
Sanders: Parallel Algorithms December 10, 2020 92

Root Process

for j := 1 to k step 2 do
send piece j + 0 along edge labelled 0
send piece j + 1 along edge labelled 1

0 1
0 1 1 0 0
0 1 0 1 1 0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 1 0 1 0 0 1
1 0 0 1
0 1 1
Sanders: Parallel Algorithms December 10, 2020 93

Other Processes,
Wait for first piece to arrive
if it comes from the upper tree over an edge labelled b then
∆:= 2· distance of the node from the bottom in the upper tree
for j := 1 to k + ∆ step 2 do
along b-edges: receive piece j and send piece j − 2
along 1 − b-edges: receive piece j + 1 − ∆ and send piece j
0 1
0 1 1 0 0
0 1 1 0 1 0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 1 0 1 0 0 1
1 0 0 1
0 1 1
Sanders: Parallel Algorithms December 10, 2020 94

Arbitrary Number of Processors

0 0
0 1 0 1
0 1 1 0 0 1 1 0
0 1 1 0 1 1 0 0 0 1 1 0 1 1 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12
1 1 0 1 0 0 1 1 1 0 1 0 0
1 0 0 1 1 0 0 1
0 1 0 1
1
Sanders: Parallel Algorithms December 10, 2020 95

Arbitrary Number of Processors

0 0
0 1 1 0
0 1 1 0 1 0 1
0 1 1 0 1 0 0 1 1 0 1 0
0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 10 11
1 1 0 1 0 0 0 0 1 1 0
1 0 0 1 1 0 0 1
0 1 0 1
1
Sanders: Parallel Algorithms December 10, 2020 96

Arbitrary Number of Processors

0 0
1 0 0
1
1 0 1 0 1
0 1 1 0 1 0 0 1 1 0 1 0
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10
0 0 1 1 0 1 1 0 1 0
1 0 0 1 1 0 0
0 1 0 1
1 1
Sanders: Parallel Algorithms December 10, 2020 97

Arbitrary Number of Processors

0 1
0 1
1 1 0
0 1 0
0 1 0 1
1 0 1 0 0 1 1
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9
1 1 0 1 0 0 0 1 0 1
1 0 0 0 1 0
0 1 1
1 0
Sanders: Parallel Algorithms December 10, 2020 98

Arbitrary Number of Processors

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 0 1 2 3 4 5 6 7
Sanders: Parallel Algorithms December 10, 2020 99

Constructing the Trees

Case p = 2h − 1: Upper tree + lower tree + root

Upper tree: complete binary tree of height h − 1, − richt leaf
Lower tree: complete binary tree of height h − 1, − left leaf
Lower tree ≈ Upper tree shifted by one.
Inner nodes upper tree = leaves of lower tree.
Inner nodes lower tree = leaves of upper tree.
0 1
0 1 1 0 0
0 1 1 0 1 0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 1 0 1 0 0 1
1 0 0 1
0 1 1
Sanders: Parallel Algorithms December 10, 2020 100

Building Smaller Trees (without root)

invariant : last node has outdegree 1 in tree x

invariant : last node has outdegree 0 in tree x̄
p p − 1:
Remove last node:
right node in x now has degree 0
right node in x̄ now has degree 1
0 0
0 1 0 1
0 1 1 0 0 1 1 0
0 1 1 0 1 1 0 0 0 1 1 0 1 1 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12
1 1 0 1 0 0 1 1 1 0 1 0 0
1 0 0 1 1 0 0 1
0 1 0 1
1
Sanders: Parallel Algorithms December 10, 2020 101

Coloring Edges

Consider bipartite graph:

B = ( s0 , . . . , s p−1 ∪ r0 , . . . , r p−2 , E).
si : Sender role of PE i. 0 1
0 1 1 0 0
ri : Receiver role of PE i.
0 1 1 0 1 0 0
2× degree 1. all other degree 2. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
⇒ B is a path plus even circles. 1 1 0 1 0 0 1
1 0 0 1
color edges alternatingly with 0 and 1. 1
0 1

s 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

r 0 1 2 3 4 5 6 7 8 9 10 11 12 13
Sanders: Parallel Algorithms December 10, 2020 102

Open Question: Parallel Coloring ?

In Time Polylog(p) using list ranking.
(unfortunately impractical for small inputs)

Fast explicit calculation color(i, p) without communication ?

Mirror layout:
Sanders: Parallel Algorithms December 10, 2020 103

Jochen Speck’s Solution

// Compute color of edge entering node i in the upper tree.

// h is a lower bound on the height of node i.
Function inEdgeColor(p, i, h)
if i is the root of T1 then return 1
while i bitand 2h = 0 do h++ // compute height

i − 2h if 2h+1 bitand i = 1 ∨ i + 2h > p
i′ := // compute parent of i
i + 2h else

return inEdgeColor(p, i′ , h) xor (p/2 mod 2) xor [i′ > i]

Sanders: Parallel Algorithms December 10, 2020 104

Analysis
Time nk β + α pro step
2 j steps until all PEs in level j are reached
d = ⌈log(p + 1)⌉ levels
Then 2 steps for 2 further packets
n
T (n, p, k) ≈ β + α (2d + k − 1)), with d ≈ log p
k
r
n(2d − 1)β
optimal k:
α
p
T ∗ (n, p): ≈ nβ + α · 2 log p + 2n log pαβ
Sanders: Parallel Algorithms December 10, 2020 105

5
bino16
pipe16
2tree16
4
T/(nTbyte+ceil(log p)Tstart)

0
0.01 0.1 1 10 100 1000 10000
nTbyte/Tstart
Sanders: Parallel Algorithms December 10, 2020 106

10
bino1024
pipe1024
2tree1024
8
T/(nTbyte+ceil(log p)Tstart)

0
0.01 0.1 1 10 100 1000 10000 100000 1e+06
nTbyte/Tstart
Sanders: Parallel Algorithms December 10, 2020 107

Implementation in the Simplex-Model

2 steps duplex 4 steps simplex. 0 2

1 PE duplex 1 simplex couple = sender + receiver. 0 1
1 3
0 1
0 2
Sanders: Parallel Algorithms December 10, 2020 108

23-Reduktion

Numbering is an inorder-numbering for both trees !

commutative or root=0 or root=p−1:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

otherwise:
<root root >root
n n
Sanders: Parallel Algorithms December 10, 2020 109

Another optimal Algorithm

[Johnsson Ho 85: Optimal Broadcasting and Personalized
Communication in Hypercube, IEEE Transactions on Computers, vol.
38, no.9, pp. 1249-1268.]

Model: full-duplex limited to a single edge per PE (telephone model)

Adaptation half-duplex: everything ×2
Sanders: Parallel Algorithms December 10, 2020 110

Hypercube Hd
p = 2d PEs
nodes V = {0, 1}d , i.e., write node number in binary

edges in dimension i: Ei = (u, v) : u ⊕ v = 2i
E = E0 ∪ · · · ∪ Ed−1

d 0 1 2 3 4
Sanders: Parallel Algorithms December 10, 2020 111

ESBT-Broadcasting
In step i communication along dimension i mod d
Decompose Hd into d Edge-disjoint Spanning Binomial Trees
0d cyclically distributes packet to roots of the ESBTs
ESBT-roots perform binomial tree broadcasting
(except missing smallest subtree 0d )
100 101
000
000 001 001 010 100
011 101 110 011 101 110
010 011
111 010 100 111 100 001 111 001 010
110 111 110 101 011
step 0 mod 3 step 1 mod 3 step 2 mod 3
Sanders: Parallel Algorithms December 10, 2020 112

Analysis, Telephone model

k packets, k divides n
k steps until last packet left PE 0
d steps until it has reached the last leaf
Overall d + k steps
n
T (n, p, k) = β + α (k + d)
k
r
nd β
optimal k:
α
p
T ∗ (n, p): = nβ + d α + nd αβ
Sanders: Parallel Algorithms December 10, 2020 113

Discussion

n
small large spcecial alg.
depending on
binomial p
tree network?
linear n
pipeline
binary tree p=2^d
EBST Y N
23−Broadcast
Sanders: Parallel Algorithms December 10, 2020 114

Reality Check
Libraries (z.B. MPI) often do not have a pipelined implementation
of collective operations your own broadcast may be significantly
faster than a library routine

choosing k is more complicated: only piece-wise linear cost

function for point-to-point communication, rounding

Hypercube gets slow when communication latencies have large

variance

perhaps modify Fibonacci-tree etc. in case of asynchronous

communication (Sender finishes before receiver). Data should
reach all leaves about at the same time.
Sanders: Parallel Algorithms December 10, 2020 115

Broadcast for Library Authors

One Implementation? 23-Broadcast

Few, simple Variants? {binomial tree, 23-Broadcast} or

{binomial tree, 23-Broadcast, linear pipeline}
Sanders: Parallel Algorithms December 10, 2020 116

Beyond Broadcast
Pipelining is important technique for handling large data sets
hyper-cube algorithms are often elegant and efficient. (and often
simpler than ESBT)

Parameter tuning (e.g. for. k) is often important.

Sanders: Parallel Algorithms December 10, 2020 117

Sorting
Fast inefficient ranking
Quicksort
Sample Sort
Multiway Mergesort
Selection
More on sorting
Sanders: Parallel Algorithms December 10, 2020 118

Fast Inefficient Ranking

n Elements, n2 processors:
Input: A[1..n] // distinct elements
Output: M[1..m] // M[i] =rank of A[i]

forall (i, j) ∈ {1..n}2 dopar B[i, j]:= A[i] ≥ A[ j]

forall i ∈ {1..n} dopar
n
M[i]:= ∑ B[i, j] // parallel subroutine
j=1

Running time: ≈ Tbroadcast (1) + Treduce (1) = O(α log p)

Sanders: Parallel Algorithms December 10, 2020 119

i A B 1 2 3 4 5 <- j M
1 3 1 0 1 0 1 1
2 5 1 1 1 0 1 1
3 2 0 0 1 0 1 1
4 8 1 1 1 1 1 1
5 1 0 0 0 0 1 1
A 3 5 2 8 1
-------------------------------
i A B 1 2 3 4 5 <- j M
1 3 1 0 1 0 1 3
2 5 1 1 1 0 1 4
3 2 0 0 1 0 1 2
4 8 1 1 1 1 1 5
5 1 0 0 0 0 1 1
Sanders: Parallel Algorithms December 10, 2020 120

Sorting Larger Data Sets

n input elements. Initially n/p per PE
Perhaps more general initial state
Output is globally sorted
d0,0 , . . . , d0,n/p−1 , ..., d p−1,0 , . . . , d p−1,n/p−1

⇓π

s0,0 ≤ · · · ≤ s0,n1 −1 ≤ · · · ≤ s p−1,0 ≤ · · · ≤ s p−1,n p−1 −1

Comparison based model
Tseq = Tcompr n log n + O(n)
Careful: different notation in Skript n ↔ n/p
Sanders: Parallel Algorithms December 10, 2020 121

Back to Fast Ranking

// Assume p = a × b PEs, PE index is (i, j)

Procedure matrixRank(s)
sort(s) // locally
r:= all-gather-by-rows(s, merge)
c:= all-gather-by-cols(s, merge)
ranks:= h| {x ∈ c : x ≤ y} | : y ∈ ri // merge
reduce-by-rows(ranks)

Time

O α log p + β √np + np log np . (1)
Sanders: Parallel Algorithms December 10, 2020 122

Example

hgdl
eamb
k i cj
Sanders: Parallel Algorithms December 10, 2020 123

row all−gather−merge
hgdl dghl dghl dghl dghl
eamb abem abem abem abem
k i cj cijk cijk cijk cijk
Sanders: Parallel Algorithms December 10, 2020 124

hgdl dghl dghl dghl dghl

eamb abem abem abem abem
k i cj cijk cijk cijk cijk
row all−gather−merge
col all−gather−merge
dghl eh dghl a dghl c dghl b
g d j
k i m l
abem eh abem a abem c abem b
g d j
k i m l
cijk eh cijk a cijk c cijk b
g d j
k i m l
Sanders: Parallel Algorithms December 10, 2020 125

hgdl
eamb
k i cj
dghl e
h dghl a dghl c
g d dghl b
0123 k 1223 i 2222 m 1113 j
l
abem e
h abem a abem c
g d abem b
0013 k 1113 i 0023 m 0113 j
l
cijk e
h cijk a cijk c
g cijk b
0223 k 1333 i 1222 d 1122 j
m l
Sanders: Parallel Algorithms December 10, 2020 126

hgdl
eamb
k i cj
dghl e dghl a dghl c dghl b d g h l
0123 h
k 1223 g 2222 d
i m 1113 j 4 6 7 11
l
abem e
h abem a abem c
g d abem b a b e m
0013 k 1113 i 0023 m 0113 j 1 2 5 12
l
cijk e
h cijk a cijk c
g cijk b c i j k
0223 k 1333 i 1222 d 1122 j 3 8 9 10
m l
Sanders: Parallel Algorithms December 10, 2020 127

More Accurate Analysis (1 Maschine Sord/PE)

local sorting: np log np Tcompr

√ 1
2× all-gather: 2 β n/ p + α log p
2
√
local ranking: 2Tcompr n/ p
reduce EDSBT-Algorithm:
r
√ 1 √ 1
β n/ p + α log p + αβ n/ p log p
2 2
Overall:
q
3 √ √ n n
log pα + 3β n/ p + αβ 0.5n/ p log p + log Tcompr
2 p p
Sanders: Parallel Algorithms December 10, 2020 128

Numerical Example:

p = 1024, α = 10−5 s, β = 10−8 s, Tcompr = 10−8 s, n/p = 32.

q
3 √ √
log pα + 3n pβ + 0.5n p log pαβ + n log nTcompr
2
Time≈ 0.200ms.

In comparison: efficient Gather+seq. sort:

2 · 32000 · 10−8 + 10 · 10−5 + 32000 · 15 · 10−8 ≈ 5.6ms
even larger difference with naive gather
Sanders: Parallel Algorithms December 10, 2020 129

Mearements Axtmann Sanders ALENEX 2017

RAMS Bitonic RFIS
GatherM RQuick HykSort
106
Running Time / np [µs]

105
Uniform
104

103

102

1
2−5 20 25 210 215 220
n/p, p = 218
Sanders: Parallel Algorithms December 10, 2020 130

Quicksort

Sequential

Procedure qSort(d[], n)
if n = 1 then return
select a pivot v
reorder the elements in d such that
d0 · · · dk−1 ≤ v < dk · · · dn−1
qSort([d0 , . . . , dk−1 ], k)
qSort([dk+1 , . . . , dn−1 ], n − k − 1)
Sanders: Parallel Algorithms December 10, 2020 131

Parallelization for Beginners

Parallelization of recursive calls.

Tpar = Ω (n)

Very limited speedup

Bad for distributed memory
Sanders: Parallel Algorithms December 10, 2020 132

Parallelization for Theoreticians

For simplicity: n = p.

Idea: Also parallelize partitioning

1. One PE provides the pivot (e.g. random choice).

2. Broadcast

3. Local comparison

4. Enumerate „small“ elements (prefix-sum)

5. Redistribute data

6. Split PEs

7. Parallel recursion
Sanders: Parallel Algorithms December 10, 2020 133

Parallelization for Theoreticians

// Let i ∈ 0..p − 1 and p denote the ‘local’ PE index and partition size
Procedure theoQSort(d, i, p)
if p = 1 then return
j:= random element from 0..p − 1// same value in entire partition
v:= d@ j // broadcast pivot
f := d ≤ v
j:= ∑ik=0 f @k // prefix sum
p′ := j@(p − 1) // broadcast
if f then send d to PE j
else send d to PE p′ + i − j // i − j = ∑ik=0 d@k > v
receive d
if i < p′ then join left partition; qsort(d, i, p′ )
else join right partition; qsort(d, i − p′ , p − p′ )
Sanders: Parallel Algorithms December 10, 2020 134

Example
pivot v = 44

PE Nummer 0 1 2 3 4 5 6 7
Nr. d. Elemente Pivot 0 1 2 3 4
Nr. d. Elemente > Pivot 0 1 2
Wert vorher 44 77 11 55 00 33 66 22

Wert nachher 44 11 00 33 22 77 55 66
PE Nummer 0+0 0+1 0+2 0+3 0+4 5+0 5+1 5+2
Sanders: Parallel Algorithms December 10, 2020 135

int pQuickSort(int item, MPI_Comm comm)

{ int iP, nP, small, allSmall, pivot;
MPI_Comm newComm; MPI_Status status;
MPI_Comm_rank(comm, &iP); MPI_Comm_size(comm, &nP);

if (nP == 1) { return item; }

else {
pivot = getPivot(item, comm, nP);
count(item < pivot, &small, &allSmall, comm, nP);
if (item < pivot) {
MPI_Bsend(&item,1,MPI_INT, small - 1 ,8,comm);
} else {
MPI_Bsend(&item,1,MPI_INT,allSmall+iP-small,8,comm);
}
MPI_Recv(&item,1,MPI_INT,MPI_ANY_SOURCE,8,comm,&status);
MPI_Comm_split(comm, iP < allSmall, 0, &newComm);
return pQuickSort(item, newComm);}}
Sanders: Parallel Algorithms December 10, 2020 136

/* determine a pivot */
int getPivot(int item, MPI_Comm comm, int nP)
{ int pivot = item;
int pivotPE = globalRandInt(nP);/* from random PE */
/* overwrite pivot by that one from pivotPE */
MPI_Bcast(&pivot, 1, MPI_INT, pivotPE, comm);
return pivot;
}

/* determine prefix-sum and overall sum over value */

void
count(int value,int *sum,int *allSum,MPI_Comm comm,int nP)
{ MPI_Scan(&value, sum, 1, MPI_INT, MPI_SUM, comm);
*allSum = *sum;
MPI_Bcast(allSum, 1, MPI_INT, nP - 1, comm);
}
Sanders: Parallel Algorithms December 10, 2020 137

Analysis
Per recursion level:
– 2× Broadcast
– 1× Prefix sum (→later)
Time O(α log p)

Expected recursion depth: O(log p)

(→ lecture randomized algorithms)
2

Expected overall time: O α log p
Sanders: Parallel Algorithms December 10, 2020 138

Generalization for n ≫ p by Standard Procedure?

Each PE has, in general, „large“ und „small“ Elements.

These numbers are not multiples of n/p
Prefix sums remain useful

On PRAMs we get a O n log p
n
+ log 2
p algorithm

For distributed memory its bad that many elements get moved
Ω (log p) times.
··· Time O np (log n + β log p) + α log2 p
Sanders: Parallel Algorithms December 10, 2020 139

Distributed memory parallel quicksort

Function parQuickSort(s : Sequence of Element, i, j : N) : Sequence of Element
p′ := j − i + 1
if p′ = 1 then quickSort(s) ; return s // sort locally
v:= pickPivot(s, i, j)
a:= he ∈ s : e ≤ vi; b:= he ∈ s : e > vi
na := ∑i≤k≤ j |a|@k; nb := ∑i≤k≤ j |b|@k
k′ := nan+n
a
b
p′
n o
na nb
choose k ∈ {⌊k′ ⌋ , ⌈k′ ⌉} such that max k , ⌈ p′ −k ⌉ is minimized
na
send the a-s to PEs i..i + k − 1 (≤ k per PE)
l m
nb
send the b-s to PEs i + k.. j (≤ p′ −k per PE)

receive data sent to PE iPE into s

if iPE < i + k then parQuickSort(s, i, i + k − 1) else parQuickSort(s, i + k, j)
Sanders: Parallel Algorithms December 10, 2020 140

PE 1 PE 2 PE 3 i=1 j=3
8 2 0 5 4 1 7 3 9 6 p′ = 3
v partition
2 0 8 5 1 4 7 3 9 6 na =4 nb =6
a b a b a b k′ = 4+6
4
·3= 56
k=1
2 0 1 3 8 5 4 7 9 6 i=2 j=3
i= j=1 v p′ = 2
quickSort 5 4 8 7 9 6 partition
0 1 2 3 a b a b na =2 nb =4
k′ = 2+4
2
·2= 32
5 4 8 7 9 6 k=1
i= j=2 i= j=3
quickSort quickSort
4 5 6 7 8 9
Sanders: Parallel Algorithms December 10, 2020 141

Load Balance

Simplified scenario: splitting always with ratio 1:2

larger subproblem gets one PE-load too much.
Imbalance-factor:

!
1
k ∑ki=1 ln 1+
1 ( )
p 23
i
∏1+ 2
i =e
i=1 p 3
1
∑ki=1 i 1 k 3 i
≤e ( ) =e
p 23 p ( ) geom. sum
∑i=0 2

k+1
1 (2)
3 −1
1 3 k
≤ e ( ) = e3 ≈ 20.1 .
p 3 −1 3
=e 2 p 2
Sanders: Parallel Algorithms December 10, 2020 142

The good News:

n n
Time O log + log2 p
p p
Sanders: Parallel Algorithms December 10, 2020 143

Better Balance?
Janus-quicksort? Axtmann, Wiebigke, Sanders, IPDPS 2018
for small p′ choose pivot carefully
for small p′ (Θ(log p)) switch to sample sort?
Alternative: always halve PEs, randomizaztion, careful choice of pivot
Axtmann, Sanders, ALENEX 2017
Sanders: Parallel Algorithms December 10, 2020 144

Results Axtmann Sanders ALENEX 2017

RAMS Bitonic RFIS
GatherM RQuick HykSort
106
Running Time / np [µs]

105
Uniform
104

103

102

1
2−5 20 25 210 215 220
n/p, p = 218
Sanders: Parallel Algorithms December 10, 2020 145

Multi-Pivot Methods

Simplifying assumption: perfect splitters are available for free

// Für 0 < k < p let vk the element with rank k · n/p

// Set v0 = −∞ and v p = ∞.
initialize p empty messages Nk , (0 ≤ k < p)
for i := 0 to n/p − 1 do
determine k, such that vk < di ≤ vk+1
put di into message Nk
send Ni to PE i and // All-to-all
receive p messages // personalized communication
sort received data
Sanders: Parallel Algorithms December 10, 2020 146

Analysis

distribution
z }| { local sorting data exchange
n z }| { z }| {
Tpar = O log p + Tseq (n/p) + Tall−to−all (p, n/p)
p
Tseq (n) n
≈ + 2 β + pα
p p
Idealizing assumption is realistic for permutation.
Sanders: Parallel Algorithms December 10, 2020 147

Sample Sort

choose a total of ap random elements sk , (a per PE) (1 ≤ k ≤ ap)

sort [s1 , . . . , sap ] // or only
for i := 1 to p − 1 do vi := sai // multiple selection
v0 := − ∞; vP := ∞
Sanders: Parallel Algorithms December 10, 2020 148

PE 1 PE 2 PE 3
of ikc ms ta phr ej ndqbgul input
abc f hj lr sample
s1= c s2= j splitters
s j l
c m r u local
i k e p g q buckets
f o a h t b d n
move
data
s j l
c m r u
i e g k p q
a b f h d o t n sort
ab cdefghi j k l m n o p q locally
rstu [Book]
Sanders: Parallel Algorithms December 10, 2020 149

log n
Lemma 2. a = O ε2
suffices such that with probability
≥ 1 − 1n no PE gets more than (1 + ε )n/p elements.
Sanders: Parallel Algorithms December 10, 2020 150

Lemma:

n
a = O log
ε2
suffices such that with probability ≥ 1 − 1
n no PE gets
more than (1 + ε )n/p elements.
Proof idea: We analyze and algorithm that choses global samples with
replacement.
Let he1 , . . . , en i denote the input in sorted order.
fail: Some PE gets more than (1 + ε )n/p elements
→ ∃ j :≤ a samples from he j , . . . , e j+(1+ε )n/p i (event E j )

→ P [fail] ≤ nP E j , j fixed.

1 if s ∈ he , . . . , e
i j j+(1+ε )n/p i
Let Xi := , X:= ∑i Xi
0 else

P E j = P [X < a] = P [X < 1/(1 + ε )E[X]] ≈ P [X < (1 − ε )E[X]]
E[Xi ] = P [Xi = 1] = 1+p ε
Sanders: Parallel Algorithms December 10, 2020 151

Chernoff-Bound

Lemma 3. Let X = ∑i Xi denote the sum of independent 0-1

random variables.
2
ε E[X]
P [X < (1 − ε )E[X]] ≤ exp − .
2

Applied to our problem:

2 2
ε (1 + ε )a ε a ! 1
P [X < a] ≤ exp − ≤ exp − ≤ 2
2 2 n
4
⇔ a ≥ 2 ln n
ε
Sanders: Parallel Algorithms December 10, 2020 152

Analysis of Sample Sort

small if n ≫ p2 log p
z }| {
sort sample
z }| { collect/distribute splitters
p log n z }| {
TsampleSort (p, n) = Tfastsort (p, O 2
) + Tallgather (p)
ε

n n n
+O log p + Tseq ((1 + ε ) ) + Tall−to−all (p, (1 + ε ) )
p p p
| {z } | {z } | {z }
partition local sorting data exchange
Sanders: Parallel Algorithms December 10, 2020 153

Sorting Samples

Using gather/gossiping
Using gather–merge
Fast ranking
Parallel quicksort
Recursively using sample sort
Sanders: Parallel Algorithms December 10, 2020 154

Sorting Samples efficient if n ≫

p2 log pTcompr
Using gather/gossiping
ε2
p2 β
Using gather–merge
ε 2 Tcompr
p2 β
Fast ranking
log pTcompr
p2 β
Parallel quicksort
log pTcompr
Recursively using sample sort
Sanders: Parallel Algorithms December 10, 2020 155

MPI Sample Sort – Init and Local Sample

Many thanks to Michael Axtmann

template<class Element> 1

void parallelSort(MPI_Comm comm, vector<Element>& data, 2

MPI_Datatype mpiType, int p, int myRank) 3

{ random_device rd; 4

mt19937 rndEngine(rd()); 5

uniform_int_distribution<size_t> dataGen(0, data.size() − 1); 6

vector<Element> locS; // local sample of elements from input <data> 7

const int a = (int)(16∗log(p)/log(2.)); // oversampling ratio 8

for (size_t i=0; i < (size_t)(a+1); ++i) 9

locS.push_back(data[dataGen(rndEngine)]); 10
Sanders: Parallel Algorithms December 10, 2020 156

Find Splitters

vector<Element> s(locS.size() ∗ p); // global samples 1

MPI_Allgather(locS.data(), locS.size(), mpiType, 2

s.data(), locS.size(), mpiType, comm); 3

sort(s.begin(), s.end()); // sort global sample 5

for (size_t i=0; i < p−1; ++i) s[i] = s[(a+1) ∗ (i+1)]; //select splitters 6

s.resize(p−1); 7
Sanders: Parallel Algorithms December 10, 2020 157

Partition Locally

vector<vector<Element>> buckets(p); // partition data 1

for(auto& bucket : buckets) bucket.reserve((data.size() / p) ∗ 2); 2

for( auto& el : data) { 3

const auto bound = upper_bound(s.begin(), s.end(), el); 4

buckets[bound − s.begin()].push_back(el); 5

} 6

data.clear(); 7
Sanders: Parallel Algorithms December 10, 2020 158

Find Message Sizes

// exchange bucket sizes and calculate send/recv information 1

vector<int> sCounts, sDispls, rCounts(p), rDispls(p + 1); 2

sDispls.push_back(0); 3

for (auto& bucket : buckets) { 4

data.insert(data.end(), bucket.begin(), bucket.end()); 5

sCounts.push_back(bucket.size()); 6

sDispls.push_back(bucket.size() + sDispls.back()); 7

} 8

MPI_Alltoall(sCounts.data(),1,MPI_INT,rCounts.data(),1,MPI_INT,comm); 9

// exclusive prefix sum of recv displacements 10

rDispls[0] = 0; 11

for(int i = 1; i <= p; i++) rDispls[i] = rCounts[i−1]+rDispls[i−1]; 12

Sanders: Parallel Algorithms December 10, 2020 159

Data Exchange and Local Sorting

vector<Element> rData(rDispls.back()); // data exchange 1

MPI_Alltoallv(data.data(), sCounts.data(), sDispls.data(), mpiType, 2

rData.data(), rCounts.data(), rDispls.data(), mpiType, comm); 3

sort(rData.begin(), rData.end()); 5

rData.swap(data); 6

} 7
Sanders: Parallel Algorithms December 10, 2020 160

Experiments Speedup on 4× Intel E7-8890 v3

90
psamplesort-mpool-numa
80 Intel TBB
std parallel mode
4
ipS o
70 psamplesort-mpi

0
10 12 14 16 18 20 22
2 2 2 2 2 2 2
input size (elements per thread)
Sanders: Parallel Algorithms December 10, 2020 161

Sorting by multiway merging

Function mmSort(d, n, p) // shared memory not SPMD

PE i sorts d[in/p..(i + 1)n/p]; barrier synchronization
PE i finds vi with rank in/p in d ; barrier synchronization
PE i merges p subsequences with vk ≤ d j < vk+1
Sanders: Parallel Algorithms December 10, 2020 162

Multisequence Selection

Idea: each PE determines a splitter with appropriate global Rank

(shared memory)

Comparison based lower bound: Ω p log np

We present an algorithmus with O p log n log np
Sanders: Parallel Algorithms December 10, 2020 163

Splitter Selection
in
Processor i selects the element with global rank k = .
p
Simple algorithm: quickSelect exploiting sortedness of the sequences.

k ?
v yes
v
v
v

no
Sanders: Parallel Algorithms December 10, 2020 164

Idea:
Ordinary select but p× binary search instead of partitioning
Function msSelect(S : Array of Sequence of Element; k : N) : Array of N
for i := 1 to |S| do (ℓi , ri ):= (0, |Si |)
invariant ∀i : ℓi ..ri contains the splitting position of Si
invariant ∀i, j : ∀a ≤ ℓi , b > r j : Si [a] ≤ S j [b]
while ∃i : ℓi < ri do
v:= pickPivot(S, ℓ, r)
for i := 1 to |S| do mi := binarySearch(v, Si [ℓi ..ri ])
if ∑i mi ≥ k then r:= m else ℓ:= m
return ℓ
Sanders: Parallel Algorithms December 10, 2020 165

Analysis of p-way mergesort

 
n n n n 
TpMergeSort (p, n) =O 
p log + p log n log + log p 
p p p 
| {z } | {z } | {z }
local sort ms-selection merging

efficient if n ≫ p2 log p
deterministic (almost)
perfect load balance
somewhat worse constant factors than sample sort
Sanders: Parallel Algorithms December 10, 2020 166

Distributed Multisequence Selection

Owner computes paradigm

O(log n) global levels of recursion.
Gather + Broadcast for finding pivot / distribution (vector length p − 1).
p − 1 local searches everywhere.
Reduction for finding partition sizes (Vector length p − 1).

Expected
time
O log n p(log np + β ) + log pα
Sanders: Parallel Algorithms December 10, 2020 167

Distributed Multisequence Selection

Function dmSelect(s : Seq of Elem; k : Array[1..p] of N) : Array[1..p] of N

ℓ, r, m, v, σ : Array [1..p] of N
for i := 1 to p do (ℓi , ri ):= (0, |s|) // initial search ranges
while ∃i, j : ℓi @ j 6= ri @ j do // or-reduction
v:= pickPivotVector(s, ℓ, r)// reduction, prefix sum, broadcast
for i := 1 to p do mi := binarySearch(vi , s[ℓi ..ri ])
σ := ∑i m@i // vector valued reduction
for i := 1 to p do if σi ≥ ki then ri := mi else ℓi := mi
return ℓ
Sanders: Parallel Algorithms December 10, 2020 168

CRCW Sorting in logarithmic time

Consider case n = p.
√
sample of size p
√
k = Θ p/ log p splitters
Buckets have size ≤ cp/k elements whp
Allocate buckets of size 2cp/k
Write elements to random free positions within their bucket
Compactify using prefix sums
Recursion
Sanders: Parallel Algorithms December 10, 2020 169

Example
sample & sort
8qwe5r t2zu9iop3 7yxc6v b4 ma s0df ghjkl1 3a fkr v
move to buckets
a r
4 05 38 1 7 2 69 boamefdqh ilcpnj kg w ysv r tz x u
compact
4053817269boamefdqhilcpnjkgwysvrtzxu
sort sort sort
0123456789abcdefghijklmnopqrstuvwxyz
Sanders: Parallel Algorithms December 10, 2020 170

More on Sorting I
Cole’s mergesort: [JáJáSection 4.3.2]
Time O np + log p deterministic, EREW PRAM (CREW in
[JáJá]). Idea: Pipelined parallel merge sort. Use (deterministic)
sampling to predict where data comes from.

Sorting Networks: nodes sort 2 elements. Simple networks have

2
O log n depth (e.g. bitonic sort). They yield reasonable
deterministic sorting algorithms (2 elements merge-and-split of
two subsequences). Very complicated ones with depth O(log n).
Sanders: Parallel Algorithms December 10, 2020 171

More on Sorting II
Integer Sorting: (Close to) linear work. Very fast algorithma on CRCW
PRAM.

Multi-Pass-Sample/Merge-Sort: more general compromise between

latency and communication volume, e.g. AMS-Sort Axtmann,
Bingmann, Schulz, Sanders SPAA 2015
Sanders: Parallel Algorithms December 10, 2020 172

Measurements Axtmann Sanders ALENEX 2017

RAMS Bitonic RFIS
GatherM RQuick HykSort
106
Running Time / np [µs]

105
Uniform
104

103

102

1
2−5 20 25 210 215 220
n/p, p = 218
Sanders: Parallel Algorithms December 10, 2020 173

Slowdown wrt Fastest Algorithm

GatherM RFIS Bitonic
RQuick RAMS
Running Time Ratio t/tbest

6
Uniform
5

2−5 20 25 210 215 220

n/p, p = 218
Sanders: Parallel Algorithms December 10, 2020 174

Collective Communication
Broadcast
Reduction
Prefix sum
Not here hier: gather / scatter
Gossiping (= all-gather = gather + broadcast)
All-to-all personalized communication
– equal message lengths
– arbitrary message lengths, = h-relation
Sanders: Parallel Algorithms December 10, 2020 175

Prefix sums

[Leighton 1.2.2] Compute

O
x@i:= m@i′
i′ ≤i

(on PE i, m may be a vector with n bytes length.)

exclusive
p−1 p−2 p−3 ... 0
inclusive
Sanders: Parallel Algorithms December 10, 2020 176

Plain Pipeline

As in broadcast

1
2
3
4
5
6
7
8
9
Sanders: Parallel Algorithms December 10, 2020 177

Hypercube Algorithm 100 101 100 101

e−e f−f e−f e−f
e−e 000 001 f−f e−e 000 001 e−f
a−a b−b a−b a−b
a−a b−b a−a a−b
// view PE index i as a
010 011 010 011
// d-bit bit array 110 c−c
c−c
d−d
d−d
111 110 c−d
c−c
c−d
c−d
111
g−g h−h g−h g−h
Function hcPrefix(m) g−g h−h g−g g−h

x:= σ := m i
sum
100
e−h
101
e−h
x e−e 000 001 e−f
for k := 0 to d − 1 do a−d
a−a
a−d
a−b
i[k..d−1]1k
invariant σ = ⊗ j=i[k..d−1]0k m@j 010
a−d
011
a−d
110 111
e−h a−c a−d e−h

invariant x = ⊗ij=i[k..d−1]0k m@j

e−g e−h

101
y:= σ @(i ⊕ 2k )
100
// sendRecv a−h
a−e 000 001
a−h
a−f
a−h a−h
σ := σ ⊗ y a−a a−b

if i[k] = 1 then x:= y ⊗ x 010

a−h
011
a−h
110 111
a−h a−c a−d a−h
return x a−g a−h
Sanders: Parallel Algorithms December 10, 2020 178

Analysis

Telephone model:
Tprefix = (α + nβ ) log p

Pipelining does not work since all PEs are busy.

8
Sanders: Parallel Algorithms December 10, 2020 179

Pipelined Binary Tree Prefix Sums 5

3 11
Inorder numbering of the nodes
2 7 10
Upward phase: as with reduction but 6 9 12
1 4
i
PE i stores ∑′ x@ j
j=i 3 4 1..i’−1
i′ −1 i’..i’’ PE
i
Downward phase: PE i receives ∑ x@ j i’..i 1..i
j=1
(root: = 0 !)
i
i’..i−1 i+1..i’’
and forwards this to the left. 1 5 2 6
i
right subtree gets ∑ x@ j
j=1 i’ i−1 i+1 i’’
Each PE only active once per phase. → pipelining OK
Sanders: Parallel Algorithms December 10, 2020 180

Pseudocode

Function InOrderTree::prefixSum(m)
// upward phase:
x:= 0; receive(leftChild, x)
z:= 0; receive(rightChild, z) x+m+z
send(parent, x + m + z) m iPE
x z
// downward phase:
ℓ:= 0; receive(parent, ℓ)
ℓ
send(leftChild, ℓ) m iPE
send(rightChild, ℓ + x + m) ℓ ℓ+x+m

return ℓ + x + m
Sanders: Parallel Algorithms December 10, 2020 181

23-Prefix Sums

Numbering is inorder-numbering for both trees!

1..i’−1
PE odd packets
i’..i’’ i 0 1
i’..i 1..i 0 1 1 0
0 1 1 0 1 0 0
i
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
i’..i−1 i+1..i’’
1 1 0 1 0 0 1
1 0 0 1
0 1
i’ i−1 i+1 i’’ even packets
Sanders: Parallel Algorithms December 10, 2020 182

Analysis

Tprefix ≈ Treduce + Tbroadcast ≈ 2Tbroadcast =

p
2nβ + α · 4 log p + 8n log pαβ
Sanders: Parallel Algorithms December 10, 2020 183

Generalization:
Applies to any algorithm based on inorder numbered trees
ESBT does not work?
Sanders: Parallel Algorithms December 10, 2020 184

Gossiping
Each PE has a message m of length n.
At the end, each PE should know all messages.

Hypercube Algorithm

Let ‘·’ denote the concatenation operation; p = 2d

PE i
y := m
for 0 ≤ j < d do
y′ := the y from PE i ⊕ 2 j
y := y · y′
return y
Sanders: Parallel Algorithms December 10, 2020 185

Example
g h gh gh efgh efgh abcdefgh abcdefgh
c d cd cd abcd abcd abcdefgh abcdefgh
e f ef ef efgh efgh abcdefgh abcdefgh
a b ab ab abcd abcd abcdefgh abcdefgh
Sanders: Parallel Algorithms December 10, 2020 186

Analysis

Telephone model, p = 2d PEs, n byte per PE:

d−1
j
Tgossip (n, p) ≈ ∑ α + n · 2 β = log pα + (p − 1)nβ
j=0

All-Reduce

Reduction instead of concatenation.

Advantage: Factor two less startups than reduction plus broadcast
Disadvantage: p log p messages.
This a disadvantage for congenstion-prone networks.
Sanders: Parallel Algorithms December 10, 2020 187

All-to-all Personalized Communication

Each PE has p − 1 messages of length n. One for each other PE.

m[i]@i is for PE i itself

Hypercube Algorithm

PE i
for j := d − 1 downto 0 do
Get from PE i ⊕ 2 j all its messages
destined for my j-D subcube
Move to PE i ⊕ 2 j all my messages
destined for its j-D subcube
Sanders: Parallel Algorithms December 10, 2020 188

Analysis, Telephone model:

p
Tall−to−all (p, n) ≈ log p( nβ + α )
2

Fully Connected:

When n is large, rather send messages individually

(Factor log p less communication volume)
Sanders: Parallel Algorithms December 10, 2020 189

1-Factor-Algorithm
i=0 1
[König 1936] 0 2
4 3
p odd, i is PE-index:
i=1 1
for r := 0 to p − 1 do 0 2
4 3
k:= 2r mod p
i=2 1
j:= (k − i) mod p′
0 2
send( j, mi j )|| recv( j, m ji ) 4 3
i=3 1
pairwise communication (telephone model):
0 2
The partner of the partners of j in round i is 4 3
i − (i − j) ≡ j mod p i=4 1
Time: p(nβ + α ) optimal for n → ∞ 0 2
4 3
Sanders: Parallel Algorithms December 10, 2020 190

1-Factor Algorithm
i=0 1 5
p even: 0 2
4 3
// PE index j ∈ {0, . . . , p − 1}
i=1 1 5
for i := 0 to p − 2 do 0 2
p
idle:= i mod (p − 1) 4 3
2
i=2 1 5
if j = p − 1 then exchange data with PE idle
0 2
else
4 3
if j = idle then i=3 1 5
exchange data with PE p − 1 0 2
else 4 3

exchange data with PE (i − j) mod (p − 1) i=4 1 5

0 2
Time: (p − 1)(nβ + α ) optimal für n → ∞ 4 3
Sanders: Parallel Algorithms December 10, 2020 191

Data exchange with

irregular message lengths
Particularly interesting with all-to-all → sorting
similar problems in inhomogeneous interconnection networks or
competition by other jobs.
Sanders: Parallel Algorithms December 10, 2020 192

The “Ostrich”-Algorithm

Push all messages using asynchronous send operations.

Receive what comes in.

Ostrich-Analysis:
BSP-Model: Time L + gh

But what is L and g in our single-ported models?

Sanders: Parallel Algorithms December 10, 2020 193

h-Relation

hin (i):= # packets received by PE i

hout (i):= # packets sent by PE i
p
simplex: h:= max hin (i) + hout (i)
i=1
p
duplex: h:= max max(hin (i), hout (i))
i=1
Lower bound for packet-wise delivery:
h steps, i.e.,
Time h(α +|packet|β )
Sanders: Parallel Algorithms December 10, 2020 194
2
Offline h-Relations in the Duplex Model1 3

[König 1916]
Consider the bipartite multigraph 4

G = ( s1 , . . . , s p ∪ r1 , . . . , r p , E) with Sender Empf.

| (si , r j ) ∈ E | = # packets from PE i to PE j. 1
Theorem: ∃ edge coloring φ : E → {1..h}, i.e., 2
no two equal-colored edges are incident to any node.
3
for j := 1 to h do
4
send messages with color j
optimal when postulating packet-wise delivery
Sanders: Parallel Algorithms December 10, 2020 195

Offline h-Relations in the Duplex Model

Problems: 2
Computing edge colorings online 1 3
is complicated and expensive

Shredding messages into packets increases # startups 4

Sender Empf.
1
2
3
4
Sanders: Parallel Algorithms December 10, 2020 196

Offline h-Relationen in the Simplex-Model

[Petersen 1891? Shannon 1949?]

Consider the multigraph G = ({1, . . . , p} , E)
with | {{i, j} ∈ E} | = # packets between PE i and PE j (both
directions).

Theorem: ∃ edge coloring φ : E → {1..3 ⌊h/2⌋ + h mod 2}

for j := 1 to h do
Send messages of color j

optimal?? ?
Sanders: Parallel Algorithms December 10, 2020 197

How Helper Hasten h-Relations

[Sanders Solis-Oba 2000]

Satz 4. For h-relations in the simplex model,

 6 (h + 1) if p even
5
#steps =
( 6 + 2 )(h + 1) if p odd .
5 p

On the other hand, there is a lower bound


6h if p even
5
#steps ≥
( 6 + 18 )h if p odd
5 25p
Sanders: Parallel Algorithms December 10, 2020 198

A Very Simple Case

a0 d0

a1 a2
Sanders: Parallel Algorithms December 10, 2020 199

Two Triangles
round a2 a1 a0 b0 b1 b2
1
a0 2

3
4

a1 a2 5
6

7
b0 8

9
10

b1 b2 11
12
Sanders: Parallel Algorithms December 10, 2020 200
l m
h
Reduction h-Relation 2 2-Relations
Ignore direction of communications for now
Connect nodes with odd degree
all nodes have even degree
Eulertour-technique: decompose the graph into edge disjoint
cycles
Direct cycles in clockwise direction
indegree and outdegree ≤ ⌈h/2⌉
Build bipartite graph (as before)
Color bipartite graph
Color classes in bipartite graph edge disjoint simple cycles in
the input graph (2-relations)
Reinstate orginal communication direction
Sanders: Parallel Algorithms December 10, 2020 201

Routing 2-Relations for Even p

Pair odd cycles.

1-cycles have nothing to do most simple case

Sanders: Parallel Algorithms December 10, 2020 202

Two odd cycles with ≥ 3 nodes

Split packets into 5 subpackets

round Cycle A Cycle B

|A|−1 |A|−2 |A|−3 3 2 1 0 0 1 2 3 |B|−3 |B|−2 |B|−1

1 ... ...

2 ... ...

3 ... ...

4 ... ...

5 ... ...

6 ... ...

Then turn this around

Sanders: Parallel Algorithms December 10, 2020 203

Odd p

Idea: Delete one edge in each 2-factor

Do that “always somewhere else”

Collect Θ(p) removed edges into a matching

one additional step every Θ(p) 2-factors.

Sanders: Parallel Algorithms December 10, 2020 204

Open Problems
Get rid of splitting into 5 subpackages?
Conjecture:
3
One h-Relation with ≤ hp packets can be delivered in ≈ h
8
steps.

Explicitly account for startup overheads

Explicitly account for interconnection network?
Distributed scheduling
Sanders: Parallel Algorithms December 10, 2020 205

A Simple Distributed Algorithm —

The Two-Phase-Algorithm

Idea: Irreg. All-to-all→ 2× regular All-to-all

Simplifying assumptions:

All message lengths are divisible by p

(in doubt round up)

communication “with oneself” is accounted for

All PEs send and receive exactly h bytes
(In doubt “pad” the messages)
Sanders: Parallel Algorithms December 10, 2020 206

// n[i] is length of message m[i]

Procedure alltoall2phase(m[1..p], n[1..p], p)
for i := 1 to p do a[i]:= hi
for j := 1 to p do a[i]:= a[i] ⊙ m[ j][(i − 1) n[pj] + 1..i n[pj] ]
b:= regularAllToAll(a, h, p)
δ := h1, . . . , 1i
for i := 1 to p do c[i]:= hi
for j := 1 to p do
c[i]:= c[i] ⊙ b[ j][δ [ j]..δ [ j] + n[i]@ j
p − 1] // Use All-
n[i]@ j
δ [ j]:= δ [ j] + p // gather to implement ‘@’
d:= regularAllToAll(c, h, p)
permute d to obtain the desired output format
proc 1 2 3 4
i= 1 2 3 4 i= 1 2 3 4 i= 1 2 3 4 i= 1 2 3 4

a
Sanders: Parallel Algorithms December 10, 2020

d
207
Sanders: Parallel Algorithms December 10, 2020 208

More on the Two-Phase-Algorithm

Large p, small messages
split local data into O(p log p) pieces (not p2 ) and disperse
randomly.

Split the problems into regular and irregular part two-phase

proctocal only applied to a part of the data.
open problem: how to split?
Sanders: Parallel Algorithms December 10, 2020 209

A Non-Preemptive Offline Algorithm

(simplex)

[Sanders Solis-Oba 99, unpublished]

Goal: deliver all messages directly, as a whole.

Let k:= max. # messages one PE is involved in.

Time for executing the schedule kα + 2hβ .

Here h is measured in bytes!
Sanders: Parallel Algorithms December 10, 2020 210

Abstract Description

s:= empty schedule

M:= set of messages to be scheduled
while M 6= 0/ do
t:= min {t : ∃m ∈ M : m’s src and dest are idle at time t}
s:= s∪ “start sending m at time t ”
M:= M \ {m}
Can be implemented such that per message, the time for O(1)
priority-queue operationen and one p-bit bit-vector operation is used.

practicable for message lengths ≫ p and moderate p.

Sanders: Parallel Algorithms December 10, 2020 211

Open Problems for Non-Preemptive Offline

Algorithms
implement, measure, use, e.g. sorting, construction of suffix-arrays
Better approximation algorithms?
Parallel scheduling algorithms
Sanders: Parallel Algorithms December 10, 2020 212

Summary: All-to-All
Ostrich: Delegate to online, asynchronous routing.
Good when that is implemented well.

Regular+2Phase: more robust. But, factor 2 is troubling. A lot of

copying overhead.

Non-preemptive: Minimizes startups and communication volume. Faktor

2 (worst case). Centralized Scheduling is troubling.
Good for repeated identical problems.

Coloring-based algorithms: Almost optimal for large packets. Complex.

Distributed Implementation? Splitting into packets is troubling.

Comparison of approaches?
Sanders: Parallel Algorithms December 10, 2020 213

Parallel Priority Queues

Manage a set M of elements. n = |M|. Initially empty

Binary Heaps (sequential)

Procedure insert(e) M:= M ∪ {e} // O(log n)

Function deleteMin e:= min M ; M:= M \ {e}; return e// O(log n)
Sanders: Parallel Algorithms December 10, 2020 214

Parallel Priority Queues, Goal

insert*: Each PE inserts a constant number of elements,
time O(log n + log p)?

deleteMin*: delete thep smallest elements,

time O(log n + log p)?

Asynchronous Variant (later): Each PE can insert or delete at any time.

Semantics: ∃ temporal ordering of the operations consistent with
sequential execution.
Sanders: Parallel Algorithms December 10, 2020 215

Applications
Priority-driven scheduling
Best first branch-and-bound:
Find best solution in a large, implicitly defined tree. (later more)

Discrete event simulation

Sanders: Parallel Algorithms December 10, 2020 216

Naive Implementation
PE 0 manages a sequential PQ
All others send requests

insert: Ω (p(α + log n))

deleteMin: Ω (p(α + log n))
Sanders: Parallel Algorithms December 10, 2020 217

Branch-and-Bound
H : tree (V, E) with bounded node degree
c(v): node costs — grow on when descending a path
v∗ : leaf with minimal costs
Ṽ : {v ∈ V : v ≤ v∗ }
m: |Ṽ | Simplification: Ω (p log p)
h: Depth of H̃ (subgraph of H induced by Ṽ ).
Tx : Time for generating the successors of a node
Tcoll : Upper bound for broadcast, min-reduction, prefix-sum, routing
one element from/to random partner.
O(α log p) on many networks.
Sanders: Parallel Algorithms December 10, 2020 218

Sequential Branch-and-Bound

Q = {root node} : PriorityQueue // frontier set

c∗ = ∞ // best solution so far
while Q 6= 0/ do
select some v ∈ Q and remove it
if c(v) < c∗ then
if v is a leaf node then process new solution; c∗ := c(v)
else insert successors of v into Q
Tseq = m(Tx + O(log m))
Sanders: Parallel Algorithms December 10, 2020 219

Parallel Branch-and-Bound

Q = {root node} : ParallelPriorityQueue

c∗ = ∞ // best solution so far
while Q 6= 0/ do
v:= Q.deleteMin∗ // SPMD!
if c(v) < c∗ then
if v is a leaf node then
process new solution
update c∗ // Reduction
else insert successors of v into Q
Sanders: Parallel Algorithms December 10, 2020 220

Analysis

Theorem: Tpar = ( mp + h)(Tx + O TqueueOp )
Case 1:
(at most m/p Iterations):
All processed nodes are in Ṽ

Case 2:
(at most h Iterations):
Some nodes outside Ṽ ~
V
are processed
→ maximal path length
from a node in Q
H
to the optimal solutions
v*
is being reduced.
Sanders: Parallel Algorithms December 10, 2020 221

The Karp–Zhang Algorithm

Q = {root node} : PriorityQueue // local!

c∗ = ∞ // best solution so far
while ∃i : Q@i 6= 0/ do
v:= Q.deleteMin // local!
if c(v) < c∗ then
if v is a leaf node then
process new solution
c∗ := mini c(v)@i // Reduction
else for each successor v′ of v do
insert v into Q@i for random i

Satz: Expected time is asymptotically optimal

Sanders: Parallel Algorithms December 10, 2020 222

Our Approach
PE: 1 2 3 4

B&B Processes
New Nodes
Random
Placement

Local Queues
Top−Nodes

Filter p best
Assign to PEs
Sanders: Parallel Algorithms December 10, 2020 223

Parallel Priority Queues: Approach

The queue is the union of local queues
insert sends new elements to random local queues
Intuition: each PE gets a representative view on the data

deleteMin* finds the globally smallest elements

(act locally think globally) PE: 1 2 3 4
distributes them to the PEs

Filter p best

Assign to PEs
Sanders: Parallel Algorithms December 10, 2020 224

Simple Probabilistic Properties

With high probability (whp):

here ≥ 1 − p−c for a constant c we can choose

log p
whp only O log log p elements per local queue during insertion

whp, the O(log p) smallest elements of each local queue together

contain the p globally best elements

whp no local queue contains more then O(n/p + log p) elements

Proof: Chernoff-bounds again-and-again.
(standard setting, balls-into-bins)
Sanders: Parallel Algorithms December 10, 2020 225

Parallel Implementation I

Insert

Sending: Tcoll

log p n
Local insertions: O · log .
log log p p
(Better with “advanced” local queues Careful: amortized bounds
are not sufficient.)
Sanders: Parallel Algorithms December 10, 2020 226

Parallel Implementation I

deleteMin*

Procedure deleteMin*(Q1 , p)
Q0 := the O(log p) smallest elements of Q1
M:= select(Q0 , p) // later

enumerate M = e1 , . . . , e p
assign ei to PE i // use prefix sums
if maxi ei > min j Q1 @ j then expensive special case treatment
empty Q0 back into Q1
Sanders: Parallel Algorithms December 10, 2020 227

Analysis

Remove locally: O log p log np

Selection: O(Tcoll ) whp todo

Enumerate M : O(Tcoll )
Deliver results: O(Tcoll ) (random sources)
Verify: O(Tcoll ) +
(sth polynomial inp)·(a polynomially small probability)

Insert locally: O log p log np
Sanders: Parallel Algorithms December 10, 2020 228

Parallel Implementation II

Idea: Avoid ping-pong of the O(log n) smallest elements.

Split the queue Q0 such that Q1 , |Q0 | = O(log p).

Invariant: whp |Q0 | = O(log p)

PE: 1 2 3 4

Q1
Q0

Filter n best

Assign to PEs
Sanders: Parallel Algorithms December 10, 2020 229

Parallel Implementation II

Insert

Send: Tcoll
Insert locally: merge Q0 and new elements
O(log p) whp.
Cleanup: Empty Q0 every log p iterations.

cost O log p log np per log p iterations

average costs O log np
Sanders: Parallel Algorithms December 10, 2020 230

Parallel Implementation II

deleteMin*

Procedure deleteMin*(Q0 , Q1 , p)
while |{e ∈ Q̆0 : e < min Q̆1 }| < p do
Q0 := Q0 ∪ {deleteMin(Q1 )}
M:= select(Q0 , p) // later

enumerate M = e1 , . . . , e p
assign ei to PE i // use prefix sums
Sanders: Parallel Algorithms December 10, 2020 231

Analysis

Remove locally: O(1) expected iterations O Tcoll + log np

Selection: O(Tcoll ) whp todo

Enumerate M : O(Tcoll )
Deliver results: O(Tcoll ) (random sources)
Sanders: Parallel Algorithms December 10, 2020 232

Result

insert*: expected O Tcoll + log np

deleteMin*: expected O Tcoll + log np
Sanders: Parallel Algorithms December 10, 2020 233

Randomized Selection [Blum et al. 1972]

Given n (randomly allocated) elements Q, find the k smallest ones.

choose a sample s
u:= element with rank nk |s| + ∆ in s.
ℓ:= element with rank nk |s| − ∆ in s
Partition Q into
Q< := {q ∈ Q : q < ℓ},
Q> := {q ∈ Q : q > u},
Q′ := Q \ Q< \ Q>
If |Q< | < k and |Q< | + |Q′ | ≥ k, output Q< and find the
k − |Q< | smallest elements of Q′
All other cases are unlikely if |s|, ∆ are sufficiently large.
Sanders: Parallel Algorithms December 10, 2020 234

Randomized Selection [Blum et al. 1972]

unknown position of k−th smallest value

"Guess" interval based on samples

iterate
1
0
0 sample
1
known unknown 0
1
0
1
smallest elements
other elements
Sanders: Parallel Algorithms December 10, 2020 235

Parallel Implementation
√
|s| = p sample can be sorted in time O(Tcoll ).

∆ = Θ p1/4+ε for a small constant ε .
This makes difficult cases unlikely.

No elements are redistributed. Random initial distribution

guarantees good load balance whp.
√
Whp a constant number of iterations suffices until only p
elements are left. Then sort directly.

Overall expected time O np + Tcoll
Sanders: Parallel Algorithms December 10, 2020 236

Parallel Priority Queues – Refinements

Procedure deleteMin*(Q0 , Q1 , p)
while |{e ∈ Q̆0 : e < min Q̆1 }| < p do
Q0 := Q0 ∪ {deleteMin(Q1 )} // select immediately
M:= select(Q0 , p) // later

enumerate M = e1 , . . . , e p
assign ei to PE i // use prefix sums

Or just use sufficiently many locally smallest elements and check later
Sanders: Parallel Algorithms December 10, 2020 237

Parallel Priority Queues – Refinements

Mergable priority queues?
Bulk delete after flush?
Larger samples
Remove larger batches?
Only a subset of the PEs work as PQ-server?
Selection by pruned merging:
√
A reduction with vector length O p log p
Sanders: Parallel Algorithms December 10, 2020 238

Asynchronous Variant

Accept insertions but do not immediately insert.

Batched deleteMin in a buffer.

Access buffer using an asynchronous FIFO.
Sometimes:
Invalidate FIFO,
commit inserted elements
refill buffer
Sanders: Parallel Algorithms December 10, 2020 239

Implementation on IBM SP-2, m = 224

6
parallel
centralized
5 0.093*n

4
T [ms]

0
24 8 16 24 32 40 48 56 64
n
Sanders: Parallel Algorithms December 10, 2020 240

Implementation on Cray T3D, m = 224

p = 256
insert 256 elements and a deleteMin*:
centralized: > 28.16ms
parallel: 3.73ms

break-even at 34 PEs
Sanders: Parallel Algorithms December 10, 2020 241

Different approach starts with a binary heap:

nodes with p sorted Elements.
Invariant: All elements > all elements in parent node
Compare-and-swap merge-and-split

Elegant but expensive

Parallelization of a single access constant time with log n PEs.

Sanders: Parallel Algorithms December 10, 2020 242

Communication Efficient Priority Queues

Each PE stores a search tree augmented with subtree sizes.

local insert – O(log n) time.

2
find k smallest elements in time O log n
(similar to multi-sequence selection for mergesort)

find Θ(k) smallest elements in time O(log n)

Communication Efficient Algorithms for Top-k Selection Problems, with
Lorenz Hübschle-Schneider, IPDPS 2016
Sanders: Parallel Algorithms December 10, 2020 243

MultiQueues: Simple Relaxed Concurrent

Priority Queues

With Roman Dementiev and Hamza Rihani, SPAA 2015

2p local queues Q[1], . . . , Q[2p]

Insert into random local queues (“wait-free” locking)
Delete smallest elements from two randomly chosen queues
Sanders: Parallel Algorithms December 10, 2020 244

80
MultiQ c=2
MultiQ HT c=2
MultiQ c=4
Throughput (MOps/s)

Spraylist
60
Linden
Lotan

0
0 7 14 21 28 35 42 49 56
Threads
Sanders: Parallel Algorithms December 10, 2020 245

100%
MultiQ c=2
90% MultiQ c=4
Cumulative frequency

80% Spraylist
70% Theoretical c=2
Theoretical c=4
60%
50%
40%
30%
20%
10%

0 1000 2000 3000

rank k
Sanders: Parallel Algorithms December 10, 2020 246

List Ranking
Motivation:

With arrays a[1..n] we can do many things in parallel

PE i processes a[(i − 1) np + 1..i np ]

Prefix sums
...
Can we do the same with linked lists?

Yes! For example, convert into an array

Sanders: Parallel Algorithms December 10, 2020 247

List Ranking
n
L: List i 1 2 3 4 5 6 7 8 9
L
n: Elements S
S(i): Successor of element i
(unordered) P
S(i) = i: at end of list
R 1 1 1 1 1 1 1 1 0
P(i): Predecessor of element i
4 3 5 8 7 2 6 1 0
Exercise: compute in constant time for n PE PRAM

R(i): Rank. Initially 1, 0 for last element.

Output:R(i) = distance to the end (when following the chain) (
rank).

Array-Conversion: store S(i) in a(n − R(i))

Sanders: Parallel Algorithms December 10, 2020 248

Motivation II
Lists are very simple graphs

warmup for graph algorithms

long paths are a main obstacle for parallelization.
I.e., solutions might help for more general graphs also (at least trees)?
Sanders: Parallel Algorithms December 10, 2020 249

Pointer Chasing

find i such that S(i) = i // parallelizable

for r := 0 to n − 1 do
R(i):= r
i:= P(i) // inherently sequential?
Work O(n)
Time Θ(n)
Sanders: Parallel Algorithms December 10, 2020 250

Doubling using CREW PRAM, n = p

n
Q(i):= S(i) // SPMD. PE index i i 1 2 3 4 5 6 7 8 9
invariant ∑ j∈Qi R( j) = rank of item i S,Q
Qi is the positions given by R 1 1 1 1 1 1 1 1 0
chasing Q-pointers from pos i
Q
while R(Q(i)) 6= 0 do
R(i):= R(i) + R(Q(i)) R 2 2 2 2 2 2 2 1 0
Q(i):= Q(Q(i)) Q
R 4 3 4 4 4 2 4 1 0
Q
R 4 3 5 8 7 2 6 1 0
Sanders: Parallel Algorithms December 10, 2020 251

Analysis

Induction Hypothesis: After k iterations

R(i) = 2k or
R(i) = final result
Proof: True for k = 0.
k k + 1:
Case R(i) < 2k : already final value (IH)
Case R(i) = 2k , R(Q(i)) < 2k : now final value (Invariant, IH)
Case R(i) = R(Q(i)) = 2k : Now 2k+1

Work Θ(n log n)

Time Θ(log n)
Sanders: Parallel Algorithms December 10, 2020 252

Independent Set Removal

// Compute the sum of the R(i)-values when following the S(i) pointers
Procedure independentSetRemovalRank(n, S, P, R)
if p ≥ n then use doubling; return
find I ⊆ 1..n such that ∀i ∈ I : S(i) 6∈ I ∧ P(i) 6∈ I
find a bijective mapping f : {1..n} \ I → 1..n − |I|
foreach i 6∈ I dopar // remove independent set I
S′ ( f (i)):= if S(i) ∈ I then f (S(S(i))) else f (S(i))
P′ ( f (i)):= if P(i) ∈ I then f (P(P(i))) else f (P(i))
R′ ( f (i)):= if S(i) ∈ I then R(i) + R(S(i)) else R(i)
independentSetRemovalRank(n − |I|, S′ , P′ , R′ )
foreach i 6∈ I dopar R(i):= R′ ( f (i))
foreach i ∈ I dopar R(i):= R(i) + R′ ( f (S(i)))
Sanders: Parallel Algorithms December 10, 2020 253
I={2,3,5,8} n
i 1 2 3 4 5 6 7 8 9
S

P
R 1 1 1 1 1 1 1 1 0
f(i) 1 2 3 4 5
S’
P’
R’ 2 2 2 20
recurse
R’ 4 8 2 6 0
R 4 3 5 8 7 2 6 1 0
Sanders: Parallel Algorithms December 10, 2020 254

Finding Independent Sets

“Throw a coin” c(i) ∈ {0, 1} for each i ∈ 1..n I={3,8}

i ∈ I if c(i) = 1 ∧ c(S(i)) = 0 c(i) 0 1 1 0 0 1 0 1 0
n i 1 2 3 4 5 6 7 8 9
Expected size |I| ≈
4 S

Monte Carlo algorithm Las Vegas algorithmus:

n
repeat until |I| > .
5
Expected time: O(n/p)

Neither start nor end of list are in I .

Sanders: Parallel Algorithms December 10, 2020 255

Finding a Bijective Mapping

Prefix sum over the indicator function of {1..n} \ I :

f (i) = ∑ [ j 6∈ I]
j≤i
Sanders: Parallel Algorithms December 10, 2020 256

Analysis
n 4
T (n) = O + log p + T n in expectation
p 5

n
O log levels of recursion
p

n n
Summe: O + log log p geometric sum
p p
n
Linear work, time O(log n log log n) with PEs
log n log log n
n

*4/5
log n/p *4/5 ...
Sanders: Parallel Algorithms December 10, 2020 257

More on List Ranking

Simple algorithm with expected time O(log n)
Complicated algorithm with worst case Time O(log n)
many “applications” in PRAM algorithms
Implementation on distributed-memory parallel computers [Sibeyn
97]: p = 100, n = 108 , speedup 30.

Generalization for segmented lists, trees

Generalization for general graphs:
contract nodes or edges

Example for multilevel algorithms

Sanders: Parallel Algorithms December 10, 2020 258

Newer Implementation Results

Cut the list at s random places
Sequential algorithm for each sublist
Recursive solution on instance of size s
Speedup ≈ 10 on 8-core CPU (???) [Wei JaJa 2010]
Sanders: Parallel Algorithms December 10, 2020 259

Parallel Graph Algorithms

The „canonical“ „easy“ graph algorithms:
Main interest, sparese, polylog. execution time, efficient

− DFS
− BFS
− shortest pahts
(nonnegative SSSP O(n) par. time. interesting for m = Ω (np) )
(what about APSP?)
− topological sorting
+ connected components (but not strongly connected)
+ Minimal spanning trees
+ Graph partitioning
Sanders: Parallel Algorithms December 10, 2020 260

Minimum Spanning Trees

undirected graph G = (V, E).
nodes V , n = |V |, e.g., V = {1, . . . , n}
edges e ∈ E , m = |E|, two-element subsets of V .
1 5 2
edge weight c(e), c(e) ∈ R+ wlog all different.
9 7
G is connected, i.e., ∃ path between any two nodes. 2
4 3

Find a tree (V, T ) with minimum weight ∑e∈T c(e) that connects all
nodes.
Sanders: Parallel Algorithms December 10, 2020 261

Selecting and Discarding MST Edges

The Cut Property

For any S ⊂ V consider the cut edges

1 5 2
C = {{u, v} ∈ E : u ∈ S, v ∈ V \ S} 9 7
The lightest edge in C can be used in an MST. 2
4 3

The Cycle Property 1 5 2

9 7
The heaviest edge on a cycle is not needed for an MST 2
4 3
Sanders: Parallel Algorithms December 10, 2020 262

The Jarník-Prim Algorithm

[Jarník 1930, Prim 1957]

Idea: grow a tree

T := 0/
S:= {s} for arbitrary start node s
repeat n − 1 times 1 5 2
find (u, v) fulfilling the cut property for S 9 7
S:= S ∪ {v} 2
4 3
T := T ∪ {(u, v)}
Sanders: Parallel Algorithms December 10, 2020 263

Graph Representation for Jarník-Prim

Adjacency Array

We need node → incident edges

1 n 5=n+1
1 5 2 V 1 3 5 7 9
9 7
2 E 2 4 1 3 2 4 1 3
4 3 c 5 9 5 7 7 2 2 9
1 m 8=m+1
Sanders: Parallel Algorithms December 10, 2020 264

Analysis

O(m + n) time outside priority queue

n deleteMin (time O(n log n))
O(m) decreaseKey (time O(1) amortized)
O(m + n log n) using Fibonacci Heaps
Problem: inherently sequential.
Best bet: use log n procs to support O(1) time PQ access.
Sanders: Parallel Algorithms December 10, 2020 265

Kruskal’s Algorithm [1956]

T := 0/ // subforest of the MST

foreach (u, v) ∈ E in ascending order of weight do
if u and v are in different subtrees of T then
T := T ∪ {(u, v)} // Join two subtrees
return T
Sanders: Parallel Algorithms December 10, 2020 266

Analysis

O(sort(m) + mα (m, n)) = O(m log m) where α is the inverse

Ackermann function

Problem: still sequential

Best bet: parallelize sorting

Idea: grow tree more aggressively

Sanders: Parallel Algorithms December 10, 2020 267

Edge Contraction

Let {u, v} denote an MST edge.

Eliminate v:

forall (w, v) ∈ E do
E := E \ (w, v) ∪ {(w, u)} // but remember orignal terminals

1 5 2 1 7 (was {2,3})
9 7 9
2 2
4 3 4 3
Sanders: Parallel Algorithms December 10, 2020 268

Boruvka’s Algorithm

[Boruvka 26, Sollin 65]

For each node find the lightest incident edge.

Include them into the MST (cut property)
contract these edges,

Time O(m) per iteration

At least halves the number of remaining nodes

1 5 2
9 7
2 3
3 4 5
Sanders: Parallel Algorithms December 10, 2020 269

Analysis (Sequential)

O(m log n) time

asymptotics is OK for sparse graphs

Goal: O(m log n) work O(Polylog(m)) time parallelization

Sanders: Parallel Algorithms December 10, 2020 270

Finding lightest incident edges

Assume the input is given in adjacency array representation

forall v ∈ V dopar
p
allocate |Γ(v)| 2m processors to node v // prefix sum
find w such that c(v, w) is minimized among Γ(v) // reduction
output original edge corresponding to (v, w)
pred(v):= w
1 5 2
9 7
2 3
Time O m
p + log p
3 4 5
Sanders: Parallel Algorithms December 10, 2020 271

Structure of Resulting Components

Consider a component C of the graph (V, {(v, pred(v)) : v ∈ V })

out-degree 1
|C| edges
pseudotree,
i.e. a tree plus one edge

one two-cycle at the 1 5 2

lightest edge (u, w) 9 7
2 3
3 4 5
remaining edges lead to u or w
Sanders: Parallel Algorithms December 10, 2020 272

Pseudotrees → Rooted Trees

forall v ∈ V dopar
w:= pred(v)
if v < w ∧ pred(w) = v then pred(v):= v

Time O np

1 5 2 1 5 2
9 7 9 7
2 3 2 3
3 4 5 3 4 5
Sanders: Parallel Algorithms December 10, 2020 273

Rooted Trees → Rooted Stars by Doubling

while ∃v ∈ V : pred(pred(v)) 6= pred(v) do

forall v ∈ V dopar pred(v):= pred(pred(v))

Time O np log n

1 5 2 1 5 2
9 7 9 7
2 3 2 3
3 4 5 3 4 5
Sanders: Parallel Algorithms December 10, 2020 274

Contraction

k:= #components
V ′ = 1..k
find a bijective mapping f : star-roots→ 1..k // prefix sum
E ′ := {( f (pred(u)), f (pred(v)), c, eold ) :
(u, v, c, eold ) ∈ E ∧ pred(u) 6= pred(v)}

Time O m p + log p

1 5 2 1
9 7 9 7
2 3
3 4 5 2
Sanders: Parallel Algorithms December 10, 2020 275

Recursion

convert G′ = (V ′ , E ′ ) into adjacency array representation// integer sorting

optional: remove parallel edges // retain lightest one
recurse on G′

Expected sorting time O m p + log p CRCW PRAM
[Rajasekaran and Reif 1989]
practical algorithms for m ≫ p
Sanders: Parallel Algorithms December 10, 2020 276

Analysis

Satz 5. On a CRCW-PRAM, parallel Borůvka can be

implemented to run in expected time

m
O log n + log2 n .
p
≤ log n iterations
sum costs determined above
for root finding:
n n −i
∑ 2i 2i
log ≤ n log n ∑ 2 = O(n log n)
i i
Sanders: Parallel Algorithms December 10, 2020 277

A Simpler Algorithm (Outline)

Alternate

Find lightest incident edges of tree roots (grafting)

One iteration of doubling (pointer jumping)
Contract leaves
As efficient as with more complicated “starification”
Sanders: Parallel Algorithms December 10, 2020 278

Randomized Linear Time Algorithm

1. Factor 8 node reduction (3× Boruvka or sweep algorithm)
O(m + n).
2. R ⇐ m/2 random edges. O(m + n).
3. F ⇐ MST (R) [Recursively].
4. Find light edges L (edge reduction). O(m + n)
mn/8
E[|L|] ≤ m/2 = n/4.

5. T ⇐ MST (L ∪ F) [Recursively].
T (n, m) ≤ T (n/8, m/2) + T (n/8, n/4) + c(n + m)
T (n, m) ≤ 2c(n + m) fulfills this recurrence.
Sanders: Parallel Algorithms December 10, 2020 279

Parallel Filter Kruskal

Procedure filterKruskal(E, T : Sequence of Edge, P : UnionFind)
if m ≤ kruskalThreshold(n, m, |T |) then
kruskal(E, T, P) // parallel sort
else
pick a pivot p ∈ E
E≤ := he ∈ E : e ≤ pi // parallel
E> := he ∈ E : e > pi // partitioning
filterKruskal(E≤ , T, P)
if |T | = n − 1 then exit
E> := filter(E> , P) // parallel removeIf
filterKruskal(E> , T, P)
Sanders: Parallel Algorithms December 10, 2020 280

Running Time: Random graph with 216 nodes

1000
time / m [ns]

Kruskal
100 qKruskal
Kruskal8
filterKruskal+
filterKruskal
filterKruskal8
qJP
pJP

1 2 4 8 16
number of edges m / number of nodes n
Sanders: Parallel Algorithms December 10, 2020 281

[Pettie Ramachandran 02] O(m) work, O(log n) expected time

randomized EREW PRAM algorithm.

[Masterarbeit Wei Zhou 17:]Parallel Borůvka + filtering.

Use edge list representation, union-find.
Speedup up to 20 on 72 cores.

Master thesis topic: communication-efficient distributed-memory MST

Sanders: Parallel Algorithms December 10, 2020 282

Load Balancing
[Sanders Worsch 97]
Given

word to be done
PEs
Load balancing = assigning work → PEs
Goal: minimize parallel execution time
Sanders: Parallel Algorithms December 10, 2020 283

What we already saw

Estimating load using sampling sample sort

Assign approx. equal sized pieces sample sort

Multi sequence selection balances multiway merging

Dynamic load balancing for quicksort and doall
Prefix sums
quicksort, parPQ, list ranking, MSTs,. . .

Parallel priority queues branch-and-bound

Sanders: Parallel Algorithms December 10, 2020 284

Measuring Cost
p
Maximal Load: max ∑
i=1 j∈jobs @ PE i
T ( j, i,...)

Time for finding the assignment

Executing the assignment
Cost for redistribution
communication between jobs? (volume, locality?)
Sanders: Parallel Algorithms December 10, 2020 285

What is Known About the Jobs?

exact size
approximate size
(almost) nothing
further subdivisible?
similar for communication costs
Sanders: Parallel Algorithms December 10, 2020 286

What is Kown About the Processors?

all equal?
different?
fluctuating external load
Tolerate faults?
similar for communication resources
Sanders: Parallel Algorithms December 10, 2020 287

In this Lecture
Independent jobs
– Sizes exactly kown — fully parallel implementation
– Sizes unknown or inaccurate — random assignment,
master-worker-scheme, random polling
Sanders: Parallel Algorithms December 10, 2020 288

A very simple modell

n jobs, O(n/p) per PE, independent, splittable, description with
size O(1)

Größe ℓi genau bekannt

Sanders: Parallel Algorithms December 10, 2020 289

Sequential Next Fit [McNaughton 59]

C:= ∑ j≤n ℓpi // work per PE

i:= 0 // current PE
f := C // free room on PE i
j:= 1 // current Job
ℓ:= ℓ1 // remaining piece of job j
while j ≤ n do
c:= min( f , ℓ) // largest fitting piece
assign a piece of size c of job j to PE i
f := f − c
ℓ:= ℓ − c
if f = 0 then i++ ; f := C // next PE
if ℓ = 0 then j++ ; ℓ:= ℓ j // next job
Sanders: Parallel Algorithms December 10, 2020 290

Sequential Next Fit [McNaughton 59]

...
Sanders: Parallel Algorithms December 10, 2020 291

Parallelization of Next Fit (Sketch)

// Assume PE i holds jobs ji .. ji′

C:= ∑ j≤n ℓpi
forall j ≤ n dopar
pos:= ∑k<i ℓk // prefix sums
j
pos pos+ℓ j k
assign job j to PEs C .. C // segmented broadcast
pos
piece size at PE i = C : (i + 1)C − pos
j k
pos+ℓ j
piece size at PE i = C : pos + ℓ j − iC

Time C + O np + log p if jobs are initially distributred randomly.
Sanders: Parallel Algorithms December 10, 2020 292

Parallelization of Next Fit: Example

4 4 5
3 3 3
2 2 2

0 2 5 9 13 15 20 23 25

0 7 14 21
Sanders: Parallel Algorithms December 10, 2020 293

Atomic Jobs
pos
assign job j to PE C

Maximum load ≤ C + max j ℓ j ≤ 2opt

Better sequential approximation:
Assign largest jobs first
(shortest queue, first fit, best fit) in time O(n log n)
probably not parallelizable

Parallel
11
· opt
9
[Anderson, Mayr, Warmuth 89]
Sanders: Parallel Algorithms December 10, 2020 294

Atomic Jobs: Example

5
4 4
3 3 3
2 2 2

0 2 5 9 13 15 20 23 25

0 7 14 21
optimal:

4 3 4 3 5 2 2 2 3
Sanders: Parallel Algorithms December 10, 2020 295

Example Mandelbrot Set

zc (m) : N → C
zc (0) := 0, zc (m + 1) := zc (m)2 + c
M := {c ∈ C : zc (m) ist beschränkt} .
Sanders: Parallel Algorithms December 10, 2020 296

Approximate Computation

Computation only for a quadratic subset of the complex plane

Computation only for a discrete grid of points
zc unbounded if |zc (k)| ≥ 2 56 63

Stop after mmax iterations

Imaginärteil(c)
a0
Where is the load balancing problem?
16 23
8 9 15
0 1 2 7
z0
a0

Realteil(c)
Sanders: Parallel Algorithms December 10, 2020 297

Code

int iterate(int pos, int resolution, double step)

{ int iter;
complex c =
z0+complex((double)(pos % resolution) * step,
(double)(pos / resolution) * step);
complex z = c;
for (iter = 1;
iter < maxiter && abs(z) <= LARGE;
iter++) {
z = z*z + c;
}
return iter; }
Sanders: Parallel Algorithms December 10, 2020 298

Static Apple Distribution

Since there is little communication, we are very flexible

split into strips

– Why attractive?
– Why rather not?
Cyclic. Good. But provable??
Random

Bearbeitet von: =PE 0 =PE 1 =PE 2 =PE 3

Sanders: Parallel Algorithms December 10, 2020 299

Parallelize Assignment Phase

If the jobs are arbitrarily distributed over the PEs: Random

permutation via all-to-all. (see also sample sort)

Implicit Generation of jobs

– Jobs can be generated based on a number 1 . . . n.
– Problem: Parallel generation of a (pseudo) random permutation
Sanders: Parallel Algorithms December 10, 2020 300

Pseudorandom Permutations π : 0..n − 1 → 0..n − 1

Wlog (?) let n be a square.

√ 2
Interpret numbers from 0..n − 1 as pairs from {0.. n − 1} .
√ √
f : 0.. n − 1 → 0.. n − 1 (pseudo)random function
√
Feistel permutation: π f ((a, b)) = (b, a + f (b) mod n)
−1 √
(π f (b, x) = (x − f (b) mod n, b))

Chain several Feistel permutations

π (x) = π f (πg (πh (πl (x)))) is even save in some cryptographical
sense
Sanders: Parallel Algorithms December 10, 2020 301

Random Assignment
Given: n jobs of size ℓ1 ,. . . , ℓn
Let L := ∑ ℓi
i≤n

Let lmax := max ℓi

i≤n

Assign the jobs to random PEs

ln p 3
Theorem: If L ≥2(β + 1)plmax 2 + O(1/ε ))
ε
L
then the maximum Load is at most (1 + ε )
p
with probability at least 1 − p−β . Proof: . . . Chernoff-Bounds. . .
Sanders: Parallel Algorithms December 10, 2020 302

Discussion

+ Job sizes need not been known at all

+ It is irrelevant where the jobs come from
(distributed generation possible)

− Inacceptable with big lmax

− Very good load balance only with large L/lmax
(quadratic in 1/ε , logarithmic in p).
Sanders: Parallel Algorithms December 10, 2020 303

Application Example: Airline Crew Scheduling

A single random assignment solves k simultaneous load balancing

problems. (Deterministically, this is probably a difficult problem.)
Duenn besetzte Matrix

eine Iteration

zufaellig permutierte Spalten

Sanders: Parallel Algorithms December 10, 2020 304

The Master-Worker-Scheme
Initially all jobs are on a master-PE
Job sizes can be master

estimated but is not requests tasks

exactly known

Once a job is assigned, worker worker worker worker

they cannot be further subdivided

(nonpreemptive)
Sanders: Parallel Algorithms December 10, 2020 305

Discussion

+ Simple
+ Natural input-output scheme (but perhaps a separate disk slave)
+ Suggests itself when the job generator is not parallelized
+ Easy to debug
− communication bottleneck ⇒ tradeoff communication cost versus
imbalance

− How to split?
− Multi-level schemes are complicated and of limited help
Sanders: Parallel Algorithms December 10, 2020 306

Size of the jobs

Deal out jobs that are as large as possible as long as balance is

not in danger. Why?
Conservative criterium: upper bound for the size of the delivered
jobs ≤
1/P-th part of lower bound
for system load.
− Where to get size estimate?
Master
More aggressive approaches make sense
Anfragen / Antworten Teilprobleme

Worker Worker Worker Worker

Sanders: Parallel Algorithms December 10, 2020 307

Work Stealing
(Almos) arbitrarily subdivisible load
Initially all the work on PE 0
Almost nothing is known on job sizes
Preemption is allowed. (Successive splitting)
Sanders: Parallel Algorithms December 10, 2020 308

Example: The 15-Puzzle

4 1 2 3 1 2 3
5 9 6 7 4 5 6 7
8 10 11 8 9 10 11
12 13 14 15 12 13 14 15

Korf 85: Iterative deepening depth first search with ≈ 109 tree nodes.
Sanders: Parallel Algorithms December 10, 2020 309

Example: Firing Squad Synchronization Problem

#G.#G..#G...#G....#G.....#G......#G.......#
#GG#GX.#GX..#GX...#GX....#GX.....#GX......#
#FF#XXX#XXG.#XXG..#XXG...#XXG....#XXG.....#
#GGG#GX.G#GX.X.#GX.X..#GX.X...#GX.X....#
#FFF#XXXX#XXG.X#XXG.G.#XXG.G..#XXG.G...#
#GGGG#GX.GX#GX.XXG#GX.XXX.#GX.XXX..#
#FFFF#XXXXX#XXG.XX#XXG.G.X#XXG.G.G.#
#GGGGG#GX.G.G#GX.XXGX#GX.XXXXG#
#FFFFF#XXXXXX#XXG.XGX#XXG.GGXX#
#GGGGGG#GX.G.GX#GX.XXGXG#
#FFFFFF#XXXXXXX#XXG.XGXX#
#GGGGGGG#GX.G.GXG#
#FFFFFFF#XXXXXXXX#
#GGGGGGGG#
#FFFFFFFF#
Sanders: Parallel Algorithms December 10, 2020 310

Backtracking over Transition Functions

#G.# :Neue Regel
#??#
#??#
:Fehlerpunkt
#G.# #G.#
#G?# #.?#
#??# #??#

#G.# #G.# #G.# #G.#

#GG# #G.# #.G# #..#
#??# #G?# #??# #??#

#G.# #G.# #G.#

#GG# #.G# #..#
#F?# #F?# #F.#

#G.# #G.#
#GG# #.G#
#FF# #FF#

#G..# #G..# #G..# #G..#

#GG.# #G..# #.G.# #...#
#F??# #G..# #F??# #???#
#???# #G..# #???# #???#
#???# #G??# #???# #???#

#G..# #G..#
#...# #...#
#G..# #...#
#...# #...#
#G??# #.??#
Sanders: Parallel Algorithms December 10, 2020 311

Goal for the analysis

Tseq
Tpar ≤ (1 + ε ) + lower order terms
p
Sanders: Parallel Algorithms December 10, 2020 312

An Abstract Model:
Tree Shaped Computations
subproblem
work
sequentially

atomic
split

empty

l
send Proc. 1022
Sanders: Parallel Algorithms December 10, 2020 313

Tree Shaped Computations:

Parameters
Tatomic : max. time for finishing up an atomic subproblem
Tsplit : max. time needed for splitting
h: max. generation gen(P) of a nonatomic subproblem P
ℓ: max size of a subproblem description
p: no. of processors
Trout : time needed for communicating a subproblem (α + ℓβ )
Tcoll : time for a reduction
Sanders: Parallel Algorithms December 10, 2020 314

Relation to Depth First Search

let stack consist of root node only
while stack is not empty do
remove a node N from the stack
if N is a leaf then
evaluate leaf N
else
put successors of N on the stack
fi
Sanders: Parallel Algorithms December 10, 2020 315

Splitting Stacks

00111100 11001100 00111100 11001100

a) b)
Sanders: Parallel Algorithms December 10, 2020 316

Limits of the Model

Quicksort and similar divide-and-conquer algorithms (shared
memory OK Cilk, MCSTL, Intel TBB, OpenMP 3.0?)

Finding the first Solution (often OK)

Branch-and-bound
– Verifying bounds OK
– Depth-first often OK
Subtree dependent pruning
– FSSP OK
– Game tree search tough (load balancing OK)
Sanders: Parallel Algorithms December 10, 2020 319

Receiver Initiated Load Balancing

request
active split
send one part

subproblem receive new

gets empty subproblem

reject
request rejected waiting request
idle
send request
Sanders: Parallel Algorithms December 10, 2020 320

Random Polling

..
.
Anfrage

Aufspaltung
.
..
Sanders: Parallel Algorithms December 10, 2020 321

Õ (·) Calculus
X ∈ Õ ( f (n)) – iff ∀β > 0 :

∃c > 0, n0 > 0 : ∀n ≥ n0 : P [X > c f (n)] ≤ n−β

Advantage: simple rules for sum and maximum.

Sanders: Parallel Algorithms December 10, 2020 322

Termination Detection
not here
Sanders: Parallel Algorithms December 10, 2020 323

Synchronous Random Polling

P, P′ : Subproblem
P := if iPE = 0 then Proot else P0/
loop P := work(P, ∆t)
m′ := |{i : T (P@i) = 0}|
if m′ = p then exit loop
else if m′ ≥ m then
if T (P) = 0 then send a request to a random PE
if there is an incoming request then
(P, P′ ) := split(P)
send P′ to one of the requestors
send empty subproblems the rest
if T (P) = 0 then receive P
Sanders: Parallel Algorithms December 10, 2020 324

Analysis
Satz 6. For all ε > 0 there is a choice of ∆t and m such that
Tseq
Tpar (1 + ε ) +
p

Õ Tatomic + h(Trout (l) + Tcoll + Tsplit ) .

active PEs
p

time
sequential work splits
Sanders: Parallel Algorithms December 10, 2020 325

Bounding Idleness
Lemma 7.
Let m < p with m ∈ Ω (p).
busy idle
11
00 successful
request

111
000000
111
1 2 3 4 5 6
PE#
Then Õ (h) iterations 0
000
111
000
111000
111
000
111000
111 000
111
with at least 1
000
111000
111000
111
000
111000
111000
111
000
111
000111
111000000
111
000
111000
111000
111
m empty subproblems
000
111000
111
000
111000
111
suffice to ensure 000
111000
111
000
111000
111
000
111000
111
000
111000
111
∀P : gen(P) ≥ h .
000111
111000 111
000
p−1 000 111
111 000
Sanders: Parallel Algorithms December 10, 2020 326

Busy phases
Tseq
Lemma 8. There are at most (p−m)∆t iterations with ≤ m idle
PEs at their end.
Sanders: Parallel Algorithms December 10, 2020 327

A Simplified Algorithm
P, P′ : Subproblem
P := if iPE = 0 then Proot else P0/
while not finished
P := work(P, ∆t)
select a global value 0 ≤ s < n uniformly at random
if T (P@iPE − s mod p) = 0 then
(P, P@iPE − s mod p) := split(P)

Satz 9. For all ε > 0 there is a choice of ∆t and m such that

Tseq
Tpar (1 + ε ) + Õ Tatomic + h(Trout (l) + Tsplit ) .
p
Sanders: Parallel Algorithms December 10, 2020 328

Asynchronous Random Polling

P, P′ : Subproblem
P := if iPE = 0 then Proot else P0/
while no global termination yet do
if T (P) = 0 then send a request to a random PE
else P := work(P, ∆t)
if there is an incoming message M then
if M is a request from PE j then
(P, P′ ) := split(P)
send P′ to PE j
else
P := M
Sanders: Parallel Algorithms December 10, 2020 329

Analysis
Satz 10.
Tseq
ETpar ≤(1 + ε ) +
p

1
O Tatomic + h + Trout + Tsplit
ε
for an appropriate choice of ∆t.
Sanders: Parallel Algorithms December 10, 2020 330

A Trivial Lower Bound

Satz 11. For all tree shaped computations

Tseq
Tpar ∈ Ω + Tatomic + Tcoll + Tsplit log p .
p
if efficiency in Ω (1) shall be achieved.
Sanders: Parallel Algorithms December 10, 2020 331

Many Consecutive Splits

empty subproblem
h .. atomic subproblem
.
complete binary tree

Tseq
log
Tatomic

...

Additional
Tseq
h − log
Tatomic
term.
Sanders: Parallel Algorithms December 10, 2020 332

Many Splits Overall

log p − 1

h
.. .. ... .. ..
. . . .

2Tseq
log
pTatomic
... ... ... ...

1 2 ... p/2−1 p/2

empty subproblem atomic subproblem complete bin. tree
Sanders: Parallel Algorithms December 10, 2020 333

Satz 12. Some problems need at least

p Tseq
h − log
2 Tatomic

splits for efficiency ≥ 12 .

Korollar 13. Receiver initiated algorithms need a
corresponding number of communications.
Satz 14 (Wu and Kung 1991). A similar bound holds for all
deterministic load balancers.
Sanders: Parallel Algorithms December 10, 2020 334

Golomb-Rulers mn
m1
0 1 4 10 12 17
Total length m
1 3 6 2 5
4 8
find n marks {m1 , . . . mn } ⊆ N0 9 7
10
13
11
m1 = 0, mn = m 12 16
17
|{m j − mi : 1 ≤ i < j ≤ n}| = n(n − 1)/2
Applications: Radar astronomy, codes, . . .
Sanders: Parallel Algorithms December 10, 2020 335

Many Processors
Parsytec GCel-3/1024 with C OSY (PB)
Verification search
1024

896 12 marks: 0.88s par. time

13 marks: 12.07s par. time
perfect speedup
768

640
speedup

512

384

256

128

1 64 256 576 1024

PEs
Sanders: Parallel Algorithms December 10, 2020 336

LAN

12 12 Marken: 12.4s par. Zeit

13 Marken: 202.5s par. Zeit
Effizienz 1
10
Beschleunigung

2 4 6 8 10 12
PEs

Differing PE-Speeds (even dynamically) are unproblematic.

Even complete suspension OK as long as requests are answered.
Sanders: Parallel Algorithms December 10, 2020 337

The 0/1-Knapsack Problem

m items
maximum knapsack weight M
item weights wi
item profits pi
Find xi ∈ {0, 1} such that
– ∑ wi x i ≤ M
– ∑ pi xi is maximized
Sanders: Parallel Algorithms December 10, 2020 338

Best known approach for large m:

Depth-first branch-and-bound
Bounding function based on a the relaxation xi ∈ [0, 1]. (Can be
computed in O(log m) steps.)
Sanders: Parallel Algorithms December 10, 2020 339

Superlinear Speedup
Parsytec GCel-3/1024 under C OSY (PB)
1024 processors
2000 items
Splitting on all levels
256 random instances at the border between simple and difficult
overall 1410× faster than seq. computation!
Sanders: Parallel Algorithms December 10, 2020 340

65536

16384

4096

1024
speedup

256

0.25
1 10 100 1000 10000 100000
sequential execution time [s]
Sanders: Parallel Algorithms December 10, 2020 341

Fast Initialization
16

14 ohne Initialisierung
mit Initialisierung
12
Beschleunigung

0
0.0001 0.001 0.01 0.1 1 10 100 1000
sequentielle Zeit [s]
Sanders: Parallel Algorithms December 10, 2020 342

Static vs Dynamic LB
16 16

14 dynamisch 14 dynamisch
16 Teilprobleme 16 Teilprobleme
12 16384 Teilprobleme 12 256 Teilprobleme
Beschleunigung

Beschleunigung
10 10

8 8

6 6

4 4

2 2

0 0
0.0001 0.001 0.01 0.1 1 10 100 1000 0.0001 0.001 0.01 0.1 1 10 100 1000
sequentielle Zeit [s] sequentielle Zeit [s]
Sanders: Parallel Algorithms December 10, 2020 343

Beyond Global Polling

Randomized Initialization
Asynchronously increase polling range (exponentially)
1024
zuf. Anfragen
896 zuf. Anfragen + Init.
exp. Anfragen + Init.
768
Beschleunigung

640

512

384

256

128

0
1 10 100 1000 10000
sequentielle Zeit [s]
Sanders: Parallel Algorithms December 10, 2020 344

Scalability Comparison Independ. Jobs

t =average job size
tˆ =maximum job size
T =required total work for parallel execution time (1 + ε ) totalpwork
Algorithm T = Ω (· · · ) Remarks
p
prefix sum (tˆ + α log p) known task sizes
ε r
p α p tˆ mα
master-worker · · bundle size
ε ε t tˆ
p log p
randomized static · · tˆ randomized
ε ε
p
work stealing (tˆ + α log p) randomized
ε
Sanders: Parallel Algorithms December 10, 2020 345

Game Tree Search

√
Naive Parallelization yields only Speedup O( n).
Young Brother Wait Concept (Feldmann et al.)
Tradeoff between Speculativity and Sequentialization
Propagate window updates
Combine with global transposition table
Sanders: Parallel Algorithms December 10, 2020 346

MapReduce in 10 Minutes
[Google, DeanGhemawat OSDI 2004] see Wikipedia

Framework for processing multisets of (key, value) Pairs.

Sanders: Parallel Algorithms December 10, 2020 347

// M ⊆ K ×V
// MapF : K ×V → K ′ ×V ′
′
// ReduceF : K × 2 → V ′′
′ V

Function mapReduce(M, MapF, ReduceF) : V ′′

M ′ := {MapF((k, v)) : (k, v) ∈ M} // easy (load balancing?)
sort(M ′ ) // basic toolbox
forall k′ with ∃(k′ , v′ ) ∈ M ′ dopar // easy
s:= {v′ : (k′ , v′ ) ∈ M ′ }
S:= S ∪ (k′ , s)
return {reduceF(k′ , s) : (k′ , s) ∈ S} // easy (load balancing?)
Sanders: Parallel Algorithms December 10, 2020 348

Refinements
Fault Tolerance
Load Balancing using hashing (default) und Master-Worker
Associative commutative reduce functions
Sanders: Parallel Algorithms December 10, 2020 349

Examples
Grep
URL access frequencies
build inverted index
Build reverse graph adjacency array
Sanders: Parallel Algorithms December 10, 2020 350

Graph Partitioning

Contraction

while |V | > c · k do
find a matching M ⊆ E
contract M // similar to MST algorithm (more simple)

save each generated level

Sanders: Parallel Algorithms December 10, 2020 351

Finding a Matching

Find approximate max. weight matching wrt edge rating

ω ({u, v})
expansion({u, v}):=
c(u) + c(v)

∗ ω ({u, v})
expansion ({u, v}):=
c(u)c(v)
ω ({u, v})2
expansion∗2 ({u, v}):=
c(u)c(v)
ω ({u, v})
innerOuter({u, v}):=
Out(v) + Out(u) − 2ω (u, v)
Sanders: Parallel Algorithms December 10, 2020 352

Approx. max. weighted Matching

todo
Sanders: Parallel Algorithms December 10, 2020 353

Graph Partitioning Future Work

Understand edge ratings
Scalable parallel weighted Matching code
Hypergraph partitioning
Handling exact balance
Max. Flow. based techniques
Parallel external, e.g., partitioning THE web graph

Ladbs Employee Phone List PDF
No ratings yet
Ladbs Employee Phone List PDF
36 pages
Formal, Non-Formal and Informal Education: Concepts/Applicability
No ratings yet
Formal, Non-Formal and Informal Education: Concepts/Applicability
3 pages
Concurrent Programming With Threads: Rajkumar Buyya
No ratings yet
Concurrent Programming With Threads: Rajkumar Buyya
168 pages
Compiler Based Optimization Techniques For Scratchpad Memory
No ratings yet
Compiler Based Optimization Techniques For Scratchpad Memory
34 pages
MapReduce Patterns, Algorithms, and Use Cases - Highly Scalable Blog
No ratings yet
MapReduce Patterns, Algorithms, and Use Cases - Highly Scalable Blog
7 pages
CUDPP Slides
No ratings yet
CUDPP Slides
26 pages
Mc4101 Ads Notes Advance Data Structure Nodes
No ratings yet
Mc4101 Ads Notes Advance Data Structure Nodes
144 pages
The Matlab NMR Library
No ratings yet
The Matlab NMR Library
34 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Teorema Rado-Edmonds
No ratings yet
Teorema Rado-Edmonds
32 pages
hpc_parallel
No ratings yet
hpc_parallel
122 pages
Map Reduce Architecture: Adapted From Lectures by
No ratings yet
Map Reduce Architecture: Adapted From Lectures by
37 pages
Algorithms
No ratings yet
Algorithms
49 pages
Parallel Algorithm Main Single
No ratings yet
Parallel Algorithm Main Single
289 pages
Data Structures Related Algorithms Correctness Time Space Complexity
No ratings yet
Data Structures Related Algorithms Correctness Time Space Complexity
56 pages
Midterm Exam #2 Solutions: Your Name: Sid: Circle The Letters of Your CS162 Login (1 Per Line) : TA Name / Section
No ratings yet
Midterm Exam #2 Solutions: Your Name: Sid: Circle The Letters of Your CS162 Login (1 Per Line) : TA Name / Section
12 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Atomic Smart Pointers: Half Thread-Safe
No ratings yet
Atomic Smart Pointers: Half Thread-Safe
4 pages
Parallel Computing Introduction
No ratings yet
Parallel Computing Introduction
36 pages
Greedy
No ratings yet
Greedy
32 pages
3.Introduction to Parallelism
No ratings yet
3.Introduction to Parallelism
64 pages
DAA Notes PDF
No ratings yet
DAA Notes PDF
55 pages
Introduction To Analysis of Algorithms: Al-Khowarizmi
No ratings yet
Introduction To Analysis of Algorithms: Al-Khowarizmi
11 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Correctness1 PDF
No ratings yet
Correctness1 PDF
50 pages
Factors Influencing The Performance of Algorithms - Computational Complexity - Classification of Algorithms
No ratings yet
Factors Influencing The Performance of Algorithms - Computational Complexity - Classification of Algorithms
20 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
Take A Close Look At: Ma Ed
No ratings yet
Take A Close Look At: Ma Ed
42 pages
AMD User Guide: Patrick R. Amestoy Timothy A. Davis Iain S. Duff VERSION 2.4.6, May 4, 2016
No ratings yet
AMD User Guide: Patrick R. Amestoy Timothy A. Davis Iain S. Duff VERSION 2.4.6, May 4, 2016
18 pages
Mc4101 Ads Notes Advance Data Structure Nodes
0% (1)
Mc4101 Ads Notes Advance Data Structure Nodes
144 pages
Map-Reduce For Machine Learning On Multicore PDF
No ratings yet
Map-Reduce For Machine Learning On Multicore PDF
8 pages
Nvidia Ansys GPU Calculation
No ratings yet
Nvidia Ansys GPU Calculation
16 pages
Chen, Computational Geometry Methods and Applications
No ratings yet
Chen, Computational Geometry Methods and Applications
228 pages
Constraint Programming For Problem Solving 1.what Is Constraint Programming
No ratings yet
Constraint Programming For Problem Solving 1.what Is Constraint Programming
3 pages
Computer Graphics: Unit 5
No ratings yet
Computer Graphics: Unit 5
5 pages
Breaking Through Memory Limitation in GPU Parallel Processing Using Strassen Algorithm
No ratings yet
Breaking Through Memory Limitation in GPU Parallel Processing Using Strassen Algorithm
5 pages
Hadoop Interview Questions Author: Pappupass Learning Resource
No ratings yet
Hadoop Interview Questions Author: Pappupass Learning Resource
16 pages
A Practical Introduction To Data Structures and Algorithm Analysis
No ratings yet
A Practical Introduction To Data Structures and Algorithm Analysis
346 pages
ADA Theory Assignment
No ratings yet
ADA Theory Assignment
4 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Computational Science
No ratings yet
Computational Science
65 pages
Mc4101 - Adsa Notes
No ratings yet
Mc4101 - Adsa Notes
142 pages
Debugging, Profiling, Performance Analysis, Optimization PDF
No ratings yet
Debugging, Profiling, Performance Analysis, Optimization PDF
56 pages
Introduction
No ratings yet
Introduction
46 pages
Practical_12-13-14.pdf
No ratings yet
Practical_12-13-14.pdf
11 pages
dsa roadmap
No ratings yet
dsa roadmap
10 pages
Processing Large Rasters Using Tiling and Parallelization: An R + Saga Gis + Grass Gis Tutorial
No ratings yet
Processing Large Rasters Using Tiling and Parallelization: An R + Saga Gis + Grass Gis Tutorial
47 pages
High Performance Computing: 772 10 91 Thomas@chalmers - Se
No ratings yet
High Performance Computing: 772 10 91 Thomas@chalmers - Se
75 pages
Technical University of Crete: Data Structures File Structures
No ratings yet
Technical University of Crete: Data Structures File Structures
20 pages
BD Notes
No ratings yet
BD Notes
11 pages
Algorithm Design in MapReduce
No ratings yet
Algorithm Design in MapReduce
62 pages
Course Outcome 1:: 15Cs4180 - Parallel Computing
No ratings yet
Course Outcome 1:: 15Cs4180 - Parallel Computing
23 pages
Computing Point-to-Point Shortest Paths From External Memory
No ratings yet
Computing Point-to-Point Shortest Paths From External Memory
15 pages
Overheads
No ratings yet
Overheads
139 pages
Mesh Tensorflow Deep Learning For Supercomputers
No ratings yet
Mesh Tensorflow Deep Learning For Supercomputers
10 pages
Computational Geometry: Exploring Geometric Insights for Computer Vision
From Everand
Computational Geometry: Exploring Geometric Insights for Computer Vision
Fouad Sabry
No ratings yet
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
From Everand
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
Fouad Sabry
No ratings yet
Essential Algorithms: A Practical Approach to Computer Algorithms
From Everand
Essential Algorithms: A Practical Approach to Computer Algorithms
Rod Stephens
4.5/5 (2)
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
ds787_axi_ethernetlite
No ratings yet
ds787_axi_ethernetlite
43 pages
C30 (Brochure)
No ratings yet
C30 (Brochure)
1 page
1-Bpms3023 Aqidah Islamiyyah 19.03.2019
No ratings yet
1-Bpms3023 Aqidah Islamiyyah 19.03.2019
7 pages
Sans 50934-1
No ratings yet
Sans 50934-1
13 pages
Mannu Kumar Singh - BBA - TalentServe
No ratings yet
Mannu Kumar Singh - BBA - TalentServe
72 pages
Formal Communication: Short Questions 1. Distinguish Between The Formal and The Informal Communication
No ratings yet
Formal Communication: Short Questions 1. Distinguish Between The Formal and The Informal Communication
14 pages
Chapter 6-Lesson 1
No ratings yet
Chapter 6-Lesson 1
2 pages
Trial Test Advgr 1
No ratings yet
Trial Test Advgr 1
3 pages
Modulador Am-Dsb-Fc
No ratings yet
Modulador Am-Dsb-Fc
3 pages
Dr. Masindet Kilusu Pri SCH Proposal
No ratings yet
Dr. Masindet Kilusu Pri SCH Proposal
8 pages
Breathing Circuit Sets Unheated - Tech Specs - en - 689569.01
No ratings yet
Breathing Circuit Sets Unheated - Tech Specs - en - 689569.01
8 pages
Jeriel Divino Balanced and Unbalanced Forces
No ratings yet
Jeriel Divino Balanced and Unbalanced Forces
13 pages
Script For Week 7
No ratings yet
Script For Week 7
8 pages
Bill To:: 2237 Prairie Center Pkwy, Unit A Brighton CO United States 80601 (720) 685-3229 Merchant ID:8024442645
No ratings yet
Bill To:: 2237 Prairie Center Pkwy, Unit A Brighton CO United States 80601 (720) 685-3229 Merchant ID:8024442645
3 pages
Column - Statistical Techniques - MedTech Intelligence
No ratings yet
Column - Statistical Techniques - MedTech Intelligence
1 page
Heat Pinch
No ratings yet
Heat Pinch
6 pages
Alarm Viewer Documentation
No ratings yet
Alarm Viewer Documentation
12 pages
How To Write Your Own Submissive Affirmations
No ratings yet
How To Write Your Own Submissive Affirmations
4 pages
National Biomass Strategic Communication Plan A
100% (1)
National Biomass Strategic Communication Plan A
30 pages
12 STEM B Indita, Shaira Marie G. (PERDEV, Module 7)
No ratings yet
12 STEM B Indita, Shaira Marie G. (PERDEV, Module 7)
5 pages
Rossi Chapter 1 Overview
100% (1)
Rossi Chapter 1 Overview
30 pages
Cft. 23.05.24
No ratings yet
Cft. 23.05.24
1 page
CSR Company
100% (1)
CSR Company
40 pages
Bvis Hanoi - Curriculum Year 7 Eng
No ratings yet
Bvis Hanoi - Curriculum Year 7 Eng
31 pages
Storm Furies Paragon Playset
No ratings yet
Storm Furies Paragon Playset
17 pages
CBSE Class 10 SST Notes Question Bank Agriculture
100% (1)
CBSE Class 10 SST Notes Question Bank Agriculture
15 pages
05a Chapter 5
No ratings yet
05a Chapter 5
4 pages
Dissertation: by Derren Teesdale
No ratings yet
Dissertation: by Derren Teesdale
66 pages