f32 Book Parallel Pres pt4
f32 Book Parallel Pres pt4
Low-Diameter Architectures
000 001
The recursive
0 1
structure of 110 111
D = q = log2p
0010 0011 1010 1011
d = q = log2p
(d) Binary 4-cube, built of two binary 3-cubes, labeled 0 and 1
Only sample
wraparound
links are
shown to
avoid clutter
Isomorphic to
the 4 4 4
3D torus
(each has
64 6/2 links)
a b
seven-node
1 2 1 0 2 binary tree
c d e f d f into 2D
meshes of
3 4 5 6 4 6
various sizes.
Dilation = 2 Dilation = 1
Congestion = 2 Congestion = 2
Load factor = 1 Load factor = 2 Expansion:
a b f b ratio of the
1 0 2 6 0,1 2,5
number of
c e c, d f
nodes (9/7, 8/7,
d
3 4 5 3,4 6
and 4/7 here)
Dim 3 Column 3
Dim 2
Column 2
Dim 1
Column 1
Dim 0 Column 0
We prove this to be the case for a torus (and thus for a mesh)
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 12
Torus is a Subgraph of Same-Size Hypercube
0a
0 a
A tool used in our proof 0b 3-by-2
2 = 2a torus
b
Product graph G1 G2: 1
1a 2b
1b
Has n1 n2 nodes
=
Each node is labeled by a
pair of labels, one from each
component graph
Two nodes are connected if =
either component of the two
nodes were connected in
the component graphs Fig. 13.4 Examples of product graphs.
The 2a 2b 2c . . . torus is the product of 2a-, 2b-, 2c-, . . . node rings
The (a + b + c + ... )-cube is the product of a-cube, b-cube, c-cube, . . .
The 2q-node ring is a subgraph of the q-cube
If a set of component graphs are subgraphs of another set, the product
graphs will have the same relationship
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 13
Embedding Trees in the Hypercube
The (2q – 1)-node complete binary tree even weight
is not a subgraph of the q-cube odd weights
even weights
Proof by contradiction based on the parity of
node label weights (number of 1s is the labels)
even weights
The 2q-node double-rooted complete
binary tree is a subgraph of the q-cube odd weights
Fig. 13.7 Embedding a 15-node complete binary tree into the 3-cube.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 15
13.4 A Few Simple Algorithms
Semigroup computation on the q-cube 4 5
Processor x, 0 x < p do t[x] := v[x]
{initialize “total” to own value} 0 1
for k = 0 to q – 1 processor x, 0 x < p, do
get y :=t[Nk(x)]
6 7
is implicit here.
How can we 6-7 6-7 4-7 4-7 0-7 0-7
set v[x] := y
h
endfor 110
g
111
c 010 011
d
f e h g d c
100 101 100 101 100 101
b a d c h g
000 001 000 001 000 001
h g f e b a
110 111 110 111 110 111
d c b a f e
010 011 010 011 010 011
Semigroup
Hypercube
Algorithms 2
6
3
7
0-7
Dimension 4-5 4-5 4-7 4-7 0-7
Dimension-order
q–1 communication 6-7 6-7
0-3
4-7
0-3
4-7
0-7
0-7
0-7
0-7
2-3 2-3
4 5
. Ascend Legend t 0 1
. u Parallel
. t: Subcube "total"
u: Subcube prefix
6 7
prefix
2 3
4-5 4-5 All "totals" 0-7
4-7
3
4-7
Normal 0-1
4 0-1 4-5
0-3 4 0-3 4-5 0-4 0-5
Sequence
000 001
0 110
g
111 h
reversal
0 1 2 3 . . . c 010 011
d
Algorithm Steps f
100
e
101
h
100
g
101
d
100
c
101
b a d c h
g
h
001
g
000
f
001
e
000
b
001
d c b a f e
010 011 010 011 010 011
13 24
5 6
7 8
2. Replicate inputs: communicate
across 1/3 of the dimensions 6. Move Cjk to
3, 4. Rearrange R C := RA RB
2
processor 0jk
the data by RA 2 14 16 RC
RB 7 8
communicating
1 1 5 6
across the 5 6
19 22
remaining 2/3 of 4 4 28 32
7 8
dimensions so
3 3
that processor ijk 6
15 18 43 50
5
has Aji and Bik
Fig. 13.12 Multiplying two 2 2 matrices on a 3-cube.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 20
Analysis of Matrix Multiplication
RA 1 2 RA 2 2
RA 4 100 5 101 RB 5 6 RB 5 6
RB
The algorithm involves 1
5 0
000 2
1
001 1 2
6
1
5
1
6
6 5
communication steps in 6 110 7 111
3
7
4
8
4
7
4
8
three loops, each with 3 010 4
3 011
3 4 3 3
7 2 8 7 8 7 8
q / 3 iterations (in one of 13 24
5 6
R C := RA RB
exchanged per iteration) RA 2 2 14 16
RB 7 8
RC
O(q) = O(log m) 7
3
8
3 15 18 43 50
5 6
categories of practical
(n log n)/p for n >> p
sorting algorithms:
log n (log log n)2
1. Deterministic 1-1,
log n log log n
O(log2p)-time
2. Deterministic k-k,
optimal for n >> p
?
log n
1990
(that is, for large k) 1988
Not competitive,
because we can
sort an arbitrary Each half is a bit onic
sequence in 2p – 2 sequence that can be
sorted independently
unidirectional
communication
0 1 2 n/2 n1
steps using odd-
even transposition Fig. 14.2 Sorting a bitonic sequence on a linear array.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 27
Bitonic Sorting on a Linear Array
5 9 10 15 3 7 14 12 8 1 4 13 16 11 6 2
----> <---- ----> <---- ----> <---- ----> <----
5 9 15 10 3 7 14 12 1 8 13 4 11 16 6 2
------------> <------------ ------------> <------------
5 9 10 15 14 12 7 3 1 4 8 13 16 11 6 2
----------------------------> <----------------------------
3 5 7 9 10 12 14 15 16 13 11 8 6 4 2 1
------------------------------------------------------------>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Alternate derivation:
T(p) = B(2) + B(4) + . . . + B(p) = 2 + 6 + . . . + (2p – 2) = 4p – 4 – 2 log2p
h f e f e f
100 101 100 101 100 101
a c a b a b
000 001 000 001 000 001
e g h g g h
110 111 110 111 110 111
c b c c c c
010 011 010 011 010 011
5 5 5
6 6 6
Unfolded hypercube
(indirect cube, butterfly) 7 7 Unfold 7
facilitates the discussion, 0 1 2 3
Hypercube
q + 1 Columns Fold
visualization, and
analysis of routing Fig. 14.5 Unfolded 3-cube or the 32-node
algorithms butterfly network.
Ascend 3 3 Descend
4 4
Number of cross 5 5
links taken = length
6 6
of path in hypercube
7 7
0 1 2 3
1 A B 1 1 1
2 C 2 2 2
3 B D 3 3 3
4 C 4 4 4
5 5 5 5
6 D 6 6 6
7 7 7 7
0 1 2 3 0 1 2 3
Bad news (false): The delay can be (p) for some permutations
1 A B 1 1 1
2 C 2 2 2
3 B D 3 3 3
4 C 4 4 4
5 5 5 5
6 D 6 6 6
7 7 7 7
0 1 2 3 0 1 2 3
broadcasting ABCD
A A
A A
A B C D
A A B B
A B A C A
0 1 0 1
6 7 6 7
2 3 2 3
0 1
Rotate
0 71Fig. 15.3
180 Folded 3-cube
degrees
viewed as
6 7 6 3 17 3-cube with a
2 3 2 5 redundant
Folded 3-cube with dimension.
After renaming, diametral
Dim-0 links removed links replace dim-0 links
x0
Dimension-0 links
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
0 1 2 3 1 2 3=0
q q
2 rows, q + 1 columns 2 rows, q columns
0 0
3 3
4 4
1 1
Switching these
two row pairs 5 5
converts this to 2 2
6 6
the original
butterfly network.
3 3 7 7
Changing the
order of stages in 8 8
a butterfly is thus 4 4
equi valent to a 9 9
relabeling of t he
rows (in this 5 5 10 10
7 7 13 13
0 1 2 3
14 14
P1 P4
P2 P3 P5 P6
5 6 7
3 4
2
0 1
Front view: Side view:
Binary tree 3 Inverted
2 6 7 binary tree
0 1 4 5
Fig. 15.7
Butterfly
0 1 2 3 4
5
6 7
network
redrawn as
a fat tree.
0 1 2 3 4 5 6 7
0 1
±2
6 7
2 3
±4
1 1
Data manipulator network
was used in Goodyear 2 2
MPP, an early SIMD
parallel machine. 3 3 q
2 Rows
“Augmented” means that 4 4
switches in a column are
independent, as opposed 5 5
to all being set to same
6 6
state (simplified control).
7 7
0 1 2 3
Fig. 15.12 Augmented a
data manipulator network. b
q + 1 Columns
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 53
15.4 The Cube-Connected Cycles Network
Dim 0
Dim 1 Dim 2
The cube-connected 0 0 0 0
5 5 5 5
6 6 6 6
7 7 7 7
1 2 3=0 0 1 2
q columns q columns/dimensions
q–1
. Ascend
.
.
3 Cycle ID = x Proc ID = y
Normal
2 2m bits m bits
1 Descend Nj–1 (x), j–1
0
0 1 2 3 . . .
Algorithm Steps x, j–1 Dim j–1
Ascend, descend,
and normal algorithms. Cycle x, j
x Nj (x) , j
Dim j
Node (x, j) is communicating
along dimension j; after the x, j+1 Nj (x), j–1
Dim j+1
next rotation, it will be linked
to its dimension-(j + 1)
neighbor. Nj+1 (x), j+1
Fig. 15.15 CCC emulating
a normal hypercube algorithm. N j+1(x) , j
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 56
15.5 Shuffle and Shuffle-Exchange Networks
000 0 0 0 0 0 0
001 1 1 1 1 1 1
010 2 2 2 2 2 2
011 3 3 3 3 3 3
100 4 4 4 4 4 4
101 5 5 5 5 5 5
110 6 6 6 6 6 6
111 7 7 7 7 7 7
Shuffle Exchange Shuffle-Exchange Alternate
Structure
Unshuffle
0 1 2 3 4 5 6 7
SE
S 1 3 SE S
SE S S
S
SE
0 2 SE 5 7
S S S SE
SE 4 6
SE
Fig. 15.17 Alternate views of an eight-node shuffle–exchange network.
0 1 2 3 4 5 6 7
Exchange Shuffle
(dotted) (solid)
2 E
S 3 S S
S S
0 E 1 6 E 7
S S
S E
4 5
Fig. 15.18 Eight-node network with separate shuffle and exchange links.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 60
0 0 0 0
Multistage 1 1 1 A 1
Shuffle- 2 2 2 2
3 3
Exchange
3 3
4 A 4 4 4
Network 5 5 5 5
6 6 6 6
7 7 7 7
0 1 2 3 0 1 2 3
q + 1 Columns q + 1 Columns
0 0 0 0
1 1 1 1
Fig. 15.19 2 2 2 2
Multistage A
3 3 3 3
shuffle–exchange
4
network A
4 4 4
(omega network) 5 5 5 5
is the same as 6 6 6 6
butterfly network. 7 7 7 7
0 1 2 0 1 2
q Columns q Columns
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 61
15.6 That’s Not All, Folks!
When q is a power of 2, the 2qq-node cube-connected cycles network
derived from the q-cube, by replacing each node with a q-node cycle,
is a subgraph of the (q + log2q)-cube CCC is a pruned hypercube
Other pruning strategies are possible, leading to interesting tradeoffs
100 101
B = p/4
All dimension-0
links are kept
0 0
1 1 Fig. 15.21
6 7 4 5
Two 8-node
Möbius cubes.
2 2 3
3
0-Mobius cube 1-Mobius cube
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 63
16 A Sampler of Other Networks
Complete the picture of the “sea of interconnection networks”:
• Examples of composite, hybrid, and multilevel networks
• Notions of network performance and cost-effectiveness
Indicator or random 29 3
communication 28 4
capacity 27 5
26 6
Node bisection and
25 7
link bisection
24 8
Hard to determine; 23 9
Intuition can be
very misleading 22 10
21 11
not as large at it 18
17 15
14
16
appears.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 67
Determining the Bisection Width
Establish upper bound by taking a number of trial cuts.
Then, try to match the upper bound by a lower bound.
P0 P1 P2
K9 P0
P3 P4 P5 P8 P1 Establishing
a lower
bound on B:
P6 P7 P8
P7 P2 Embed Kp
0 1 2
into p-node
network
7
3 4 5
P6 P3 Let c be the
maximum
7 congestion
6 7 8 P5 P4
B p2/4/c
An embedding of Bisection width = 4 5 = 20
K9 into 3 3 mesh
For d = 2, we have p 2D + 1
Odd-sized ring satisfies this bound
For d = 3, we have p 3 2D – 2
D = 1 leads to p 4 (K4 satisfies the bound)
D = 2 leads to p 10 and the first nontrivial example (Petersen graph)
11010
01011 01110
Summary:
For node degree d, Moore’s bounds establish the lowest possible
diameter D that we can hope to achieve with p nodes, or the largest
number p of nodes that we can hope to accommodate for a given D.
Coming within a constant factor of the bound is usually good enough;
the smaller the constant factor, the better.
Network diameter 4 5 6 7 8 9
Star nodes 24 -- 120 720 -- 5040
Hypercube nodes 16 32 64 128 256 512
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 76
The Star-Connected Cycles Network
1234,4 4
1234,3 1234,2
Replace degree-(q – 1)
3 2 nodes with (q – 1)-cycles
Source node 1 5 4 3 6 2
Dimension-2 link to 5 1 4 3 6 2
Dimension-6 link to 2 6 3 4 1 5
Last 2 symbols now adjusted
Dimension-4 link to 4 3 6 2 1 5
Last 4 symbols now adjusted
Dimension-2 link (Dest’n) 3 4 6 2 1 5
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 78
Cayley Networks
Node x
Group:
Gen 3
A semigroup with an identity element
and inverses for all elements.
Gen 1 x 3
Gen 2 Example 1: Integers with addition or
multiplication operator form a group.
Example 2: Permutations, with the
x 2
x 1 composition operator, form a group.
1342 4123
The identity element is:
4132 1243 (1) (2) (3) (4)
(1 3) (2) (4)
3142 2143
Hence, a variety
of multilevel and
augmented ring S
networks have Message
been proposed source
s1 7 1 s1
. . . .
. . . .
..
Fig. . 16.6 .
v–s 1 Unidirectional ring, v+s 1 v–s 1 v+s 1
6 2
two chordal rings,
s k–1 s k–1
and node
connectivity in 5 3
v–s general.
k–1 v+s k–1 4 v–s k–1 v+s k–1
N
o de
sp–g Node
s0
top–1 tog–1 7 1
G
ro
up1
s
0
s
2 s
1
As k iplinklea
d sto N o
desg
thes am erela
tive to2g–1
positioninth e 6 2
destin atio
n
grou p
Nod
e s2
g
to3g–1
G
ro
up2
No de
sig
to(i+
1)g–1 5 3
G
ro
upi 4
Fig. 16.8 Periodically regular chordal ring.
c e
0
g No skip in this
4 d f dimension
0 1 2 3
4 5 6 7
1
8 9 10 11
5
2
12 13 14 15
6 To 6
To 3
16 17 18 19
3 7
63
Fig. 16.9 VLSI layout for a 64-node 20 21 22 23
Dimension 1
d s 4 = 16
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 85
16.4 Composite or Hybrid Networks
7 (111) 0 0 0 (000)
0 (000) 1 1 1 (001)
4 (100) 3 3 2 (010)
6 (110) 2 2 3 (011)
1 (001) 7 4 4 (100)
5 (101) 4 5 5 (101)
3 (011) 6 7 6 (110)
2 (010) 5 6 7 (111)
Sort by Sort by the Sort by
MSB middle bit LSB
Fig. 16.14 Example of sorting on a binary radix sort network.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 93
Partial List of Important MINs
Augmented data manipulator (ADM): aka unfolded PM2I (Fig. 15.12)
Banyan: Any MIN with a unique path between any input and any output (e.g. butterfly)
Baseline: Butterfly network with nodes labeled differently
Beneš: Back-to-back butterfly networks, sharing one column (Figs. 15.9-10)
Bidelta: A MIN that is a delta network in either direction
Butterfly: aka unfolded hypercube (Figs. 6.9, 15.4-5)
Data manipulator: Same as ADM, but with switches in a column restricted to same state
Delta: Any MIN for which the outputs of each switch have distinct labels (say 0 and 1
for 2 2 switches) and path label, composed of concatenating switch output labels
leading from an input to an output depends only on the output
Flip: Reverse of the omega network (inputs outputs)
Indirect cube: Same as butterfly or omega
Omega: Multi-stage shuffle-exchange network; isomorphic to butterfly (Fig. 15.19)
Permutation: Any MIN that can realize all permutations
Rearrangeable: Same as permutation network
Reverse baseline: Baseline network, with the roles of inputs and outputs interchanged