0% found this document useful (0 votes)
30 views

f32 Book Parallel Pres pt4

This document discusses low-diameter architectures like hypercubes. It covers topics like hypercubes and their algorithms, sorting and routing on hypercubes, other hypercubic architectures, and other networks. Specific topics covered include the definition and main properties of hypercubes, embeddings and their usefulness, embedding of arrays and trees, simple hypercube algorithms, and matrix operations on hypercubes.

Uploaded by

Harish Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

f32 Book Parallel Pres pt4

This document discusses low-diameter architectures like hypercubes. It covers topics like hypercubes and their algorithms, sorting and routing on hypercubes, other hypercubic architectures, and other networks. Specific topics covered include the definition and main properties of hypercubes, embeddings and their usefulness, embedding of arrays and trees, simple hypercube algorithms, and matrix operations on hypercubes.

Uploaded by

Harish Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 94

Part IV

Low-Diameter Architectures

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 1


About This Presentation
This presentation is intended to support the use of the textbook
Introduction to Parallel Processing: Algorithms and Architectures
(Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by
the author in connection with teaching the graduate-level course
ECE 254B: Advanced Computer Architecture: Parallel Processing,
at the University of California, Santa Barbara. Instructors can use
these slides in classroom teaching and for other educational
purposes. Any other use is strictly prohibited. © Behrooz Parhami

Edition Released Revised Revised


First Spring 2005 Spring 2006

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 2


IV Low-Diameter Architectures
Study the hypercube and related interconnection schemes:
• Prime example of low-diameter (logarithmic) networks
• Theoretical properties, realizability, and scalability
• Complete our view of the “sea of interconnection nets”

Topics in This Part


Chapter 13 Hypercubes and Their Algorithms
Chapter 14 Sorting and Routing on Hypercubes
Chapter 15 Other Hypercubic Architectures
Chapter 16 A Sampler of Other Networks

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 3


13 Hypercubes and Their Algorithms
Study the hypercube and its topological/algorithmic properties:
• Develop simple hypercube algorithms (more in Ch. 14)
• Learn about embeddings and their usefulness

Topics in This Chapter


13.1 Definition and Main Properties
13.2 Embeddings and Their Usefulness
13.3 Embedding of Arrays and Trees
13.4 A Few Simple Algorithms
13.5 Matrix Multiplication
13.6 Inverting a Lower-Triangular Matrix

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 4


13.1 Definition and Main Properties
P
0
P P
8 1 P P P
0 1 2
Intermediate
P P
architectures:
7 2 logarithmic or P P P
sublogarithmic 3 4 5
diameter
P P
6 3
P P P
6 7 8
P P
5 4

Begin studying networks that are intermediate between


diameter-1 complete network and diameter-p1/2 mesh
Sublogarithmic diameter Superlogarithmic diameter
1 2 log n / log log n log n n n/2 n1

Complete PDN Star, Binary tree, Torus Ring Linear


network pancake hypercube array

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 5


Hypercube and Its History
Binary tree has logarithmic diameter, but small bisection
Hypercube has a much larger bisection
Hypercube is a mesh with the maximum possible number of dimensions
222 ... 2
 q = log2 p 
We saw that increasing the number of dimensions made it harder to
design and visualize algorithms for the mesh
Oddly, at the extreme of log2 p dimensions, things become simple again!

Brief history of the hypercube (binary q-cube) architecture


Concept developed: early 1960s [Squi63]
Direct (single-stage) and indirect (multistage) versions: mid 1970s
Initial proposals [Peas77], [Sull77] included no hardware
Caltech’s 64-node Cosmic Cube: early 1980s [Seit85]
Introduced an elegant solution to routing (wormhole switching)
Several commercial machines: mid to late 1980s
Intel PSC (personal supercomputer), CM-2, nCUBE (Section 22.3)
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 6
0
Basic Definitions
00 01
0 1

Hypercube is generic term; (a) Binary 1-cube, (b) Binary 2-cube,


built of two built of two 1
3-cube, 4-cube, . . . , q-cube binary 0-cubes, binary 1-cubes,
10 11

labeled 0 and 1 labeled 0 and 1


in specific cases
100 101

Fig. 13.1 000 001 100 101

000 001

The recursive
0 1
structure of 110 111

binary 010 011 110 111


010 011
hypercubes. (c) Binary 3-cube, built of two binary 2-cubes, labeled 0 and 1

Parameters: 0100 0101 1100 1101

p = 2q 0000 0001 1000 1001


0 1
B = p/2 = 2q–1
0110 0111 1110 1111

D = q = log2p
0010 0011 1010 1011

d = q = log2p
(d) Binary 4-cube, built of two binary 3-cubes, labeled 0 and 1

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 7


The 64-Node
Hypercube

Only sample
wraparound
links are
shown to
avoid clutter

Isomorphic to
the 4  4  4
3D torus
(each has
64  6/2 links)

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 8


Neighbors of a Node in a Hypercube
xq–1xq–2 . . . x2x1x0 ID of node x
xq–1xq–2 . . . x2x1x0 dimension-0 neighbor; N0(x)
The q
xq–1xq–2 . . . x2x1x0 dimension-1 neighbor; N1(x)
neighbors
. .
. . of node x
. .
xq–1xq–2 . . . x2x1x0 dimension-(q – 1) neighbor; Nq–1(x)
0100 Dim 0
0101
Nodes whose labels differ in k bits xx
Dim 2 Dim 3
(at Hamming distance k) connected 1100
by shortest path of length k 0000 Dim 1
1101
Both node- and edge-symmetric
1111
Strengths: symmetry, log diameter,
0110 0111
and linear bisection width
Weakness: poor scalability 1010 1011
0010 0011
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 9
13.2 Embeddings and Their Usefulness
Dilation = 1
0 Congestion = 1 3 5 Fig. 13.2
Load factor = 1 Embedding a
a b c e

a b
seven-node
1 2 1 0 2 binary tree
c d e f d f into 2D
meshes of
3 4 5 6 4 6
various sizes.
Dilation = 2 Dilation = 1
Congestion = 2 Congestion = 2
Load factor = 1 Load factor = 2 Expansion:
a b f b ratio of the
1 0 2 6 0,1 2,5
number of
c e c, d f
nodes (9/7, 8/7,
d
3 4 5 3,4 6
and 4/7 here)

Dilation: Longest path onto which an edge is mapped (routing slowdown)


Congestion: Max number of edges mapped onto one edge (contention slowdown)
Load factor: Max number of nodes mapped onto one node (processing slowdown)
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 10
13.3 Embedding of Arrays and Trees
(q – 1)-bit
Gray code
x 0 000 . . . 000
N q–1(x)
0 000 . . . 001
Nk(x) 0 000 . . . 011
N q–1(N (x))
k .
.
.
0 100 . . . 000
(q – 1)-cube 0 (q – 1)-cube 1 1 100 . . . 000
.
Fig. 13.3 Hamiltonian cycle in the q-cube. .
.
Alternate inductive proof: Hamiltonicity of the q-cube 1 000 . . . 011
is equivalent to the existence of a q-bit Gray code 1 000 . . . 010
1 000 . . . 000
Basis: q-bit Gray code beginning with the all-0s codeword (q – 1)-bit
and ending with 10q–1 exists for q = 2: 00, 01, 11, 10 Gray code
in reverse
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 11
Mesh/Torus Embedding in a Hypercube

Dim 3 Column 3
Dim 2

Column 2
Dim 1

Column 1

Dim 0 Column 0

Fig. 13.5 The 4  4 mesh/torus is a subgraph of the 4-cube.

Is a mesh or torus a subgraph of the hypercube of the same size?

We prove this to be the case for a torus (and thus for a mesh)
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 12
Torus is a Subgraph of Same-Size Hypercube
0a
0 a
A tool used in our proof 0b 3-by-2
2 = 2a torus
 b
Product graph G1  G2: 1
1a 2b
1b
Has n1  n2 nodes
  =
Each node is labeled by a
pair of labels, one from each
component graph
Two nodes are connected if  =
either component of the two
nodes were connected in
the component graphs Fig. 13.4 Examples of product graphs.
The 2a  2b  2c . . . torus is the product of 2a-, 2b-, 2c-, . . . node rings
The (a + b + c + ... )-cube is the product of a-cube, b-cube, c-cube, . . .
The 2q-node ring is a subgraph of the q-cube
If a set of component graphs are subgraphs of another set, the product
graphs will have the same relationship
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 13
Embedding Trees in the Hypercube
The (2q – 1)-node complete binary tree even weight
is not a subgraph of the q-cube odd weights
even weights
Proof by contradiction based on the parity of
node label weights (number of 1s is the labels)

even weights
The 2q-node double-rooted complete
binary tree is a subgraph of the q-cube odd weights

x Nc(x) Nb(N (x))


c
Na(x) New Roots Nc(N (x))
b Fig. 13.6
The 2q-node
Nb(x) Nc(N (x))
a double-
Na(N (x))
c rooted
complete
binary tree is
2q -node double-rooted Double-rooted tree Double-rooted tree a subgraph of
complete binary tree in the (q–1)-cube 0 in the (q–1)-cube 1
the q-cube.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 14
A Useful Tree Embedding in the Hypercube
Processor 000

The (2q – 1)-node 001 Despite the load


complete binary tree factor of q, many
010
can be embedded tree algorithms
into the (q – 1)-cube 011 entail no slowdown
Dim-2
link 100
101
Dim-1
links 110
111
Dim-0
links

Fig. 13.7 Embedding a 15-node complete binary tree into the 3-cube.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 15
13.4 A Few Simple Algorithms
Semigroup computation on the q-cube 4 5
Processor x, 0  x < p do t[x] := v[x]
{initialize “total” to own value} 0 1
for k = 0 to q – 1 processor x, 0  x < p, do
get y :=t[Nk(x)]
6 7

set t[x] := t[x]  y 2 3

endfor 0-7 0-7


4-5 4-5 4-7 4-7
Commutativity
of the operator  0-1 0-1 0-3 0-3 0-7 0-7

is implicit here.
How can we 6-7 6-7 4-7 4-7 0-7 0-7

remove this 2-3 0-3 0-3 0-7 0-7


2-3
assumption?
Fig. 13.8 Semigroup computation on a 3-cube.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 16
Parallel Prefix Computation
Parallel prefix computation on the q-cube
t : subcube “total”
Processor x, 0  x < p, do t[x] := u[x] := v[x]
{initialize subcube “total” and partial prefix} 4 5
for k = 0 to q – 1 processor
Legendx, 0  xt < p, do u : subcube prefix
0 1
get y :=t[Nk(x)]
set t[x] := t[x]  y u
if x > Nk(x) then set u[x]
t: Subcube  u[x]
:= y "total" 6 7
u: Subcube prefix
endfor 2 3
All "totals" 0-7
Commutativity 4-5 4-5 4-7 4-7
of the operator  4 4-5 0-4 0-5
0-1 0-1 0-3 4 0-3 4-5
is implicit in this
6-7 6-7
algorithm as well. 0 0-1 0 4-7 0-1 4-7 0 0-1
6 6-7 4-6 4-7 0-6 0-7
How can we 2-3 2-3 0-3 0-3
remove this 2 2-3 0-2 0-3 0-2 0-3
assumption? Fig. 13.8 Semigroup computation on a 3-cube.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 17
Sequence Reversal on the Hypercube
Reversing a sequence on the q-cube
for k = 0 to q – 1 Processor x, 0  x < p, do e 100 101 f
get y := v[Nk(x)] b
a
000 001

set v[x] := y
h
endfor 110

g
111

c 010 011
d

f e h g d c
100 101 100 101 100 101

b a d c h g
000 001 000 001 000 001

h g f e b a
110 111 110 111 110 111

d c b a f e
010 011 010 011 010 011

Fig. 13.11 Sequence reversal on a 3-cube.


Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 18
4 5

Ascend, Descend, and Normal 0 1

Semigroup
Hypercube
Algorithms 2
6

3
7

0-7
Dimension 4-5 4-5 4-7 4-7 0-7

0-1 0-1 0-3 0-3 0-7 0-7

Dimension-order
q–1 communication 6-7 6-7

0-3
4-7

0-3
4-7

0-7
0-7

0-7
0-7

2-3 2-3
4 5

. Ascend Legend t 0 1

. u Parallel
. t: Subcube "total"
u: Subcube prefix
6 7
prefix
2 3
4-5 4-5 All "totals" 0-7
4-7
3
4-7

Normal 0-1
4 0-1 4-5
0-3 4 0-3 4-5 0-4 0-5

6-7 6-7 4-7 4-7


0 0-1 0 0-1 0 0-1
2 2-3
6
2-3 6-7 0-3 4-6
0-3
4-7 0-6 0-7

2 2-3 0-2 0-3 0-2 0-3

1 Descend e 100 101 f


b
a

Sequence
000 001

0 110

g
111 h
reversal
0 1 2 3 . . . c 010 011
d

Algorithm Steps f
100
e
101
h
100
g
101
d
100
c
101

b a d c h
g

Graphical depiction of ascend, descend, 000

h
001

g
000

f
001

e
000

b
001

and normal algorithms.


110 111 110 111 110 111

d c b a f e
010 011 010 011 010 011

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 19


13.5 Matrix Multiplication
p = m3 = 2q processors, indexed as ijk (with three q/3-bit segments)
RA 1 2 RA 2 2
1. Place elements RA 4 100 5 101 RB 5 6 RB 5 6
RB
of A and B in 1 000 2 001 1 2 1 1
0 1
registers RA & RB 5 6 5
3
6
4
5
4
6
4
of m2 processors 6 110 7 111 7 8 7 8
with the IDs 0jk 3 010 4 011 3 4 3 3
7 2 8
3 7 8 7 8

 13 24   

5 6
7 8
2. Replicate inputs: communicate
across 1/3 of the dimensions 6. Move Cjk to
3, 4. Rearrange R C := RA  RB
2
processor 0jk
the data by RA 2 14 16 RC
RB 7 8
communicating
1 1 5 6
across the 5 6
19 22
remaining 2/3 of 4 4 28 32
7 8
dimensions so
3 3
that processor ijk 6
15 18 43 50
5
has Aji and Bik
Fig. 13.12 Multiplying two 2  2 matrices on a 3-cube.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 20
Analysis of Matrix Multiplication
RA 1 2 RA 2 2
RA 4 100 5 101 RB 5 6 RB 5 6
RB
The algorithm involves 1
5 0
000 2
1
001 1 2
6
1
5
1
6
6 5
communication steps in 6 110 7 111
3
7
4
8
4
7
4
8
three loops, each with 3 010 4
3 011
3 4 3 3
7 2 8 7 8 7 8
q / 3 iterations (in one of  13 24   

5 6

the 4 loops, 2 values are 7 8

R C := RA  RB
exchanged per iteration) RA 2 2 14 16
RB 7 8
RC

Tmul (m, m3) = 1


5
1
6
5 6 19 22
4 4 28 32

O(q) = O(log m) 7
3
8
3 15 18 43 50
5 6

Analysis in the case of block matrix multiplication (m  m matrices):


Matrices are partitioned into p1/3  p1/3 blocks of size (m / p1/3)  (m / p1/3)
Each communication step deals with m2 / p2/3 block elements
Each multiplication entails 2m3/p arithmetic operations
Tmul(m, p) = m2 / p2/3  O(log p) + 2m3 / p
Communication Computation
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 21
13.6 Inverting a Lower-Triangular Matrix
B 0 B–1 0
For A = we have A–1 =
C D –D–1CB–1 D–1
I
B 0 B –1
0 BB–1 0
 =
C D –D CB
–1 –1
D
–1
CB–1 – DD–1CB–1 DD–1 0
0 I

Because B and D are both lower aij


triangular, the same algorithm
can be used recursively to invert ij
them in parallel

Tinv(m) = Tinv(m/2) + 2Tmul(m/2) = Tinv(m/2) + O(log m) = O(log2m)

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 22


14 Sorting and Routing on Hypercubes
Study routing and data movement problems on hypercubes:
• Learn about limitations of oblivious routing algorithms
• Show that bitonic sorting is a good match to hypercube

Topics in This Chapter


14.1 Defining the Sorting Problem
14.2 Bitonic Sorting on a Hypercube
14.3 Routing Problems on a Hypercube
14.4 Dimension-Order Routing
14.5 Broadcasting on a Hypercube
14.6 Adaptive and Fault-Tolerant Routing

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 23


14.1 Defining the Sorting Problem
Arrange data in order of processor ID numbers (labels) 4 5
The ideal parallel sorting algorithm:
T(p) = ((n log n)/p) 0 1
Smallest
This ideal has not been achieved in all
value
cases for the hypercube
6 7
Largest
1-1 sorting (p items to sort, p processors) 2 3 value

Batcher’s odd-even merge or bitonic sort: O(log2p) time


O(log p)-time deterministic algorithm not known

k-k sorting (n = kp items to sort, p processors)

Optimal algorithms known for n >> p or when average


running time is considered (randomized)
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 24
Hypercube Sorting: Attempts and Progress
One of the oldest
log2 n for n = p, bitonic
No bull’s eye yet! parallel algorithms;
discovered 1960,
log p log n/log(p/n), n p/4;
in particular, log p for n = p 1–  published 1968

There are three log n randomized

categories of practical
(n log n)/p for n >> p
sorting algorithms:
log n (log log n)2
1. Deterministic 1-1,
log n log log n
O(log2p)-time
2. Deterministic k-k,
optimal for n >> p
?
log n

1990
(that is, for large k) 1988

3. Probabilistic More than p items


1987
1980
(1-1 or k-k)
Practical, probabilistic 1960s

Pursuit of O(log p)-time algorithm Fewer than p items

is of theoretical interest only Practical, deterministic

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 25


Bitonic Sequences
Bitonic sequence: In Chapter 7, we designed bitonic sorting nets
Bitonic sorting is ideally suited to hypercube
1 3 3 4 6 6 6 2 2 1 0 0
Rises, then falls

8 7 7 6 6 6 5 4 6 8 8 9 (a) Cyclic shift of (a)

Falls, then rises

8 9 8 7 7 6 6 6 5 4 6 8 (b) Cyclic shift of (b)


The previous sequence,
right-rotated by 2 Fig. 14.1 Examples of bitonic
sequences.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 26
Sorting a Bitonic Sequence on a Linear Array
Time needed to Shifted right half Bitonic sequence
sort a bitonic Shift right half of dat a to
sequence on a left half (superimpose
the two halves)
p-processor
linear array:

In each position, keep


B(p) = p + p/2 0 1 2 n/2 n1 the smaller of the two
+ p/4 + . . . + 2 = values and ship the
2p – 2 larger value to the right

Not competitive,
because we can
sort an arbitrary Each half is a bit onic
sequence in 2p – 2 sequence that can be
sorted independently
unidirectional
communication
0 1 2 n/2 n1
steps using odd-
even transposition Fig. 14.2 Sorting a bitonic sequence on a linear array.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 27
Bitonic Sorting on a Linear Array
5 9 10 15 3 7 14 12 8 1 4 13 16 11 6 2
----> <---- ----> <---- ----> <---- ----> <----
5 9 15 10 3 7 14 12 1 8 13 4 11 16 6 2
------------> <------------ ------------> <------------
5 9 10 15 14 12 7 3 1 4 8 13 16 11 6 2
----------------------------> <----------------------------
3 5 7 9 10 12 14 15 16 13 11 8 6 4 2 1
------------------------------------------------------------>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Fig. 14.3 Sorting an arbitrary sequence on a linear array through


recursive application of bitonic sorting.

Sorting an arbitrary sequence of length p: Recall that


T(p) = T(p/2) + B(p) = T(p/2) + 2p – 2 = 4p – 4 – 2 log2p B(p) = 2p – 2

Alternate derivation:
T(p) = B(2) + B(4) + . . . + B(p) = 2 + 6 + . . . + (2p – 2) = 4p – 4 – 2 log2p

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 28


14.2 Bitonic Sorting on a Hypercube
For linear array, the 4p-step bitonic sorting algorithm is inferior to
odd-even transposition which requires p compare-exchange steps
(or 2p unidirectional communications)
The situation is quite different for a hypercube

Sorting a bitonic sequence on a hypercube: Compare-exchange


values in the upper subcube (nodes with xq–1 = 1) with those in the
lower subcube (xq–1 = 0); sort the resulting bitonic half-sequences
B(q) = B(q – 1) + 1 = q Complexity: 2q communication steps

Sorting a bitonic sequence of size n on q-cube, q = log2n


for l = q – 1 downto 0 processor x, 0  x < p, do
if xl = 0
then get y := v[Nl(x)]; keep min(v(x), y); send max(v(x), y) to Nl(x)
endif
endfor This is a “descend” algorithm
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 29
Bitonic Sorting on a Hypercube
h 100 101 f
Fig. 14.4 c
T(q) = T(q – 1) +
a
Sorting a 000 001 B(q)
Data ordering
in upper cube
bitonic = T(q – 1) + q
Data ordering
sequence of in lower cube 110 111 b = q(q + 1)/2
size 8 on the c
e = O(log2 p)
3-cube. 010 011
g

h f e f e f
100 101 100 101 100 101
a c a b a b
000 001 000 001 000 001

e g h g g h
110 111 110 111 110 111
c b c c c c
010 011 010 011 010 011

Dimension 2 Dimension 1 Dimension 0

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 30


14.3 Routing Problems on a Hypercube
Recall the following categories of routing algorithms:
Off-line: Routes precomputed, stored in tables
On-line: Routing decisions made on the fly
Oblivious: Path depends only on source and destination
Adaptive: Path may vary by link and node conditions

Good news for routing on a hypercube:


Any 1-1 routing problem with p or fewer packets can be solved in
O(log p) steps, using an off-line algorithm; this is a consequence
of there being many paths to choose from

Bad news for routing on a hypercube:


Oblivious routing requires (p1/2/log p) time in the worst case
(only slightly better than mesh)
In practice, actual routing performance is usually much closer to
the log-time best case than to the worst case.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 31
Limitations of Oblivious Routing
Theorem 14.1: Let G = (V, E) be a p-node, degree-d network. Any oblivious
routing algorithm for routing p packets in G needs (p1/2/d) worst-case time

Proof Sketch: Let Pu,v be the unique path


used for routing messages from u to v
There are p(p – 1) possible paths for
routing among all node pairs
These paths are predetermined and do not
v
depend on traffic within the network
Our strategy: find k node pairs ui, vi (1  i  k)
such that ui  uj and vi  vj for i  j, and
Pui,vi all pass through the same edge e
Because  2 packets can go through a link in one step, (k) steps
will be needed for some 1-1 routing problem
The main part of the proof consists of showing that k can be
as large as p1/2/d
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 32
14.4 Dimension-Order Routing
dim 0 dim 1 dim 2
Source 01011011 0 0 0
Destination 11010110 1 1 1
Differences ^ ^^ ^
Path: 01011011 2 2 2
11011011
11010011 3 3
q
3
11010111 4 4
2 Rows
11010110
4

5 5 5

6 6 6
Unfolded hypercube
(indirect cube, butterfly) 7 7 Unfold 7
facilitates the discussion, 0 1 2 3
Hypercube
q + 1 Columns Fold
visualization, and
analysis of routing Fig. 14.5 Unfolded 3-cube or the 32-node
algorithms butterfly network.

Dimension-order routing between nodes i and j in q-cube can be viewed


as routing from node i in column 0 (q) to node j in column q (0) of the
butterfly
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 33
Self-Routing on a Butterfly Network
dim 0 dim 1 dim 2
0 0
Fig. 14.6 Example
dimension-order 1 1
routing paths.
2 2

Ascend 3 3 Descend

4 4

Number of cross 5 5
links taken = length
6 6
of path in hypercube
7 7
0 1 2 3

From node 3 to 6: routing tag = 011  110 = 101 “cross-straight-cross”


From node 3 to 5: routing tag = 011  101 = 110 “cross-cross-straight”
From node 6 to 1: routing tag = 110  001 = 111 “cross-cross-cross”
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 34
Butterfly Is Not a Permutation Network
dim 0 dim 1 dim 2 dim 0 dim 1 dim 2
0 A 0 0 0

1 A B 1 1 1

2 C 2 2 2

3 B D 3 3 3

4 C 4 4 4

5 5 5 5

6 D 6 6 6

7 7 7 7
0 1 2 3 0 1 2 3

Fig. 14.7 Packing is a “good” Fig. 14.8 Bit-reversal permutation is


routing problem for dimension- a “bad” routing problem for dimension-
order routing on the hypercube. order routing on the hypercube.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 35
Why Bit-Reversal Routing Leads to Conflicts?
Consider the (2a + 1)-cube and messages that must go from nodes
0 0 0 . . . 0 x1 x2 . . . xa–1 xa to nodes xa xa–1 . . . x2 x1 0 0 0 . . . 0
a + 1 zeros a + 1 zeros
If we route messages in dimension order, starting from the right end,
all of these 2a = (p1/2) messages will pass through node 0
Consequences of this result:
1. The (p1/2) delay is even worse than (p1/2/d) of Theorem 14.1
2. Besides delay, large buffers are needed within the nodes

True or false? If we limit nodes to a constant number of message


buffers, then the (p1/2) bound still holds, except that messages are
queued at several levels before reaching node 0

Bad news (false): The delay can be (p) for some permutations

Good news: Performance usually much better; i.e., log2 p + o(log p)

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 36


Wormhole Routing on a Hypercube
Good/bad routing problems are good/bad for wormhole routing as well
Dimension-order routing is deadlock-free
dim 0 dim 1 dim 2 dim 0 dim 1 dim 2
0 A 0 0 0

1 A B 1 1 1

2 C 2 2 2

3 B D 3 3 3

4 C 4 4 4

5 5 5 5

6 D 6 6 6

7 7 7 7
0 1 2 3 0 1 2 3

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 37


14.5 Broadcasting on a Hypercube
Flooding: applicable to any network with all-port communication
00000 Source node
00001, 00010, 00100, 01000, 10000 Neighbors of source
00011, 00101, 01001, 10001, 00110, 01010, 10010, 01100, 10100, 11000 Distance-2 nodes
00111, 01011, 10011, 01101, 10101, 11001, 01110, 10110, 11010, 11100 Distance-3 nodes
01111, 10111, 11011, 11101, 11110 Distance-4 nodes
11111 Distance-5 node

Binomial broadcast tree with single-port communication


00000 Time

Fig. 14.9 10000


0
The binomial 01000 11000
1
broadcast tree 00100 01100 11100
2
10100
for a 5-cube. 00010
3
4
00001
5
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 38
Hypercube Broadcasting Algorithms
ABCD ABCD
Fig. 14.10
Three
hypercube ABCD

broadcasting ABCD

schemes as Binomial-t ree scheme (nonpipelined)


performed B A B
A A
on a 4-cube. A B C A

A A
A A
A B C D
A A B B
A B A C A

Pipelined binomial-tree scheme


A
A D B A
A B A
D
B B A A
B B A A
A D D B C A
C D C
B A D
C
D C C B C B A B A A

Johnsson & Ho’s method To avoid clutter, only A shown

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 39


14.6 Adaptive and Fault-Tolerant Routing
There are up to q node-disjoint and edge-disjoint shortest paths between
any node pairs in a q-cube
Thus, one can route messages around congested or failed nodes/links
A useful notion for designing adaptive wormhole routing algorithms is
that of virtual communication networks
4 5 4 5
Each of the two subnetworks
0 1 0 1 in Fig. 14.11 is acyclic
Hence, any routing scheme
6 7 6 7
that begins by using links in
2 3 2 3 subnet 0, at some point
switches the path to subnet 1,
Subnetwork 0 Subnetwork 1
and from then on remains in
Fig. 14.11 Partitioning a 3-cube into subnet 1, is deadlock-free
subnetworks for deadlock-free routing.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 40
Robustness of the Hypercube
Rich connectivity
Source S X
provides many
alternate paths for
message routing
X
Three
faulty
nodes
X
The node that is
furthest from S is
not its diametrically
opposite node in Destination
the fault-free
hypercube The fault diameter of the q-cube is q + 1.

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 41


15 Other Hypercubic Architectures
Learn how the hypercube can be generalized or extended:
• Develop algorithms for our derived architectures
• Compare these architectures based on various criteria

Topics in This Chapter


15.1 Modified and Generalized Hypercubes
15.2 Butterfly and Permutation Networks
15.3 Plus-or-Minus-2i Network
15.4 The Cube-Connected Cycles Network
15.5 Shuffle and Shuffle-Exchange Networks
15.6 That’s Not All, Folks!

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 42


15.1 Modified and Generalized Hypercubes
4 5 4 5

0 1 0 1

6 7 6 7
2 3 2 3

3-cube and a 4-cycle in it Twisted 3-cube

Fig. 15.1 Deriving a twisted 3-cube by


redirecting two links in a 4-cycle.

Diameter is one less than the original hypercube

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 43


Folded Hypercubes
4 5 4 5

Fig. 15.2 Deriving a


folded 3-cube by 0 1 0 1
adding four diametral
links.
Diameter is half that of 6 3 7 6 3 7
2 2
the original hypercube
A diametral path in the 3-cubeFolded 3-cube
4 5 4 3 5

0 1
Rotate
0 71Fig. 15.3
180 Folded 3-cube
degrees
viewed as
6 7 6 3 17 3-cube with a
2 3 2 5 redundant
Folded 3-cube with dimension.
After renaming, diametral
Dim-0 links removed links replace dim-0 links

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 44


Generalized Hypercubes

A hypercube is a power or homogeneous product network


q-cube = (oo)q ; q th power of K2
Generalized hypercube = qth power of Kr
(node labels are radix-r numbers)
Node x is connected to y iff x and y differ in one digit
Each node has r – 1 dimension-k links
x3
Example: radix-4 generalized hypercube x2
Node labels are radix-4 numbers
x1

x0
Dimension-0 links

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 45


15.2 Butterfly and Permutation Networks
Dim 0
Dim 0 Dim 1 Dim 2
Dim 1 Dim 2
0 0 0 0
Fig. 7.4
Butterfly 1 1 1 1
and
wrapped 2 2 2 2
butterfly
networks. 3 3 3 3

4 4 4 4

5 5 5 5

6 6 6 6

7 7 7 7
0 1 2 3 1 2 3=0
q q
2 rows, q + 1 columns 2 rows, q columns

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 46


Dim 0 Dim 1 Dim 2 Dim 3
0

Structure of Butterfly Networks 1 1

Dim 1 Dim 0 Dim 2 2 2

0 0
3 3

4 4
1 1
Switching these
two row pairs 5 5

converts this to 2 2
6 6
the original
butterfly network.
3 3 7 7
Changing the
order of stages in 8 8
a butterfly is thus 4 4
equi valent to a 9 9
relabeling of t he
rows (in this 5 5 10 10

example, row xyz


11 11
becomes row xzy)
6 6
12 12

7 7 13 13

0 1 2 3
14 14

Fig. 15.5 Butterfly network 15 15


0 1 2 3 4
with permuted dimensions.
The 16-row butterfly network.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 47
Fat Trees
P0

P1 P4

P2 P3 P5 P6

Fig. 15.6 Two representations of a fat tree. Skinny tree? P


7 P8

5 6 7
3 4
2
0 1
Front view: Side view:
Binary tree 3 Inverted
2 6 7 binary tree
0 1 4 5
Fig. 15.7
Butterfly
0 1 2 3 4
5
6 7
network
redrawn as
a fat tree.
0 1 2 3 4 5 6 7

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 48


Butterfly as Multistage Interconnection Network
log2 p + 1 Columns of 2-by-2 Switches
p Processors log2 p Columns of 2-by-2 Switches p Memory Banks 0 1 2 3
0 1 2 3 000 0
0000 0 0 0000
0001 1 1 0001 001 1
0010 2 2 0010 010 2
0011 3 3 0011 011 3
0100 4 4 0100 100 4
0101 5 5 0101 101 5
0110 6 6 0110 110 6
0111 7 7 0111 111 7
1000 8 8 1000 000 0
1001 9 9 1001 001 1
1010 10 10 1010 010 2
1011 11 11 1011 011 3
1100 12 12 1100 100 4
1101 13 13 1101 101 5
1110 14 14 1110 110 6
1111 15 15 1111 111 7

Fig. 6.9 Example of a multistage Fig. 15.8 Butterfly network


memory access network used to connect modules
that are on the same side
Generalization of the butterfly network
High-radix or m-ary butterfly, built of m  m switches
Has mq rows and q + 1 columns (q if wrapped)
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 49
Beneš Network
Processors 2 log2p – 1 Columns of 2-by-2 Switches Memory Banks
0 1 2 3 4 000
000 0 0
001 1 1 001
010 2 2 010
011 3 3 011
100 4 4 100
101 5 5 101
110 6 6 110
111 7 7 111

Fig. 15.9 Beneš network formed from two back-to-back butterflies.

A 2q-row Beneš network:


Can route any 2q  2q permutation
It is “rearrangeable”

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 50


Routing Paths in a Beneš Network
0 0
To which 1 1
memory 2 2
modules 3 3
can we 4 4
connect 5 5
proc 4 6 6
without 7 7
rearranging 8 8
the other 9 9
paths? 10 10
11 11
12 12
What 13 13
about 14 14
proc 6? 15 0 1 2 3 4 5 6 15
q+1
2 q+1 Inputs 2 q Rows, 2q + 1 Columns 2 Outputs
Fig. 15.10 Another example of a Beneš network.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 51
15.3 Plus-or-Minus-2i Network
0 1 2 3 4 5 6 7
±1 4 5

0 1
±2

6 7
2 3

±4

Fig. 15.11 Two representations of the eight-node PM2I network.

The hypercube is a subgraph of the PM2I network

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 52


a
Unfolded PM2I b
Network 0 0

1 1
Data manipulator network
was used in Goodyear 2 2
MPP, an early SIMD
parallel machine. 3 3 q
2 Rows
“Augmented” means that 4 4
switches in a column are
independent, as opposed 5 5
to all being set to same
6 6
state (simplified control).
7 7
0 1 2 3
Fig. 15.12 Augmented a
data manipulator network. b
q + 1 Columns
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 53
15.4 The Cube-Connected Cycles Network
Dim 0
Dim 1 Dim 2
The cube-connected 0 0 0 0

cycles network (CCC)


1 1 1 1
can be viewed as a
simplified wrapped 2 2 2 2
butterfly whose node
degree is reduced 3 3 3 3
from 4 to 3. q
2 rows
4 4 4 4

5 5 5 5

6 6 6 6

7 7 7 7
1 2 3=0 0 1 2
q columns q columns/dimensions

Fig. 15.13 A wrapped butterfly (left)


converted into cube-connected cycles.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 54
Another View of The CCC Network
4 5 4,2 Example of
0,2
hierarchical
0 1 substitution
0,1
0,0 1,0 to derive a
lower-cost
6 7 network from
a basis
2 2,1 network
3

Fig. 15.14 Alternate derivation


of CCC from a hypercube.

Replacing each node of a high-dimensional


q-cube by a cycle of length q is how CCC
was originally proposed

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 55


Emulation of Hypercube Algorithms by CCC
Hypercube
Dimension

q–1
. Ascend
.
.

3 Cycle ID = x Proc ID = y
Normal
2 2m bits m bits
1 Descend Nj–1 (x), j–1
0
0 1 2 3 . . .
Algorithm Steps x, j–1 Dim j–1
Ascend, descend,
and normal algorithms. Cycle x, j
x Nj (x) , j
Dim j
Node (x, j) is communicating
along dimension j; after the x, j+1 Nj (x), j–1
Dim j+1
next rotation, it will be linked
to its dimension-(j + 1)
neighbor. Nj+1 (x), j+1
Fig. 15.15 CCC emulating
a normal hypercube algorithm. N j+1(x) , j
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 56
15.5 Shuffle and Shuffle-Exchange Networks
000 0 0 0 0 0 0

001 1 1 1 1 1 1

010 2 2 2 2 2 2

011 3 3 3 3 3 3

100 4 4 4 4 4 4

101 5 5 5 5 5 5

110 6 6 6 6 6 6

111 7 7 7 7 7 7
Shuffle Exchange Shuffle-Exchange Alternate
Structure
Unshuffle

Fig. 15.16 Shuffle, exchange, and shuffle–exchange connectivities.


Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 57
Shuffle-Exchange Network

0 1 2 3 4 5 6 7

SE
S 1 3 SE S
SE S S
S
SE
0 2 SE 5 7
S S S SE
SE 4 6
SE
Fig. 15.17 Alternate views of an eight-node shuffle–exchange network.

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 58


Routing in Shuffle-Exchange Networks
In the 2q-node shuffle network, node x = xq–1xq–2 . . . x2x1x0

is connected to xq–2 . . . x2x1x0xq–1 (cyclic left-shift of x)


In the 2q-node shuffle-exchange network, node x is
additionally connected to xq–2 . . . x2x1x0x q–1
01011011 Source
11010110 Destination
^ ^^ ^ Positions that differ
01011011 Shuffle to 10110110 Exchange to 10110111
10110111 Shuffle to 01101111
01101111 Shuffle to 11011110
11011110 Shuffle to 10111101
10111101 Shuffle to 01111011 Exchange to 01111010
01111010 Shuffle to 11110100 Exchange to 11110101

11110101 Shuffle to 11101011


11101011 Shuffle to 11010111 Exchange to 11010110
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 59
Diameter of Shuffle-Exchange Networks
For 2q-node shuffle-exchange network: D = q = log2p, d = 4
With shuffle and exchange links provided separately, as in Fig. 15.18,
the diameter increases to 2q – 1 and node degree reduces to 3

0 1 2 3 4 5 6 7
Exchange Shuffle
(dotted) (solid)
2 E
S 3 S S
S S
0 E 1 6 E 7
S S
S E
4 5
Fig. 15.18 Eight-node network with separate shuffle and exchange links.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 60
0 0 0 0

Multistage 1 1 1 A 1

Shuffle- 2 2 2 2

3 3
Exchange
3 3

4 A 4 4 4
Network 5 5 5 5

6 6 6 6

7 7 7 7
0 1 2 3 0 1 2 3
q + 1 Columns q + 1 Columns

0 0 0 0

1 1 1 1
Fig. 15.19 2 2 2 2
Multistage A
3 3 3 3
shuffle–exchange
4
network A
4 4 4

(omega network) 5 5 5 5

is the same as 6 6 6 6
butterfly network. 7 7 7 7
0 1 2 0 1 2
q Columns q Columns
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 61
15.6 That’s Not All, Folks!
When q is a power of 2, the 2qq-node cube-connected cycles network
derived from the q-cube, by replacing each node with a q-node cycle,
is a subgraph of the (q + log2q)-cube  CCC is a pruned hypercube
Other pruning strategies are possible, leading to interesting tradeoffs
100 101

000 001 Odd-dimension


links are kept in
the odd subcube D = log2 p + 1
Even-dimension 110 111

links are kept in


the even subcube d = (log2 p + 1)/2
010 011

B = p/4
All dimension-0
links are kept

Fig. 15.20 Example of a pruned hypercube.

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 62


Möbius Cubes
Dimension-i neighbor of x = xq–1xq–2 ... xi+1xi ... x1x0 is:
xq–1xq–2 ... 0xi... x1x0 if xi+1 = 0 (xi complemented, as in q-cube)
xq–1xq–2 ... 1xi... x1x0 if xi+1 = 1 (xi and bits to its right complemented)
For dimension q – 1, since there is no xq, the neighbor can be defined
in two possible ways, leading to 0- and 1-Mobius cubes
A Möbius cube has a diameter of about 1/2 and an average internode
distance of about 2/3 of that of a hypercube
4 5 7 6

0 0
1 1 Fig. 15.21
6 7 4 5
Two 8-node
Möbius cubes.
2 2 3
3
0-Mobius cube 1-Mobius cube
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 63
16 A Sampler of Other Networks
Complete the picture of the “sea of interconnection networks”:
• Examples of composite, hybrid, and multilevel networks
• Notions of network performance and cost-effectiveness

Topics in This Chapter


16.1 Performance Parameters for Networks
16.2 Star and Pancake Networks
16.3 Ring-Based Networks
16.4 Composite or Hybrid Networks
16.5 Hierarchical (Multilevel) Networks
16.6 Multistage Interconnection Networks

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 64


16.1 Performance Parameters for Networks
A wide variety of direct
interconnection networks
have been proposed for, or
used in, parallel computers

They differ in topological,


performance, robustness,
and realizability attributes.

Fig. 4.8 (expanded)


The sea of direct
interconnection networks.

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 65


Diameter and Average Distance
Diameter D (indicator of worst-case message latency)
Routing diameter D(R); based on routing algorithm R
Average internode distance  (indicator of average-case latency)
Routing average internode distance (R)

Sum of distances For the 3  3 mesh:


from corner node: P P P  = (4  18 + 4  15 +12)
0 1 2
2 1 +3  2 + 2 3 / (9  8) = 2
+ 1  4 = 18 [or 144 / 81 = 16 / 9]
Sum of distances
from side node:
P P P
3 4 5 3 1 +3  2 +
2  3 = 15
Sum of distances
from center node: For the 3  3 torus:
4  1 + 4  2 = 12 Average distance:
P P P =(4 (4  +1 4+415 +2) / 8
 18
6 7 8 =1 1.5
12)[or
/ (912 / 9= =2 4 / 3]
x 8)

Finding the average internode distance of a 3  3 mesh.


Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 66
Bisection Width 0
31 1
30 2

Indicator or random 29 3

communication 28 4

capacity 27 5

26 6
Node bisection and
25 7
link bisection
24 8

Hard to determine; 23 9
Intuition can be
very misleading 22 10

21 11

Fig. 16.2 A network 20 12

whose bisection width is 19 13

not as large at it 18
17 15
14
16
appears.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 67
Determining the Bisection Width
Establish upper bound by taking a number of trial cuts.
Then, try to match the upper bound by a lower bound.
P0 P1 P2
K9 P0

P3 P4 P5 P8 P1 Establishing
a lower
bound on B:
P6 P7 P8
P7 P2 Embed Kp
0 1 2
into p-node
network
7
3 4 5
P6 P3 Let c be the
maximum
7 congestion
6 7 8 P5 P4
B  p2/4/c
An embedding of Bisection width = 4  5 = 20
K9 into 3  3 mesh

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 68


Degree-Diameter Relationship
Age-old question: What is the best way to interconnect p nodes
of degree d to minimize the diameter D of the resulting network?
Alternatively: Given a desired diameter D and nodes of degree d,
what is the max number of nodes p that can be accommodated?
Moore bounds (digraphs) d 2 nodes
d nodes
p  1 + d + d2 + . . . + dD = (dD+1 – 1)/(d – 1)
D  logd [p(d – 1) + 1] – 1
Only ring and Kp match these bounds x
Moore bounds (undirected graphs)
d (d – 1) nodes
p  1 + d + d(d – 1) + . . . + d(d – 1)D–1 d nodes
= 1 + d [(d – 1)D – 1]/(d – 2)
D  logd–1[(p – 1)(d – 2)/d + 1]
Only ring with odd size p and a few other x
networks match these bounds
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 69
Moore Graphs
A Moore graph matches the bounds on diameter and number of nodes.

For d = 2, we have p  2D + 1
Odd-sized ring satisfies this bound

For d = 3, we have p  3  2D – 2
D = 1 leads to p  4 (K4 satisfies the bound)
D = 2 leads to p  10 and the first nontrivial example (Petersen graph)
11010

00111 10101 01101


11001 10110

01011 01110

11100 10011 Fig. 16.1 The 10-node Petersen graph.


Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 70
How Good Are Meshes and Hypercubes?
For d = 4, we have D  log3[(p + 1)/2]
So, 2D mesh and torus networks are far from optimal in diameter,
whereas butterfly is asymptotically optimal within a constant factor

For d = log2 p (as for d-cube), we have D = (d / log d)


So the diameter d of a d-cube is a factor of log d over the best possible
We will see that star graphs match this bound asymptotically

Summary:
For node degree d, Moore’s bounds establish the lowest possible
diameter D that we can hope to achieve with p nodes, or the largest
number p of nodes that we can hope to accommodate for a given D.
Coming within a constant factor of the bound is usually good enough;
the smaller the constant factor, the better.

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 71


Layout Area and Longest Wire
The VLSI layout area required by an interconnection network is
intimately related to its bisection width B

If B wires must cross the bisection in 2D layout


of a network and wire separation is 1 unit, the
smallest dimension of the VLSI chip will be  B
The chip area will thus be (B2) units
p-node 2D mesh needs O(p) area
p-node hypercube needs at least (p2) area B wires crossing a bisection

The longest wire required in VLSI layout affects network performance


For example, any 2D layout of a p-node hypercube requires wires of
length ((p / log p)1/2); wire length of a mesh does not grow with size
When wire length grows with size, the per-node performance is bound
to degrade for larger systems, thus implying sublinear speedup

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 72


Measures of Network Cost-Effectiveness
Composite measures, that take both the network performance and its
implementation cost into account, are useful in comparisons

One such measure is the degree-diameter product, dD


Mesh / torus: (p1/2)
Binary tree: (log p)
Not quite similar in cost-performance
Pyramid: (log p)
Hypercube: (log2 p)
However, this measure is somewhat misleading, as the node degree d
is not an accurate measure of cost; e.g., VLSI layout area also
depends on wire lengths and wiring pattern and bus based systems
have low node degrees and diameters without necessarily being cost-
effective
Robustness must be taken into account in any practical comparison of
interconnection networks (e.g., tree is not as attractive in this regard)

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 73


16.2 Star and Pancake Networks
1234 4231
4 Has p = q ! nodes
3214 2134 3241
3 2 2431 Each node labeled with a
string x1x2 ... xq which is a
2 3
2314 3421 permutation of {1, 2, ... , q}
3 2
3124 2341 Node x1x2 ... xi ... xq is
1324 4321
connected to xi x2 ... x1 ... xq
3412 2413 for each i (note that x1
1432 4213
and xi are interchanged)
4312 1423
When the i th symbol is
1342 4123 switched with x1 , the
4132 1243
corresponding link is
3142 2143 called a dimension-i link

Fig. 16.3 The four-dimensional star graph. d = q – 1; D = 3(q – 1)/2


D, d = O(log p / log log p)
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 74
Routing in the Star Graph
Source node 1 5 4 3 6 2 The diameter
Dimension-2 link to 5 1 4 3 6 2 of star is in fact
Dimension-6 link to 2 1 4 3 6 5 somewhat less
Last symbol now adjusted D = 3(q – 1)/2
Dimension-2 link to 1 2 4 3 6 5
Dimension-5 link to 6 2 4 3 1 5 Clearly, this is
Last 2 symbols now adjusted not a shortest-
Dimension-2 link to 2 6 4 3 1 5 path routing
Dimension-4 link to 3 6 4 2 1 5 algorithm.
Last 3 symbols now adjusted
Dimension-2 link to 6 3 4 2 1 5 Correction to text,
Dimension-3 link to 4 3 6 2 1 5 p. 328: diameter is
Last 4 symbols now adjusted not 2q – 3
Dimension-2 link (Dest’n) 3 4 6 2 1 5
We need a maximum of two routing steps per symbol, except that last
two symbols need at most 1 step for adjustment  D  2q – 3
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 75
Star’s Sublogarithmic Degree and Diameter
d = (q) and D = (q); but how is q related to the number p of nodes?

p = q !  e–qqq (2q)1/2 [ using Striling’s approximation to q ! ]


ln p  –q + (q + 1/2) ln q + ln(2)/2 = (q log q) or q = (log p / log log p)

Hence, node degree and diameter are sublogarithmic


Star graph is asymptotically optimal to within a constant factor with regard
to Moore’s diameter lower bound
Routing on star graphs is simple and reasonably efficient; however,
virtually all other algorithms are more complex than the corresponding
algorithms on hypercubes

Network diameter 4 5 6 7 8 9
Star nodes 24 -- 120 720 -- 5040
Hypercube nodes 16 32 64 128 256 512
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 76
The Star-Connected Cycles Network
1234,4 4
1234,3 1234,2
Replace degree-(q – 1)
3 2 nodes with (q – 1)-cycles

2 3 This leads to a scalable


version of the star graph
3 2 whose node degree of 3
does not grow with size

The diameter of SCC is


about the same as that of
a comparably sized CCC
network

However, routing and


other algorithms for SCC
Fig. 16.4 The four-dimensional are more complex
star-connected cycles network.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 77
Pancake Networks
1234 Similar to star networks in terms of
Dim 4
node degree and diameter
Dimension-i neighbor obtained by
Dim 2 4321 “flipping” the first i symbols;
Dim 3 hence, the name “pancake”

We need two flips per symbol in


3214 the worst case; D  2q – 3
2134

Source node 1 5 4 3 6 2
Dimension-2 link to 5 1 4 3 6 2
Dimension-6 link to 2 6 3 4 1 5
Last 2 symbols now adjusted
Dimension-4 link to 4 3 6 2 1 5
Last 4 symbols now adjusted
Dimension-2 link (Dest’n) 3 4 6 2 1 5
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 78
Cayley Networks
Node x
Group:
Gen 3
A semigroup with an identity element
and inverses for all elements.
Gen 1 x  3
Gen 2 Example 1: Integers with addition or
multiplication operator form a group.
Example 2: Permutations, with the
x  2
x  1 composition operator, form a group.

Star and pancake networks are


Cayley graph:
instances of Cayley graphs
Node labels are from a group G, and
Elements of S are “generators” a subset S of G defines the
of G if every element of G can connectivity via the group operator 
be expressed as a finite Node x is connected to node y
product of their powers iff x   = y for some   S
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 79
Star as a Cayley Network
1234 (1 4) (2) (3) 4231
4 Four-dimensional star:
3214 2 2134 3241
3 2431
Group G of the
2 3 permutations of {1, 2, 3, 4}
2314 3421
3 2
3124 2341 The generators are the
1324 4321 following permutations:
3412 2413 (1 2) (3) (4)
1432 4213 (1 3) (2) (4)
4312 1423 (1 4) (2) (3)

1342 4123
The identity element is:
4132 1243 (1) (2) (3) (4)
(1 3) (2) (4)
3142 2143

Fig. 16.3 The four-dimensional star graph.

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 80


16.3 Ring-Based Networks
Rings are simple, Remote Message
Ring destination Local Ring
but have low
performance and D
lack robustness

Hence, a variety
of multilevel and
augmented ring S
networks have Message
been proposed source

Fig. 16.5 A 64-node ring-of-rings architecture composed


of eight 8-node local rings and one second-level ring.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 81
0 0
Chordal Ring 7 1 7 1
Networks
6 2 6 2
Routing algorithm:
Greedy routing
5 3 5 3
Given one chord 4 4
type s, the optimal ((a)a ) (b)( b )
length for s is
s 0=1 s 0=1
v–1 approximately p1/2
v v+1 0 v–1 v v+1

s1 7 1 s1
. . . .
. . . .
..
Fig. . 16.6 .
v–s 1 Unidirectional ring, v+s 1 v–s 1 v+s 1
6 2
two chordal rings,
s k–1 s k–1
and node
connectivity in 5 3
v–s general.
k–1 v+s k–1 4 v–s k–1 v+s k–1

(a) (b)( c ) (a)( d )


Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 82
Chordal Rings Compared to Torus Networks
0 1 2 0 1 2
The ILLIAC IV
interconnection
scheme, often 3 4 5 3 4 5
described as
8  8 mesh or
torus, was really 6 7 6 7 8
a 64-node
chordal ring with Fig. 16.7 Chordal rings redrawn to show
skip distance 8. their similarity to torus networks.

Perfect Difference Networks


A class of chordal rings, recently studied at UCSB (two-part paper
in IEEE TPDS, August 2005) have a diameter of D = 2
Perfect difference {0, 1, 3}: All numbers in the range 1-6 mod 7
can be formed as the difference of two numbers in the set.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 83
Periodically Regular Chordal Rings
G
ro
upp
/g–1 G
ro
up0 0 s
0 Gro u

N
o de
sp–g Node
s0
top–1 tog–1 7 1
G
ro
up1
s
0
s
2 s
1
As k iplinklea
d sto N o
desg
thes am erela
tive to2g–1
positioninth e 6 2
destin atio
n
grou p
Nod
e s2
g
to3g–1
G
ro
up2
No de
sig
to(i+
1)g–1 5 3
G
ro
upi 4
Fig. 16.8 Periodically regular chordal ring.

Modified greedy routing: first route to the head of a group;


then use pure greedy routing

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 84


a

Some Properties of PRC Rings b

c e
0
g No skip in this
4 d f dimension

0 1 2 3

4 5 6 7
1

8 9 10 11
5
2
12 13 14 15
6 To 6
To 3
16 17 18 19
3 7
63
Fig. 16.9 VLSI layout for a 64-node 20 21 22 23

periodically regular chordal ring. 24 25 26 27

Remove some skip links 28 29 30 31


Fig. 16.10 A PRC
for cost-performance Dimension 1
ring redrawn as a a e g s 1 = nil
tradeoff; similar in
butterfly- or ADM- Dimension 2
nature to CCC network b f s 2= 4
like network. Dimension 1
with longer cycles c s 3= 8

Dimension 1
d s 4 = 16
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 85
16.4 Composite or Hybrid Networks

Motivation: Combine the connectivity schemes from


two (or more) “pure” networks in order to:

 Achieve some advantages from each structure

 Derive network sizes that are otherwise unavailable

 Realize any number of performance / cost benefits

A very large set of combinations have been tried


New combinations are still being discovered

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 86


Composition by Cartesian Product Operation
0 a
0a Properties of product
0b 3-by-2 graph G = G  G:
2a torus
2  b =
1
2b Nodes labeled (x , x ),
1a
x   V , x   V 
1b
p = pp
 d = d + d
 =
D = D + D
 =  + 
Routing: G -first
(x , x )  (y , x )
 =  (y , y )
Broadcasting
Semigroup & parallel
Fig. 13.4 Examples of product graphs. prefix computations
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 87
Other Properties and Examples of Product Graphs
If G and G are Hamiltonian, then the p p torus is a subgraph of G
For results on connectivity and fault diameter, see [Day00], [AlAy02]

Mesh of trees (Section 12.6) Product of two trees

Fig. 16.11 Mesh of trees compared with mesh-connected trees.


Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 88
16.5 Hierarchical (Multilevel) Networks
We have already seen
several examples of
hierarchical networks:
multilevel buses (Fig. 4.9);
CCC; PRC rings
Can be defined
from the bottom up Fig. 16.13 Hierarchical or multilevel bus network.
or from the top down
Take first-level ring
networks and
interconnect them
as a hypercube
Take a top-level
hypercube and
replace its nodes
with given networks
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 89
Example: Mesh of Meshes Networks

The same idea can be used N


to form ring of rings, W E
hypercube of hypercubes,
complete graph of complete S
graphs, and more generally,
X of Xs networks

When network topologies at


the two levels are different,
we have X of Ys networks

Generalizable to three levels


(X of Ys of Zs networks),
four levels, or more Fig. 16.12 The mesh of meshes network
exhibits greater modularity than a mesh.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 90
Example: Swapped Networks
Build a p2-node network using p-node building blocks (nuclei or clusters)
by connecting node i in cluster j to node j in cluster i
Also known in the literature as OTIS (optical transpose interconnect system) network
Cluster # Node #
00 10 01 11

Cluster # Node # Level-2 link


20 30 21 31
We can Level-1 link
square the
network 0 1
02 12 03 13
size by
adding
one link 2 3
per node 22 32 23 33

Two-level swapped network with 2  2 mesh as its nucleus.


Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 91
16.6 Multistage Interconnection Networks
Numerous indirect or
multistage interconnection
networks (MINs) have been
proposed for, or used in,
parallel computers
They differ in topological,
performance, robustness,
and realizability attributes
We have already seen the
butterfly, hierarchical bus,
beneš, and ADM networks

Fig. 4.8 (modified)


The sea of indirect
interconnection networks.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 92
Self-Routing Permutation Networks
Do there exist self-routing permutation networks? (The butterfly network
is self-routing, but it is not a permutation network)

Permutation routing through a MIN is the same problem as sorting

7 (111) 0 0 0 (000)
0 (000) 1 1 1 (001)
4 (100) 3 3 2 (010)
6 (110) 2 2 3 (011)
1 (001) 7 4 4 (100)
5 (101) 4 5 5 (101)
3 (011) 6 7 6 (110)
2 (010) 5 6 7 (111)
Sort by Sort by the Sort by
MSB middle bit LSB
Fig. 16.14 Example of sorting on a binary radix sort network.
Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 93
Partial List of Important MINs
Augmented data manipulator (ADM): aka unfolded PM2I (Fig. 15.12)
Banyan: Any MIN with a unique path between any input and any output (e.g. butterfly)
Baseline: Butterfly network with nodes labeled differently
Beneš: Back-to-back butterfly networks, sharing one column (Figs. 15.9-10)
Bidelta: A MIN that is a delta network in either direction
Butterfly: aka unfolded hypercube (Figs. 6.9, 15.4-5)
Data manipulator: Same as ADM, but with switches in a column restricted to same state
Delta: Any MIN for which the outputs of each switch have distinct labels (say 0 and 1
for 2  2 switches) and path label, composed of concatenating switch output labels
leading from an input to an output depends only on the output
Flip: Reverse of the omega network (inputs  outputs)
Indirect cube: Same as butterfly or omega
Omega: Multi-stage shuffle-exchange network; isomorphic to butterfly (Fig. 15.19)
Permutation: Any MIN that can realize all permutations
Rearrangeable: Same as permutation network
Reverse baseline: Baseline network, with the roles of inputs and outputs interchanged

Spring 2006 Parallel Processing, Low-Diameter Architectures Slide 94

You might also like