Chapter 4
Chapter 4
2
Flynn’s Taxonomy
Computers can be classified by numbers of instruction
and data streams
✤ SISD: single instruction stream, single data stream
• pipelined computers
✤ MIMD: multiple instruction streams, multiple data streams
✤ Easier to program than true MIMD but more flexible than SIMD
✤ Most parallel computers today have MIMD architecture but are
programmed in SPMD style
4
Parallel Computer Architectures
Parallel architectural design issues
✤ Processor coordination: synchronous or asynchronous?
✤ Memory organization: distributed or shared?
✤ Address space: local or global?
✤ Memory access: uniform or nonuniform?
✤ Granularity: coarse or fine?
✤ Scalability: additional processors used efficiently?
✤ Interconnection network: topology, switching, routing?
5
Major Architectural Paradigms
Memory organization is fundamental architectural design
choice: How are processors connected to memory?
M0 M1 ••• MN P0 P1 ••• PN
P0 P1 ••• PN network
network M0 M1 ••• MN
9
Cache Coherence
✤ In shared memory multiprocessor, same cache line in main memory
may reside in cache of more than one processor, so values could be
inconsistent
✤ Cache coherence protocol ensures consistent view of memory
regardless of modifications of values in cache of any processor
✤ Cache coherence protocol keeps track of state of each cache line
✤ MESI protocol is typical
• M, modified: has been modified, and resides in no other cache
• E, exclusive: not yet modified, and resides in no other cache
• S, shared: not yet modified, and resides in multiple caches
• I, invalid: may be inconsistent, value not to be trusted
10
Cache Coherence
✤ Small systems often implement cache coherence using bus snoop
✤ Larger systems typically use directory-based protocol that keeps track
of all cache lines in system
✤ Coherence traffic can hurt application performance, especially if
same cache line is modified frequently by different processors, as in
false sharing
11
Hybrid Parallel Architectures
✤ Most large computers today have hierarchical combination of shared
and distributed memory, with memory shared locally within SMP
nodes but distributed globally across nodes interconnected by
network
12
Communication Networks
✤ Access to remote data requires communication
✤ Direct connections among p processors would require O(p2) wires
and communication ports, which in infeasible for large p
✤ Limited connectivity necessitates routing data through intermediate
processors or switches
✤ Topology of network affects algorithm design, implementation, and
performance
13
Common Network Topologies
1-D mesh
bus
star crossbar
14
Common Network Topologies
binary tree
butterfly
0-cube
hypercubes
15
Graph Terminology
✤ Graph: pair (V, E), where V is set of vertices or nodes connected by set
E of edges
✤ Complete graph: graph in which any two nodes are connected by an
edge
✤ Path: sequence of contiguous edges in graph
✤ Connected graph: graph in which any two nodes are connected by a
path
✤ Cycle: path of length greater than one that connects a node to itself
✤ Tree: connected graph containing no cycles
✤ Spanning tree: subgraph that includes all nodes of given graph and is
also a tree
16
Graph Models
✤ Graph model of network: nodes are processors (or switches or
memory units), edges are communication links
✤ Graph model of computation: nodes are tasks, edges are data
dependences between tasks
✤ Mapping task graph of computation to network graph of target
computer is instance of graph embedding
✤ Distance between two nodes: number of edges (hops) in shortest path
between them
17
Network Properties
✤ Degree: maximum number of edges incident on any node
• determines number of communication ports per processor
✤ Diameter: maximum distance between any pair of nodes
• determines maximum communication delay between processors
✤ Bisection width: smallest number of edges whose removal splits graph
into two subgraphs of equal size
• determines ability to support simultaneous global communication
✤ Edge length: maximum physical length of any wire
• may be constant or variable as number of processors varies
18
Network Properties
19
Graph Embedding
✤ Graph embedding: φ: Vs → Vt maps nodes in source graph Gs = (Vs, Es)
to nodes in target graph Gt = (Vt, Et)
✤ Edges in Gs mapped to paths in Gt
✤ Load: maximum number of nodes in Vs mapped to same node in Vt
✤ Congestion: maximum number of edges in Es mapped to paths
containing same edge in Et
✤ Dilation: maximum distance between any two nodes φ(u), φ(v) ∈ Vt
such that (u,v) ∈ Es
20
Graph Embedding
✤ Uniform load helps balance work across processors
✤ Minimizing congestion optimizes use of available bandwidth of
network links
✤ Minimizing dilation keeps nearest-neighbor communications in
source graph as short as possible in target graph
✤ Perfect embedding has load, congestion, and dilation 1, but not
always possible
✤ Optimal embedding difficult to determine (NP-complete, in general),
so heuristics used to determine good embedding
21
Graph Embedding Examples
✤ For some important cases, good or optimal embeddings are known
22
Gray Code
✤ Gray code: ordering of integers 0 to 2n−1 such that consecutive
members differ in exactly one bit position
✤ Example: binary reflected Gray code of length 16
23
Computing Gray Code
24
Hypercube Embeddings
✤ Visiting nodes of hypercube in Gray code order gives Hamiltonian
cycle embedding ring in hypercube
26
Message Routing
✤ Messages sent between nodes that are not directly connected must be
routed through intermediate nodes
✤ Message routing algorithms can be
28
Routing Schemes
✤ Store-and-forward routing: entire message is received and stored
at each node before being forwarded to next node on path, so
Tmsg = (ts + tw L) D, where D = distance in hops
✤ Cut-through (or wormhole) routing: message broken into segments
that are pipelined through network, with each segment
forwarded as soon as it is received, so Tmsg = ts + tw L + th D,
where th = incremental time per hop
29
Communication Concurrency
✤ For given communication system, it may or may not be possible for
each node to
30
Communication Concurrency
✤ When multiple messages contend for network bandwidth, time
required to send message modeled by Tmsg = ts + tw S L, where S is
number of messages sent concurrently over same communication
link
✤ In effect, each message uses 1/S of available bandwidth
31
Collective Communication
✤ Collective communication: multiple nodes communicating
simultaneously in systematic pattern, such as
• broadcast: one-to-all
• reduction: all-to-one
• scatter/gather: one-to-all/all-to-one
• total or complete exchange: personalized all-to-all
• scan or prefix
• circular shift
• barrier
32
Collective Communication
33
Broadcast
✤ Broadcast: source node sends same message to each of p−1 other
nodes
✤ Generic broadcast algorithm generates spanning tree, with source
node as root
34
Broadcast
35
Broadcast
✤ Cost of broadcast depends on network, for example
• send each segment along different spanning tree having same root
36
Reduction
✤ Reduction: data from all p nodes are combined by applying specified
associative operation ⊕ (e.g., sum, product, max, min, logical OR,
logical AND) to produce overall result
✤ Generic broadcast algorithm generates spanning tree, with source
node as root
37
Reduction
38
Reduction
✤ Subsequent broadcast required if all nodes need result of reduction
✤ Cost of reduction depends on network, for example
39
Multinode Broadcast
✤ Multinode broadcast: each of p nodes sends message to all other nodes
(all-to-all)
✤ Logically equivalent to p broadcasts, one from each node, but
efficiency can often be enhanced by overlapping broadcasts
✤ Total time for multinode broadcast depends strongly on concurrency
supported by communication system
✤ Multinode broadcast need be no more costly than standard broadcast
if aggressive overlapping of communication is supported
40
Multinode Broadcast
✤ Implementation of multinode broadcast in specific networks
42
Personalized Communication
✤ Personalized collective communication: each node sends (or receives)
distinct message to (or from) each other node
43
Scan or Prefix
✤ Scan (or prefix): given data values x0, x1, . . ., xp−1, one per node, along
with associative operation ⊕, compute sequence of partial results s0,
s1, . . ., sp−1, where sk = x0 ⊕ x1 ⊕ ⋅ ⋅ ⋅ ⊕ xk and sk is to reside on node k,
k = 0, . . ., p − 1
✤ Scan can be implemented similarly to multinode broadcast, except
intermediate results received by each node are selectively combined
depending on sending node's numbering, before being forwarded
44
Circular Shift
✤ Circular k-shift: for 0 < k < p, node i sends data to node (i + k) mod p
✤ Circular shift implemented naturally in ring network, and by
embedding ring in other networks
45
Barrier
✤ Barrier: synchronization point that all processes must reach before
any process is allowed to proceed beyond it
✤ For distributed-memory systems, barrier usually implemented by
message passing, using algorithm similar to all-to-all
46