0% found this document useful (0 votes)
13 views

Chapter 4

Uploaded by

bia.maiolini
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Chapter 4

Uploaded by

bia.maiolini
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Introduction to High Performance

Computing for Scientists and


Engineers
Chapter 4: Parallel Computers
Parallel Computers
✤ World’s fastest supercomputers have always exploited some degree
of parallelism in their hardware
✤ With advent of multicore processors, virtually all computers today
are parallel computers, even desktop and laptop computers
✤ Today’s largest supercomputers have hundreds of thousands of cores
and soon will have millions of cores
✤ Parallel computers require more complex algorithms and
programming to divide computational work among multiple
processors and coordinate their activities
✤ Efficient use of additional processors becomes increasingly difficult
as total number of processors grows (scalability)

2
Flynn’s Taxonomy
Computers can be classified by numbers of instruction
and data streams
✤ SISD: single instruction stream, single data stream

• conventional serial computers


✤ SIMD: single instruction stream, multiple data streams

• vector or data parallel computers


✤ MISD: multiple instruction streams, single data stream

• pipelined computers
✤ MIMD: multiple instruction streams, multiple data streams

• general purpose parallel computers


3
SPMD Programming Style
SPMD (single program, multiple data): all processors
execute same program, but each operates on different
portion of problem data

✤ Easier to program than true MIMD but more flexible than SIMD
✤ Most parallel computers today have MIMD architecture but are
programmed in SPMD style

4
Parallel Computer Architectures
Parallel architectural design issues
✤ Processor coordination: synchronous or asynchronous?
✤ Memory organization: distributed or shared?
✤ Address space: local or global?
✤ Memory access: uniform or nonuniform?
✤ Granularity: coarse or fine?
✤ Scalability: additional processors used efficiently?
✤ Interconnection network: topology, switching, routing?

5
Major Architectural Paradigms
Memory organization is fundamental architectural design
choice: How are processors connected to memory?

M0 M1 ••• MN P0 P1 ••• PN

P0 P1 ••• PN network

network M0 M1 ••• MN

distributed-memory multicomputer shared-memory multiprocessor

Can also have hybrid combinations of these


6
Parallel Programming Styles
✤ Shared-memory multiprocessor
• Entire problem data stored in common memory
• Programs do loads and stores from common (and typically
remote) memory
• Protocols required to maintain data integrity
• Often exploit loop-level parallelism using pool of tasks paradigm
✤ Distributed-memory multicomputer
• Problem data partitioned among private processor memories
• Programs communicate by sending messages between processors
• Messaging protocol provides synchronization
• Often exploit domain decomposition parallelism
7
Distributed vs. Shared Memory
distributed shared
memory memory
scalability easier harder
data mapping harder easier
data integrity easier harder
incremental
harder easier
parallelization
automatic
harder easier
parallelization
8
Shared-Memory Computers
✤ UMA (uniform memory access): same latency and bandwidth for all
processors and memory locations
• sometimes called SMP (symmetric multiprocessor)
• often implemented using bus, crossbar, or multistage network
• multicore processor is typically SMP
✤ NUMA (nonuniform memory access): latency and bandwidth vary
with processor and memory location
• some memory locations “closer” than others, with different access
speeds
• consistency of multiple caches is crucial to correctness
• ccNUMA: cache coherent nonuniform memory access

9
Cache Coherence
✤ In shared memory multiprocessor, same cache line in main memory
may reside in cache of more than one processor, so values could be
inconsistent
✤ Cache coherence protocol ensures consistent view of memory
regardless of modifications of values in cache of any processor
✤ Cache coherence protocol keeps track of state of each cache line
✤ MESI protocol is typical
• M, modified: has been modified, and resides in no other cache
• E, exclusive: not yet modified, and resides in no other cache
• S, shared: not yet modified, and resides in multiple caches
• I, invalid: may be inconsistent, value not to be trusted
10
Cache Coherence
✤ Small systems often implement cache coherence using bus snoop
✤ Larger systems typically use directory-based protocol that keeps track
of all cache lines in system
✤ Coherence traffic can hurt application performance, especially if
same cache line is modified frequently by different processors, as in
false sharing

11
Hybrid Parallel Architectures
✤ Most large computers today have hierarchical combination of shared
and distributed memory, with memory shared locally within SMP
nodes but distributed globally across nodes interconnected by
network

12
Communication Networks
✤ Access to remote data requires communication
✤ Direct connections among p processors would require O(p2) wires
and communication ports, which in infeasible for large p
✤ Limited connectivity necessitates routing data through intermediate
processors or switches
✤ Topology of network affects algorithm design, implementation, and
performance

13
Common Network Topologies

1-D mesh

1-D torus (ring) 2-D mesh 2-D torus

bus

star crossbar

14
Common Network Topologies

binary tree
butterfly

0-cube

1-cube 2-cube 3-cube 4-cube

hypercubes

15
Graph Terminology
✤ Graph: pair (V, E), where V is set of vertices or nodes connected by set
E of edges
✤ Complete graph: graph in which any two nodes are connected by an
edge
✤ Path: sequence of contiguous edges in graph
✤ Connected graph: graph in which any two nodes are connected by a
path
✤ Cycle: path of length greater than one that connects a node to itself
✤ Tree: connected graph containing no cycles
✤ Spanning tree: subgraph that includes all nodes of given graph and is
also a tree
16
Graph Models
✤ Graph model of network: nodes are processors (or switches or
memory units), edges are communication links
✤ Graph model of computation: nodes are tasks, edges are data
dependences between tasks
✤ Mapping task graph of computation to network graph of target
computer is instance of graph embedding
✤ Distance between two nodes: number of edges (hops) in shortest path
between them

17
Network Properties
✤ Degree: maximum number of edges incident on any node
• determines number of communication ports per processor
✤ Diameter: maximum distance between any pair of nodes
• determines maximum communication delay between processors
✤ Bisection width: smallest number of edges whose removal splits graph
into two subgraphs of equal size
• determines ability to support simultaneous global communication
✤ Edge length: maximum physical length of any wire
• may be constant or variable as number of processors varies

18
Network Properties

19
Graph Embedding
✤ Graph embedding: φ: Vs → Vt maps nodes in source graph Gs = (Vs, Es)
to nodes in target graph Gt = (Vt, Et)
✤ Edges in Gs mapped to paths in Gt
✤ Load: maximum number of nodes in Vs mapped to same node in Vt
✤ Congestion: maximum number of edges in Es mapped to paths
containing same edge in Et
✤ Dilation: maximum distance between any two nodes φ(u), φ(v) ∈ Vt
such that (u,v) ∈ Es

20
Graph Embedding
✤ Uniform load helps balance work across processors
✤ Minimizing congestion optimizes use of available bandwidth of
network links
✤ Minimizing dilation keeps nearest-neighbor communications in
source graph as short as possible in target graph
✤ Perfect embedding has load, congestion, and dilation 1, but not
always possible
✤ Optimal embedding difficult to determine (NP-complete, in general),
so heuristics used to determine good embedding

21
Graph Embedding Examples
✤ For some important cases, good or optimal embeddings are known

22
Gray Code
✤ Gray code: ordering of integers 0 to 2n−1 such that consecutive
members differ in exactly one bit position
✤ Example: binary reflected Gray code of length 16

23
Computing Gray Code

24
Hypercube Embeddings
✤ Visiting nodes of hypercube in Gray code order gives Hamiltonian
cycle embedding ring in hypercube

✤ For mesh or torus of higher dimension, concatenating Gray codes for


each dimension gives embedding in hypercube
25
Communication Cost
✤ Simple model for time required to send message (move data)
between adjacent nodes: Tmsg = ts + tw L, where

• ts = startup time = latency (time to send message of length 0)


• tw = incremental transfer time per word (bandwidth = 1/tw)
• L = length of message in words
✤ For most real parallel systems, ts >> tw
✤ Caveats
• Some systems treat message of length 0 as special case or may
have minimum message size greater than 0
• Many systems use different protocols depending on message size
(e.g. 1-trip vs. 3-trip)

26
Message Routing
✤ Messages sent between nodes that are not directly connected must be
routed through intermediate nodes
✤ Message routing algorithms can be

• minimal or nonminimal, depending on whether shortest path is


always taken
• static or dynamic, depending on whether same path is always taken
• deterministic or randomized, depending on whether path is chosen
systematically or randomly
• circuit switched or packet switched, depending on whether entire
message goes along reserved path or is transferred in segments
that may not all take same path
✤ Most regular network topologies admit simple routing schemes that
are static, deterministic, and minimal
27
Message Routing Examples

28
Routing Schemes
✤ Store-and-forward routing: entire message is received and stored
at each node before being forwarded to next node on path, so
Tmsg = (ts + tw L) D, where D = distance in hops
✤ Cut-through (or wormhole) routing: message broken into segments
that are pipelined through network, with each segment
forwarded as soon as it is received, so Tmsg = ts + tw L + th D,
where th = incremental time per hop

29
Communication Concurrency
✤ For given communication system, it may or may not be possible for
each node to

• send message while receiving another simultaneously on same


communication link

• send message on one link while receiving simultaneously on


different link

• send or receive, or both, simultaneously on multiple links


✤ Depending on concurrency supported, time required for each step of
communication algorithm is effectively multiplied by appropriate
factor (e.g., degree of network graph)

30
Communication Concurrency
✤ When multiple messages contend for network bandwidth, time
required to send message modeled by Tmsg = ts + tw S L, where S is
number of messages sent concurrently over same communication
link
✤ In effect, each message uses 1/S of available bandwidth

31
Collective Communication
✤ Collective communication: multiple nodes communicating
simultaneously in systematic pattern, such as

• broadcast: one-to-all
• reduction: all-to-one

• multinode broadcast: all-to-all

• scatter/gather: one-to-all/all-to-one
• total or complete exchange: personalized all-to-all

• scan or prefix
• circular shift

• barrier
32
Collective Communication

33
Broadcast
✤ Broadcast: source node sends same message to each of p−1 other
nodes
✤ Generic broadcast algorithm generates spanning tree, with source
node as root

34
Broadcast

35
Broadcast
✤ Cost of broadcast depends on network, for example

• 1-D mesh: Tbcast = (p − 1) (ts + tw L)

• 2-D mesh: Tbcast = 2 (√p − 1) (ts + tw L)

• hypercube: Tbcast = log p (ts + tw L)


✤ For long messages, bandwidth utilization may be enhanced by
breaking message into segments and either

• pipeline segments along single spanning tree, or

• send each segment along different spanning tree having same root

• can also use scatter/allgather

36
Reduction
✤ Reduction: data from all p nodes are combined by applying specified
associative operation ⊕ (e.g., sum, product, max, min, logical OR,
logical AND) to produce overall result
✤ Generic broadcast algorithm generates spanning tree, with source
node as root

37
Reduction

38
Reduction
✤ Subsequent broadcast required if all nodes need result of reduction
✤ Cost of reduction depends on network, for example

• 1-D mesh: Tbcast = (p − 1) (ts + (tw + tc) L)

• 2-D mesh: Tbcast = 2 (√p − 1) (ts + (tw + tc) L)

• hypercube: Tbcast = log p (ts + (tw + tc) L)


✤ Time per word for associative reduction operation, tc , is often much
smaller than tw , so is sometimes omitted from performance analyses

39
Multinode Broadcast
✤ Multinode broadcast: each of p nodes sends message to all other nodes
(all-to-all)
✤ Logically equivalent to p broadcasts, one from each node, but
efficiency can often be enhanced by overlapping broadcasts
✤ Total time for multinode broadcast depends strongly on concurrency
supported by communication system
✤ Multinode broadcast need be no more costly than standard broadcast
if aggressive overlapping of communication is supported

40
Multinode Broadcast
✤ Implementation of multinode broadcast in specific networks

• 1D torus (ring): initiate broadcast from each node simultaneously


in same direction around ring; completes after p − 1 steps at same
cost as single-node broadcast

• 2D or 3D torus: apply ring algorithm successively in each


dimension

• hypercube: exchange messages pairwise in each of log p


dimensions, with messages concatenated at each stage
✤ Multinode broadcast can be used to implement reduction by
combining messages using associative operation instead of
concatenation, which avoids subsequent broadcast when result
needed by all nodes
41
Multinode Reduction
✤ Multinode reduction: each of p nodes is destination of reduction from
all other nodes
✤ Algorithms for multinode reduction are essentially reverse of
corresponding algorithms for multinode broadcast

42
Personalized Communication
✤ Personalized collective communication: each node sends (or receives)
distinct message to (or from) each other node

• scatter: analogous to broadcast, but root sends different message to


each other node

• gather: analogous to reduction, but data received by root are


concatenated rather than combined using associative operation

• total exchange: analogous to multinode broadcast, but each node


exchanges different message with each other node

43
Scan or Prefix
✤ Scan (or prefix): given data values x0, x1, . . ., xp−1, one per node, along
with associative operation ⊕, compute sequence of partial results s0,
s1, . . ., sp−1, where sk = x0 ⊕ x1 ⊕ ⋅ ⋅ ⋅ ⊕ xk and sk is to reside on node k,
k = 0, . . ., p − 1
✤ Scan can be implemented similarly to multinode broadcast, except
intermediate results received by each node are selectively combined
depending on sending node's numbering, before being forwarded

44
Circular Shift
✤ Circular k-shift: for 0 < k < p, node i sends data to node (i + k) mod p
✤ Circular shift implemented naturally in ring network, and by
embedding ring in other networks

45
Barrier
✤ Barrier: synchronization point that all processes must reach before
any process is allowed to proceed beyond it
✤ For distributed-memory systems, barrier usually implemented by
message passing, using algorithm similar to all-to-all

• Some systems have special network for fast barriers


✤ For shared-memory systems, barrier usually implemented using
mechanism for enforcing mutual exclusion, such as test-and-set or
semaphore, or with atomic memory operations

46

You might also like