Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
John Mellor-Crummey
Department of Computer Science Rice University [email protected]
COMP 422
Lecture 9
SIMD
Single Instruction stream
single control unit dispatches the same instruction to processors
MIMD
Multiple Instruction streams
each processor has its own control control unit each processor can execute different instructions
SIMD architecture
MIMD architecture
PE = Processing Element
SIMD Control
SIMD relies on the regular structure of computations
media processing, scientific kernels (e.g. linear algebra, FFT)
Activity mask
per PE predicated execution: turn off operations on certain PEs
each PE tests own conditional and sets own activity mask PE can conditionally perform operation predicated on mask value
conditional statement
initial values
SIMD Examples
Many early parallel computers
Illiac IV, MPP, DAP, Connection Machine CM-1/2, and MasPar MP-1/2
Today
vector units: SSE, SSE2, Altivec (Velocity Engine, VMX)
128-bit vector registers 16 8-bit chars, 8 16-bit short int, 4 32-bit ints, 4 32-bit FP variables SSE2 also operates on 2 64-bit double precision values
Scalar processing
traditional mode one operation produces one result
x3
x2
x1
x0
+
Y X+Y Y X+Y y3 x3+y3 y2
+
y1 x1+y1 y0 x0+y0
x2+y2
Data bytes must be contiguous in memory and aligned Additional instructions needed for
masking data moving data from one part of a register to another
10
Figure credit:http://www.clearspeed.com/images/arch_mtap.jpg
12
Dimension-lifted Transformation
(a) 1D array in memory (b) 2D view of same array (c) Transposed 2D array brings non-interacting elements into contiguous vectors (d) New 1D layout after transformation
Figure credit: P. Sadayappan. See Henretty et al. [CC11]
13
MIMD Processors
Execute different programs on different processors
distributed memory
clusters (e.g. sugar.rice.edu, stic.rice.edu, ada.rice.edu) Cray XT, IBM Blue Gene
SIMD platforms
special purpose: not well-suited for all applications custom designed with long design cycles less hardware: single control unit need less memory: only 1 copy of program today: SIMD common only for accelerators and vector units
MIMD platforms
suitable for broad range of applications inexpensive: off-the-shelf components + short design cycle need more memory: program and OS on each processor
15
16
Processor interactions
modify data objects stored in shared memory
17
UMA shared address space platform with cache (Sequent Symmetry, 1988)
18
J. Laudon and D. Lenoski. The SGI Origin: a ccNUMA highly scalable server. Proc. of the 24th annual Intl. Symp. on Computer Architecture, Denver, 241 - 251,1997
19
20
21
22
Shared memory
access shared data with load/store
Can provide shared address space abstraction on distributed memory multicomputers using software
e.g. Unified Parallel C, Co-array Fortran
23
Message-Passing Multicomputers
Components
set of processors each processor has its own exclusive memory
Examples
clustered workstations non-shared-address-space multicomputers
Cray XT, IBM Blue Gene, many others
Communication model
exchange data using send and receive primitives
24
25
6 PM
T a s k O r d e r 30
7
40 40
8
40 40
9
Time 20
In this example: sequential execution 4 * 90 min = 6 hours Pipelined execution 30+ 4 * 40 + 20 = 3.5 hours Bandwidth = loads/hour BW = 4/6 l/h w/o pipelining BW = 4/3.5 l/h w pipelining BW <= 1.5 l/h w pipelining, more total loads Pipelining helps bandwidth but not latency (90 min) Bandwidth limited by slowest pipeline stage Potential speedup = Number pipe stages
A B C D
26
27
L1 Cache-I L1 Cache-D
Write Buffers, Etc
16K I + 16K D, 1 cycle 256K, 5 (6 FP) cycles 3M, 13.3 (13.1 FP cycles)
209.6 ns
28
http://www.devx.com/Intel/Article/20521
Observations
fetching the two matrices into the cache: 2n2 words
fetch 2K words 100ns x 2K = ~200 s
Memory Bandwidth
Limited by both
the bandwidth of the memory bus the bandwidth of the memory modules
Can be improved by increasing the size of memory blocks Memory system takes l time units to deliver b units of data
l is the latency of the system b is the block size
30
31
32
s for array A of length L from 4KB to 8MB by 2x for stride s from 4 Bytes (1 word) to L/2 by 2x time the following loop time the following loop (repeat many times and average) (repeat many times and average) for i from 0 to L for i from 0 to L by s load A[i] from memory (4 Bytes) load A[i] from memory (4 Bytes)
1 experiment
33
size > L1
cache hit time
s = stride
L2: 2 MB, 12 cycles (36 ns) L1: 16 KB 2 cycles (6ns) 8 K pages, 32 TLB entries
L1: 16 B line
35
L2: 512 KB 60 ns
36
memory layout and computation organization significantly affect spatial and temporal locality
37
38
Next cycle
data items for the next function instance arrive
39
Also requires program to have explicit threaded concurrency Machines such as the HEP, Tera, and Sun T2000 (Niagara-2) rely on multithreaded processors
can switch the context of execution in every cycle are able to hide latency effectively
Prefetching support
software only, e.g. Itanium2 hardware and software, e.g. Opteron
41
Multithreaded systems
bandwidth requirements
may increase very significantly because of reduced cache/ thread
42
References
Adapted from slides Parallel Programming Platforms by Ananth Grama accompanying course textbook Vivek Sarkar (Rice), COMP 422 slides from Spring 2008 Jack Dongarra (U. Tenn.), CS 594 slides from Spring 2008, http://www.cs.utk.edu/%7Edongarra/WEB-PAGES/ cs594-2008.htm Kathy Yelick (UC Berkeley), CS 267 slides from Spring 2007, http://www.eecs.berkeley.edu/~yelick/cs267_sp07/lectures Tom Henretty, Kevin Stock, Louis-Nol Pouchet, Franz Franchetti, J. Ramanujam and P. Sadayappan. Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures. In ETAPS Intl. Conf. on Compiler Construction (CC'2011), Springer Verlag, Saarbrucken, Germany, March 2011. To appear.
43