0% found this document useful (0 votes)
95 views

Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey

The document discusses parallel computing platforms, describing SIMD and MIMD architectures as well as their control structures. It also covers communication models for parallel platforms, distinguishing between shared memory and message passing. Finally, it addresses memory hierarchies and how the performance of parallel applications can be limited by memory bandwidth and latency.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views

Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey

The document discusses parallel computing platforms, describing SIMD and MIMD architectures as well as their control structures. It also covers communication models for parallel platforms, distinguishing between shared memory and message passing. Finally, it addresses memory hierarchies and how the performance of parallel applications can be limited by memory bandwidth and latency.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Parallel Computing Platforms and Memory System Performance

John Mellor-Crummey
Department of Computer Science Rice University [email protected]

COMP 422

Lecture 9

8-10 February 2011

Topics for Today



SIMD and MIMD control structure Communication models for parallel platforms Memory hierarchy and performance

Parallel Computing Platforms


A parallel computing platform must specify
concurrency = control structure interaction between concurrent tasks = communication model

Control Structure of Parallel Platforms


Parallelism ranges from instructions to processes

Processor control structure alternatives


operate under the centralized control of a single control unit, or work independently

SIMD
Single Instruction stream
single control unit dispatches the same instruction to processors

Multiple Data streams


processors work on their own data

MIMD
Multiple Instruction streams
each processor has its own control control unit each processor can execute different instructions

Multiple Data streams


processors work on their own data 4

SIMD and MIMD Processors

SIMD architecture

MIMD architecture

PE = Processing Element

SIMD Control

SIMD relies on the regular structure of computations
media processing, scientific kernels (e.g. linear algebra, FFT)

Activity mask
per PE predicated execution: turn off operations on certain PEs
each PE tests own conditional and sets own activity mask PE can conditionally perform operation predicated on mask value

Executing a Conditional on SIMD Processors


if (B == 0) then C = A else C = A/B

conditional statement

initial values

execute then branch execute else branch

SIMD Examples

Many early parallel computers
Illiac IV, MPP, DAP, Connection Machine CM-1/2, and MasPar MP-1/2

Today
vector units: SSE, SSE2, Altivec (Velocity Engine, VMX)
128-bit vector registers 16 8-bit chars, 8 16-bit short int, 4 32-bit ints, 4 32-bit FP variables SSE2 also operates on 2 64-bit double precision values

co-processors ClearSpeed array processor (control PE + array of 96 PEs)


http://www.clearspeed.com

nVidia G80 GPGPU

SSE/SSE2 as examples of SIMD vector units

Scalar processing
traditional mode one operation produces one result

SIMD vector units


with SSE / SSE2 one operation produces multiple results

x3

x2

x1

x0

+
Y X+Y Y X+Y y3 x3+y3 y2

+
y1 x1+y1 y0 x0+y0

x2+y2

Slide Credit: Alex Klimovitski & Dean Macri, Intel Corporation

SSE / SSE2 SIMD Processing

SSE2 data types: anything that fits into 16 bytes, e.g.,


4x floats 2x doubles 16x bytes

Instructions operate in parallel on data in this 16 byte register


add, multiply etc.

Data bytes must be contiguous in memory and aligned Additional instructions needed for
masking data moving data from one part of a register to another
10

SIMD: ClearSpeed MTAP Co-processor


MTAP processor

Features hardware multi-threading asynchronous, overlapped I/O extensible instruction set

SIMD core poly controller poly execution unit


array of 96 PEs 64- and 32-bit floating point 250 MHz (key to low power) 128 million transistors low power: ~10 Watts 11

Figure credit:http://www.clearspeed.com/images/arch_mtap.jpg

The Subtlety of Short Vectors

Consider the following:

Stream alignment conflict! A solution: data layout transformation


Figure credit: P. Sadayappan. See Henretty et al. [CC11]

12

Dimension-lifted Transformation

(a) 1D array in memory (b) 2D view of same array (c) Transposed 2D array brings non-interacting elements into contiguous vectors (d) New 1D layout after transformation
Figure credit: P. Sadayappan. See Henretty et al. [CC11]

13

MIMD Processors
Execute different programs on different processors

Platforms include current generation systems


shared memory
multicore laptop workstation with multiple quad core processors SGI Altix UV (up to 32K sockets, each with an 8-core processor) Legacy: Cray X1 (up to 8K processors)

distributed memory
clusters (e.g. sugar.rice.edu, stic.rice.edu, ada.rice.edu) Cray XT, IBM Blue Gene

SPMD programming paradigm


Single Program, Multiple Data streams same program on different PEs, behavior conditional on thread id
14

SIMD vs. MIMD

SIMD platforms
special purpose: not well-suited for all applications custom designed with long design cycles less hardware: single control unit need less memory: only 1 copy of program today: SIMD common only for accelerators and vector units

MIMD platforms
suitable for broad range of applications inexpensive: off-the-shelf components + short design cycle need more memory: program and OS on each processor

15

Communication Models for Parallel Platforms



Two primary forms of data exchange between parallel tasks
accessing a shared data space exchanging messages

Platforms that provide a shared data space =


shared-memory platforms
AKA multiprocessors

Platforms that support messaging =


message-passing platforms
AKA multicomputers

16

Shared Memory Platforms



Components
set of processors part (or all) memory is accessible to all processors

Processor interactions
modify data objects stored in shared memory

Flavors of shared address space platforms


UMA: uniform memory access
time taken by a processor to access any memory word is identical

NUMA: non-uniform memory access


time taken by a processor to memory may vary

17

NUMA and UMA Platforms

UMA shared address space platform with cache (Sequent Symmetry, 1988)

NUMA shared address space platform (BBN Butterfly, 1988)

UMA shared address space platform (BBN Monarch, 1990)

18

SGI Origin 2000: NUMA Platform (1997)

J. Laudon and D. Lenoski. The SGI Origin: a ccNUMA highly scalable server. Proc. of the 24th annual Intl. Symp. on Computer Architecture, Denver, 241 - 251,1997

19

Contemporary NUMA Multiprocessor

Figure credit: AMD (http://bit.ly/dT7Zxd)

20

SGI Origin Network Topology


Example: bristled hypercube for 64 processors

Figure credit: http://web.cecs.pdx.edu/~alaa/ece588/papers/laudon_isca_1997.pdf

21

NUMA and UMA Platforms



NUMA and UMA platforms differ with respect to locality
algorithms must exploit locality for performance on NUMA platforms

Programming these platforms


easy communication: reads and writes are implicitly visible to other processors tricky coordination: read-write operations to shared data must be coordinated
Pthreads mutexes, OpenMP critical sections

Cache coherent vs. non cache-coherent architectures


non-cache coherent shared-address space architectures
provides an address map, but not coordinated sharing processors must explicitly flush data to memory before sharing examples: BBN Butterfly, Cray T3E

cache coherent architectures: caches coordinate access to multiple copies


hardware support to keep copies consistent up to date values can be retrieved from cache by remote processors examples: SGI Origin, SGI Altix

22

Shared Memory vs. Shared Address Space

Shared memory
access shared data with load/store

Shared address space


name shared data with global address cant necessarily access all shared data with load/store
e.g. Cray T3E: GET/PUT

Can provide shared address space abstraction on distributed memory multicomputers using software
e.g. Unified Parallel C, Co-array Fortran

23

Message-Passing Multicomputers

Components
set of processors each processor has its own exclusive memory

Examples
clustered workstations non-shared-address-space multicomputers
Cray XT, IBM Blue Gene, many others

Communication model
exchange data using send and receive primitives

Common message passing library: MPI

24

Important Aspects of Communication



Latency: How long does a single operation take?
measured in microseconds

Bandwidth: What data rate can be sustained?


measured in Mbytes/second

These terms can be applied to


memory access messaging

25

Bandwidth vs. Latency in a Pipeline


Dave Pattersons Laundry example: 4 people doing laundry wash (30 min) + dry (40 min) + fold (20 min) = 90 min

6 PM
T a s k O r d e r 30

7
40 40

8
40 40

9
Time 20

In this example: sequential execution 4 * 90 min = 6 hours Pipelined execution 30+ 4 * 40 + 20 = 3.5 hours Bandwidth = loads/hour BW = 4/6 l/h w/o pipelining BW = 4/3.5 l/h w pipelining BW <= 1.5 l/h w pipelining, more total loads Pipelining helps bandwidth but not latency (90 min) Bandwidth limited by slowest pipeline stage Potential speedup = Number pipe stages

A B C D

26

Memory Systems and Performance



Consider architecture model X
a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns, no cache

Assume that the processor can execute 1 FLOP per cycle


FLOP = floating point operation peak performance = 1 GFLOPS

Consider adding two vectors on this architecture


each FLOP requires two data accesses peak speed of this computation
one addition (i.e. 1 FLOP) every 200 ns = speed of 5 MFLOPS

achieves small fraction of the peak processor performance

Memory system is the rate limiter for performance

27

A Modern Memory Hierarchy (Itanium 2)


Processor TLB TLB
(FP)

L1 Cache-I L1 Cache-D
Write Buffers, Etc

16K I + 16K D, 1 cycle 256K, 5 (6 FP) cycles 3M, 13.3 (13.1 FP cycles)

L2 Cache L3 cache memory controller mem bank 1 mem bank 2

209.6 ns
28

http://www.devx.com/Intel/Article/20521

Consider the Impact of Adding a Cache



Add a cache of size 32 KB with a latency of 1 ns or one cycle Use this setup to multiply n x n matrices: C = A x B
let n = 32: carefully chosen so all matrices fit in cache

Observations
fetching the two matrices into the cache: 2n2 words
fetch 2K words 100ns x 2K = ~200 s

multiplying two n n matrices takes 2n3 operations


64K operations total 64K cycles = ~64 s

total time for the computation is therefore approximately sum of


time = time for load/store operations + time for the computation time = 200 s + 64 s

peak computation rate = (64K flop) / (264 s) = or 248 MFLOPS


29

Memory Bandwidth

Limited by both
the bandwidth of the memory bus the bandwidth of the memory modules

Can be improved by increasing the size of memory blocks Memory system takes l time units to deliver b units of data
l is the latency of the system b is the block size

30

Consider the Impact of Non-unit Block Size



Again, consider architecture model X, except
block size is 4 words instead of 1 word

Repeat the vector-add computation analysis


assume vectors are laid out contiguously in memory single memory access fetches four consecutive words four additions can be performed in 200 cycles performance = 1 FLOP / 50 ns; peak speed = 20 MFLOPS

31

Reusing Data in the Memory Hierarchy



Spatial reuse: using more than one word in a multi-word line
using multiple words in a cache line

Temporal reuse: using a word repeatedly


accessing the same word in a cache line more than once

Applies at every level of the memory hierarchy


e.g. TLB
spatial reuse: access multiple cache lines in a page temporal reuse: access data on the same page repeatedly

32

Experimental Study of Memory (membench)


Microbenchmark for memory system performance

s for array A of length L from 4KB to 8MB by 2x for stride s from 4 Bytes (1 word) to L/2 by 2x time the following loop time the following loop (repeat many times and average) (repeat many times and average) for i from 0 to L for i from 0 to L by s load A[i] from memory (4 Bytes) load A[i] from memory (4 Bytes)

1 experiment

33

Membench: What to Expect


average cost per access
memory time

size > L1
cache hit time
s = stride

total size < L1

Consider the average cost per load


plot one line for each array length, time vs. stride unit stride is best: if cache line holds 4 words, only miss if array is smaller than a cache, all accesses will hit after first run time for first run is negligible with enough repetitions upper right figure assumes only one level of cache performance profile is more complicated on modern systems 34

Memory Hierarchy on a Sun Ultra-2i


Sun Ultra-2i, 333 MHz
Array length

Mem: 396 ns (132 cycles)

L2: 2 MB, 12 cycles (36 ns) L1: 16 KB 2 cycles (6ns) 8 K pages, 32 TLB entries

L1: 16 B line

L2: 64 byte line

See www.cs.berkeley.edu/~yelick/arvindk/t3d-isca95.ps for details

35

Memory Hierarchy on a Pentium III


Katmai processor on Millennium, 550 MHz Array size

L2: 512 KB 60 ns

L1: 32 byte line ?

L1: 64K 5 ns, 4-way?

36

Memory System Performance: Summary


Examples here illustrate the following concepts:
exploiting spatial and temporal locality is critical for
amortizing memory latency increasing effective memory bandwidth

ratio of the number of operations to number of memory accesses


good indicator of anticipated tolerance to memory bandwidth

memory layout and computation organization significantly affect spatial and temporal locality

37

Multithreading for Latency Hiding



A thread is a single stream of control in the flow of a program. We illustrate threads with a dense matrix vector multiply for (i = 0; i < n; i++) c[i] = dot_product(get_row(a, i), b);

Each dot-product is independent of others


thus, can execute concurrently

Can rewrite the above code segment using threads


for (i = 0; i < n; i++) c[i] = create_thread(dot_product,get_row(a, i), b);

38

Multithreading for Latency Hiding (contd)

Consider how the code executes


first thread accesses a pair of vector elements and waits for them second thread can access two other vector elements in the next cycle ...

After l units of time


(l is the latency of the memory system) first thread gets its data from memory and performs its madd

Next cycle
data items for the next function instance arrive

... Every clock cycle, we can perform a computation

39

Multithreading for Latency Hiding (contd)



Previous example makes two hardware assumptions
memory system can service multiple outstanding requests processor is capable of switching threads at every cycle

Also requires program to have explicit threaded concurrency Machines such as the HEP, Tera, and Sun T2000 (Niagara-2) rely on multithreaded processors
can switch the context of execution in every cycle are able to hide latency effectively

Sun T2000, 64-bit SPARC v9 processor @1200MHz


organization: 8 cores, 4 strands per core, 8KB Data cache and 16KB Instruction cache per core, L2 cache: unified 12-way 3MB, RAM: 32GB
40

Prefetching for Latency Hiding



Misses on loads cause programs to stall; why not load data before it is needed?
by the time it is actually needed, it will be there!

Drawback: need space to store early loads


may overwrite other necessary data in cache if early loads are overwritten, we are little worse than before!

Prefetching support
software only, e.g. Itanium2 hardware and software, e.g. Opteron

Hardware prefetching requires


predictable access pattern limited number of independent streams

41

Tradeoffs in Multithreading and Prefetching

Multithreaded systems
bandwidth requirements
may increase very significantly because of reduced cache/ thread

can become bandwidth bound instead of latency bound

Multithreading and prefetching


only address latency may often exacerbate bandwidth needs have significantly larger data footprint; need hardware for that

42

References

Adapted from slides Parallel Programming Platforms by Ananth Grama accompanying course textbook Vivek Sarkar (Rice), COMP 422 slides from Spring 2008 Jack Dongarra (U. Tenn.), CS 594 slides from Spring 2008, http://www.cs.utk.edu/%7Edongarra/WEB-PAGES/ cs594-2008.htm Kathy Yelick (UC Berkeley), CS 267 slides from Spring 2007, http://www.eecs.berkeley.edu/~yelick/cs267_sp07/lectures Tom Henretty, Kevin Stock, Louis-Nol Pouchet, Franz Franchetti, J. Ramanujam and P. Sadayappan. Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures. In ETAPS Intl. Conf. on Compiler Construction (CC'2011), Springer Verlag, Saarbrucken, Germany, March 2011. To appear.
43

You might also like