Unit 2
Unit 2
Parallel Programming
Syllabus
• Principles of Parallel Algorithm Design: Preliminaries,
Decomposition Techniques, Characteristics of Tasks and
Interactions, Mapping Techniques for Load Balancing, Methods for
Containing Interaction Overheads, Parallel Algorithm Models,
• Processor Architecture, Interconnect, Communication, Memory
Organization, and Programming Models in high performance
computing architecture examples: IBM CELL BE, Nvidia Tesla GPU,
Intel Larrabee Micro architecture and Intel Nehalem micro
architecture
• Memory hierarchy and transaction specific memory design, Thread
Organization
Preliminaries: Decomposition, Tasks, and
Dependency Graphs
• Observations:
– Tasks share the vector b but they have no control dependencies.
– There are zero edges in the task-dependency graph
– All tasks are of the same size in terms of number of operations .
• Is this the maximum number of tasks we could decompose this
problem into?
Example: Database Query Processing
Consider the execution of the query:
MODEL = “CIVIC” AND YEAR = “2001” AND
(COLOR = “GREEN” OR COLOR = “WHITE”)
on the following database:
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
Example: Database Query Processing
• Assume the query is divided into four subtasks
– Each task generates an intermediate table of entries
• Processes (no in UNIX sense): logical computing agents that perform tasks
– Task + task data + task code required to produce the task’s output
• Decomposition:
– The process of dividing the computation into smaller pieces of work i.e., tasks
1. procedure SERIAL_MIN(A,n)
2. begin
3. min =A[0];
4. for i:= 1 to n − 1 do
5. if (A[i] < min) min := A[i];
6. endfor;
7. return min;
8. end SERIAL_MIN
Example: Finding the Minimum
• In the frequency counting example, the input (i.e., the transaction set) can
be partitioned.
– This induces a task decomposition in which each task generates partial counts
for all itemsets. These are combined subsequently for aggregate counts.
Partitioning Input and Output Data
Intermediate Data Partitioning
Stage II
Task 01: D1,1,1= A1,1 B1,1 Task 02: D2,1,1= A1,2 B2,1
Task 03: D1,1,2= A1,1 B1,2 Task 04: D2,1,2= A1,2 B2,2
Task 05: D1,2,1= A2,1 B1,1 Task 06: D2,2,1= A2,2 B2,1
Task 07: D1,2,2= A2,1 B1,2 Task 08: D2,2,2= A2,2 B2,2
Task 09: C1,1 = D1,1,1 + D2,1,1 Task 10: C1,2 = D1,1,2 + D2,1,2
Task 11: C2,1 = D1,2,1 + D2,2,1 Task 12: C2,,2 = D1,2,2 + D2,2,2
Intermediate Data Partitioning: Example
• Performs more or the same aggregate work (but not less) than the
sequential algorithm
Example: Discrete Event Simulation
Speculative Execution
• Block Distribution
– Used to load-balance a variety of parallel computations that operate on
multi-dimensional arrays
• Cyclic Distribution
• Block-Cyclic Distribution
• For example, the task mapping of the binary tree (quicksort) cannot
use a large number of processes.
• For this reason, task mapping can be used at the top level and data
partitioning within each level.
Hierarchical Mappings
• When a process runs out of work, it requests the master for more work.
• When the number of processes increases, the master may become the
bottleneck.
• Selecting large chunk sizes may lead to significant load imbalances as well.
• Work Pool Model: The work pool or the task pool model is characterized
by a dynamic mapping of tasks onto processes for load balancing in which
any task may potentially be performed by any process. There is no desired
premapping of tasks onto processes. The mapping may be centralized or
decentralized. Pointers to the tasks may be stored in a physically shared
list, priority queue, hash table, or tree, or they could be stored in a physically
distributed data structure. The work may be statically available in the
beginning, or could be dynamically generated; i.e., the processes may
generate work and add it to the global (possibly distributed) work pool. If the
work is generated dynamically and a decentralized mapping is used, then a
termination detection algorithm would be required so that all processes can
actually detect the completion of the entire program (i.e., exhaustion of all
potential tasks) and stop looking for more work.
• Example : Parallelization of Loops by chunk scheduling
IBM BE- Cell Processor
Cell Architecture: “System on a Chip”
●
Heterogeneous chip multiprocessor
– 64-bit Power Architecture with cooperative offload
processors, with direct memory access and
communication synchronization methods
– 64-bit Power Processing Element Control Core (PPE)
– 8 Synergistic Processing Elements (SPE)
– Single instruction, multiple-data architecture, supported
by both the vector media extensions on the PPE and
the instruction set of the SPE's (i.e. gaming/media and
scientific applications)
– High-bandwidth on-chip element interconnection bus
(EIB),
●
PPE
– Main control unit for the entire Cell
– 32KB L1 instruction and data caches
– 512KB L2 unified cache
– Dual threaded, static dual issue
– Composed of three main units
●
Instruction Unit (IU)
– Fetch, Decode, Branch, Issue, and Completion
●
Fixed-Point Execution Unit
– Fixed-Point instructions and load/store
instructions
●
Vector Scalar Unit
– Vector and floating point instructions
Cell Architecture: Synergistic Processing
Element
• Synergistic Processing Element (SPE)
• SIMD instruction set architecture, optimized for power and
performance of computational-intensive applications
• Local store memory for instructions and data
● Additional level of memory hierarchy
● Largest component of SPE
● Single-port SRAM, capable of reading and writing
through both narrow 128-bit and wide 128-byte ports
• Data and instructions transferred between local store and
system memory by asynchronous DMA commands,
controlled by memory flow controlled on SPE
● Support for 16 simultaneous DMA commands
• Programmable DMA options
● SPE instructions insert DMA commands into queues
● Pool DMA commands into a single “DMA list” command
● Insert commands in DMA queue from other SPE processors
• 128 entry unified register file for improved memory
bandwidth, and optimum power efficiency and performance
• Dynamically configurable to provide support for content
protection and privacy
Cell Synergistic Memory Flow Controller
●
XDR RAMBUS is used as bus between EIB and main memory
– 2 x 32-bit channels, with 3.2Gbit/s data transfer per pin
– 25.6GB/s peak data transfer
●
potentially scalable to 6.4Gbit/s, offering a peak rate of 51.2GB/s
– “FlexPhase” Technology
●
signals can arrive at different times, reducing need for exact wire
length
– Configurable to attach to variable amounts of main memory
– Memory Interface Controller (MIC) controls the flow of data between
the EIB and the XDR RAMBUS
●
RAMBUS RRAC FlexIO used for I/O
– 7 transmit, 5 receive 8-bit ports
●
35GB/s transmit peak
●
25GB/s receive peak
– Used as a high speed interconnect between SPE's of different Cells
when a multiple Cell architecture is in place
●
4 Cells can be seamlessly connected given the current
architecture, an additional 4 with an added switch
Compiler
●
Octopiler
– “single source” parallelizing, simdizing
compiler
●
generates multiple binaries targeting both
the PPE and SPE elements from one
single source file
●
allows programmers to develop
applications for parallel architectures
with the illusion of a single shared
memory image
– Compiler-controlled software-cache,
memory hierarchy optimizations, and
code partitioning techniques assume all
data resides in a shared system memory
●
enables automatic transfer of data and
code
●
preserves coherence across all local
SPE memories and system memory
Cell Programming Models
●
In order to obtain optimum performance within the Cell both the SPE's LS and
Cell's SIMD dataflow must be taken into account by the programmer, and
ultimately the programming model
●
Function Offload Model
– SPE's used as accelerators for critical functions designated by the PPE
●
original code optimized and recompiled for SPE instruction set
●
designated program functions offloaded to SPE's for execution
●
when PPE calls designated program function it is automatically invoked on the
SPE
– Programmer statically identifies which program functions will execute on PPE and
which will be offloaded to the SPE
●
separate source files and compilation for PPE and SPE
– Prototype single-source approach using compiler directives as special offload hints
has been developed for the Cell
●
challenge to have compiler automatically determine which functions execute
on PPE and SPE
– allows applications to remain compatible with the Power Architecture
generating both PPE and SPE binaries
●
Cell systems load SPE-accelerated binaries
●
Non-Cell systems load PPE binaries
Cell Programming Models
●
Device Extension Model
– Special type of function offload model
– SPE's act as interfaces between PPE and external devices
●
Use memory-mapped SPE-accessible registers as command/response FIFO
between PPE and SPE's
●
Device memory can be mapped by DMA, supporting transfer size granularity as
small as a single byte
●
Computational Acceleration Model
– SPE centralized model, offering greater integration of SPE's in application
execution and programming than function offload model
– Computational intensive code performed on SPE's rather than PPE
– PPE acts as a control center and service engine for the SPE software
●
Work is partitioned among multiple SPE's operating in parallel
– manually done by programmers or automatically by compiler
– must include efficient scheduling of DMA operations for code and data
transfers between PPE and SPE
– utilize shared-memory programming model or a supported message
passing model
– Generally provides extensive acceleration of intense math functions without
requiring significant re-write or re-compilation
Cell Programming Models
●
Streaming Model
– Designed to take advantage of support for message passing between PPE and
SPE's
– Construct serial or parallel pipelines using the SPE's
●
Each SPE applies a particular application kernel to the received data stream
●
PPE acts as a stream controller, with SPE's acting as data stream processors
– Extremely efficient when each SPE performs an equivalent amount of work
●
Shared Memory Multiprocessor Model
– Utilizes the two different instruction sets of the PPE and SPE to execute applications
●
provides great benefits to performance when utilizing just a single instruction
set for a program would be very inefficient
– Since DMA operations are cache coherent, combining DMA commands with
SPE load/store commands provides the illusion of a single shared address
space
●
conventional shared-memory loads are replaced by a DMA from system
memory to an SPE's LS, and then a load from the LS to the SPE's register file
●
conventional shared-memory stores are replaced by a store from an
SPE's register file to it's LS, then a DMA from the SPE's LS to system
memory
– Affective addresses in common with both PPE and SPE are used for all
“pseudo”
shared-memory load/store commands
– Possibilities exist for a compiler or interpretor to manage all SPE LS's as local
caches for instruction and data
Cell Programming Models
●
Asymmetric Thread Runtime Model
– Using this model the scheduling of threads on both the PPE and
SPE are possible
●
Interaction among threads as in a conventional SMP
– Similar to thread or lightweight task models of conventional
operating systems
●
extensions exist to include processing units with different
instruction sets as exists within the PPE and SPE's
– Tasks are scheduled on both the PPE and SPE in order to
optimize
performance and utilization of resources
●
abilities of the SPE to run only a single thread is thus hidden
from programmer
Applications
• Games (PS3)
– Audio
– Video
– physics
Applications
• Imaging
– Medical, rendering, feature
extraction
– Mercury – Blade, Turismo
• Televisions (Toshiba)
– MPEG decode
Nvidia Tesla GPU
Intoduction
•GPU: Graphics Processing Unit
Hundreds of Cores
Programmable
Can be easily installed in most desktops
Similar price to CPU
GPU follows Moore's Law better than CPU
Introduction
Motivation:
GPU Hardware
Multiprocessor
Structure:
GPU Hardware
Multiprocessor
Structure:
N multiprocessors with
M cores each
SIMD – Cores share an
Instruction Unit with
other cores in a
multiprocessor.
Diverging threads may
not execute in parallel.
GPU Hardware
Memory Hierarchy:
Processors have 32-bit
registers
Multiprocessors have
shared memory, constant
cache, and texture cache
Constant/texture cache are
read- only and have faster
access than shared memory.
Programming Model
Past:
The GPU was intended for graphics only, not
general purpose computing.
The programmer needed to rewrite the program
in a graphics language, such as OpenGL
Complicated
Present:
NVIDIA developed CUDA, a language for
general purpose GPU computing
Simple
Programming Model
CUDA:
Compute Unified Device Architecture
Extension of the C language
Used to control the device
The programmer specifies CPU and
GPU functions
The host code can be C++
Device code may only be C
The programmer specifies thread layout
Programming Model
Thread Layout:
Threads are organized
into
blocks.
Blocks are organized into
a grid.
A multiprocessor
executes one block at a
time.
A warp is the set of
threads executed in
parallel
32 threads in a warp
Programming Model
Heterogeneous Computing:
GPU and CPU
execute different
types of code.
CPU runs the main
program, sending tasks
to the GPU in the form
of kernel functions
Multiple kernel functions
may be declared and
called.
Only one kernel may
be called at a time.
19
Programming Model: GPU vs. CPU Code
Supercomputing Products
Tesla C1060
GPU 933
GFLOPS
nForce
Motherboard
• L1
– 8K I-Cache and 8k D-Cache per thread
– 32K I/D cache per core
– 2-way
– Treated as extended registers
• L2
– Coherent
– Shared & Divided
– A 256K subset for each core
On Chip Network: Ring
• Threads (Hyper-Threading)
– Hardware-managed
– As heavy as an application program or OS
– Up to 4 threads per core
• Fibers
– Software-managed
– Chunks decomposed by compliers
– Typically up to 8 fibers per thread
Multithreading at Multiple Levels
• Strands
– Lowers level
– Individual operations in the SIMD engines
– One strand corresponds to a thread on GPUs
– 16 Strands because the 16-lane VPU
Larrabee Programming Model
• “FULLY” programmable
• Legacy code easy to migrate and deploy
– Run both DirectX/OpenGL and C/C++ code.
– Many C/C++ source can be recompiled without modification, due to x86
structures.
– Crucial to large x86 legacy programs.
• Limitations
– System call
– Requires application recompilation
Software Threading
• System calls like I/O functions are proxied from the Larrabe app
back to OS service.
High level programming
• Nvidia Geforce
– Memory sharing is supported by PBSM(Per-Block Shared Memories),
16KB on Geforce 8.
– Each PBSM shared by 8 scalar processors.
– Programmers MUST explicitly load data into PBSM.
– Not directly sharable by different SIMD groups.
• Larrabe
– All memory shared by all processors.
– Local data structure sharing transparantly provided by coherent cached
memory hierarchy.
– Don’t have to care about data loading procedure.
– Scatter-gather mechanism
Intel® Nehalem
Micro-Architecture
Outline
• Brief History
• Overview
• Memory System
• Core Architecture
• Hyper- Threading Technology
• Quick Path Interconnect (QPI)
Brief History
about Intel Processors
Nehalem System Example:
Building Blocks
How to make Silicon Die?
Overview of Nehalem
Processor Chip
• Four identical compute core
• UIU: Un-core interface unit
• L3 cache memory
and
data block memory
Overview of Nehalem
Processor Chip(cont.)
• IMC : Integrated Memory Controller with 3 DDR3 memory channels
• QPI : Quick Path Interconnect ports
• Auxiliary circuitry for
cache-coherence,
power control,
system management,
performance monitoring
Overview of Nehalem
Processor Chip(cont.)
5. Results written
– Private L1/L2 cache
Components: Write-Back
5. Results written
– Private L1/L2 cache
– Shared L3 cache
– QuickPath
• Dedicated channel
to another CPU,
chip, or device
• Replaces FSB
End of Unit 2