Introduction To Programming Massively Parallel Graphics Processors
Introduction To Programming Massively Parallel Graphics Processors
Computation
Calculations Data communication/Storage
Calculation capabilities
Need data feed and storage The larger the slower Takes time to get there and back
Multiple cycles even on the same die
Unlimited Bandwidth Zero/Low Latency
Slower Cache
Faster cache
time
Automatically extract instruction level parallelism Large on-die caches to tolerate off-chip memory latency
Flow of control / Thread One instruction at the time Optimizations possible at the machine level
time
time
time
CPU:
Handles sequential code well Cant take advantage of massively parallel code Off-chip bandwidth lower Peak Computation capability lower
GPU:
Requires massively parallel computation Handles some control flow Higher off-chip bandwidth Higher peak computation capability
Programmers view
CPU
3GB/s 8GB.s
GPU
141GB/sec
GPU Memory
1GB on our systems
Memory
Target Applications
Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads
Concurrency:
Do multiple things in parallel
Needs more functional units
Execution Timeline
CPU / Host
1. Copy to GPU mem 2. Launch GPU Kernel
GPU / Device
Programmers view
CPU
GPU
Programmers view
CPU
GPU
Programmers view
CPU
GPU
Programmers view
CPU
GPU
Programmers view
CPU
GPU
Computation partitioning:
At the highest level:
Think of computation as a series of loops:
for (i = 0; i < big_number; i++) a[i] = some function for (i = 0; i < big_number; i++) a[i] = some other function for (i = 0; i < big_number; i++) a[i] = some other function
Kernels
thread
Block
Why? Realities of integrated circuits: need to cluster computation and storage to achieve high speeds
Block (1, 1)
Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
IDs and dimensions are accessible through predefined variables, e.g., blockDim.x and threadIdx.x
Programmers view: Memory Model Different memories with different uses and performance
Some managed by the compiler Some must be managed by the programmer
Blocks do not migrate: execute on the same processor Several blocks may run over the same processor
e.g., fft()
cuda()
cu()
GPU
CPU
1. 2. 3. 4. 5. 6. 7. 8. 9.
Allocate CPU Data Structure Initialize Data on CPU Allocate GPU Data Structure Copy Data from CPU to GPU Define Execution Configuration Run Kernel CPU synchronizes with GPU Copy Data from GPU to CPU De-allocate GPU and CPU memory
1. Allocate CPU Data float *ha; main (int argc, char *argv[]) { int N = atoi (argv[1]); ha = (float *) malloc (sizeof (float) * N);
...
} No memory allocated on the GPU side
Pinned memory allocation results in faster CPU to/from GPU copies But pinned memory cannot be paged-out More on this later cudaMallocHost ()
3. Allocate GPU Data float *da; cudaMalloc ((void **) &da, sizeof (float) * N); Notice: no assignment side
NOT: da = cudaMalloc ()
cudaMemCpy ((void *) da, // DESTINATION (void *) ha, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyHostToDevice); // DIRECTION
The host initiates all transfers: cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction)
Asynchronous from the CPUs perspective
CPU thread continues
Alternatively:
blocks = (N + threads_block 1) / threads_block;
Instructs the GPU to launch blocks x threads_block threads: darradd <<<blocks, threads_block>> (da, 10f, N); cudaThreadSynchronize (); // forces CPU to wait
CPU/GPU Synchronization
Eventually, CPU must know when GPU is done Then it can safely copy the GPU results
cudaThreadSynchronize ()
Block CPU until all preceding cuda() and kernel requests have completed
8. Copy data from GPU to CPU & 9. DeAllocate Memory float *da; float *ha; cudaMemCpy ((void *) ha, // DESTINATION (void *) da, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyDeviceToHost); // DIRECTION cudaFree (da); // display or process results here free (ha);
The GPU Kernel __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x;
blockIdx.x = 0
blockIdx.x = 1
blockIdx.x = 2
a[0]
a[63] a[64]
a[127]a[128]
a[191]a[192]
i=0
i = 63
i = 64
i = 127
i = 128
i = 191 i = 192
Assuming blockDim.x = 64
device
device host
device
host host
float HostFunc()
__device__ Example
Kernel and Device Function Restrictions __device__ functions cannot have their address taken
e.g., f = &addmany; *f();
GPU
CPU
Control flow
32 threads run together If they diverge there is a performance penalty
Texture cache
When you think there is locality
But:
The learning curve and expertise needed for CPUs is much larger
Computer Architecture
How to build the best possible system Best: performance, power, cost, etc.
Claims to fame
Memory Dependence Prediction
Commercially implemented and licensed
UofT-DRDC Partnership
62
63
1. Biomedical Engineering 2. Communications 3. Computer Engineering 4. Electromagnetics 5. Electronics 6. Energy Systems 7. Photonics 8. Systems Control
ECE
Human-Computer Interaction
Willy Wong, Steve Mann
Computer Hardware
Jonathan Rose, Steve Brown, Paul Chow, Jason Anderson
Computer Architecture
Greg Steffan, Andreas Moshovos, Tarek Abdelrahman, Natalie Enright Jerger
Computer Security
Davie Lie, Ashvin Goel
Neurosystems
Biomedical Engineering Berj L. Bardakjian, Roman Genov. Willy Wong, Hans Kunov Moshe Eizenman
Rehabilitation
Brendan Frey.
Kevin Truong.
Ca2+ Ca2+
65
Communications Group
Study of the principles, mathematics and algorithms that underpin how information is encoded, exchanged and processed
Three Sub-Groups:
1. Networks 2. Signal Processing 3. Information Theory
Sequence Analysis
Networks
Computer Engineering
System Software
Michael Stumm, H-A. Jacobsen, Cristiana Amza, Baochun Li
Electronics Group
Electronic device modelling Semiconductor technology VLSI CAD and Systems FPGAs DSP and Mixed-mode ICs Biomedical microsystems High-speed and mm-wave ICs and SoCs
72
Lab for (on-wafer) SoC and IC testing through 220 72 GHz UofT-IBM Partnership
On-chip micro-sensors
Project examples
73
Modelling mm-wave and noise performance of active and passive devices past 300 GHz. 60-120GHz multi-gigabit data rate phased-array radios Single-chip 76-79 GHz automotive radar 170 GHz transceiver with on-die antennas
74
Electromagnetics Group
Metamaterials: From microwaves to optics
Super-resolving lenses for imaging and sensing Small antennas Multiband RF components CMOS phase shifters
Antennas
Telecom and Wireless Systems Reflectarrays Wave electronics Integrated antennas Controlled-beam antennas Adaptive and diversity antennas
METAMATERIALS (MTMs)
Super-lens capable of resolving details down to l/6
Computational Electromagnetics
Fast CAD for RF/ optical structures
Microstrip spiral inductor Optical power splitter
Modeling of Metamaterials
Plasmonic Left-Handed Media
Leaky-Wave Antennas
78
79
UofT
IC for cell phone power supplies Voltage Control System for Wind Power Generators
Photonics Group
Photonics Group
Photonics Group
Basic & applied research in control engineering World-leading group in Control theory _______________________________________ ________ Optical Signal-to-Noise Ratio opt. with game theory Erbium-doped fibre amplifier design Analysis and design of digital watermarks for authentication Nonlinear control theory
application to magnetic levitation, micro positioning system