0% found this document useful (0 votes)
76 views

Introduction To Programming Massively Parallel Graphics Processors

This document provides an introduction and overview of CUDA programming and massively parallel graphics processors. It discusses how GPUs achieve high performance through tons of compute engines and unlimited bandwidth between those engines. The document outlines the data parallel execution model of GPUs and how programmers can partition computation into kernels that run large numbers of threads in parallel. It also summarizes the CUDA programming model where the CPU initiates data transfers and kernel launches on the GPU.

Uploaded by

djrive
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

Introduction To Programming Massively Parallel Graphics Processors

This document provides an introduction and overview of CUDA programming and massively parallel graphics processors. It discusses how GPUs achieve high performance through tons of compute engines and unlimited bandwidth between those engines. The document outlines the data parallel execution model of GPUs and how programmers can partition computation into kernels that run large numbers of threads in parallel. It also summarizes the CUDA programming model where the CPU initiates data transfers and kernel launches on the GPU.

Uploaded by

djrive
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 84

Introduction to Programming Massively Parallel Graphics processors Introduction to CUDA Programming

Andreas Moshovos [email protected] ECE, Univ. of Toronto Summer 2010


Some slides/material from: UIUC course by Wen-Mei Hwu and David Kirk UCSB course by Andrea Di Blas Universitat Jena by Waqar Saleem NVIDIA by Simon Green and others as noted on slides

How to Get High Performance

Computation
Calculations Data communication/Storage

Unlimited Bandwidth Zero/Low Latency Tons of Compute Engines Tons of Storage

Calculation capabilities

How many calculation units can be built? Todays silicon chips


About 1B transistors 30K transistors for a 52b multiplier
~30K multipliers
Tons of Compute Engines

260mm^2 area (mid-range) 112microns^2 for FP unit (overestimated)


~2K FP units

Frequency ~ 3Ghz common today


TFLOPs possible
Disclaimer: back-on-the-envelop calculations take with a grain of salt

Can build lots of calculation units (ALUs)

How about Communication/Storage

Need data feed and storage The larger the slower Takes time to get there and back
Multiple cycles even on the same die

Tons of Compute Engines


Unlimited Bandwidth Zero/Low Latency

Tons of Slow Storage

Is there enough parallelism?


Unlimited Bandwidth Zero/Low Latency Tons of Compute Engines Tons of Storage

Keep this busy?


Needs lots of independent calculations
Parallelism/Concurrency

Much of what we do is sequential


First do 1, then do 2, then if X do 3 else do 4

Todays High-End General Purpose Processors

Localize Communication and Computation Try to automatically extract parallelism

Slower Cache

Faster cache

time

Tons of Slow Storage

Automatically extract instruction level parallelism Large on-die caches to tolerate off-chip memory latency

Some things are naturally parallel

Sequential Execution Model

int a[N]; // N is large for (i =0; i < N; i++)


a[i] = a[i] * fade;

Flow of control / Thread One instruction at the time Optimizations possible at the machine level

time

Data Parallel Execution Model / SIMD

int a[N]; // N is large for all elements do in parallel


a[index] = a[index] * fade;

time

This has been tried before: ILLIAC III, UIUC, 1966

Single Program Multiple Data / SPMD

int a[N]; // N is large for all elements do in parallel


if (a[i] > threshold) a[i]*= fade;

time

The model used in todays Graphics Processors

CPU vs. GPU overview

CPU:
Handles sequential code well Cant take advantage of massively parallel code Off-chip bandwidth lower Peak Computation capability lower

GPU:
Requires massively parallel computation Handles some control flow Higher off-chip bandwidth Higher peak computation capability

Programmers view

GPU as a co-processor (2008)

CPU

3GB/s 8GB.s

GPU
141GB/sec

6.4GB/sec 31.92GB/sec 8B per transfer

GPU Memory
1GB on our systems

Memory

Target Applications

int a[N]; // N is large for all elements of a compute


a[i] = a[i] * fade

Lots of independent computations


CUDA threads need not be independent

Programmers View of the GPU

GPU: a compute device that:


Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel

Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads

Why are threads useful? Parallelism

Concurrency:
Do multiple things in parallel
Needs more functional units

Uses more hardware Gets higher performance

Why are threads useful #2 Tolerating stalls

Often a thread stalls, e.g., memory access


Multiplex the same functional unit Get more performance at a fraction of the cost

GPU vs. CPU Threads

GPU threads are extremely lightweight


Very little creation overhead In the order of microseconds All done in hardware

GPU needs 1000s of threads for full efficiency


Multi-core CPU needs only a few

Execution Timeline

CPU / Host
1. Copy to GPU mem 2. Launch GPU Kernel

GPU / Device

2. Synchronize with GPU time 3. Copy from GPU mem

Programmers view

First create data on CPU memory

CPU

GPU

GPU Memory Memory

Programmers view

Then Copy to GPU

CPU

GPU

GPU Memory Memory

Programmers view

GPU starts computation runs a kernel CPU can also continue

CPU

GPU

GPU Memory Memory

Programmers view

CPU and GPU Synchronize

CPU

GPU

GPU Memory Memory

Programmers view

Copy results back to CPU

CPU

GPU

GPU Memory Memory

Computation partitioning:
At the highest level:
Think of computation as a series of loops:
for (i = 0; i < big_number; i++) a[i] = some function for (i = 0; i < big_number; i++) a[i] = some other function for (i = 0; i < big_number; i++) a[i] = some other function

Kernels

Computation Partitioning -- Kernel


CUDA exposes the hardware to the programmer Programmer must manually partition work appropriately
Programmers view is hierarchical:
Think of data as an array

Per Kernel Computation Partitioning

Computation Grid: 2D Case

thread

Block

Threads within a block can communicate/synchronize


Run on the same multiprocessor

Threads across blocks cant communicate


Shouldnt touch each others data Behavior undefined

Thread Coordination Overview

Race-free access to data

GBT: Grids of Blocks of Threads


Programmers view of data and computation partitioning

Why? Realities of integrated circuits: need to cluster computation and storage to achieve high speeds

Block and Thread IDs

Threads and blocks have IDs


So each thread can decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D
Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1)

Simplifies memory addressing when processing multidimensional data


Convenience not necessity

Block (1, 1)
Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

IDs and dimensions are accessible through predefined variables, e.g., blockDim.x and threadIdx.x

Execution Model: Ordering

Execution order is undefined

Do not assume and use:


block 0 executes before block 1 Thread 10 executes before thread 20 And any other ordering even if you can observe it
Future implementations may break this ordering Its not part of the CUDA definition Why? More flexible hardware options

Programmers view: Memory Model Different memories with different uses and performance
Some managed by the compiler Some must be managed by the programmer

Arrows show whether read and/or write is possible

Execution Model Summary (for your reference)

Grid of blocks of threads


1D/2D grid of blocks 1D/2D/3D blocks of threads

All blocks are identical:


same structure and # of threads

Block execution order is undefined Same block threads:


can synchronize and share data fast (shared memory)

Threads from different blocks:


Cannot cooperate Communication through global memory

Threads and Blocks have IDs


Simplifies data indexing Can be 1D, 2D, or 3D (threads)

Blocks do not migrate: execute on the same processor Several blocks may run over the same processor

CUDA Software Architecture

e.g., fft()

cuda()

cu()

Reasoning about CUDA call ordering

GPU communication via cuda() calls and kernel invocations


cudaMalloc, cudaMemCpy

Asynchronous from the CPUs perspective


CPU places a request in a CUDA queue requests are handled in-order

Streams allow for multiple queues


Order within each queue honored No order across queues

More on this much later on

My first CUDA Program


__global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; }
int main() { float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd <<< n_blocks, block_size >>> (d_a, 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); }

GPU

CPU

CUDA API: Example

int a[N]; for (i =0; i < N; i++) a[i] = a[i] + x;

1. 2. 3. 4. 5. 6. 7. 8. 9.

Allocate CPU Data Structure Initialize Data on CPU Allocate GPU Data Structure Copy Data from CPU to GPU Define Execution Configuration Run Kernel CPU synchronizes with GPU Copy Data from GPU to CPU De-allocate GPU and CPU memory

1. Allocate CPU Data float *ha; main (int argc, char *argv[]) { int N = atoi (argv[1]); ha = (float *) malloc (sizeof (float) * N);

...
} No memory allocated on the GPU side
Pinned memory allocation results in faster CPU to/from GPU copies But pinned memory cannot be paged-out More on this later cudaMallocHost ()

2. Initialize CPU Data (dummy) float *ha; int i;

for (i = 0; i < N; i++) ha[i] = i;

3. Allocate GPU Data float *da; cudaMalloc ((void **) &da, sizeof (float) * N); Notice: no assignment side
NOT: da = cudaMalloc ()

Assignment is done internally:


Thats why we pass &da

Space is allocated in Global Memory on the GPU

GPU Memory Allocation

The host manages GPU memory allocation:


cudaMalloc (void **ptr, size_t nbytes) Must explicitly cast to (void **)
cudaMalloc ((void **) &da, sizeof (float) * N);

cudaFree (void *ptr);


cudaFree (da);

cudaMemset (void *ptr, int value, size_t nbytes);


cudaMemset (da, 0, N * sizeof (int));

Check the CUDA Reference Manual

4. Copy Initialized CPU data to GPU float *da; float *ha;

cudaMemCpy ((void *) da, // DESTINATION (void *) ha, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyHostToDevice); // DIRECTION

Host/Device Data Transfers

The host initiates all transfers: cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction)
Asynchronous from the CPUs perspective
CPU thread continues

In-order processing with other CUDA requests enum cudaMemcpyKind


cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice

5. Define Execution Configuration

How many blocks and threads/block


int threads_block = 64; int blocks = N / threads_block; if (blocks % N != 0) blocks += 1;

Alternatively:
blocks = (N + threads_block 1) / threads_block;

6. Launch Kernel & 7. CPU/GPU Synchronization

Instructs the GPU to launch blocks x threads_block threads: darradd <<<blocks, threads_block>> (da, 10f, N); cudaThreadSynchronize (); // forces CPU to wait

darradd: kernel name <<<>>> execution configuration


More on this soon

(da, x, N): arguments


256 8 byte limit / No variable arguments

CPU/GPU Synchronization

CPU does not block on cuda() calls


Kernel/requests are queued and processed in-order Control returns to CPU immediately

Good if there is other work to be done


e.g., preparing for the next kernel invocation

Eventually, CPU must know when GPU is done Then it can safely copy the GPU results
cudaThreadSynchronize ()
Block CPU until all preceding cuda() and kernel requests have completed

8. Copy data from GPU to CPU & 9. DeAllocate Memory float *da; float *ha; cudaMemCpy ((void *) ha, // DESTINATION (void *) da, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyDeviceToHost); // DIRECTION cudaFree (da); // display or process results here free (ha);

The GPU Kernel __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < N) da[i] = da[i] + x;


}

BlockIdx: Unique Block ID.


Numerically asceding: 0, 1,

BlockDim: Dimensions of Block = how many threads it has


BlockDim.x, BlockDim.y, BlockDim.z Unused dimensions default to 0

ThreadIdx: Unique per Block Index


0, 1, Per Block

Array Index Calculation Example int i = blockIdx.x * blockDim.x + threadIdx.x;

blockIdx.x = 0

blockIdx.x = 1

blockIdx.x = 2

a[0]

a[63] a[64]

a[127]a[128]

a[191]a[192]

i=0

i = 63

i = 64

i = 127

i = 128

i = 191 i = 192

Assuming blockDim.x = 64

CUDA Function Declarations

Executed Only callable on the: from the: __device__ float DeviceFunc()


__global__ void __host__ KernelFunc()

device
device host

device
host host

float HostFunc()

__global__ defines a kernel function


Must return void Can only call __device__ functions

__device__ and __host__ can be used together


Two difference versions generated

__device__ Example

Add x to a[i] multiple times


__device__ float addmany (float a, float b, int count) { while (count--) a += b; return a; } __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = addmany (da[i], x, 10); }

Kernel and Device Function Restrictions __device__ functions cannot have their address taken
e.g., f = &addmany; *f();

For functions executed on the device:


No recursion
darradd () { darradd () }

No static variable declarations inside the function


darradd () { static int canthavethis; }

No variable number of arguments


e.g., something like printf ()

My first CUDA Program


__global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; }
int main() { float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd <<< n_blocks, block_size >>> (d_a, 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); }

GPU

CPU

How to get high-performance #1

Programmer managed Scratchpad memory


Bring data in from global memory Reuse 16KB/banked Accessed in parallel by 16 threads

Programmer needs to:


Decide what to bring and when Decide which thread accesses what and when Coordination paramount

How to get high-performance #2

Global memory accesses


32 threads access memory together Can coalesce into a single reference E.g., a[threadID] works well

Control flow
32 threads run together If they diverge there is a performance penalty

Texture cache
When you think there is locality

Are GPUs really that much faster than CPUs

50x 200x speedups typically reported Recent work found


Not enough effort goes into optimizing code for CPUs

But:
The learning curve and expertise needed for CPUs is much larger

ECE Overview - ECE research Profile


Personnel and budget Partnerships with industry Biomedical Engineering Communications Computer Engineering Electromagnetics Electronics Energy Systems Photonics Systems Control

Our areas of expertise

- Slides from F. Najm (Chair) and T. Sargent (Research Vice Chair)

About our group

Computer Architecture
How to build the best possible system Best: performance, power, cost, etc.

Expertise in high-end systems


Micro-architecture Multi-processor and Multi-core systems

Current Research Support:


AMD, IBM, NSERC, Qualcomm (planned)

Claims to fame
Memory Dependence Prediction
Commercially implemented and licensed

Snoop Filtering: IBM Blue Gene

UofT-DRDC Partnership

Examples of industry research contracts with ECE in the past 8 years


AMD Agile Systems Inc Altera ARISE Technologies Asahi Kasei Microsystems Bell Canada Bell Mobility Cellular Bioscrypt Inc Broadcom Corporation Ciclon Semiconductor Cybermation Inc Digital Predictive Systems Inc. DPL Science Eastman Kodak Electro Scientific Industries EMS Technologies Exar Corp FOX-TEK Firan Technology Group Fuji Electric Fujitsu Gennum H2Green Energy Corporation Honeywell ASCa, Inc. Hydro One Networks Inc. IBM Canada Ltd. IBM IMAX Corporation Intel Corporation Jazz Semiconductor KT Micro LG Electronics Maxim MPB Technologies Microsoft Motorola Northrop Grumman NXP Semiconductors ON Semiconductor Ontario Lottery and Gaming Corp Ontario Power Generation Inc. Panasonic Semiconductor Singapore Peraso Technologies Inc. Philips Electronics North America Redline Communications Inc. Research in Motion Ltd. Right Track CAD Robert Bosch Corporation Samsung Thales Co., Ltd Semiconductor Research Corporation Siemens Aktiengesellschaft Sipex Corporation STMicroelectronics Inc. Sun Microsystems of Canada Inc. Telus Mobility Texas Instruments Toronto Hydro-Electric System Toshiba Corporation Xilinx Inc.

62

Eight Research Groups

63

1. Biomedical Engineering 2. Communications 3. Computer Engineering 4. Electromagnetics 5. Electronics 6. Energy Systems 7. Photonics 8. Systems Control

ECE

Computer Engineering Group

Human-Computer Interaction
Willy Wong, Steve Mann

Multi-sensor information systems


Parham Aarabi

Computer Hardware
Jonathan Rose, Steve Brown, Paul Chow, Jason Anderson

Computer Architecture
Greg Steffan, Andreas Moshovos, Tarek Abdelrahman, Natalie Enright Jerger

Computer Security
Davie Lie, Ashvin Goel

Neurosystems

Biomedical Engineering Berj L. Bardakjian, Roman Genov. Willy Wong, Hans Kunov Moshe Eizenman

Rehabilitation

Milos Popovic, Tom Chau.


Medical Imaging

Michael Joy, Adrian Nachman. Richard Cobbold Ofer Levi


Proteomics

Brendan Frey.
Kevin Truong.
Ca2+ Ca2+
65

Communications Group

Study of the principles, mathematics and algorithms that underpin how information is encoded, exchanged and processed
Three Sub-Groups:
1. Networks 2. Signal Processing 3. Information Theory

Sequence Analysis

Image Analysis and Computer Vision


Pattern recognition and detection

Embedded computer vision Computer vision and graphics

Networks

Quantum Cryptography and Computing

Computer Engineering

System Software
Michael Stumm, H-A. Jacobsen, Cristiana Amza, Baochun Li

Computer-Aided Design of Circuits


Farid Najm, Andreas Veneris, Jianwen Zhu, Jonathan Rose

Electronics Group

14 active professors; largest electronics group in Canada. Breadth of research topics:

Electronic device modelling Semiconductor technology VLSI CAD and Systems FPGAs DSP and Mixed-mode ICs Biomedical microsystems High-speed and mm-wave ICs and SoCs
72

Lab for (on-wafer) SoC and IC testing through 220 72 GHz UofT-IBM Partnership

Intelligent Sensory Microsystems

Mixed-signal VLSI circuits

Low-power, low-noise signal processing, computing and ADCs

On-chip micro-sensors

Electrical, chemical, optical

Project examples

Brain-chip interfaces On-chip biochemical sensors CMOS imagers

73

mm-Wave and 100+GHz systems on chip

Modelling mm-wave and noise performance of active and passive devices past 300 GHz. 60-120GHz multi-gigabit data rate phased-array radios Single-chip 76-79 GHz automotive radar 170 GHz transceiver with on-die antennas

74

Electromagnetics Group
Metamaterials: From microwaves to optics
Super-resolving lenses for imaging and sensing Small antennas Multiband RF components CMOS phase shifters

Electromagnetics of High-Speed Circuits


Signal integrity in high-speed digital systems

Microwave integrated circuit design, modeling and characterization Computational Electromagnetics


Interaction of Electromagnetic Fields with Living Tissue

Antennas
Telecom and Wireless Systems Reflectarrays Wave electronics Integrated antennas Controlled-beam antennas Adaptive and diversity antennas

METAMATERIALS (MTMs)
Super-lens capable of resolving details down to l/6

Scanning antennas with CMOS MTM chips


Small and broadband antennas

Computational Electromagnetics
Fast CAD for RF/ optical structures
Microstrip spiral inductor Optical power splitter

Modeling of Metamaterials
Plasmonic Left-Handed Media

Leaky-Wave Antennas

Energy Systems Group


Power Electronics
High power (> 1.2 MW) converters modeling, control, and digital control realization Micro-Power Grids converters for distributed resources, dc distribution systems, and HVdc systems Low-Power Electronics Integrated power supplies and power management systems-on-chip for low-power electronics
computers, cell phones, PDA-s, MP3 players, body implants

78

Harvesting Energy from humans

79

Energy Systems Research

Matrix Converter for Micro-Turbine Generator

UofT

IC for cell phone power supplies Voltage Control System for Wind Power Generators

Photonics Group

Photonics Group

Photonics Group

Photonics Group: Bio-Photonics

Systems Control Group

Basic & applied research in control engineering World-leading group in Control theory _______________________________________ ________ Optical Signal-to-Noise Ratio opt. with game theory Erbium-doped fibre amplifier design Analysis and design of digital watermarks for authentication Nonlinear control theory
application to magnetic levitation, micro positioning system

You might also like