Topic GPU1
Topic GPU1
CS240A. 2017
T. Yang
• Hardware architecture
• Programming model
• Example
Historical PC
FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory. Copyright © 2009
Elsevier, Inc. All rights reserved.
Intel/AMD CPU with GPU
FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and
interconnects in this figure. Copyright © 2009 Elsevier
GPU Evolution
HOST
• GPU:
SM
ØSMs S S S S
P P P P
o30xSM on GT200, S S S S
P P P P
o14xSM on Fermi S S S S
P P P P
ØFor example, GTX 480: S
P
S
P
S
P
S
P
Ø 14 SMs x 32 cores SHARED
= 448 cores on a GPU MEMORY
HOST
More Detailed GPU Architecture View
FIGURE A.2.5 Basic unified GPU architecture. Example GPU with 112 streaming processor (SP) cores organized in 14
streaming multiprocessors (SMs); the cores are highly multithreaded. It has the basic Tesla architecture of an NVIDIA
GeForce 8800. The processors connect with four 64-bit-wide DRAM partitions via an interconnection network. Each SM
has eight SP cores, two special function units (SFUs), instruction and constant caches, a multithreaded instruction unit,
and a shared memory. Copyright © 2009 Elsevier, Inc. All rights reserved.
CUDA essentials
• developer.nvidia.com. Download
§ Driver
§ Toolkit (compiler nvcc)
§ SDK (examples) (recommended)
§ CUDA Programmers guide
Other tools:
• ‘Emulator’. Executes on CPU. Slow
• Simple profiler
• cuda-gdb (Linux)
How To Program For GPUs
SM
S S S S
n Parallelization P P P P
n Decomposition to threads S S S S
P P P P
n Memory S S S S
n shared memory, global P P P P
S S S S
memory P P P P
SHARED
n Enormous processing MEMORY
power
n Thread communication GLOBAL MEMORY
n Synchronization, no (ON DEVICE)
interdependencies
HOST
Application Thread blocks
n Threads grouped in BLOCK 1
thread blocks THREAD THREAD THREAD
n 128, 192 or 256 (0,0) (0,1) (0,2)
DEVICE (GPU)
HOST (CPU)
P
P Interconnect
between devices and
M memories
M
image[j] = result;
• Runtime API }
§ Memory, symbol, // Allocate GPU memory
execution void *myimage = cudaMalloc(bytes)
management
• cudaMemcpy
§ transfer data to and from GPU (global memory)
• cudaMalloc
§ Allocate memory on GPU (global memory)
Host Only
Main() { }
int main_data; __shared__ sdata;
__host__ hfunc () {
int hdata;
Main() {}
<<<gfunc(g,b,m)>>>();
__host__ hfunc () { __global__ gfunc() {
}
int hdata; int gdata;
Device Only Interface
IDs Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
§ So each thread can
Block Block Block
decide what data to work (0, 1) (1, 1) (2, 1)
on
§ Block ID: 1D or 2D Block (1, 1)
(blockIdx.x, blockIdx.y) Thread Thread Thread Thread Thread
processing
Courtesy: NDVIA
multidimensional data
Simple working code example
3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6
}
threadIdx.x = 0 examines in_array elements 0, 4, 8, 12
Known as a
threadIdx.x = 1 examines in_array elements 1, 5, 9, 13
cyclic data
threadIdx.x = 2 examines in_array elements 2, 6, 10, 14
distribution
threadIdx.x = 3 examines in_array elements 3, 7, 11, 15
CUDA Pseudo-Code
MAIN PROGRAM:
HOST FUNCTION:
Initialization
Allocate memory on device for
• Allocate memory on host copy of input and output
for input and output Copy input to device
• Assign random numbers to Set up grid/block
input array
Call global function
Call host function
Synchronize after completion
Calculate final output from
per-thread output Copy device output to host
Print result
DEVICE FUNCTION:
GLOBAL FUNCTION:
Compare current element
Thread scans subset of array elements and “6”
Call device function to compare with “6” Return 1 if same, else 0
Compute local result
Main Program: Preliminaries
CS6235
Main Program: Invoke Global Function
CS6235
Main Program: Calculate Output & Print Result
#include <stdio.h>
MAIN PROGRAM: #define SIZE 16
#define BLOCKSIZE 4
Initialization (OMIT)
__host__ void outer_compute (int
• Allocate memory on host for *in_arr, int *out_arr);
input and output
• Assign random numbers to int main(int argc, char **argv)
input array {
Call host function int *in_array, *out_array;
Calculate final output from int sum = 0;
per-thread output /* initialization */ …
Print result outer_compute(in_array, out_array);
for (int i=0; i<BLOCKSIZE; i++) {
sum+=out_array[i];
}
printf (”Result = %d\n",sum);
}
CS6235
Host Function: Preliminaries & Allocation
CS6235
Host Function: Copy Data To/From Host
3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6
}
threadIdx.x = 0 examines in_array elements 0, 4, 8, 12
Cyclic
threadIdx.x = 1 examines in_array elements 1, 5, 9, 13
distribution
threadIdx.x = 2 examines in_array elements 2, 6, 10, 14
threadIdx.x = 3 examines in_array elements 3, 7, 11, 15
Device Function
CS6235
Summary of Lecture
• Introduction to CUDA: C + API supporting
heterogeneous data-parallel CPU+GPU execution
§ Computation partitioning
§ Data partititioning (parts of this implied by decomposition
into threads)
§ Data organization and management
§ Concurrency management
• Compiler nvcc takes as input a .cu program and
produces
§ C Code for host processor (CPU), compiled by native C
compiler
§ Code for device processor (GPU), compiled by nvcc
compiler