0% found this document useful (0 votes)
18 views

Topic GPU1

Uploaded by

Bin Li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Topic GPU1

Uploaded by

Bin Li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Introduction to GPU (Graphics

Processing Unit) Architecture &


Programming

CS240A. 2017
T. Yang

Some of slides are from M. Hall of Utah


CS6235
Overview

• Hardware architecture
• Programming model
• Example
Historical PC

FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory. Copyright © 2009
Elsevier, Inc. All rights reserved.
Intel/AMD CPU with GPU

FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and
interconnects in this figure. Copyright © 2009 Elsevier
GPU Evolution

Highly parallel, highly multithreaded multiprocessor optimized for


graphic computing and other applications
• New GPU are being developed every 12 to 18 months
• Number crunching: 1 card ~= 1 teraflop ~= small cluster.

• 1980’s – No GPU. PC used VGA controller


• 1990’s – Add more function into VGA controller
• 1997 – 3D acceleration functions:
Hardware for triangle setup and rasterization
Texture mapping
Shading
• 2000 – A single chip graphics processor ( beginning of GPU
term)
• 2005 – Massively parallel programmable processors
GPU Programming API
• CUDA (Compute Unified Device Architecture) : parallel
GPU programming API created by NVIDA
– Hardware and software architecture for issuing and
managing computations on GPU
• Massively parallel architecture. over 8000 threads is
common
• API libaries with C/C++/Fortran language
• Numerical libraries: cuBLAS, cuFFT,
• OpenGL – an open standard for GPU programming
• DirectX – a series of Microsoft multimedia programming
interfaces
GPU Architecture
SM
S S S S
SP: scalar processor P P P P
‘CUDA core’ S S S S
P P P P
S S S S
Executes one thread P P P P
S S S S
P P P P
SM SHARED
streaming MEMORY
multiprocessor
32xSP (or 16, 48 or more)
Fast local ‘shared memory’ GLOBAL MEMORY
(shared between SPs) (ON DEVICE)
16 KiB (or 64 KiB)

HOST
• GPU:
SM
ØSMs S S S S
P P P P
o30xSM on GT200, S S S S
P P P P
o14xSM on Fermi S S S S
P P P P
ØFor example, GTX 480: S
P
S
P
S
P
S
P
Ø 14 SMs x 32 cores SHARED
= 448 cores on a GPU MEMORY

GDDR memory GLOBAL MEMORY


(ON DEVICE)
512 MiB - 6 GiB

HOST
More Detailed GPU Architecture View

FIGURE A.2.5 Basic unified GPU architecture. Example GPU with 112 streaming processor (SP) cores organized in 14
streaming multiprocessors (SMs); the cores are highly multithreaded. It has the basic Tesla architecture of an NVIDIA
GeForce 8800. The processors connect with four 64-bit-wide DRAM partitions via an interconnection network. Each SM
has eight SP cores, two special function units (SFUs), instruction and constant caches, a multithreaded instruction unit,
and a shared memory. Copyright © 2009 Elsevier, Inc. All rights reserved.
CUDA essentials

• developer.nvidia.com. Download
§ Driver
§ Toolkit (compiler nvcc)
§ SDK (examples) (recommended)
§ CUDA Programmers guide

Other tools:
• ‘Emulator’. Executes on CPU. Slow
• Simple profiler
• cuda-gdb (Linux)
How To Program For GPUs
SM
S S S S
n Parallelization P P P P
n Decomposition to threads S S S S
P P P P
n Memory S S S S
n shared memory, global P P P P
S S S S
memory P P P P
SHARED
n Enormous processing MEMORY
power
n Thread communication GLOBAL MEMORY
n Synchronization, no (ON DEVICE)
interdependencies

HOST
Application Thread blocks
n Threads grouped in BLOCK 1
thread blocks THREAD THREAD THREAD
n 128, 192 or 256 (0,0) (0,1) (0,2)

threads in a block THREAD THREAD THREAD


(1,0) (1,1) (1,2)

• One thread block executes on one


SM
– All threads sharing the ‘shared memory’
– 32 threads are executed simultaneously
(‘warp’)
Application Thread blocks
n Blocks execute on SMs BLOCK 1
n - execute in parallel THREAD THREAD THREAD
(0,0) (0,1) (0,2)
n - execute independently!
THREAD THREAD THREAD
(1,0) (1,1) (1,2)

• Blocks form a GRID


• Thread ID
BLOCK 0 BLOCK 1 BLOCK 2
unique within block
BLOCK 3 BLOCK 4 BLOCK 5
• Block ID
BLOCK 6 BLOCK 7 BLOCK 8
unique within grid Grid
Thread Batching: Grids and Blocks
• A kernel is executed as a grid
of thread blocks Host Device

§ All threads share data Grid 1

memory space Kernel Block Block Block


1
• A thread block is a batch of (0, 0) (1, 0) (2, 0)

threads that can cooperate Block Block Block


with each other by: (0, 1) (1, 1) (2, 1)

§ Synchronizing their execution


Grid 2
– For hazard-free shared
Kernel
memory accesses 2
§ Efficiently sharing data through a
low latency shared memory
Block (1, 1)
• Two threads from two
different blocks cannot Thread Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)
cooperate Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Courtesy: NDVIA Thread Thread Thread Thread Thread


(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
What Programmer Expresses in CUDA

DEVICE (GPU)
HOST (CPU)
P
P Interconnect
between devices and
M memories
M

• Computation partitioning (where to run)


§ Declarations on functions __host__, __global__, __device__
§ Mapping of thread programs to device: compute <<<gs,
bs>>>(<args>)
• Data partitioning (where does data reside, who may access it and
how?)
• Declarations on data __shared__, __device__, __constant__, …
• Data management and orchestration
• Copying to/from host: e.g., cudaMemcpy(h_obj,d_obj, cudaMemcpyDevicetoHost)
• Concurrency management
§ E.g. __synchthreads()
Code that executes on GPU: Kernels
n Kernel
n a simple C function
n executes on GPU in parallel
n as many times as there are threads
n The keyword __global__ tells the compiler nvcc to
make a function a kernel (and compile/run it for the
GPU, instead of the CPU)
n It's the functions that you may call from the host side
using CUDA kernel call semantics (<<<...>>>).

Device functions can only be called from other device or global


functions. __device__ functions cannot be called from host code
Minimal Extensions to C + API
• Declspecs __device__ float filter[N];

§ global, device, __global__ void convolve (float *image)


shared, local, {
constant
__shared__ float region[M];
...
• Keywords
§ threadIdx, blockIdx region[threadIdx] = image[i];
• Intrinsics __syncthreads()
§ __syncthreads ...

image[j] = result;
• Runtime API }
§ Memory, symbol, // Allocate GPU memory
execution void *myimage = cudaMalloc(bytes)
management

// 100 blocks, 10 threads per block


• Function launch convolve<<<100, 10>>> (myimage);
Setup and data transfer

• cudaMemcpy
§ transfer data to and from GPU (global memory)
• cudaMalloc
§ Allocate memory on GPU (global memory)

• GPU is the ‘device’, CPU is the ‘host’

• Kernel call syntax


NVCC Compiler’s Role: Partition Code and
Compile for Device
mycode.cu Compiled by native Compiled by nvcc
compiler: gcc, icc, cc compiler
int main_data;
__shared__ int sdata;

Host Only
Main() { }
int main_data; __shared__ sdata;
__host__ hfunc () {
int hdata;
Main() {}
<<<gfunc(g,b,m)>>>();
__host__ hfunc () { __global__ gfunc() {
}
int hdata; int gdata;
Device Only Interface

__global__ gfunc() { <<<gfunc(g,b,m)>>> }


int gdata; ();
} }

__device__ dfunc() { __device__ dfunc() {


int ddata; int ddata;
} }
CUDA Programming Model: How Threads are
Executed
• The GPU is viewed as a compute device that:
§ Is a coprocessor to the CPU or host
§ Has its own DRAM (device memory)
§ Runs many threads in parallel
• Data-parallel portions of an application are executed
on the device as kernels which run in parallel on
many threads
• Differences between GPU and CPU threads
§ GPU threads are extremely lightweight
– Very little creation overhead
§ GPU needs 1000s of threads for full efficiency
– Multi-core CPU needs only a few
Block and Thread IDs
Device

• Threads and blocks have Grid 1

IDs Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
§ So each thread can
Block Block Block
decide what data to work (0, 1) (1, 1) (2, 1)

on
§ Block ID: 1D or 2D Block (1, 1)
(blockIdx.x, blockIdx.y) Thread Thread Thread Thread Thread

§ Thread ID: 1D, 2D, or 3D (0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

(threadIdx.{x,y,z}) Thread Thread Thread Thread Thread


(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

• Simplifies memory Thread Thread Thread Thread Thread

addressing when (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

processing
Courtesy: NDVIA
multidimensional data
Simple working code example

• What does it do?


§ Scan elements of array of numbers (any of 0 to 9)
§ How many times does “6” appear?
§ Array of 16 elements, each thread examines 4
elements, 1 block in grid, 1 grid

3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6

}
threadIdx.x = 0 examines in_array elements 0, 4, 8, 12
Known as a
threadIdx.x = 1 examines in_array elements 1, 5, 9, 13
cyclic data
threadIdx.x = 2 examines in_array elements 2, 6, 10, 14
distribution
threadIdx.x = 3 examines in_array elements 3, 7, 11, 15
CUDA Pseudo-Code
MAIN PROGRAM:
HOST FUNCTION:
Initialization
Allocate memory on device for
• Allocate memory on host copy of input and output
for input and output Copy input to device
• Assign random numbers to Set up grid/block
input array
Call global function
Call host function
Synchronize after completion
Calculate final output from
per-thread output Copy device output to host

Print result
DEVICE FUNCTION:
GLOBAL FUNCTION:
Compare current element
Thread scans subset of array elements and “6”
Call device function to compare with “6” Return 1 if same, else 0
Compute local result
Main Program: Preliminaries

MAIN PROGRAM: #include <stdio.h>


Initialization #define SIZE 16
• Allocate memory on host for #define BLOCKSIZE 4
input and output
• Assign random numbers to
input array int main(int argc, char **argv)
Call host function {
Calculate final output from int *in_array, *out_array;
per-thread output …
Print result }

CS6235
Main Program: Invoke Global Function

MAIN PROGRAM: #include <stdio.h>


#define SIZE 16
Initialization (OMIT)
#define BLOCKSIZE 4
• Allocate memory on host for
input and output __host__ void outer_compute (int
*in_arr, int *out_arr);
• Assign random numbers to
input array int main(int argc, char **argv)
Call host function {
Calculate final output from int *in_array, *out_array;
per-thread output /* initialization */ …
Print result outer_compute(in_array, out_array);

}

CS6235
Main Program: Calculate Output & Print Result
#include <stdio.h>
MAIN PROGRAM: #define SIZE 16
#define BLOCKSIZE 4
Initialization (OMIT)
__host__ void outer_compute (int
• Allocate memory on host for *in_arr, int *out_arr);
input and output
• Assign random numbers to int main(int argc, char **argv)
input array {
Call host function int *in_array, *out_array;
Calculate final output from int sum = 0;
per-thread output /* initialization */ …
Print result outer_compute(in_array, out_array);
for (int i=0; i<BLOCKSIZE; i++) {
sum+=out_array[i];
}
printf (”Result = %d\n",sum);
}
CS6235
Host Function: Preliminaries & Allocation

HOST FUNCTION: __host__ void outer_compute (int


*h_in_array, int *h_out_array) {
Allocate memory on device for
copy of input and output
int *d_in_array, *d_out_array;
Copy input to device
Set up grid/block cudaMalloc((void **) &d_in_array,
SIZE*sizeof(int));
Call global function
cudaMalloc((void **) &d_out_array,
Synchronize after completion BLOCKSIZE*sizeof(int));
Copy device output to host

}

CS6235
Host Function: Copy Data To/From Host

__host__ void outer_compute (int


HOST FUNCTION: *h_in_array, int *h_out_array) {
Allocate memory on device for int *d_in_array, *d_out_array;
copy of input and output
Copy input to device cudaMalloc((void **) &d_in_array,
SIZE*sizeof(int));
Set up grid/block
cudaMalloc((void **) &d_out_array,
Call global function BLOCKSIZE*sizeof(int));
cudaMemcpy(d_in_array, h_in_array,
Synchronize after completion SIZE*sizeof(int),
Copy device output to host cudaMemcpyHostToDevice);
… do computation ...
cudaMemcpy(h_out_array,d_out_array,
BLOCKSIZE*sizeof(int),
cudaMemcpyDeviceToHost);
}
CS6235
Host Function: Setup & Call Global Function
__host__ void outer_compute (int
*h_in_array, int *h_out_array) {
int *d_in_array, *d_out_array;
HOST FUNCTION:
Allocate memory on device for cudaMalloc((void **) &d_in_array,
copy of input and output SIZE*sizeof(int));
Copy input to device cudaMalloc((void **) &d_out_array,
BLOCKSIZE*sizeof(int));
Set up grid/block cudaMemcpy(d_in_array, h_in_array,
Call global function SIZE*sizeof(int),
cudaMemcpyHostToDevice);
Synchronize after completion compute<<<(1,BLOCKSIZE)>>> (d_in_array,
d_out_array);
Copy device output to host
cudaThreadSynchronize();
cudaMemcpy(h_out_array, d_out_array,
BLOCKSIZE*sizeof(int),
cudaMemcpyDeviceToHost);
}
CS6235
Global Function: How to distribute tasks?

__global__ void compute(int *d_in,int


*d_out) {
GLOBAL FUNCTION:
d_out[threadIdx.x] = 0;
Thread scans subset of array
elements for (int i=0; i<SIZE/BLOCKSIZE; i++) {
Call device function to int val = d_in[i*BLOCKSIZE +
compare with “6” threadIdx.x];

Compute local result d_out[threadIdx.x] += compare(val, 6);


}
}

3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6

}
threadIdx.x = 0 examines in_array elements 0, 4, 8, 12
Cyclic
threadIdx.x = 1 examines in_array elements 1, 5, 9, 13
distribution
threadIdx.x = 2 examines in_array elements 2, 6, 10, 14
threadIdx.x = 3 examines in_array elements 3, 7, 11, 15
Device Function

DEVICE FUNCTION: __device__ int


compare(int a, int b) {
Compare current element
and “6” if (a == b) return 1;
Return 1 if same, else 0 return 0;
}

CS6235
Summary of Lecture
• Introduction to CUDA: C + API supporting
heterogeneous data-parallel CPU+GPU execution
§ Computation partitioning
§ Data partititioning (parts of this implied by decomposition
into threads)
§ Data organization and management
§ Concurrency management
• Compiler nvcc takes as input a .cu program and
produces
§ C Code for host processor (CPU), compiled by native C
compiler
§ Code for device processor (GPU), compiled by nvcc
compiler

You might also like