Lec 2 PDC
Lec 2 PDC
Computing
Lec-2
performance: •>380,000
Number of AMDofInterlogos
Number processors:
NVIDIA Kepler GPUs: •>3,000
Chip Chip
Core Compute Unit
Cache/Local Mem
Local Cache
Threading
Control Registers
Registers
ALUs
ALUs
CPUs: Latency Oriented Design
Control ALU ALU
• Large caches
– Convert long latency memory ALU ALU
accesses to short latency cache
accesses
• Sophisticated control
– Branch prediction for CPU
reduced branch latency
Cache
– Data forwarding for reduced
data latency
• Powerful ALU DRAM
1
Winning Applications Use Both CPU
and GPU
– Data understanding:-
• It will involve Parallelism & Threads Mapping
-- Kernel-Code (SPMD Parallel Program)
-- Data Allocation & Movement APIs
-- Memory & Threads Hierarchy
CUDA – Execution Model
Heterogeneous host (CPU) + device (GPU) based application C
program
– Serial parts in host C code
– Parallel parts in device SPMD kernel C code
Serial Code (host)
Parallel Kernel (device)
...
KernelA<<< nBlk, nTid >>>(args);
+ + + + + +
N threads;
© David Kirk/NVIDIA andeach
Wen-meireads
W. Hwu,its 2007-
input, computes sum & stores result
2 in Parallel
2012 ECE408/CS483, University of Illinois, Urbana-
Cuda grid (array) of Parallel Threads
• A CUDA kernel is executed
by a grid (array) of threads 0 1 2 254
– All threads in a grid run the 255
same kernel code (SPMD) …
– Each thread has index
(threadIdx) that it uses to i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
compute (map to) memory
address & make control
decisions …
©
ECE408/CS483, …
David Kirk/NVUniversity
IDIA and Wen-mei
of Illinois, W. Hwu,
Urbana-
… 11 …
cudaC
• CUDA C is heterogeneous parallel programming
interface that enables exploitation of data parallelism
using GPUs.
– Data understanding:-
• It will involve Parallelism & Threads Mapping
-- Kernel-Code (SPMD Parallel Program)
-- Data Allocation & Movement APIs
-- Memory & Threads Hierarchy
VecAdd: CUDA-C Host Kernel-Launch Code
int vecAdd(float* h_A, float* h_B, float* h_C, int n) {
// d_A, d_B, d_C mem-allocations & copying code included here (ref last slide)
Shared Shared
Memory Memory
– Two parameters
• Address of a pointer to the
allocated object Registers Registers Registers Registers
Shared Shared
Memory Memory
– Two parameters
• Address of a pointer to the
allocated object Registers Registers Registers Registers
Constant
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007- Memory
2012 ECE408/CS483, University of Illinois, Urbana-
Example: CUDA Memory Allocation/Transfer APIs
size = Membytes reqd for H, D;
// size-in-bytes= num-of-elements x sizeof (element-datatype)
cudaMalloc ((void **) device-MemPtr, size); //size= Mem-bytes
cudaMemcpy (destination, source, size, cudaMemcpysourceToDestination);
-------------------------------------------------------------------------------------------------------------
// Declare & allocate Host (H) memory;
// Declare Device (D) MemoryPtrs;
cudaMalloc ((void **) &D, size);
cudaMemcpy (D, H, size, cudaMemcpyHostToDevice);
// data processed (inplace) at device & results returned to host
cudaMemcpy (H, D, size, cudaMemcpyDeviceToHost);
cudaFree(D);
Vector Addition – Traditional C
Code
// Compute vector sum C = A+B
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
int i;
for (i = 0; i < n; i++)
h_C[i] = h_A[i] + h_B[i];
}
int main()
{
// Memory allocation for h_A, h_B, and h_C
//generate /get h_A, h_B, N elements
…
vecAdd(h_A, h_B, h_C, N);
} 2
VecAdd: cudaC Host & Device parts
0. // Host Memory: allocate & data input (h-A, h_B) & alloc h_C for result
1. // Allocate device-memory d_A, d_B & d_C
// equal in size to host memory h_A, h_B & h_C Part 1
// copy inputs h_A & h_B from host to device d_A, d_B
Thread Config & kernel Launching in host code Kernel func code will be different
& Thread mapping & kernel func in device code for different
5 data parallel tasks
CUDA Function Declarations
Executed Only callable
function type on the: from the: