0% found this document useful (0 votes)
44 views31 pages

Lec 2 PDC

Uploaded by

Asnan Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views31 pages

Lec 2 PDC

Uploaded by

Asnan Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Parallel & Distributed

Computing
Lec-2

Source : Dr Wen-mei Hwu ; University of Illinois at Urbana-Champaign


Graphic Processing Unit
(GPU) & cudaC-coding

Source : Dr Wen-mei Hwu ; University of Illinois at Urbana-Champaign


GPU/ cudaC topics & books
• Intro to Hetero Computing
• Intro to GPU & cudaC
• Data Parallelism Model
• CUDA Memory Model
• 1D- Vector addition
• 2D- Image Processing
• 1D & 2D Convolution
- Programming Massively Parallel Processors (2nd ed)–
Hands-on Approach,by D. Kirk & W. Hwu
- NVIDIA, NVidia CUDA C Programming Guide, version 4.0, NVidia
Why Program the GPU ?
 Historical Perspective
 From closed to open system
 Compute
 Intel Core i7 – 4 cores – 100 GFLOP
 NVIDIA GTX280 – 240 cores – 1 TFLOP
 Memory Bandwidth
 System Memory – 60 GB/s
 NVIDIA GT200 – 150 GB/s
Concurrency Revolution:
Enlarging performance gap between GPUs and CPUs.
Blue Waters Supercomputer

Cray System & Storage cabinets: •>300

Compute nodes: •>25,000

Usable Storage Bandwidth: •>1 TB/s

System •>1.5 Petabytes

Memory: Memory per core •4 GB

module: Gemin Interconnect •3D Torus

Topology: •>25 Petabytes

Usable •>11.5 Petaflops

Storage: Peak •>49,000

performance: •>380,000

Number of AMDofInterlogos
Number processors:
NVIDIA Kepler GPUs: •>3,000

3 Number of AMD x86 core modules:


Hierarchy of Platforms & Programming Models
in Heterogeneous parallel computing
Heterogeneous Parallel Computing Platforms

Cluster + Multicore SMP + GPU

Heterogeneous Parallel Programming Models

MPI + OpenMP + CUDA-C


CPU & GPU have very different design philosophy
Control Parallelism vs
CPU Data Parallelism GPU
Latency Oriented Cores Throughput Oriented Cores

Chip Chip
Core Compute Unit
Cache/Local Mem
Local Cache

Threading
Control Registers
Registers
ALUs
ALUs
CPUs: Latency Oriented Design
Control ALU ALU
• Large caches
– Convert long latency memory ALU ALU
accesses to short latency cache
accesses
• Sophisticated control
– Branch prediction for CPU
reduced branch latency
Cache
– Data forwarding for reduced
data latency
• Powerful ALU DRAM

– Reduced operation latency

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 9


CE408/CS483, University of Illinois, Urbana-Champaign
GPUs: Throughput Oriented Design
• Large # of ALUs
– high throughput data processing
• Small caches
– To boost memory throughput GPU
• Simple control
– No branch prediction
– No data forwarding
• Require massive number of
threads to tolerate latencies DRAM

1
Winning Applications Use Both CPU
and GPU

• CPUs for sequential • GPUs for parallel data


Control parts where parts where throughput
latency matters matters
– CPUs can be faster than GPUs – GPUs can be faster than
for sequential code CPUs for parallel code

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 1


ECE408/CS483, University of Illinois, Urbana-Champagn
cudaC
• CUDA C is heterogeneous parallel programming
interface that enables exploitation of data parallelism
using GPUs.

– Data understanding:-
• It will involve Parallelism & Threads Mapping
-- Kernel-Code (SPMD Parallel Program)
-- Data Allocation & Movement APIs
-- Memory & Threads Hierarchy
CUDA – Execution Model
Heterogeneous host (CPU) + device (GPU) based application C
program
– Serial parts in host C code
– Parallel parts in device SPMD kernel C code
Serial Code (host)
Parallel Kernel (device)
...
KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)


Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args); ...
Vector Addition – Traditional C
Code
// Compute vector sum C = A+B
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
int i;
for (i = 0; i < n; i++)
h_C[i] = h_A[i] + h_B[i];
}
int main()
{
// Memory allocation for h_A, h_B, and h_C
//generate /get h_A, h_B, N elements

vecAdd(h_A, h_B, h_C, N);
} 1
Parallel Vector Addition – Conceptual
View
vector A A[0] A[1] A[2] A[3] A[4] A[N-1]

vector B B[0] B[1] B[2] B[3] B[4] B[N-1]

+ + + + + +

vector C C[0] C[1] C[2] C[3] C[4]



C[N-1]

N threads;
© David Kirk/NVIDIA andeach
Wen-meireads
W. Hwu,its 2007-
input, computes sum & stores result
2 in Parallel
2012 ECE408/CS483, University of Illinois, Urbana-
Cuda grid (array) of Parallel Threads
• A CUDA kernel is executed
by a grid (array) of threads 0 1 2 254
– All threads in a grid run the 255
same kernel code (SPMD) …
– Each thread has index
(threadIdx) that it uses to i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
compute (map to) memory
address & make control
decisions …

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007- 10


2011 ECE408/CS483, University of Illinois, Urbana- 10
Cuda Threads Blocks: next level of thread hierarchy
• thread grid(array) can be divided into multiple blocks
• Block index/dimension & threadIdx are combined to map threads to memory
• All the Threads in different blocks can access same (global) memory
System only limits max-num of thread, max block size/dim/number
Within limits thread hierarchy/ configuration is programmers choice to suit application.

Thread Block 0 Thread Block 1 Thread Block N-1


0
1 2 254 255 0 1 2 254 255 0 1 2 254
255
i = blockIdx.x *… blockDim.x + … +
i = blockIdx.x * blockDim.x … +
i = blockIdx.x * blockDim.x
threadIdx.x; threadIdx.x; … threadIdx.x;
C[i] = A[i] + B[i]; C[i] = A[i] + B[i]; C[i] = A[i] + B[i];

©
ECE408/CS483, …
David Kirk/NVUniversity
IDIA and Wen-mei
of Illinois, W. Hwu,
Urbana-
… 11 …
cudaC
• CUDA C is heterogeneous parallel programming
interface that enables exploitation of data parallelism
using GPUs.

– Data understanding:-
• It will involve Parallelism & Threads Mapping
-- Kernel-Code (SPMD Parallel Program)
-- Data Allocation & Movement APIs
-- Memory & Threads Hierarchy
VecAdd: CUDA-C Host Kernel-Launch Code
int vecAdd(float* h_A, float* h_B, float* h_C, int n) {
// d_A, d_B, d_C mem-allocations & copying code included here (ref last slide)

// Kernel launch code to run on ceil(n/256.0) blocks of 256 threads each


vecAddKernnel <<<ceil(n/256.0),256>>> (d_A, d_B, d_C, n);

// Alternate to above Kernel Launch code}


dim3 DimGrid(ceil(n/256), 1, 1);
dim3 DimBlock(256, 1, 1);
vecAddKernnel <<<DimGrid,DimBlock>>> (d_A,d_B,d_C, n);
}
VecAdd: CUDA-C Kernel code exec at Device
// Compute vector sum C = A+B
global //Cudacode to exec on device, although launched by host

void vecAddKernel(float* A, float* B, float* C, int


n) // A, B, C & n are func formal parameters
{
// Mapping threads to Data index

int i = blockIdx.x * blockDim.x + threadIdx.x;


//Each thread i performs one pair-wise addition at index=i
if(i<n) C[i] = A[i] + B[i];
}
cudaC
• CUDA C is heterogeneous parallel programming
interface that enables exploitation of data parallelism
using GPUs.
• CudaC vs OpenCL
– Data understanding:-
• It will involve Parallelism & Threads Mapping
-- Kernel-Code (SPMD Parallel Program)
-- Data Allocation & Movement APIs
-- Memory & Threads Hierarchy
• Memory allocation
-- Static (Compile-time: Declared)
vs
-- Dynamic (Run-time: Malloc)
CUDA Device (dynamic)Memory Allocation/Free
API functions
• cudaMalloc()
– Allocates object in the (Device) Grid

device Block (0, 0) Block (0, 1)

Shared Shared
Memory Memory
– Two parameters
• Address of a pointer to the
allocated object Registers Registers Registers Registers

• Size of allocated object in terms


of bytes
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)
• cudaFree()
– Frees object from device global Host
memory Global
Memory
• Pointer to freed object
Constant
Memory
CUDA Device (dynamic)Memory Allocation/Free
API functions
• cudaMalloc()
– Allocates object in the (Device) Grid

device Block (0, 0) Block (0, 1)

Shared Shared
Memory Memory
– Two parameters
• Address of a pointer to the
allocated object Registers Registers Registers Registers

• Size of allocated object in terms


of bytes
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)
• cudaFree()
– Frees object from device global Host
memory Global
Memory
• Pointer to freed object
Constant
Memory
Host-Device Data
API functions
Transfer
cudaMemcpy() (Device) Grid
• memory data transfer host device
Block (0, 0) Block (0, 1)
• Requires four parameters
Shared Shared
• Pointer to destination Memory Memory

• Pointer to source Registers Registers Registers Registers

• Number of bytes copied


• Type/Direction of transfer Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)

• Transfer to device is Host


Global
asynchronous Memory

Constant
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007- Memory
2012 ECE408/CS483, University of Illinois, Urbana-
Example: CUDA Memory Allocation/Transfer APIs
size = Membytes reqd for H, D;
// size-in-bytes= num-of-elements x sizeof (element-datatype)
cudaMalloc ((void **) device-MemPtr, size); //size= Mem-bytes
cudaMemcpy (destination, source, size, cudaMemcpysourceToDestination);
-------------------------------------------------------------------------------------------------------------
// Declare & allocate Host (H) memory;
// Declare Device (D) MemoryPtrs;
cudaMalloc ((void **) &D, size);
cudaMemcpy (D, H, size, cudaMemcpyHostToDevice);
// data processed (inplace) at device & results returned to host
cudaMemcpy (H, D, size, cudaMemcpyDeviceToHost);
cudaFree(D);
Vector Addition – Traditional C
Code
// Compute vector sum C = A+B
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
int i;
for (i = 0; i < n; i++)
h_C[i] = h_A[i] + h_B[i];
}
int main()
{
// Memory allocation for h_A, h_B, and h_C
//generate /get h_A, h_B, N elements

vecAdd(h_A, h_B, h_C, N);
} 2
VecAdd: cudaC Host & Device parts
0. // Host Memory: allocate & data input (h-A, h_B) & alloc h_C for result
1. // Allocate device-memory d_A, d_B & d_C
// equal in size to host memory h_A, h_B & h_C Part 1
// copy inputs h_A & h_B from host to device d_A, d_B

Host Memory Device Memory


2. // a) Host Launches kernel for device Part 0
// b) Device executes kernel-function GPU
CPU
Part
3. // copy result from device d_C to host_C 2
// Free all allocated device-memory
Part 3
VecADD: CUDA-C Host-Code
#include <cuda.h>
__host__
void vecAdd(float* h_A, float* h_B, float* h_C, int n){
int size = n * sizeof(float); //memory sizes in bytes reqd for malloc
float* d_A, d_B, d_C; // declare mem-block ptrs for device
cudaMalloc ((void **) &d_A, size); // mem alloc in device
cudaMemcpy (d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc ((void **) &d_B, size); // mem alloc in device
cudaMemcpy (d_B, h_B, size, cudaMemcpyHostToDevice);
cudaMalloc ((void **) &d_C, size); // mem alloc in device
// Kernel invocation code – to be shown later????
cudaMemcpy (h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree (d_C); }
Overall View of host & Device Code for VectAdd
host global
Void vecAdd(…) void vecAddKernel(float *A,
{ ……………. float *B, float *C, int n)
Device Dyn Memory allocation &
dim3 DimGrid = (ceil(n/256.0),1,1); {
Data transfer from host to Device
dim3 DimBlock = (256,1,1); int i = blockIdx.x * blockDim.x
vecAddKernel<<<ceil(n/256),256>>> + threadIdx.x;
(d_A,d_B,d_C,n); if( i<n ) C[i] = A[i]+B[i];
………………….
Data transfer from Device to host GPU }
Blk 0 Blk N-1
} Grid • • •
main()
Host Memory Allocation &
Memory Data initialization M0 ••• Mk
Call host func vecAdd (….) RAM

Thread Config & kernel Launching in host code Kernel func code will be different
& Thread mapping & kernel func in device code for different
5 data parallel tasks
CUDA Function Declarations
Executed Only callable
function type on the: from the:

__host__ float HostFunc() host host

global void KernelFunc() device host

• global defines a kernel


_device_float
function DeviceFunc() device device
• Each “ ” consists of two underscore
• characters
device and host can be used together
• A kernel function must return void
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007- 6
2012 ECE408/CS483, University of Illinois, Urbana-

You might also like