0% found this document useful (0 votes)
2 views

L06_GPGPU_CUDA_Programming_1

The document discusses High Performance Computing using CUDA programming, focusing on heterogeneous computing with CPUs and GPUs. It outlines the basic philosophy of GPGPU programming, including memory allocation, kernel creation, and execution processes. Additionally, it highlights challenges in CUDA programming, the differences between CPU and GPU threads, and provides examples of vector addition and neighborhood sum operations.

Uploaded by

damneduser12
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

L06_GPGPU_CUDA_Programming_1

The document discusses High Performance Computing using CUDA programming, focusing on heterogeneous computing with CPUs and GPUs. It outlines the basic philosophy of GPGPU programming, including memory allocation, kernel creation, and execution processes. Additionally, it highlights challenges in CUDA programming, the differences between CPU and GPU threads, and provides examples of vector addition and neighborhood sum operations.

Uploaded by

damneduser12
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

High Performance Computing

(CUDA Programming)

Subhasis Bhattacharjee
Department of Computer Science and Engineering,
Indian Institute of Technology, Jammu

February 9, 2023
Heterogeneous Computing: CPU + GPGPU)

Heterogeneous Computing refers to systems that use more than one kind of
processors or cores (eg, CPU & GPU together)
▶ CPUs for sequential parts where latency matters
♦ CPUs can be 10x faster than GPUs for sequential code
▶ GPUs for parallel parts where throughput wins
♦ GPUs can process multiple data at once

CUDA C/C++
▶ Based on industry-standard C/C++
▶ Small set of extensions to enable heterogeneous programming
▶ Straightforward APIs to manage devices, memory etc.

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 1 / 22


Vector Addition Sequential on CPU

#i n c l u d e < s t d i o . h>
#i n c l u d e < s t d l i b . h>
#i n c l u d e <math . h>
int main ( int argc , char * argv [ ] ) {
int n = 100000 , i ; // Size of vectors
double * a , * b ; // i n p u t v e c t o r s
double * c ; // o u t p u t v e c t o r
size_t b y t e s = n* s i z e o f ( d o u b l e ) ; // Size , in bytes , of each vector
// Allocate memory for each vector
a = ( double *) malloc ( b y t e s ) ; b = ( double *) malloc ( b y t e s ) ; c = ( double *)
malloc ( bytes ) ;
// Initialize vectors
for ( i = 0; i < n; i++ ) { a[ i ] = rand () ; b[ i ] = rand () ; }
// Actual computation
for ( i = 0; i < n; i++ ) {
c[ i ] = a[ i ] + b[ i ];
}
// Free memory
free (a) ; free (b) ; free (c) ;
return 0;
}

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 2 / 22


Basic Philosophy of GPGPU Programming

1 Identify basic computation


▶ Create a kernel function
2 Allocate memory on device (GPU)
3 Move (input) data into device (GPU) memory
▶ Host & GPU synchronous
▶ Kind of blocking the Host
4 Launch kernel (host & GPU asynchronous)
▶ GPU will compute kernel function
▶ Move (output) data from GPU memory to host memory
▶ Free memory on GPU

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 3 / 22


Identify Kernel GPU

/ / CUDA kernel . Each thread takes care of one element of c


// threadIdx . x gives thread id
__global__ void vecAdd ( d o u b l e *a , double *b , double *c , int n)
{
// Get our global thread ID
int id = threadIdx . x ;

// Make sure we do not go out of bounds


if ( id < n)
c [ id ] = a [ id ] + b [ id ] ;
}

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 4 / 22


Transforming Vector Addition for CPU+GPU

int main ( i n t argc , char * argv [ ] ) {


int n = 1000 , i ; // Size of vectors
double * h_a , * h_b ; / / i n p u t v e c t o r s
doube * h_c ; / / o u t p u t v e c t o r
s i z e _ t b y t e s = n * s i z e o f ( d o u b l e ) ; // Size , in bytes , of each vector
// Allocate memory for each vector on host
h_a = ( d o u b l e *) malloc ( b y t e s ) ; h_b = ( d o u b l e *) malloc ( b y t e s ) ; h_c = (
double *) malloc ( b y t e s ) ;
// Initialize vectors on host
for ( i = 0; i < n; i++ ) { h_a [ i ] = rand () ; h_b [ i ] = rand () ; }
double * d_a , * d_b / / D e v i c e i n p u t v e c t o r s
double * d_c ; / / D e v i c e o u t p u t v e c t o r
// Allocate memory for each vector on GPU
c u d a M a l l o c (&d_a , bytes ) ; c u d a M a l l o c (&d_b , bytes ) ; c u d a M a l l o c (&d_c ,
bytes ) ;
// Copy data into device (GPU) memory
cudaMemcpy ( d_a , h_a , bytes , cudaMemcpyHostToDevice ) ;
cudaMemcpy ( d_b , h_b , bytes , cudaMemcpyHostToDevice ) ;
// Launch kernels
v e c A d d <<<1, n>>>(d_a , d_b , d_c , n) ;
// Copy output data into Host memory
cudaMemcpy ( d_c , h_c , bytes , cudaMemcpyDeviceToHost ) ;
// Free device memory
c u d a F r e e ( d_a ) ; c u d a F r e e ( d_b ) ; c u d a F r e e ( d_c ) ;
/ / WE ARE DONE = back in Host ( CPU ) processing , Free main memory
free (a) ; free (b) ; free (c) ; return 0;
}

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 5 / 22


Key facts in CPU-GPU execution

Kernel and CUDA calls are queued

GPU runs in FIFO basis

All memory calls are synchronous with Host

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 6 / 22


Modied Example

Neighbourhood sum

Let we have a vector A of size "n"

We want 3-neighbours sum around A[i].

A[i] = A[i-1] + A[i] + A[i+1] for all 0 <= i <= n-1

(Assuming A[-1] = 0 & A[n] = 0)

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 7 / 22


How much code do we need to change?

Very little - supposed to be.


Only computation requirement is changed

Change kernel

New kernel

__global__ void AddNeighbors ( double *a , int n)


{
// Get our global thread ID
int id = threadIdx . x ;
a [ id ] = a [ id = 1] + a [ id ] + a [ i d +1];
} Will it work ? No !!

Will it work ? No !!

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 8 / 22


Correcting AddNeighbors Kernel

__global__ void AddNeighbors ( double *a , int n)


{
// Get our global thread ID
int id = threadIdx . x ;
// Make sure we do not go out of bounds
if ( i d == 0 ) {
a [ id ] = a [ id ] + a [ i d +1];
} else if ( i d == n = 1) {
a [ id ] = a [ id = 1] + a [ id ] ;
} else {
a [ id ] = a [ id = 1] + a [ id ] + a [ i d +1];
}
}

It could be correct in CPU but Still incorrect in GPU

One thread may overwrite even before other thread nished reading

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 9 / 22


Correcting AddNeighbors Kernel

__global__ void AddNeighbors ( double *a , int n)


{
// Get our global thread ID
int id = threadIdx . x ;
// Collect data in local memory
double left , right ;
if ( i d == 0 ) {
left = 0;
} else {
left = a [ id =1];
}
if ( i d == n = 1) {
right = 0;
} else {
right = a [ i d +1];
}
// Need to synchronize here
__syncthreads ( ) ;
a [ id ] = left + a [ id ] + right ;
}

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 10 / 22


Analyzing Kernel

if ( i d == 0 ) {
left = 0;
} else {
left = a [ id =1];
}

Execution time of this If-else block of code

How much time will this code take?

Assume each of assignment, addition & comparison takes 1 unit time.

It will take 3 units of time !!! (NOT 2 units of time).


Because it is a SIMD architecture.
1 condition checking (id == 0) takes 1 unit time.
2 left = 0; takes 1 unit time.
3 left = a[id-1]; takes 1 unit time.

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 11 / 22


Total Time in AddNeighbors Kernel

__global__ void AddNeighbors ( double *a , int n)


{
int id = threadIdx . x ; // 1 unit
double left , right ;
if ( i d == 0 ) { // 1 unit
left = 0; // 1 unit
} else {
left = a [ id =1]; // 1 unit
}
if ( i d == n = 1) { // 1 unit
right = 0; // 1 unit
} else {
right = a [ i d +1]; // 1 unit
}
// Need to synchronize here
__syncthreads ( ) ; s unit
a [ id ] = left + a [ id ] + right ; // 3 unit
}

Analysis of total execution time

__syncthreads() takes s time unit.

All threads have nished the work just before __syncthreads()

Total time = 10 + s units

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 12 / 22


Observations

Thread divergence takes time


▶ Reduces performance
▶ Should be reduced
Barrier synchronization __syncthreads()
▶ Block all threads (in a block) from moving ahead

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 13 / 22


Challenges in CUDA Programming

Decomposing the problem

Identifying the kernels

Identifying communication pattern

Finding exact location for barrier synchronization

Reducing thread divergence

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 14 / 22


CUDA Threads

Data-parallel portions of an application are coded as device kernels and they run in
parallel on many threads.

Dierence between CPU & GPU Threads


CPU Thread =>
▶ Heavyweight - keeps a lot of info / data about the threads, lot of controls
▶ Users need to create / kill threads - overhead in creation
▶ run asynchronously
GPU Threads =>
▶ Lightweight - keep very little info / data
▶ All threads in a SM (streaming multiprocessor) runs same instruction
▶ Fully synchronous

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 15 / 22


GPU Memory Allocation / Release

cudaMemcpy - declaration

cudaMemcpy ( v o i d * dst , void * src , size_t nbytes , enum cu daMemcp yKind


direction ) ;

direction species locations (host or device) of src and dst

Blocks CPU thread: returns after the copy is complete

Doesn't start copying until previous CUDA calls complete


enum cudaMemcpyKind
▶ cudaMemcpyHostToHost
▶ cudaMemcpyHostToDevice
▶ cudaMemcpyDeviceToHost
▶ cudaMemcpyDeviceToDevice

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 16 / 22


Extension / keywords in C to support CUDA

Declaration
__global__, __device__, __host__, __shared__
Built-in variables
threadIdx, blockIdx
Runtime API
cudaMalloc, cudaFree
Function launch
kernel<1, 100>(data)

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 17 / 22


C + CUDA Program compilation / running

Source program contains:

CUDA specic code => nvcc compiles

Normal C code => normal gcc compilation

Linking together the using gcc

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 18 / 22


Unied memory example

int main ( v o i d )
{
int N = 1<<20;
#i n c l u d e <i o s t r e a m >
float *x , *y ;
#i n c l u d e <math . h>
// Allocate Unified Memory == accessible
/ / CUDA kernel to add
from CPU or GPU
elements of two arrays
c u d a M a l l o c M a n a g e d (& x , N* s i z e o f ( f l o a t ) ) ;
__global__
c u d a M a l l o c M a n a g e d (& y , N* s i z e o f ( f l o a t ) ) ;
void add ( i n t n, float *x ,
// initialize x and y arrays on the host
float *y )
for ( int i = 0; i < N; i ++) {
{
x[ i ] = 1.0 f ; y[ i ] = 2.0 f ;
int index = blockIdx . x *
}
blockDim . x + threadIdx . x
// Launch kernel on 1M e l e m e n t s on the
;
GPU
int stride = blockDim . x *
int blockSize = 256;
gridDim . x ;
int n u m B l o c k s = (N + blockSize = 1) /
for ( int i = index ; i < n;
blockSize ;
i += stride )
add<<<n u m B l o c k s , b l o c k S i z e >>>(N , x , y) ;
y[ i ] = x[ i ] + y[ i ];
// Wait for GPU to finish before
}
accessing on host
cudaDeviceSynchronize () ;
...

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 19 / 22


CUDA blocks and threads-per-block - TODO

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 20 / 22


CUDA Atomic Operations - TODO

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 21 / 22


End of GPGPU Programming - Part 1

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 22 / 22

You might also like