L06_GPGPU_CUDA_Programming_1
L06_GPGPU_CUDA_Programming_1
(CUDA Programming)
Subhasis Bhattacharjee
Department of Computer Science and Engineering,
Indian Institute of Technology, Jammu
February 9, 2023
Heterogeneous Computing: CPU + GPGPU)
Heterogeneous Computing refers to systems that use more than one kind of
processors or cores (eg, CPU & GPU together)
▶ CPUs for sequential parts where latency matters
♦ CPUs can be 10x faster than GPUs for sequential code
▶ GPUs for parallel parts where throughput wins
♦ GPUs can process multiple data at once
CUDA C/C++
▶ Based on industry-standard C/C++
▶ Small set of extensions to enable heterogeneous programming
▶ Straightforward APIs to manage devices, memory etc.
#i n c l u d e < s t d i o . h>
#i n c l u d e < s t d l i b . h>
#i n c l u d e <math . h>
int main ( int argc , char * argv [ ] ) {
int n = 100000 , i ; // Size of vectors
double * a , * b ; // i n p u t v e c t o r s
double * c ; // o u t p u t v e c t o r
size_t b y t e s = n* s i z e o f ( d o u b l e ) ; // Size , in bytes , of each vector
// Allocate memory for each vector
a = ( double *) malloc ( b y t e s ) ; b = ( double *) malloc ( b y t e s ) ; c = ( double *)
malloc ( bytes ) ;
// Initialize vectors
for ( i = 0; i < n; i++ ) { a[ i ] = rand () ; b[ i ] = rand () ; }
// Actual computation
for ( i = 0; i < n; i++ ) {
c[ i ] = a[ i ] + b[ i ];
}
// Free memory
free (a) ; free (b) ; free (c) ;
return 0;
}
Neighbourhood sum
Change kernel
New kernel
Will it work ? No !!
One thread may overwrite even before other thread nished reading
if ( i d == 0 ) {
left = 0;
} else {
left = a [ id =1];
}
Data-parallel portions of an application are coded as device kernels and they run in
parallel on many threads.
cudaMemcpy - declaration
Declaration
__global__, __device__, __host__, __shared__
Built-in variables
threadIdx, blockIdx
Runtime API
cudaMalloc, cudaFree
Function launch
kernel<1, 100>(data)
int main ( v o i d )
{
int N = 1<<20;
#i n c l u d e <i o s t r e a m >
float *x , *y ;
#i n c l u d e <math . h>
// Allocate Unified Memory == accessible
/ / CUDA kernel to add
from CPU or GPU
elements of two arrays
c u d a M a l l o c M a n a g e d (& x , N* s i z e o f ( f l o a t ) ) ;
__global__
c u d a M a l l o c M a n a g e d (& y , N* s i z e o f ( f l o a t ) ) ;
void add ( i n t n, float *x ,
// initialize x and y arrays on the host
float *y )
for ( int i = 0; i < N; i ++) {
{
x[ i ] = 1.0 f ; y[ i ] = 2.0 f ;
int index = blockIdx . x *
}
blockDim . x + threadIdx . x
// Launch kernel on 1M e l e m e n t s on the
;
GPU
int stride = blockDim . x *
int blockSize = 256;
gridDim . x ;
int n u m B l o c k s = (N + blockSize = 1) /
for ( int i = index ; i < n;
blockSize ;
i += stride )
add<<<n u m B l o c k s , b l o c k S i z e >>>(N , x , y) ;
y[ i ] = x[ i ] + y[ i ];
// Wait for GPU to finish before
}
accessing on host
cudaDeviceSynchronize () ;
...