0% found this document useful (0 votes)

2 views

L06_GPGPU_CUDA_Programming_1

The document discusses High Performance Computing using CUDA programming, focusing on heterogeneous computing with CPUs and GPUs. It outlines the basic philosophy of GPGPU programming, including memory allocation, kernel creation, and execution processes. Additionally, it highlights challenges in CUDA programming, the differences between CPU and GPU threads, and provides examples of vector addition and neighborhood sum operations.

Uploaded by

damneduser12

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

L06_GPGPU_CUDA_Programming_1

Uploaded by

damneduser12

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

High Performance Computing

(CUDA Programming)

Subhasis Bhattacharjee
Department of Computer Science and Engineering,
Indian Institute of Technology, Jammu

February 9, 2023
Heterogeneous Computing: CPU + GPGPU)

Heterogeneous Computing refers to systems that use more than one kind of
processors or cores (eg, CPU & GPU together)
▶ CPUs for sequential parts where latency matters
♦ CPUs can be 10x faster than GPUs for sequential code
▶ GPUs for parallel parts where throughput wins
♦ GPUs can process multiple data at once

CUDA C/C++
▶ Based on industry-standard C/C++
▶ Small set of extensions to enable heterogeneous programming
▶ Straightforward APIs to manage devices, memory etc.

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 1 / 22

Vector Addition Sequential on CPU

#i n c l u d e < s t d i o . h>
#i n c l u d e < s t d l i b . h>
#i n c l u d e <math . h>
int main ( int argc , char * argv [ ] ) {
int n = 100000 , i ; // Size of vectors
double * a , * b ; // i n p u t v e c t o r s
double * c ; // o u t p u t v e c t o r
size_t b y t e s = n* s i z e o f ( d o u b l e ) ; // Size , in bytes , of each vector
// Allocate memory for each vector
a = ( double *) malloc ( b y t e s ) ; b = ( double *) malloc ( b y t e s ) ; c = ( double *)
malloc ( bytes ) ;
// Initialize vectors
for ( i = 0; i < n; i++ ) { a[ i ] = rand () ; b[ i ] = rand () ; }
// Actual computation
for ( i = 0; i < n; i++ ) {
c[ i ] = a[ i ] + b[ i ];
}
// Free memory
free (a) ; free (b) ; free (c) ;
return 0;
}

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 2 / 22

Basic Philosophy of GPGPU Programming

1 Identify basic computation

▶ Create a kernel function
2 Allocate memory on device (GPU)
3 Move (input) data into device (GPU) memory
▶ Host & GPU synchronous
▶ Kind of blocking the Host
4 Launch kernel (host & GPU asynchronous)
▶ GPU will compute kernel function
▶ Move (output) data from GPU memory to host memory
▶ Free memory on GPU

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 3 / 22

Identify Kernel GPU

/ / CUDA kernel . Each thread takes care of one element of c

// threadIdx . x gives thread id
__global__ void vecAdd ( d o u b l e *a , double *b , double *c , int n)
{
// Get our global thread ID
int id = threadIdx . x ;

// Make sure we do not go out of bounds

if ( id < n)
c [ id ] = a [ id ] + b [ id ] ;
}

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 4 / 22

Transforming Vector Addition for CPU+GPU

int main ( i n t argc , char * argv [ ] ) {

int n = 1000 , i ; // Size of vectors
double * h_a , * h_b ; / / i n p u t v e c t o r s
doube * h_c ; / / o u t p u t v e c t o r
s i z e _ t b y t e s = n * s i z e o f ( d o u b l e ) ; // Size , in bytes , of each vector
// Allocate memory for each vector on host
h_a = ( d o u b l e *) malloc ( b y t e s ) ; h_b = ( d o u b l e *) malloc ( b y t e s ) ; h_c = (
double *) malloc ( b y t e s ) ;
// Initialize vectors on host
for ( i = 0; i < n; i++ ) { h_a [ i ] = rand () ; h_b [ i ] = rand () ; }
double * d_a , * d_b / / D e v i c e i n p u t v e c t o r s
double * d_c ; / / D e v i c e o u t p u t v e c t o r
// Allocate memory for each vector on GPU
c u d a M a l l o c (&d_a , bytes ) ; c u d a M a l l o c (&d_b , bytes ) ; c u d a M a l l o c (&d_c ,
bytes ) ;
// Copy data into device (GPU) memory
cudaMemcpy ( d_a , h_a , bytes , cudaMemcpyHostToDevice ) ;
cudaMemcpy ( d_b , h_b , bytes , cudaMemcpyHostToDevice ) ;
// Launch kernels
v e c A d d <<<1, n>>>(d_a , d_b , d_c , n) ;
// Copy output data into Host memory
cudaMemcpy ( d_c , h_c , bytes , cudaMemcpyDeviceToHost ) ;
// Free device memory
c u d a F r e e ( d_a ) ; c u d a F r e e ( d_b ) ; c u d a F r e e ( d_c ) ;
/ / WE ARE DONE = back in Host ( CPU ) processing , Free main memory
free (a) ; free (b) ; free (c) ; return 0;
}

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 5 / 22

Key facts in CPU-GPU execution

Kernel and CUDA calls are queued

GPU runs in FIFO basis

All memory calls are synchronous with Host

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 6 / 22

Modied Example

Neighbourhood sum

Let we have a vector A of size "n"

We want 3-neighbours sum around A[i].

A[i] = A[i-1] + A[i] + A[i+1] for all 0 <= i <= n-1

(Assuming A[-1] = 0 & A[n] = 0)

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 7 / 22

How much code do we need to change?

Very little - supposed to be.

Only computation requirement is changed

Change kernel

New kernel

global void AddNeighbors ( double *a , int n)

{
// Get our global thread ID
int id = threadIdx . x ;
a [ id ] = a [ id = 1] + a [ id ] + a [ i d +1];
} Will it work ? No !!

Will it work ? No !!

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 8 / 22

Correcting AddNeighbors Kernel

global void AddNeighbors ( double *a , int n)

{
// Get our global thread ID
int id = threadIdx . x ;
// Make sure we do not go out of bounds
if ( i d == 0 ) {
a [ id ] = a [ id ] + a [ i d +1];
} else if ( i d == n = 1) {
a [ id ] = a [ id = 1] + a [ id ] ;
} else {
a [ id ] = a [ id = 1] + a [ id ] + a [ i d +1];
}
}

It could be correct in CPU but Still incorrect in GPU

One thread may overwrite even before other thread nished reading

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 9 / 22

Correcting AddNeighbors Kernel

global void AddNeighbors ( double *a , int n)

{
// Get our global thread ID
int id = threadIdx . x ;
// Collect data in local memory
double left , right ;
if ( i d == 0 ) {
left = 0;
} else {
left = a [ id =1];
}
if ( i d == n = 1) {
right = 0;
} else {
right = a [ i d +1];
}
// Need to synchronize here
__syncthreads ( ) ;
a [ id ] = left + a [ id ] + right ;
}

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 10 / 22

Analyzing Kernel

if ( i d == 0 ) {
left = 0;
} else {
left = a [ id =1];
}

Execution time of this If-else block of code

How much time will this code take?

Assume each of assignment, addition & comparison takes 1 unit time.

It will take 3 units of time !!! (NOT 2 units of time).

Because it is a SIMD architecture.
1 condition checking (id == 0) takes 1 unit time.
2 left = 0; takes 1 unit time.
3 left = a[id-1]; takes 1 unit time.

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 11 / 22

Total Time in AddNeighbors Kernel

global void AddNeighbors ( double *a , int n)

{
int id = threadIdx . x ; // 1 unit
double left , right ;
if ( i d == 0 ) { // 1 unit
left = 0; // 1 unit
} else {
left = a [ id =1]; // 1 unit
}
if ( i d == n = 1) { // 1 unit
right = 0; // 1 unit
} else {
right = a [ i d +1]; // 1 unit
}
// Need to synchronize here
__syncthreads ( ) ; s unit
a [ id ] = left + a [ id ] + right ; // 3 unit
}

Analysis of total execution time

__syncthreads() takes s time unit.

All threads have nished the work just before __syncthreads()

Total time = 10 + s units

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 12 / 22

Observations

Thread divergence takes time

▶ Reduces performance
▶ Should be reduced
Barrier synchronization __syncthreads()
▶ Block all threads (in a block) from moving ahead

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 13 / 22

Challenges in CUDA Programming

Decomposing the problem

Identifying the kernels

Identifying communication pattern

Finding exact location for barrier synchronization

Reducing thread divergence

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 14 / 22

CUDA Threads

Data-parallel portions of an application are coded as device kernels and they run in
parallel on many threads.

Dierence between CPU & GPU Threads

CPU Thread =>
▶ Heavyweight - keeps a lot of info / data about the threads, lot of controls
▶ Users need to create / kill threads - overhead in creation
▶ run asynchronously
GPU Threads =>
▶ Lightweight - keep very little info / data
▶ All threads in a SM (streaming multiprocessor) runs same instruction
▶ Fully synchronous

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 15 / 22

GPU Memory Allocation / Release

cudaMemcpy - declaration

cudaMemcpy ( v o i d * dst , void * src , size_t nbytes , enum cu daMemcp yKind

direction ) ;

direction species locations (host or device) of src and dst

Blocks CPU thread: returns after the copy is complete

Doesn't start copying until previous CUDA calls complete

enum cudaMemcpyKind
▶ cudaMemcpyHostToHost
▶ cudaMemcpyHostToDevice
▶ cudaMemcpyDeviceToHost
▶ cudaMemcpyDeviceToDevice

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 16 / 22

Extension / keywords in C to support CUDA

Declaration
__global__, __device__, __host__, __shared__
Built-in variables
threadIdx, blockIdx
Runtime API
cudaMalloc, cudaFree
Function launch
kernel<1, 100>(data)

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 17 / 22

C + CUDA Program compilation / running

Source program contains:

CUDA specic code => nvcc compiles

Normal C code => normal gcc compilation

Linking together the using gcc

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 18 / 22

Unied memory example

int main ( v o i d )
{
int N = 1<<20;
#i n c l u d e <i o s t r e a m >
float *x , *y ;
#i n c l u d e <math . h>
// Allocate Unified Memory == accessible
/ / CUDA kernel to add
from CPU or GPU
elements of two arrays
c u d a M a l l o c M a n a g e d (& x , N* s i z e o f ( f l o a t ) ) ;
__global__
c u d a M a l l o c M a n a g e d (& y , N* s i z e o f ( f l o a t ) ) ;
void add ( i n t n, float *x ,
// initialize x and y arrays on the host
float *y )
for ( int i = 0; i < N; i ++) {
{
x[ i ] = 1.0 f ; y[ i ] = 2.0 f ;
int index = blockIdx . x *
}
blockDim . x + threadIdx . x
// Launch kernel on 1M e l e m e n t s on the
;
GPU
int stride = blockDim . x *
int blockSize = 256;
gridDim . x ;
int n u m B l o c k s = (N + blockSize = 1) /
for ( int i = index ; i < n;
blockSize ;
i += stride )
add<<<n u m B l o c k s , b l o c k S i z e >>>(N , x , y) ;
y[ i ] = x[ i ] + y[ i ];
// Wait for GPU to finish before
}
accessing on host
cudaDeviceSynchronize () ;
...

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 19 / 22

CUDA blocks and threads-per-block - TODO

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 20 / 22

CUDA Atomic Operations - TODO

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 21 / 22

End of GPGPU Programming - Part 1

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 22 / 22

3-CUDA
No ratings yet
3-CUDA
5 pages
2023-CSC14120-Lecture01-CUDAIntroduction
No ratings yet
2023-CSC14120-Lecture01-CUDAIntroduction
32 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
No ratings yet
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
19 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
Lab 1 Parallel
No ratings yet
Lab 1 Parallel
4 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
CUDA_part-1
No ratings yet
CUDA_part-1
52 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
PDC assignment
No ratings yet
PDC assignment
9 pages
BECOA157 Parallel Matrix Multiplication
No ratings yet
BECOA157 Parallel Matrix Multiplication
3 pages
Basic-Cuda
No ratings yet
Basic-Cuda
49 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
LP 1,,1
No ratings yet
LP 1,,1
5 pages
cuda
No ratings yet
cuda
4 pages
Group A Assignment 4 (A) : Two Large Vectors
No ratings yet
Group A Assignment 4 (A) : Two Large Vectors
5 pages
CUDA Programming Model
No ratings yet
CUDA Programming Model
14 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Google Colab Solution Activity
No ratings yet
Google Colab Solution Activity
5 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
38 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
Laboratory Practice I (410246)
No ratings yet
Laboratory Practice I (410246)
28 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
Rishi
No ratings yet
Rishi
30 pages
CUDA
No ratings yet
CUDA
33 pages
TP1: Converting Vector Addition To CUDA.: Listing 1 An Example of Vector Addition Implemented in C
No ratings yet
TP1: Converting Vector Addition To CUDA.: Listing 1 An Example of Vector Addition Implemented in C
1 page
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
CUDA_part-1-LMS
No ratings yet
CUDA_part-1-LMS
51 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
No ratings yet
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
74 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
No ratings yet
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
18 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
Department of Computer Engineering BE Laboratory Practice-I A.Y 2021-22 SEM1
No ratings yet
Department of Computer Engineering BE Laboratory Practice-I A.Y 2021-22 SEM1
45 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
5-computation
No ratings yet
5-computation
13 pages
Week 11
No ratings yet
Week 11
21 pages
Introduction To The Cuda Programming
No ratings yet
Introduction To The Cuda Programming
25 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Experiment No. (3) Single Sideband Modulation: Object
No ratings yet
Experiment No. (3) Single Sideband Modulation: Object
5 pages
Class1 - Introduction
No ratings yet
Class1 - Introduction
35 pages
Bell B50D
No ratings yet
Bell B50D
1 page
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
8 pages
Cyber Security and Data Mining Competition Phase-04: Team Members
No ratings yet
Cyber Security and Data Mining Competition Phase-04: Team Members
13 pages
Electronics Has Branches As Follows
100% (2)
Electronics Has Branches As Follows
2 pages
Vendor List
No ratings yet
Vendor List
1 page
Basics of Remote Sensing
No ratings yet
Basics of Remote Sensing
23 pages
A Project Report On "Student Information System": Internal Guide Submitted by
No ratings yet
A Project Report On "Student Information System": Internal Guide Submitted by
59 pages
Simatic Pcs 7: Performance You Trust
No ratings yet
Simatic Pcs 7: Performance You Trust
42 pages
Project PM Quality Control - Mar - Signed
No ratings yet
Project PM Quality Control - Mar - Signed
1 page
Alpha Inverter AC Drive Selection Table
No ratings yet
Alpha Inverter AC Drive Selection Table
1 page
CV-Maroof-Pro (1)
No ratings yet
CV-Maroof-Pro (1)
3 pages
Satellite Up or Down Link Design
No ratings yet
Satellite Up or Down Link Design
31 pages
Ii. Solar Panel Battery Power Calculation
No ratings yet
Ii. Solar Panel Battery Power Calculation
65 pages
Sac Project Report
No ratings yet
Sac Project Report
12 pages
1 and 3 Phase Transformer Testing Using MCA
No ratings yet
1 and 3 Phase Transformer Testing Using MCA
11 pages
Latex Thesis Style Examples
100% (3)
Latex Thesis Style Examples
4 pages
L-3-EEE251-Measurement and Instrumentation - DMAK
No ratings yet
L-3-EEE251-Measurement and Instrumentation - DMAK
88 pages
For The Go
No ratings yet
For The Go
28 pages
Web Developement Lab
No ratings yet
Web Developement Lab
3 pages
5 - Organization Chart
No ratings yet
5 - Organization Chart
2 pages
6 Relays Phone Line Controller
No ratings yet
6 Relays Phone Line Controller
8 pages
8051901443791812-Artificial Intelligence Sustainable Farming Presentation
No ratings yet
8051901443791812-Artificial Intelligence Sustainable Farming Presentation
29 pages
CalipriC4x RollingStock Brochure EN
No ratings yet
CalipriC4x RollingStock Brochure EN
12 pages
EasusUser Guide
No ratings yet
EasusUser Guide
37 pages
BioSystems A 25
No ratings yet
BioSystems A 25
147 pages
ECA R13 Syllabus
No ratings yet
ECA R13 Syllabus
2 pages
A123 12V Starter Battery
No ratings yet
A123 12V Starter Battery
2 pages
E-Commerce Market Mechanisms
No ratings yet
E-Commerce Market Mechanisms
94 pages

L06_GPGPU_CUDA_Programming_1

Uploaded by

L06_GPGPU_CUDA_Programming_1

Uploaded by

High Performance Computing

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 1 / 22

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 2 / 22

1 Identify basic computation

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 3 / 22

/ / CUDA kernel . Each thread takes care of one element of c

// Make sure we do not go out of bounds

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 4 / 22

int main ( i n t argc , char * argv [ ] ) {

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 5 / 22

Kernel and CUDA calls are queued

GPU runs in FIFO basis

All memory calls are synchronous with Host

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 6 / 22

Let we have a vector A of size "n"

We want 3-neighbours sum around A[i].

A[i] = A[i-1] + A[i] + A[i+1] for all 0 <= i <= n-1

(Assuming A[-1] = 0 & A[n] = 0)

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 7 / 22

Very little - supposed to be.

__global__ void AddNeighbors ( double *a , int n)

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 8 / 22

__global__ void AddNeighbors ( double *a , int n)

It could be correct in CPU but Still incorrect in GPU

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 9 / 22

__global__ void AddNeighbors ( double *a , int n)

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 10 / 22

Execution time of this If-else block of code

How much time will this code take?

Assume each of assignment, addition & comparison takes 1 unit time.

It will take 3 units of time !!! (NOT 2 units of time).

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 11 / 22

__global__ void AddNeighbors ( double *a , int n)

Analysis of total execution time

__syncthreads() takes s time unit.

All threads have nished the work just before __syncthreads()

Total time = 10 + s units

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 12 / 22

Thread divergence takes time

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 13 / 22

Decomposing the problem

Identifying the kernels

Identifying communication pattern

Finding exact location for barrier synchronization

Reducing thread divergence

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 14 / 22

Dierence between CPU & GPU Threads

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 15 / 22

cudaMemcpy ( v o i d * dst , void * src , size_t nbytes , enum cu daMemcp yKind

direction species locations (host or device) of src and dst

Blocks CPU thread: returns after the copy is complete

Doesn't start copying until previous CUDA calls complete

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 16 / 22

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 17 / 22

Source program contains:

CUDA specic code => nvcc compiles

Normal C code => normal gcc compilation

Linking together the using gcc

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 18 / 22

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 19 / 22

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 20 / 22

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 21 / 22

S Bhattacharjee (IIT Jammu) HPCS: CUDA P1 February 9, 2023 22 / 22

You might also like

Let we have a vector A of size "n"

global void AddNeighbors ( double *a , int n)

global void AddNeighbors ( double *a , int n)

global void AddNeighbors ( double *a , int n)

global void AddNeighbors ( double *a , int n)

__syncthreads() takes s time unit.

All threads have nished the work just before __syncthreads()

Dierence between CPU & GPU Threads

direction species locations (host or device) of src and dst

CUDA specic code => nvcc compiles

Linking together the using gcc