01 Cuda c Basics
01 Cuda c Basics
NVIDIA Corporation
WHAT IS CUDA?
CUDA Architecture
Expose/Enable performance
CUDA C++
2
INTRODUCTION TO CUDA C++
3
HETEROGENEOUS COMPUTING
4
PORTING TO CUDA
Application Code
Rest of Sequential
Compute-Intensive Functions CPU Code
GPU Use GPU to Parallelize CPU
+ 5
SIMPLE PROCESSING FLOW
6
SIMPLE PROCESSING FLOW
7
SIMPLE PROCESSING FLOW
8
PARALLEL PROGRAMMING IN CUDA C++
a b c
9
GPU KERNELS: DEVICE CODE
Is called from host code (can also be called from other device code)
gcc, cl.exe
10
GPU KERNELS: DEVICE CODE
mykernel<<<1,1>>>();
The parameters inside the triple angle brackets are the CUDA kernel execution configuration
11
MEMORY MANAGEMENT
Host and device memory are separate entities
add<<< 1, 1 >>>();
add<<< N, 1 >>>();
Instead of executing add() once, execute N times in parallel
13
VECTOR ADDITION ON THE DEVICE
With add() running in parallel we can do vector addition
Terminology: each parallel invocation of add() is referred to as a block
The set of all blocks is referred to as a grid
}
By using blockIdx.x to index into the array, each block handles a different index
Built-in variables like blockIdx.x are zero-indexed (C/C++ style), 0..N-1, where N is from the kernel execution
configuration indicated at the kernel launch
14
VECTOR ADDITION ON THE DEVICE
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);
// Alloc space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Alloc space for host copies of a, b, c and setup input values
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
15
VECTOR ADDITION ON THE DEVICE
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with N blocks
add<<<N,1>>>(d_a, d_b, d_c);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
16
REVIEW (1 OF 2)
Host CPU
Device GPU
Called from the host (or possibly from other device code)
17
REVIEW (2 OF 2)
cudaMalloc()
cudaMemcpy()
cudaFree()
18
CUDA THREADS
19
COMBINING BLOCKS AND THREADS
20
INDEXING ARRAYS WITH BLOCKS AND THREADS
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
21
INDEXING ARRAYS: EXAMPLE
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
M = 8 threadIdx.x = 5
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
blockIdx.x = 2
int index = threadIdx.x + blockIdx.x * M;
= 5 + 2 * 8;
= 21;
22
VECTOR ADDITION WITH BLOCKS AND THREADS
23
ADDITION WITH BLOCKS AND THREADS
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);
// Alloc space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Alloc space for host copies of a, b, c and setup input values
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
24
ADDITION WITH BLOCKS AND THREADS
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
} 25
HANDLING ARBITRARY VECTOR SIZES
What do we gain?
Communicate
Synchronize
27
REVIEW
28
FUTURE SESSIONS
Cooperative Groups
29
FURTHER STUDY
An introduction to CUDA:
https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/
https://devblogs.nvidia.com/even-easier-introduction-cuda/
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
CUDA Documentation:
https://docs.nvidia.com/cuda/index.html
30
HOMEWORK
https://github.com/olcf/cuda-training-series/blob/master/exercises/hw1/readme.md
Prerequisites: basic linux skills, e.g. ls, cd, etc., knowledge of a text editor like vi/emacs, and some
knowledge of C/C++ programming
31
QUESTIONS?