27th Aug - Introduction To GPGPU - Part 1
27th Aug - Introduction To GPGPU - Part 1
A A
A
Motherboard
A A
GPU DRAM
Device Device Memory
Graphics Card
•GPUs are designed for tasks that can tolerate high latency as long as it can process a lot of tasks
in one go (i.e., high latency, high throughput)
• So, data caching is not a priority
•So, more chip area can be given to ALUs instead of Control Logic and Caches. Therefore, a lot of
GPU threads (10s of thousands) can (should) execute at a time.
ad
DRAM
Device Memory
__host__ __device__
__global__
CPU GPU
• Each block can execute in any order relative to other blocks Thread Thread Thread Thread
(0,1,0) (1,1,0) (2,1,0) (3,1,0)
Courtesy: NDVIA
• Dimensions of the blocks (in terms of threads) → dim3 blockDim Block Block
• blockDim.x, blockDim.y, blockDim.z (0, 1) (1, 1)
Grid 2
Courtesy: NDVIA
ad + bd = cd
(6 x 8) (6 x 8) (6 x 8)
Row
.
.
dim3 grid_conf( ? , ? );
dim3 block_conf( ?, ?, ? );
matrix_add <<< grid_conf, block_conf>>> (ad, bd, cd, N);
.
.
threadIdx.x
threadIdx.y
threadIdx.x
blockIdx.y * blockDim.y
+ threadIdx.y
threadIdx.x
blockIdx.y * blockDim.y
+ threadIdx.y
blockIdx.y * blockDim.y
+ threadIdx.y
blockIdx.x * blockDim.x +
threadIdx.x
threadIdx.y
blockIdx.x * blockDim.x +
threadIdx.x
threadIdx.y
blockIdx.x * blockDim.x +
threadIdx.x
threadIdx.y
blockIdx.x * blockDim.x +
threadIdx.x
blockIdx.y * blockDim.y
+ threadIdx.y
blockIdx.x * blockDim.x +
threadIdx.x
blockIdx.y * blockDim.y
+ threadIdx.y
blockIdx.x * blockDim.x +
dim3 grid_conf();
threadIdx.x
dim3 block_conf();
blockIdx.y * blockDim.y
+ threadIdx.y
blockIdx.x * blockDim.x +
threadIdx.x
blockIdx.y * blockDim.y
+ threadIdx.y