CUDA Execution Model
CUDA Execution Model
CUDA Cores
Shared
Memory/L1
Cache
Register File
Load/Store
Units
Special
Function Units
Warp
Scheduler
GPU Computing, PIEAS
GPU Architecture Overview
Cores are
grouped into
32-core
group
graphics-
specific
components
largely
omitted.
512 CUDA
cores
Each SM
represented
by a vertical
rectangular CUDA Core, Register file, Warp scheduler & dispatching unit, S
strip
GPU Computing, PIEAS
The Fermi Architecture
Enhanced SMs
Dynamic Parallelism
Hyper-Q
warpSize )
A warp is never split between different thread
blocks.
If thread block size is not an even multiple of
warp size, some threads are left inactive
( still consume SM resources, such as
registers.)
GPU Computing, PIEAS
Threadblock and Warps
Lesson:
Try to avoid different execution paths within the
same warp.
Keep in mind that the assignment of threads to
warp in a thread block is deterministic. Therefore,
it may be possible (though not trivial, depending
on the algorithm) to partition data in such a way
as to ensure all threads in the same warp take the
same control path in an application.