Chapter 3
Chapter 3
Grid 1
Block Block
Kernel 1 (0, 0) (0, 1)
Block Block
(1, 0) (1, 1)
Block (1,1)
(1,0,0) (1,0,1) (1,0,2) (1,0,3)
16×16 blocks
4
The pixels can be calculated
independently of each other
(review)
Covering a 76×62 picture with
16×16 blocks
M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3
M Row*Width+Col = 2*4+1 = 9
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016
2,1 of Illinois, Urbana-Champaign
ECE408/CS483/ECE498al, University
Image Blurring
Row
4
2
Blocks
• Threads are assigned to Streaming
Multiprocessors in block granularity
– Up to 32 blocks to each SM as
Memory
Shared
Memory
Shared resource allows
– Maxwell SM can take up to 2048
threads
• Threads run concurrently
– SM maintains thread/block id #s
– SM manages/schedules thread
execution
TB1, W1 stall
TB2, W1 stall TB3, W2 stall
– For 16X16, we have 256 threads per block. Since each SM can
take up to 1,536 threads (48 warps), which is 6 blocks (within the
8 block limit). Thus we use the full thread capacity of an SM.
– For 32X32, we would have 1,024 threads per Block. Only one
block can fit into an SM, using only 2/3 of the thread capacity of
an SM.
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016
ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign