0% found this document useful (0 votes)
19 views

Chapter 3

Uploaded by

YAAKOV SOLOMON
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Chapter 3

Uploaded by

YAAKOV SOLOMON
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Multithreaded Architectures

(Applied Parallel Programming)

Lecture 3: Scalable Parallel


Execution

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016


ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
A Multi-Dimensional Grid Example
host device

Grid 1
Block Block
Kernel 1 (0, 0) (0, 1)

Block Block
(1, 0) (1, 1)

Block (1,1)
(1,0,0) (1,0,1) (1,0,2) (1,0,3)

Thread Thread Thread Thread


(0,0,0) (0,0,1) (0,0,2) (0,0,3)
Threa
Threadd
Thread Thread Thread
(0,1,0) (0,1,1) (0,1,2) (0,0,0)
(0,1,3)
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016
ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Processing a Picture with a 2D Grid

16×16 blocks

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016


ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Conversion of a color image to grey–
scale image (review)

4
The pixels can be calculated
independently of each other
(review)
Covering a 76×62 picture with
16×16 blocks

Test (Col < width)

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016


ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
colorToGreyscaleConversion Kernel
with 2D thread mapping to data
// we have 3 channels corresponding to RGB
// The input image is encoded as unsigned characters [0, 255]
__global__
void colorToGreyscaleConvertion(unsigned char * Pout, unsigned char * Pin,
int width, int height) {

int Col = blockIdx.x * blockDim.x + threadIdx.x ;


int Row = blockIdx.y * blockDim.y + threadIdx.y ;

if (Col < width && Row < height) {


// get 1D coordinate for the grayscale image
int greyOffset = Row*width + Col;
// one can think of the RGB image having
// CHANNEL times columns of the gray scale image
int rgbOffset = greyOffset*CHANNELS;
unsigned char r = rgbImage[rgbOffset ]; // red value for pixel
unsigned char g = rgbImage[rgbOffset + 1]; // green value for pixel
unsigned char b = rgbImage[rgbOffset + 2]; // blue value for pixel
// perform the rescaling and store it
// We multiply by floating point constants
grayImage[grayOffset] = 0.21f*r + 0.71f*g + 0.07f*b;
}
}
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016
ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Row-Major Layout of 2D arrays in
C/C++
M0,0 M0,1 M0,2 M0,3

M1,0 M1,1 M1,2 M1,3

M2,0 M2,1 M2,2 M2,3

M3,0 M3,1 M3,2 M3,3


M

M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3

M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15

M  Row*Width+Col = 2*4+1 = 9
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016
2,1 of Illinois, Urbana-Champaign
ECE408/CS483/ECE498al, University
Image Blurring

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016


ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Each output pixel is the average of
pixels aroundCol
it (BLRU_SIZE = 1)

Row

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016


ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
__global__
An Image Blur Kernel
void blurKernel(unsigned char * in, unsigned char * out, int w, int h) {
int Col = blockIdx.x * blockDim.x + threadIdx.x;
int Row = blockIdx.y * blockDim.y + threadIdx.y;
 
if (Col < w && Row < h) {
1. int pixVal = 0;
2. int pixels = 0;
 
// Get the average of the surrounding BLUR_SIZE x BLUR_SIZE box
3. for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE+1; ++blurRow) {
4. for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE+1; ++blurCol) {
 
5. int curRow = Row + blurRow;
6. int curCol = Col + blurCol;
// Verify we have a valid image pixel
7. if(curRow > -1 && curRow < h && curCol > -1 && curCol < w) {
8. pixVal += in[curRow * w + curCol];
9. pixels++; // Keep track of number of pixels in the avg
}
}
}
// Write our new pixel value out
10 out[Row * w + Col] = (unsigned char)(pixVal / pixels);
}
}
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016
ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
__global__
An Image Blur Kernel
void blurKernel(unsigned char * in, unsigned char * out, int w, int h) {
int Col = blockIdx.x * blockDim.x + threadIdx.x;
int Row = blockIdx.y * blockDim.y + threadIdx.y;
 
if (Col < w && Row < h) {
1. int pixVal = 0;
2. int pixels = 0;
 
// Get the average of the surrounding BLUR_SIZE x BLUR_SIZE box
3. for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE+1; ++blurRow) {
4. for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE+1; ++blurCol) {
 
5. int curRow = Row + blurRow;
6. int curCol = Col + blurCol;
// Verify we have a valid image pixel
7. if(curRow > -1 && curRow < h && curCol > -1 && curCol < w) {
8. pixVal += in[curRow * w + curCol];
9. pixels++; // Keep track of number of pixels in the avg
}
}
}
// Write our new pixel value out
10 out[Row * w + Col] = (unsigned char)(pixVal / pixels);
}
}
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016
ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Handling boundary conditions for
pixels near the edges of the image
3
1

4
2

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016


ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
CUDA Thread Block (review)
• All threads in a block execute the same
kernel program (SPMD)
• Programmer declares block: CUDA Thread Block
– Block size 1 to 1024 concurrent threads
– Block shape 1D, 2D, or 3D Thread Id #:
• Threads have thread index numbers within 0123… m
block
– Kernel code uses thread index and block
index to select work and address shared data
Thread program
• Threads in the same block share data and
synchronize while doing their share of the
work
• Threads in different blocks cannot cooperate
– Each block can execute in any order relative
to other blocks! Courtesy: John Nickolls,
NVIDIA

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016


ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Compute Capabilities are GPU
Dependent

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016


ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Executing Thread Blocks
t0 t1 t2 … tm SM 0 SM 1 t0 t1 t2 … tm
MT IU MT IU
Blocks
SP SP

Blocks
• Threads are assigned to Streaming
Multiprocessors in block granularity
– Up to 32 blocks to each SM as
Memory
Shared
Memory
Shared resource allows
– Maxwell SM can take up to 2048
threads
• Threads run concurrently
– SM maintains thread/block id #s
– SM manages/schedules thread
execution

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016


ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Thread Scheduling (1/2)
• Each block is executed as 32- Block 1 Warps
… …Block 2 Warps Block 1 Warps

thread warps t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31
– An implementation decision, … … …
not part of the CUDA
programming model
– Warps are scheduling units in
SM
• If 3 blocks are assigned to an
SM and each block has 256
threads, how many warps are
there in an SM?
– Each block is divided into 256/32 = Register File
8 warps

(128 KB)
8 warps/blk * 3 blks = 24 warps
L1 Shared Memory
(16 KB) (48 KB)
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016
ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Thread Scheduling (2/2)

• SM implements zero-overhead warp scheduling


– Warps whose next instruction has its operands ready for
consumption are eligible for execution
– Eligible warps are selected for execution on a prioritized
scheduling policy
– All threads in a warp execute the same instruction when
selected

TB1, W1 stall
TB2, W1 stall TB3, W2 stall

TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3


W1 W1 W1 W2 W1 W1 W2 W3 W2
Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4

Time TB = Thread Block, W = Warp

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016


ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Single Program Multiple Data (SPMD)
• Main performance concern with branching is divergence
– Threads within a single warp take different paths
– Different execution paths are serialized in current GPUs
• The control paths taken by the threads in a warp are traversed
one at a time until there is no more.
• A common case: divergence could occur when branch
condition is a function of thread ID
– Example with divergence:
• If (threadIdx.x > 2) { }
• This creates two different control paths for threads in a block
• Branch granularity < warp size; threads 0, 1 and 2 follow different
path than the rest of the threads in the first warp
– Example without divergence:
• If (threadIdx.x / WARP_SIZE > 2) { }
• Also creates two different control paths for threads in a block
• Branch granularity is a whole multiple of warp size; all threads in
any given warp follow the same path

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016


ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Block Granularity Considerations
• For Matrix Multiplication using multiple blocks, should one
use 8X8, 16X16 or 32X32 blocks? Assume that in the
GPU used, each SM can take up to 1,536 threads and up
to 8 blocks.

– For 8X8, we have 64 threads per block. Each SM can take up to


1536 threads, which is 24 blocks. But each SM can only take up
to 8 Blocks, only 512 threads (16 warps) will go into each SM!

– For 16X16, we have 256 threads per block. Since each SM can
take up to 1,536 threads (48 warps), which is 6 blocks (within the
8 block limit). Thus we use the full thread capacity of an SM.

– For 32X32, we would have 1,024 threads per Block. Only one
block can fit into an SM, using only 2/3 of the thread capacity of
an SM.
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016
ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

You might also like