0% found this document useful (0 votes)

25 views20 pages

Chapter 3

Uploaded by

YAAKOV SOLOMON

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views20 pages

Chapter 3

Uploaded by

YAAKOV SOLOMON

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Multithreaded Architectures

(Applied Parallel Programming)

Lecture 3: Scalable Parallel

Execution

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016

ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
A Multi-Dimensional Grid Example
host device

Grid 1
Block Block
Kernel 1 (0, 0) (0, 1)

Block Block
(1, 0) (1, 1)

Block (1,1)
(1,0,0) (1,0,1) (1,0,2) (1,0,3)

Thread Thread Thread Thread

(0,0,0) (0,0,1) (0,0,2) (0,0,3)
Threa
Threadd
Thread Thread Thread
(0,1,0) (0,1,1) (0,1,2) (0,0,0)
(0,1,3)
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016
ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Processing a Picture with a 2D Grid

16×16 blocks

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016

ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Conversion of a color image to grey–
scale image (review)

4
The pixels can be calculated
independently of each other
(review)
Covering a 76×62 picture with
16×16 blocks

Test (Col < width)

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016

ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
colorToGreyscaleConversion Kernel
with 2D thread mapping to data
// we have 3 channels corresponding to RGB
// The input image is encoded as unsigned characters [0, 255]
__global__
void colorToGreyscaleConvertion(unsigned char * Pout, unsigned char * Pin,
int width, int height) {

int Col = blockIdx.x * blockDim.x + threadIdx.x ;

int Row = blockIdx.y * blockDim.y + threadIdx.y ;

if (Col < width && Row < height) {

// get 1D coordinate for the grayscale image
int greyOffset = Row*width + Col;
// one can think of the RGB image having
// CHANNEL times columns of the gray scale image
int rgbOffset = greyOffset*CHANNELS;
unsigned char r = rgbImage[rgbOffset ]; // red value for pixel
unsigned char g = rgbImage[rgbOffset + 1]; // green value for pixel
unsigned char b = rgbImage[rgbOffset + 2]; // blue value for pixel
// perform the rescaling and store it
// We multiply by floating point constants
grayImage[grayOffset] = 0.21f*r + 0.71f*g + 0.07f*b;
}
}
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016
ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Row-Major Layout of 2D arrays in
C/C++
M0,0 M0,1 M0,2 M0,3

M1,0 M1,1 M1,2 M1,3

M2,0 M2,1 M2,2 M2,3

M3,0 M3,1 M3,2 M3,3

M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3

M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15

M  Row*Width+Col = 2*4+1 = 9
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016
2,1 of Illinois, Urbana-Champaign
ECE408/CS483/ECE498al, University
Image Blurring

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016

ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Each output pixel is the average of
pixels aroundCol
it (BLRU_SIZE = 1)

Row

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016

ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
__global__
An Image Blur Kernel
void blurKernel(unsigned char * in, unsigned char * out, int w, int h) {
int Col = blockIdx.x * blockDim.x + threadIdx.x;
int Row = blockIdx.y * blockDim.y + threadIdx.y;

if (Col < w && Row < h) {
1. int pixVal = 0;
2. int pixels = 0;

// Get the average of the surrounding BLUR_SIZE x BLUR_SIZE box
3. for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE+1; ++blurRow) {
4. for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE+1; ++blurCol) {

5. int curRow = Row + blurRow;
6. int curCol = Col + blurCol;
// Verify we have a valid image pixel
7. if(curRow > -1 && curRow < h && curCol > -1 && curCol < w) {
8. pixVal += in[curRow * w + curCol];
9. pixels++; // Keep track of number of pixels in the avg
}
}
}
// Write our new pixel value out
10 out[Row * w + Col] = (unsigned char)(pixVal / pixels);
}
}
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016
ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
__global__
An Image Blur Kernel
void blurKernel(unsigned char * in, unsigned char * out, int w, int h) {
int Col = blockIdx.x * blockDim.x + threadIdx.x;
int Row = blockIdx.y * blockDim.y + threadIdx.y;

if (Col < w && Row < h) {
1. int pixVal = 0;
2. int pixels = 0;

// Get the average of the surrounding BLUR_SIZE x BLUR_SIZE box
3. for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE+1; ++blurRow) {
4. for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE+1; ++blurCol) {

5. int curRow = Row + blurRow;
6. int curCol = Col + blurCol;
// Verify we have a valid image pixel
7. if(curRow > -1 && curRow < h && curCol > -1 && curCol < w) {
8. pixVal += in[curRow * w + curCol];
9. pixels++; // Keep track of number of pixels in the avg
}
}
}
// Write our new pixel value out
10 out[Row * w + Col] = (unsigned char)(pixVal / pixels);
}
}
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016
ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Handling boundary conditions for
pixels near the edges of the image
3
1

4
2

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016

ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
CUDA Thread Block (review)
• All threads in a block execute the same
kernel program (SPMD)
• Programmer declares block: CUDA Thread Block
– Block size 1 to 1024 concurrent threads
– Block shape 1D, 2D, or 3D Thread Id #:
• Threads have thread index numbers within 0123… m
block
– Kernel code uses thread index and block
index to select work and address shared data
Thread program
• Threads in the same block share data and
synchronize while doing their share of the
work
• Threads in different blocks cannot cooperate
– Each block can execute in any order relative
to other blocks! Courtesy: John Nickolls,
NVIDIA

ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Compute Capabilities are GPU
Dependent

ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Executing Thread Blocks
t0 t1 t2 … tm SM 0 SM 1 t0 t1 t2 … tm
MT IU MT IU
Blocks
SP SP

Blocks
• Threads are assigned to Streaming
Multiprocessors in block granularity
– Up to 32 blocks to each SM as
Memory
Shared
Memory
Shared resource allows
– Maxwell SM can take up to 2048
threads
• Threads run concurrently
– SM maintains thread/block id #s
– SM manages/schedules thread
execution

ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Thread Scheduling (1/2)
• Each block is executed as 32- Block 1 Warps
… …Block 2 Warps Block 1 Warps
…
thread warps t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31
– An implementation decision, … … …
not part of the CUDA
programming model
– Warps are scheduling units in
SM
• If 3 blocks are assigned to an
SM and each block has 256
threads, how many warps are
there in an SM?
– Each block is divided into 256/32 = Register File
8 warps
–
(128 KB)
8 warps/blk * 3 blks = 24 warps
L1 Shared Memory
(16 KB) (48 KB)
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016
ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Thread Scheduling (2/2)

• SM implements zero-overhead warp scheduling

– Warps whose next instruction has its operands ready for
consumption are eligible for execution
– Eligible warps are selected for execution on a prioritized
scheduling policy
– All threads in a warp execute the same instruction when
selected

TB1, W1 stall
TB2, W1 stall TB3, W2 stall

TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3

W1 W1 W1 W2 W1 W1 W2 W3 W2
Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4

Time TB = Thread Block, W = Warp

ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Single Program Multiple Data (SPMD)
• Main performance concern with branching is divergence
– Threads within a single warp take different paths
– Different execution paths are serialized in current GPUs
• The control paths taken by the threads in a warp are traversed
one at a time until there is no more.
• A common case: divergence could occur when branch
condition is a function of thread ID
– Example with divergence:
• If (threadIdx.x > 2) { }
• This creates two different control paths for threads in a block
• Branch granularity < warp size; threads 0, 1 and 2 follow different
path than the rest of the threads in the first warp
– Example without divergence:
• If (threadIdx.x / WARP_SIZE > 2) { }
• Also creates two different control paths for threads in a block
• Branch granularity is a whole multiple of warp size; all threads in
any given warp follow the same path

ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Block Granularity Considerations
• For Matrix Multiplication using multiple blocks, should one
use 8X8, 16X16 or 32X32 blocks? Assume that in the
GPU used, each SM can take up to 1,536 threads and up
to 8 blocks.

– For 8X8, we have 64 threads per block. Each SM can take up to

1536 threads, which is 24 blocks. But each SM can only take up
to 8 Blocks, only 512 threads (16 warps) will go into each SM!

– For 16X16, we have 256 threads per block. Since each SM can
take up to 1,536 threads (48 warps), which is 6 blocks (within the
8 block limit). Thus we use the full thread capacity of an SM.

– For 32X32, we would have 1,024 threads per Block. Only one
block can fit into an SM, using only 2/3 of the thread capacity of
an SM.
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016
ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
No ratings yet
Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
44 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
CUDA_Memory
No ratings yet
CUDA_Memory
56 pages
VSCSE-Lecture3-cuda-memory-model-2012
No ratings yet
VSCSE-Lecture3-cuda-memory-model-2012
31 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Class 10
No ratings yet
Class 10
13 pages
217 Lec6
No ratings yet
217 Lec6
23 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
GPU_Programming_slides_3
No ratings yet
GPU_Programming_slides_3
73 pages
Lecture4 CUDA Threads Part2
No ratings yet
Lecture4 CUDA Threads Part2
15 pages
Gpu Cuda 2
No ratings yet
Gpu Cuda 2
72 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
Multithreaded Architectures: Lecture 5: Performance Considerations
No ratings yet
Multithreaded Architectures: Lecture 5: Performance Considerations
49 pages
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
100% (5)
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
4 pages
HPC
No ratings yet
HPC
90 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
CSC447 Multidimensional Grids and Data
No ratings yet
CSC447 Multidimensional Grids and Data
65 pages
Chapter 3 Multidimensional Grids a 2023 Programming Massively Parallel Pro
No ratings yet
Chapter 3 Multidimensional Grids a 2023 Programming Massively Parallel Pro
22 pages
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
No ratings yet
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
31 pages
Lec 6
No ratings yet
Lec 6
16 pages
A41101 - How CUDA Programming Works
No ratings yet
A41101 - How CUDA Programming Works
116 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
CUDA_part-2
No ratings yet
CUDA_part-2
49 pages
Hpc file
No ratings yet
Hpc file
22 pages
Threads
No ratings yet
Threads
54 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
1
No ratings yet
1
44 pages
5-computation
No ratings yet
5-computation
13 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
CUDA
No ratings yet
CUDA
33 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
No ratings yet
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
21 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15 (1)
No ratings yet
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15 (1)
42 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
3-CUDA
No ratings yet
3-CUDA
5 pages
GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use
No ratings yet
GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use
52 pages
Unit 2 - GPU DFG
No ratings yet
Unit 2 - GPU DFG
27 pages
GPU Architecture
No ratings yet
GPU Architecture
17 pages
Processors
No ratings yet
Processors
25 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
tilining
No ratings yet
tilining
23 pages
217 Lec7
No ratings yet
217 Lec7
30 pages
LAB2
No ratings yet
LAB2
4 pages
Opencl Programming For The Cuda Architecture
No ratings yet
Opencl Programming For The Cuda Architecture
23 pages
Introduction To Random Signals and Noise - 2005 - Van Etten - Appendix F The Q and Erfc Functions
No ratings yet
Introduction To Random Signals and Noise - 2005 - Van Etten - Appendix F The Q and Erfc Functions
2 pages
Calculus
100% (1)
Calculus
796 pages
Introduction To Time-Delay and Sampled-Data Systems
No ratings yet
Introduction To Time-Delay and Sampled-Data Systems
6 pages
No Proca Photons
No ratings yet
No Proca Photons
6 pages
Global Entrainment of Transcriptional Systems To Periodic Inputs
No ratings yet
Global Entrainment of Transcriptional Systems To Periodic Inputs
26 pages
Comment On Crystallographic, Spectroscopic, Thermal, Optical Physics
No ratings yet
Comment On Crystallographic, Spectroscopic, Thermal, Optical Physics
4 pages
Improving GPU Performance Via Large Warps and Two
No ratings yet
Improving GPU Performance Via Large Warps and Two
11 pages
Design of 2 4 GHZ Mmic Feed Forward Ampl
No ratings yet
Design of 2 4 GHZ Mmic Feed Forward Ampl
7 pages
Multithreaded Architectures: (Applied Parallel Programming)
No ratings yet
Multithreaded Architectures: (Applied Parallel Programming)
29 pages
Specification Parameters of WLAN Performance With MATLAB Simulink Model of IEEE 802.11
No ratings yet
Specification Parameters of WLAN Performance With MATLAB Simulink Model of IEEE 802.11
10 pages
Ieee 802.11ac Wlan Simulation in Matlab
No ratings yet
Ieee 802.11ac Wlan Simulation in Matlab
6 pages
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
No ratings yet
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
31 pages
ILP-Architectures Part III
No ratings yet
ILP-Architectures Part III
49 pages
4 1 MWagner GPU Volta
No ratings yet
4 1 MWagner GPU Volta
36 pages
ILP-Architectures Part I
No ratings yet
ILP-Architectures Part I
56 pages
A Hybrid Register Cache For GPUs
No ratings yet
A Hybrid Register Cache For GPUs
11 pages

Chapter 3

Uploaded by

Chapter 3

Uploaded by

Multithreaded Architectures

(Applied Parallel Programming)

Lecture 3: Scalable Parallel

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016

Thread Thread Thread Thread

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016

Test (Col < width)

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016

int Col = blockIdx.x * blockDim.x + threadIdx.x ;

if (Col < width && Row < height) {

M1,0 M1,1 M1,2 M1,3

M2,0 M2,1 M2,2 M2,3

M3,0 M3,1 M3,2 M3,3

M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016

• SM implements zero-overhead warp scheduling

TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3

Time TB = Thread Block, W = Warp

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016

© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2016

– For 8X8, we have 64 threads per block. Each SM can take up to

You might also like