CSC447 Multidimensional Grids and Data
CSC447 Multidimensional Grids and Data
A Hands-on Approach
WEEK 4 z
Multidimensional Grids and Data
gridDim.x
# of blocks in grid
blockIdx.x
position of block in grid
threadIdx.x
position of thread in block
blockDim.x
# threads in block
• If you create a dim3 object without parameters, it defaults all dimensions (x,
y, and z) to 1.
• Example: dim3 grid; // grid.x = 1, grid.y = 1, grid.z = 1
• x, y, z: These are the three dimensions, each of type unsigned int, and can be
accessed directly (e.g., grid.x, grid.y, grid.z).
• This structure is specifically designed for managing and organizing the parallel
execution of blocks and threads on a CUDA-enabled GPU.
•Grid: 32 blocks
•Block: 128 threads per block
•Total threads: 128 × 32 = 4096 threads in total.
•In CUDA, each block can have up to 1024 threads. If the total number of threads per
block exceeds this limit, the configuration is invalid. Here's an example of an invalid
block configuration:
blockDim.x = 32
blockDim.y = 32
blockDim.z = 2
•The total number of threads in this block would be:
• Total threads=32×32×2=2048
• Since 2048 threads exceed the 1024 thread limit per block, this configuration is invalid.
• If you need more than 1024 threads, you can increase the number of blocks in the grid,
but the block size itself must be within the 1024-thread limit.
Here's a breakdown of how the threads are indexed within the block:
•threadIdx.x ranges from 0 to 3 (4 threads in the x dimension).
•threadIdx.y ranges from 0 to 3 (4 threads in the y dimension).
•threadIdx.z ranges from 0 to 3 (4 threads in the z dimension).
So, in total, you have 4 * 4 * 4 = 64 threads per block, with the thread indices
running from (0, 0, 0) (the first thread) to (3, 3, 3) (the last thread).
threadIdx.z threadIdx.y threadIdx.x
0 0 0
0 0 1
0 0 2
0 0 3
0 1 0
0 1 1
0 1 2
0 1 3
.
.
.
Copyright © 2022 Elsevier
Grid and Block Dimensions in CUDA
Here, gridSize is dynamically calculated to ensure enough blocks are created to cover
all vector elements.
For example:
•If n = 1000, grid size = 4 blocks
•If n = 4000, grid size = 16 blocks
This approach adapts the grid size based on the input size, ensuring all threads
are properly allocated.
• vecAddKernel<<<gridSize, blockSize>>>(...);
• Here:
• gridSize defines the number of blocks in the x-dimension.
• blockSize defines the number of threads per block in the x-dimension.
• Y and Z dimensions are automatically set to 1.
This approach works perfectly when you're using 1D grids and blocks.
• Built-in dimension and index variables each have three components x, y, and z
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
gridDim.y
gridDim.x
• Built-in dimension and index variables each have thee components x, y, and z
blockIdx.x
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
blockIdx.y gridDim.y
• Built-in dimension and index variables each have thee components x, y, and z
blockIdx.x
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
blockIdx.y gridDim.y
blockDim.y
gridDim.x
blockDim.x
The complete list of thread indices:
(0, 0), (0, 1), (0, 2), (0, 3) // First row (y = 0)
(1, 0), (1, 1), (1, 2), (1, 3) // Second row (y = 1)
(2, 0), (2, 1), (2, 2), (2, 3) // Third row (y = 2)
Copyright © 2022 Elsevier
(3, 0), (3, 1), (3, 2), (3, 3) // Fourth row (y = 3)
Multidimensional Indexing
• Built-in dimension and index variables each have thee components x, y, and z
blockIdx.x threadIdx.x
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
blockIdx.y gridDim.y
threadIdx.y
blockDim.y
gridDim.x
blockDim.x
•Example:
•gridDim (2, 2, 1)
•blockDim (4, 2, 2)
•Important Detail:
•CUDA uses a reversed ordering of dimensions when labeling blocks.
The highest dimension comes first for intuitive mapping with data
indexing.
•Example Configuration:
•Block Dimension: (4, 2, 2)
•Threads per Block: 16
•Total Threads in Grid: 64
•Note:
•The example uses small numbers for simplicity; real CUDA applications often use
thousands to millions of threads.
•Extra Threads:
•Using a block size of 16 × 16 threads to process a 76 x 62 image leads to extra
threads.
•The 64 × 80 threads generated exceed the actual pixel count.
•Solution: Use conditional checks (if statements) to ensure extra threads do not
operate on non-existent pixels.
•Processing Larger Images:
•For a 1500 × 2000 image:
•Grid dimensions: 94 blocks in the x-direction (1500 ÷ 16 ≈ 94) and 125
blocks in the y-direction (2000 ÷ 16 ≈ 125).
•Total blocks: 11,750 blocks (94 × 125).
•Kernel Functionality:
•The kernel uses gridDim.x, gridDim.y, blockDim.x, and blockDim.y to
reference grid and block dimensions.
•In this case, gridDim.x =94, gridDim.y = 125, blockDim.x = 16, and
blockDim.y = 16.
•Problem:
•In CUDA C, multidimensional arrays are difficult to access dynamically at
runtime because the number of columns in such arrays is unknown at
compile time.
•Example of static array: int array[10][20]; // Here, 10 and 20 are known at
compile time
•Key Concept:
•Linearization of arrays: All multidimensional arrays are stored in memory as
1D arrays.
•While static arrays are automatically flattened by compilers, for dynamically
allocated arrays, programmers must explicitly flatten the array.
•Example:
•A 4 × 4 matrix M is flattened into a 1D array.
•Each row is stored sequentially:
•Memory Layout:
•In memory, a 2D array is stored in a contiguous block, with each row placed
sequentially.
•The linearization formula calculates the 1D offset based on the number of
columns.
•Example:
•For an array Pind with num_columns = 76, the element at row j = 2 and column i
= 3 is accessed as:
•Pind[2×76+3]= Pind[155]
•Why Flattening?:
•Flattening ensures that dynamically allocated arrays are accessed correctly in
CUDA C without compile-time dimensional information.
•Column-Major Layout:
•In column-major layout, all elements of a column are placed in consecutive
memory locations.
•This method is used by FORTRAN compilers.
•Thread Mapping:
•The kernel organizes threads in a 2D grid and 2D blocks to process the image.
•Each thread computes the horizontal index (col) and vertical index (row) using:
•col=blockIdx.x×blockDim.x+threadIdx.x
•row=blockIdx.y×blockDim.y+threadIdx.y
•Each thread is assigned to process one pixel, given by its row and
column indices.
Parallelization approach: assign one thread to convert each pixel in the image
• An unsigned char is a data type in C and C++ that can store integer values ranging
from 0 to 255. Here’s a quick summary:
•Unsigned: Indicates that the type cannot represent negative values.
•Char: Typically refers to a character type, but in this case, it's used to denote an 8-bit
data type.
•Range: Because it uses 8 bits, the values can go from 0 (00000000 in binary)
to 255 (11111111 in binary).
•In C and C++, when you see a number like 0.21f, the f suffix indicates that it is a
float data type.
•Float (float): Typically represents a single-precision floating-point number. It usually
uses 32 bits (4 bytes) and has about 7 decimal digits of precision.
•Double (double): Represents a double-precision floating-point number. It usually uses
64 bits (8 bytes) and offers about 15 decimal digits of precision.
•Grayscale Calculation:
•The grayscale value is calculated using the weighted sum:
L=0.21×r+0.72×g+0.07×b
•This value is stored in the Pout array at the grayOffset.
•Example Calculation:
•For thread (0,0) of block (1,0), the 1D index for Pin is:
• https://strawpoll.live
• PIN: 524393
• Consider a 2D matrix with a width of 400 and a height of 500. The matrix is
stored as a one-dimensional array. Specify the array index of the matrix
element at row 20 and column 10 if the matrix is stored in row-major order.
• A. 5,020
• B. 4,020
• C. 10,010
• D. 8,010
• https://strawpoll.live
• PIN: 332893
• Consider a 2D matrix with a width of 400 and a height of 500. The matrix is
stored as a one-dimensional array. Specify the array index of the matrix
element at row 20 and column 10 if the matrix is stored in column-major
order.
• A. 5,020
• B. 4,020
• C. 10,010
• D. 8,010
• Consider a 3D tensor with a width of 400, a height of 500, and a depth of 300.
The tensor is stored as a one-dimensional array in row-major order. Specify
the array index of the tensor element at x=10, y=20, and z=5.
• Answer: index=5x(500x400)+20x400+10= 1,008,010
•Blurring in CUDA:
•Each thread in the image-blurring kernel will process a patch of pixels.
•The output pixel is computed as the average of an N×N patch of pixels.
Output pixel is the average of the corresponding input pixel and the pixels around it
•Thread-to-Pixel Mapping:
•Each thread computes an output pixel using its position as the center of the
patch.
•Calculating Col and Row:
•Each thread calculates the col and row indices for its output pixel.
•The if-statement ensures that only threads with valid indices (within image
boundaries) participate in the kernel.
•Patch Processing (BLUR_SIZE):
•BLUR_SIZE determines the radius of the patch:
•For a 3×3 patch: BLUR_SIZE = 1
•For a 7×7 patch: BLUR_SIZE = 3
•The outer loop iterates over rows, and the inner loop iterates over columns
within the patch.
•Example Calculation:
•For output pixel (25,50):
•First iteration: Process row 24, pixels (24,49),(24,50),(24,51).
•Second iteration: Process row 25, pixels (25,49),(25,50),(25,51).
•Third iteration: Process row 26, pixels (26,49),(26,50),(26,51).
Even if output pixel is in bounds, might be accessing out of bounds input pixel
•Matrix Multiplication:
•Matrix multiplication is fundamental in linear algebra and applications like deep
learning.
•Multiply an i×j matrix M by a j×k matrix N to produce an i×k matrix P.
•Element Calculation:
•Each element Prow,col is computed as the dot product (inner product) of:
•The row from matrix M (horizontal strip).
•The column from matrix N (vertical strip).
•Dot Product Formula:
•Thread-to-Data Mapping:
•Each thread in the CUDA grid is responsible for computing one element of the
output matrix P.
Loop Structure
1.Initialization: Local variable Pvalue is set to 0 before entering the loop (Line 06).
2.Element Access:
•Matrix M:
•Linearized using row-major order.
•Beginning of rowth row: M[row×Width].
•kth element of rowth row: M[row×Width+k] (Line 08).
•Matrix N:
•Beginning of colth column: N[col].
•kth element of colth column: N[k×Width+col] (Line 08).
•Accessing the next element in the colth column requires skipping over an
entire row.
•This is because the next element of the same column is the same
element in the next row.
• Thread Responsibilities
• Each thread computes one element of the resultant matrix P using its unique
row and column indices.
• After the loop, threads store results using:
• Index expression: row×Width+col (Line 10).
• Example Execution
• Small Example: 4x4 matrix P with BLOCK_WIDTH=2.
• Block Structure:
• Divided into 4 tiles, each handled by a 2x2 thread block.
• Each thread computes a specific P element:
• Thread (0,0) in block (0,0) computes P0,0.
• Thread (0,1) in block (0,0) computes P0,1.
• Thread (1,0) in block (0,0) computes P1,0.
• Thread (1,1) in block (0,0) computes P1,1.
• Thread (0,0) in block (1,0) computes P2,0. Block (1, 0):Computes the bottom-left tile
(elements P2,0, P2,1, P3,0, P3,1).
• Visualization: Each thread maps its indices to corresponding rows of M and
columns of N for dot product calculations.
Copyright © 2022 Elsevier
Accessing Multidimensional Data
width
col