0% found this document useful (0 votes)
11 views

CSC447 Multidimensional Grids and Data

Uploaded by

omarobeidd03
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

CSC447 Multidimensional Grids and Data

Uploaded by

omarobeidd03
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

Programming Massively Parallel Processors

Wen-mei Hwu, David Kirk, Izzat El Hajj

A Hands-on Approach

WEEK 4 z
Multidimensional Grids and Data

Dr. Rachad Atat

Copyright © 2022 Elsevier


CUDA Thread and Block Organization

•Thread and Block Hierarchy:


•Threads in CUDA are organized in a two-level hierarchy:
•Grids: Consist of one or more blocks.
•Blocks: Consist of one or more threads.

•Thread and Block Indices:


•Built-in Variables:
•blockIdx: Block index in the grid.
•threadIdx: Thread index within a block.
•These indices allow threads to distinguish themselves and identify the
correct data to process.

•Execution Configuration Parameters:


•When a kernel is called, execution configuration parameters specify:
•The grid dimensions (number of blocks).
•The block dimensions (number of threads per block).
•These parameters are given in the form of dim3 objects, which are 3D
integer vectors.

Copyright © 2022 Elsevier


Previously: One Dimensional Indexing

gridDim.x
# of blocks in grid

blockIdx.x
position of block in grid

threadIdx.x
position of thread in block

blockDim.x
# threads in block

Copyright © 2022 Elsevier


CUDA Thread and Block Organization
•Grid and Block Dimensionality:
•A grid is a 3D array of blocks.
•A block is a 3D array of threads.
•Each dimension (x, y, z) is specified by the programmer using the dim3 type.
•Fewer than three dimensions can be used by setting the size of unused
dimensions to 1. What is total number of threads in all 3 dimensions?

•Accessing Grid and Block Dimensions:


•CUDA provides built-in variables:
•gridDim: Number of blocks in the grid in each dimension.
•blockDim: Number of threads in a block in each dimension.
•These variables are used by threads to determine their coordinates within the grid
and block.
•For example, a kernel can be launched with a 2D grid and 2D blocks like this:
dim3 grid(3, 3); // 3x3 grid of blocks
dim3 block(8, 8); // Each block has 8x8 threads
kernel_name<<<grid, block>>>(parameters);
 gridDim.x = 3, gridDim.y = 3, gridDim.z = 1
 blockDim.x = 8, blockDim.y = 8, blockDim.z = 1
•Inside the kernel, each thread will calculate its position using blockDim, blockIdx,
and
Copyright threadIdx.
© 2022 Elsevier
dim3 built-in type

• If you create a dim3 object without parameters, it defaults all dimensions (x,
y, and z) to 1.
• Example: dim3 grid; // grid.x = 1, grid.y = 1, grid.z = 1
• x, y, z: These are the three dimensions, each of type unsigned int, and can be
accessed directly (e.g., grid.x, grid.y, grid.z).
• This structure is specifically designed for managing and organizing the parallel
execution of blocks and threads on a CUDA-enabled GPU.

Copyright © 2022 Elsevier


Grid and Block Dimensions in CUDA

Example: Calculating a 2D global thread index


If you're working with a 2D grid and 2D blocks, the global position can be
calculated using both x and y dimensions:
int globalIdxX = blockIdx.x * blockDim.x + threadIdx.x;
int globalIdxY = blockIdx.y * blockDim.y + threadIdx.y;

CUDA Kernel Call Example


To call a CUDA kernel and define the grid and block dimensions:

// Defines block dimensions: 128 threads in x, 1 in y, and 1 in z.


dim3 dimBlock(128, 1, 1); or dim3 dimBlock(128);
// Defines grid dimensions: 32 blocks in x, 1 in y, and 1 in z.
dim3 dimGrid(32, 1, 1); or dim3 dimGrid(32);
// Launches the kernel
vecAddKernel<<<dimGrid, dimBlock>>>(...);

•Grid: 32 blocks
•Block: 128 threads per block
•Total threads: 128 × 32 = 4096 threads in total.

Copyright © 2022 Elsevier


Thread limit per block

•In CUDA, each block can have up to 1024 threads. If the total number of threads per
block exceeds this limit, the configuration is invalid. Here's an example of an invalid
block configuration:
 blockDim.x = 32
 blockDim.y = 32
 blockDim.z = 2
•The total number of threads in this block would be:

• Total threads=32×32×2=2048
• Since 2048 threads exceed the 1024 thread limit per block, this configuration is invalid.
• If you need more than 1024 threads, you can increase the number of blocks in the grid,
but the block size itself must be within the 1024-thread limit.

Copyright © 2022 Elsevier


Thread Indices
If you're using a block with dimensions (4, 4, 4):

Here's a breakdown of how the threads are indexed within the block:
•threadIdx.x ranges from 0 to 3 (4 threads in the x dimension).
•threadIdx.y ranges from 0 to 3 (4 threads in the y dimension).
•threadIdx.z ranges from 0 to 3 (4 threads in the z dimension).

So, in total, you have 4 * 4 * 4 = 64 threads per block, with the thread indices
running from (0, 0, 0) (the first thread) to (3, 3, 3) (the last thread).
threadIdx.z threadIdx.y threadIdx.x
0 0 0
0 0 1
0 0 2
0 0 3

0 1 0
0 1 1
0 1 2
0 1 3
.
.
.
Copyright © 2022 Elsevier
Grid and Block Dimensions in CUDA

Flexible Dimensions with Variables


Alternatively, you can use dynamic dimensions based on the input size, as shown
in this example:

// Fixed block size


dim3 blockSize(256, 1, 1);
// Grid size calculated to fit 'n' elements
dim3 gridSize((n + blockSize.x - 1) / blockSize.x, 1, 1);
vecAddKernel<<<gridSize, blockSize>>>(...);

Here, gridSize is dynamically calculated to ensure enough blocks are created to cover
all vector elements.
For example:
•If n = 1000, grid size = 4 blocks
•If n = 4000, grid size = 16 blocks
This approach adapts the grid size based on the input size, ensuring all threads
are properly allocated.

Copyright © 2022 Elsevier


Shortcut for 1D Grid and Block Configurations in CUDA

• Convenient 1D Kernel Launching


• Instead of explicitly defining dim3 variables, you can use arithmetic expressions for
1D grids and blocks:

• vecAddKernel<<<gridSize, blockSize>>>(...);
• Here:
• gridSize defines the number of blocks in the x-dimension.
• blockSize defines the number of threads per block in the x-dimension.
• Y and Z dimensions are automatically set to 1.

int N = 10000; // Total number of elements


int blockSize = 256; // Number of threads per block
int gridSize = (N + blockSize - 1) / blockSize; // Number of blocks

// Kernel launch without using dim3 explicitly


vecAddKernel<<<gridSize, blockSize>>>(...);

This approach works perfectly when you're using 1D grids and blocks.

Copyright © 2022 Elsevier


CUDA Grid and Block Configurations
• Grid Dimensions (gridDim) Allowed ranges:
• gridDim.x: 1 to 2³¹ - 1 (over 2 billion)
• gridDim.y and gridDim.z: 1 to 2¹⁶ - 1 (which is 65,535)
• Block Dimensions (blockDim)
• 3D Thread Block:
• Threads in a block are organized in a 3D array, defined by blockDim.x, blockDim.y,
and blockDim.z.
• 1D Blocks: Set blockDim.y = 1 and blockDim.z = 1 (e.g., in vector addition).
• 2D Blocks: Set blockDim.z = 1.
• Examples: 1D Block (512 threads):
• blockDim.x = 512 blockDim.y = 1 blockDim.z = 1
• 2D Block (512 threads):
• blockDim.x = 16 blockDim.y = 32 blockDim.z = 1
• 3D Block (512 threads):
• blockDim.x = 8 blockDim.y = 16 blockDim.z = 4

Copyright © 2022 Elsevier


Multidimensional Indexing

• Built-in dimension and index variables each have three components x, y, and z

⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
gridDim.y

gridDim.x

Copyright © 2022 Elsevier


Remark

in CUDA, the vertical (row) row corresponds to the y-coordinate in a 2D grid of


threads. The coordinates are typically represented as follows:
•threadIdx.x: Represents the horizontal index (column) of the thread within the
block.
•threadIdx.y: Represents the vertical index (row) of the thread within the block.

Copyright © 2022 Elsevier


Multidimensional Indexing

• Built-in dimension and index variables each have thee components x, y, and z

blockIdx.x

⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
blockIdx.y gridDim.y

Block Index gridDim.x


(0, 0) (0, 1) // First row
(1, 0) (1, 1) // Second row

Copyright © 2022 Elsevier


Multidimensional Indexing

• Built-in dimension and index variables each have thee components x, y, and z
blockIdx.x

⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
blockIdx.y gridDim.y

blockDim.y

gridDim.x
blockDim.x
The complete list of thread indices:
(0, 0), (0, 1), (0, 2), (0, 3) // First row (y = 0)
(1, 0), (1, 1), (1, 2), (1, 3) // Second row (y = 1)
(2, 0), (2, 1), (2, 2), (2, 3) // Third row (y = 2)
Copyright © 2022 Elsevier
(3, 0), (3, 1), (3, 2), (3, 3) // Fourth row (y = 3)
Multidimensional Indexing

• Built-in dimension and index variables each have thee components x, y, and z

blockIdx.x threadIdx.x

⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝ ⇝
blockIdx.y gridDim.y

threadIdx.y

blockDim.y

gridDim.x
blockDim.x

Copyright © 2022 Elsevier


Copyright © 2022 Elsevier
Grid and Block Dimensionality in CUDA

•Grids and Blocks can have different dimensions


•Grid dimensionality can be higher or lower than block dimensionality.

•Example:
•gridDim (2, 2, 1)
•blockDim (4, 2, 2)

•Grid Example (2x2 array of blocks):


•Blocks labeled by (blockIdx.y, blockIdx.x)
•Example: Block (1,0) → blockIdx.y = 1, blockIdx.x = 0

•Important Detail:
•CUDA uses a reversed ordering of dimensions when labeling blocks.
The highest dimension comes first for intuitive mapping with data
indexing.

Copyright © 2022 Elsevier


Thread Organization in CUDA

•Thread Dimensions within a Block:


•Thread Coordinates:
•threadIdx.x (x-coordinate)
•threadIdx.y (y-coordinate)
•threadIdx.z (z-coordinate)

•Example Configuration:
•Block Dimension: (4, 2, 2)
•Threads per Block: 16
•Total Threads in Grid: 64

•Example (Block (1,1) Expanded):


•Threads labeled as (threadIdx.z, threadIdx.y, threadIdx.x)
•Example: Thread (1,0,2) → threadIdx.x = 2, threadIdx.y = 0, threadIdx.z = 1

•Note:
•The example uses small numbers for simplicity; real CUDA applications often use
thousands to millions of threads.

Copyright © 2022 Elsevier


Exercise
• Consider the following CUDA kernel and the corresponding host function that calls it:

• a. What is the number of threads per block?


• 512
• b. What is the number of blocks in the grid?
• 95
• c. What is the number of threads in the grid?
• 48,640
• c. What is the number of threads that execute the code on line 05?
• 45,000
Copyright © 2022 Elsevier
Thread Organization for 2D Data Processing
•Data Nature:
•Thread organization (1D, 2D, 3D) is chosen based on the data structure.
•For 2D data like images, a 2D grid with 2D blocks is often convenient.
•Example: Processing a 76 x 62 Pixel Image (Fig. 3.2)
•Image size: 76 pixels (x-direction) x 62 pixels (y-direction)
•Block size: 16 threads (x-direction) x 16 threads (y-direction)
o To accommodate the entire image, the grid is divided into blocks:
 Number of blocks in the y-direction
 Number of blocks in the x-direction
 Therefore, the grid size is:
 dim3 gridSize(5, 4, 1) (5 blocks in x, 4 blocks in y)
•Grid size: 4 blocks (y) x 5 blocks (x) = 20 blocks total
•Thread Assignment to Pixels:
•Each thread processes one pixel
•Vertical (y) coordinate:
•row coordinate=blockIdx.y×blockDim.y+threadIdx.y
•Horizontal (x) coordinate:
•column coordinate=blockIdx.x×blockDim.x+threadIdx.x
•Example Calculation:
•Thread (0,0) of block (1,0) processes pixel at:
PinblockIdx.y×blockDim.y+threadIdx.y,blockIdx.x×blockDim.x+threadIdx.x = Pin1*16+0,0*16+0​= Pin16,0​
Copyright © 2022 Elsevier
Copyright © 2022 Elsevier
Handling Extra Threads and Processing Large Images

•Extra Threads:
•Using a block size of 16 × 16 threads to process a 76 x 62 image leads to extra
threads.
•The 64 × 80 threads generated exceed the actual pixel count.
•Solution: Use conditional checks (if statements) to ensure extra threads do not
operate on non-existent pixels.
•Processing Larger Images:
•For a 1500 × 2000 image:
•Grid dimensions: 94 blocks in the x-direction (1500 ÷ 16 ≈ 94) and 125
blocks in the y-direction (2000 ÷ 16 ≈ 125).
•Total blocks: 11,750 blocks (94 × 125).
•Kernel Functionality:
•The kernel uses gridDim.x, gridDim.y, blockDim.x, and blockDim.y to
reference grid and block dimensions.
•In this case, gridDim.x =94, gridDim.y = 125, blockDim.x = 16, and
blockDim.y = 16.

Copyright © 2022 Elsevier


Understanding Multidimensional Arrays in CUDA C

•Problem:
•In CUDA C, multidimensional arrays are difficult to access dynamically at
runtime because the number of columns in such arrays is unknown at
compile time.
•Example of static array: int array[10][20]; // Here, 10 and 20 are known at
compile time

•Why the Challenge?:


•Dynamically allocated arrays allow flexible sizes at runtime, but the size
information (number of columns) is not available at compile time.
•Thus, CUDA C cannot directly access the arrays as a 2D matrix unless
programmers flatten the array manually.

•Key Concept:
•Linearization of arrays: All multidimensional arrays are stored in memory as
1D arrays.
•While static arrays are automatically flattened by compilers, for dynamically
allocated arrays, programmers must explicitly flatten the array.

Copyright © 2022 Elsevier


Arrays in C

• C uses row-major order, which means that the elements of a multidimensional


array are stored row by row in contiguous memory.
• To calculate the memory address of an element in a multidimensional array, C
needs to know the size of the second dimension (number of columns).
• In a 2D array, the memory layout looks like:
• array[0,0], array[0,1], array[0,2], ..., array[1,0], array[1,1],
array[1,2], ...
• This is why the number of columns must be known at compile time — it helps C
calculate the memory address for elements using the formula:
• address = baseAddress + (row * numberOfColumns + column);

• int array[rows][columns]; // Number of columns must be known at compile time.


• Without the number of columns, the compiler wouldn’t know how far to jump in
memory to access elements in the next row.
• The compiler needs to know the number of columns at compile time to compute
the memory offset for each element during indexing.

Copyright © 2022 Elsevier


Copyright © 2022 Elsevier
Row-Major Layout for Linearizing 2D Arrays
•Row-Major Layout:
•In row-major layout, all elements of the same row are placed in consecutive
memory locations.
•Rows are stored one after another in memory, making this the default layout for
C programs.

•Example:
•A 4 × 4 matrix M is flattened into a 1D array.
•Each row is stored sequentially:

Copyright © 2022 Elsevier


Dynamic Arrays in CUDA

• You can allocate memory dynamically (using cudaMalloc) and access


elements using flat indexing.
• You calculate the index based on the current thread and block indices,
bypassing the need for compile-time knowledge of the array
dimensions.

__global__ void kernel(int *array, int rows, int cols) {


int col = threadIdx.x + blockIdx.x * blockDim.x;
int row = threadIdx.y + blockIdx.y * blockDim.y;
if (row < rows && col < cols) {
int index = row * cols + col;
array[index] = ...; // Perform some operation
}
}

Copyright © 2022 Elsevier


Example of Flattening

• For a 2D array of size 5 x 4 (5 rows, 4 columns):


• int array[5][4]; // Conceptually a 2D array
• You flatten this into a 1D array like this:
• int array[5 * 4]; // Flattened 1D array
• To access an element at array[row][col] (e.g., array[3][2]), you calculate the
index:
• index = row * numCols + col = 3 * 4 + 2 = 12 + 2 = 14
• So, array[3][2] is accessed via array[14] in the flattened array.

Copyright © 2022 Elsevier


Layout of Multidimensional Data

• Convention is C is to store data in row major order


• Elements in the same row are contiguous in memory

Logical view of data


0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

Actual layout in memory


0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3

Copyright © 2022 Elsevier


Flattening 2D Arrays in CUDA C

•Memory Layout:
•In memory, a 2D array is stored in a contiguous block, with each row placed
sequentially.
•The linearization formula calculates the 1D offset based on the number of
columns.

•Example:
•For an array Pind with num_columns = 76, the element at row j = 2 and column i
= 3 is accessed as:
•Pind[2×76+3]= Pind[155]

•Why Flattening?:
•Flattening ensures that dynamically allocated arrays are accessed correctly in
CUDA C without compile-time dimensional information.

Copyright © 2022 Elsevier


Column-Major Layout and Differences from Row-Major Layout

•Column-Major Layout:
•In column-major layout, all elements of a column are placed in consecutive
memory locations.
•This method is used by FORTRAN compilers.

•Difference from Row-Major:


•Row-major layout stores rows consecutively, while column-major layout stores
columns consecutively.

•Key Insight: Column-major layout is equivalent to row-major layout of the


transposed matrix.

•CUDA C and Row-Major Layout:


•CUDA C uses the row-major layout, matching C’s default behavior.
•When working with C libraries designed for FORTRAN programs (which use
column-major layout), developers may need to transpose arrays when passing
them between C and FORTRAN.

Copyright © 2022 Elsevier


colorToGrayscaleConversion Kernel for Image Processing

•Converting RGB to Grayscale:


•The kernel converts each color pixel to grayscale using the equation:

•Thread Mapping:
•The kernel organizes threads in a 2D grid and 2D blocks to process the image.
•Each thread computes the horizontal index (col) and vertical index (row) using:
•col=blockIdx.x×blockDim.x+threadIdx.x
•row=blockIdx.y×blockDim.y+threadIdx.y
•Each thread is assigned to process one pixel, given by its row and
column indices.

•1D Index Calculation:


•The grayscale output (Pout) is stored as a 1D array, and the 1D index for pixel
(row,col) is:
•grayOffset=row×width+col

Copyright © 2022 Elsevier


Example: RGB to Grayscale

Parallelization approach: assign one thread to convert each pixel in the image

Copyright © 2022 Elsevier


grayOffset vs rgbOffset
1. Grayscale Image (Single Channel)
In a grayscale image, each pixel is represented by a single value (intensity). So, if
you have a grayOffset = 6, it directly points to the grayscale value at index 6 in the
1D array.

2. RGB Image (Three Channels)


In an RGB image, each pixel is represented by three values: one for red, one for
green, and one for blue. These three values are stored consecutively in the array.
For example:
•The first pixel (at row = 0, col = 0) has its red component at index 0, green at index
1, and blue at index 2.
•The second pixel (at row = 0, col = 1) has its red component at index 3, green at
index 4, and blue at index 5.
•This means that the second pixel whose 1D offset index is 1 needs to be
multiplied by 3 to access its first color value (red in this case)
•If you have a color image with a width of 4 pixels and you're looking for the pixel
at row 1, column 2: grayOffset = 1 * 4 + 2 = 6
•If channels = 3, then rgbOffset = 6 * 3 = 18
•This means the RGB image's corresponding pixel starts at index 18 in the 1D
RGB array. Green at 18 + 1 = 19 Blue at 18 + 2 = 20.

Copyright © 2022 Elsevier


Code

Copyright © 2022 Elsevier


Remarks

• An unsigned char is a data type in C and C++ that can store integer values ranging
from 0 to 255. Here’s a quick summary:
•Unsigned: Indicates that the type cannot represent negative values.
•Char: Typically refers to a character type, but in this case, it's used to denote an 8-bit
data type.
•Range: Because it uses 8 bits, the values can go from 0 (00000000 in binary)
to 255 (11111111 in binary).

•In C and C++, when you see a number like 0.21f, the f suffix indicates that it is a
float data type.
•Float (float): Typically represents a single-precision floating-point number. It usually
uses 32 bits (4 bytes) and has about 7 decimal digits of precision.
•Double (double): Represents a double-precision floating-point number. It usually uses
64 bits (8 bytes) and offers about 15 decimal digits of precision.

Copyright © 2022 Elsevier


Detailed Execution of colorToGrayscaleConversion Kernel
•1D Index for RGB Input:
•The color image (Pin) is stored as RGB values, requiring 3 bytes per pixel. The 1D index
for pixel (row,col) is: rgbOffset=(row×width+col)×3
•The RGB values (r, g, b) are read from 3 consecutive bytes starting at rgbOffset.

•Grayscale Calculation:
•The grayscale value is calculated using the weighted sum:
L=0.21×r+0.72×g+0.07×b
•This value is stored in the Pout array at the grayOffset.

•Handling Extra Threads:


•Since there are more threads than pixels (due to block padding), if conditions
are used to ensure that only threads with valid row and col indices process pixels:
• if (col<width) and (row<height)

•Example Calculation:
•For thread (0,0) of block (1,0), the 1D index for Pin is:

•The 1D index for Pout is: Pout[1216]=Pout[(16×76)+0] index=(row×width+col)×channels


Copyright © 2022 Elsevier
Copyright © 2022 Elsevier
Handling Extra Threads in Image Processing (Areas 1 & 2)

•Area 1: Fully Utilized Threads:


•Covers 12 blocks in the dark-shaded area.
•All threads in these blocks (256 per block) are within the valid range.
•Each thread processes a pixel since both col and row values fall within bounds.

•Area 2: Partially Utilized Threads (Horizontal Overlap):


•3 blocks cover the medium-shaded upper-right area.
•In these blocks:
•All row values are within range, but some col values exceed the width
(76).
•Only 12 threads per row process valid pixels, as 4 threads in each row
exceed the valid range (col > 76).
•192 threads process pixels per block (12 rows × 16 columns).

Copyright © 2022 Elsevier


Handling Extra Threads in Image Processing (Areas 3 & 4)

•Area 3: Partially Utilized Threads (Vertical Overlap):


•4 blocks cover the medium-shaded lower-left area.
•In these blocks:
•All col values are valid, but some row values exceed the height (62).
•Only 14 threads per column process valid pixels (2 threads in each column
are out of bounds).
•224 threads process pixels per block (16 columns × 14 rows).

•Area 4: Partially Utilized Threads (Horizontal and Vertical Overlap):


•Covers the lower-right lightly shaded area.
•Similar to Area 2 and 3:
•4 threads per row exceed the valid range in the horizontal direction.
•The bottom 2 rows exceed the valid range in the vertical direction.
•168 threads process pixels per block (12 valid columns × 14 valid rows).

Copyright © 2022 Elsevier


Multiple Choice Question

• https://strawpoll.live
• PIN: 524393

• Consider a 2D matrix with a width of 400 and a height of 500. The matrix is
stored as a one-dimensional array. Specify the array index of the matrix
element at row 20 and column 10 if the matrix is stored in row-major order.
• A. 5,020
• B. 4,020
• C. 10,010
• D. 8,010

Copyright © 2022 Elsevier


Multiple Choice Question

• https://strawpoll.live
• PIN: 332893

• Consider a 2D matrix with a width of 400 and a height of 500. The matrix is
stored as a one-dimensional array. Specify the array index of the matrix
element at row 20 and column 10 if the matrix is stored in column-major
order.
• A. 5,020
• B. 4,020
• C. 10,010
• D. 8,010

Copyright © 2022 Elsevier


1D index for 3D tensor

• To calculate the 1D array index for an element with 3 coordinates, the


formula for row-major order is:
• index=z×(height×width) + y×width + x

• Consider a 3D tensor with a width of 400, a height of 500, and a depth of 300.
The tensor is stored as a one-dimensional array in row-major order. Specify
the array index of the tensor element at x=10, y=20, and z=5.
• Answer: index=5x(500x400)+20x400+10= 1,008,010

Copyright © 2022 Elsevier


Introduction to Complex Thread Cooperation in CUDA C

•Simple Operations in Previous Examples:


•Previous kernels like vecAddKernel and colorToGrayscaleConversion
demonstrated simple, independent operations by threads.
•Each thread performed small arithmetic operations on a single array element.

•Moving Towards Complex Operations:


•In real CUDA programs, threads often perform more complex operations
and must cooperate with each other.
•Next steps: Exploring kernels where threads share data and collaborate to
achieve tasks.

•Transition to Image Processing:


•To introduce complex operations, we will start with an image-blurring
function.

Copyright © 2022 Elsevier


Introduction to Image Blurring in CUDA C

•What is Image Blurring?:


•Image blurring smooths abrupt variations in pixel values while preserving key
edges.
•The result is a “blurry” image where fine details are obscured, but major
objects are emphasized.

•Applications of Image Blurring:


•Noise reduction: Blurring corrects pixel values by averaging nearby pixels.
•Computer Vision: Helps algorithms focus on thematic objects by reducing
noise from fine details.
•Display Use: Blurring can be used to highlight certain parts of an image by
blurring the rest.

•Blurring in CUDA:
•Each thread in the image-blurring kernel will process a patch of pixels.
•The output pixel is computed as the average of an N×N patch of pixels.

Copyright © 2022 Elsevier


Example: Blur

Output pixel is the average of the corresponding input pixel and the pixels around it

Copyright © 2022 Elsevier


Example: Blur

Input Image Output Blurred Image

Parallelization approach: assign one thread to each


output pixel, and have it read multiple input pixels

Copyright © 2022 Elsevier


Copyright © 2022 Elsevier
Image Blurring with a 3×3 Patch

• Understanding the 3×3 Patch:


• The output pixel at position (row,col) is computed using a 3×3 patch centered at
that pixel.
• The patch spans three rows and three columns:
• Rows:row−1,row,row+1
• Columns:col−1,col,col+1
• Example: Pixel (25, 50):
• The patch for output pixel (25,50) includes the following 9 input pixels:
• (24,49),(24,50),(24,51)
• (25,49),(25,50),(25,51)
• (26,49),(26,50),(26,51)
• Thread Mapping:
• Each thread computes one output pixel, using its col and row values to locate the
center pixel of its patch.

Copyright © 2022 Elsevier


Image Blurring Kernel and Patch Calculation

•Thread-to-Pixel Mapping:
•Each thread computes an output pixel using its position as the center of the
patch.
•Calculating Col and Row:
•Each thread calculates the col and row indices for its output pixel.
•The if-statement ensures that only threads with valid indices (within image
boundaries) participate in the kernel.
•Patch Processing (BLUR_SIZE):
•BLUR_SIZE determines the radius of the patch:
•For a 3×3 patch: BLUR_SIZE = 1
•For a 7×7 patch: BLUR_SIZE = 3
•The outer loop iterates over rows, and the inner loop iterates over columns
within the patch.

•Example Calculation:
•For output pixel (25,50):
•First iteration: Process row 24, pixels (24,49),(24,50),(24,51).
•Second iteration: Process row 25, pixels (25,49),(25,50),(25,51).
•Third iteration: Process row 26, pixels (26,49),(26,50),(26,51).

Copyright © 2022 Elsevier


Handling Edge Cases in Image Blurring Kernel

•Pixel Accumulation in the Patch:


•In the image blurring kernel, the pixel value is accumulated into a variable
pixVal (Line 16).
•A running sum is calculated by adding the pixel values from the 3×3 patch.
•The variable pixels (Line 17) tracks how many valid pixels have been accumulated.

•Conditional Statement for Edge Pixels:


•At image edges, the patch may extend beyond the image boundary.
•If-statement (Line 15) ensures only valid pixels are processed by checking if:
•curRow≥0 and curCol≥0
•Skips pixels outside the valid image range.

•Example: Upper-Left Corner (Case 1 in Fig. 3.9):


•For pixel (0,0):
•Five out of nine pixels in the intended 3×3 patch are outside the image.
•Valid pixels processed: (0,0), (0,1), (1,0), (1,1).
•Only these four pixels are accumulated into the sum, and pixels is
incremented four times.

Copyright © 2022 Elsevier


Copyright © 2022 Elsevier
Boundary Conditions

Input Image Output Blurred Image

Even if output pixel is in bounds, might be accessing out of bounds input pixel

Copyright © 2022 Elsevier


Calculating the Average and Handling Edge Variations

•Calculating the Average:


•After accumulating all the valid pixels in the 3×3 patch, the kernel calculates the
average (Line 22).
•The average is computed by dividing the pixVal (sum of pixel values) by the pixels
variable (number of valid pixels).
•Different Edge Scenarios:
•Corners: Threads responsible for corner pixels accumulate four pixels.
•Edges (not corners): Threads accumulate six pixels.
•Non-Edge Pixels: Threads accumulate nine pixels as all the patch pixels are valid.

•Why Track the Pixel Count?:


•Since patches near the edges include fewer pixels, the kernel needs to keep track
of the actual number of pixels processed to compute the correct average.
•Example of Final Calculation:
•For pixel (0,0), the kernel accumulates four valid pixels.
•The final pixel value is:

Copyright © 2022 Elsevier


Copyright © 2022 Elsevier
Introduction to Matrix Multiplication in CUDA

•Matrix Multiplication:
•Matrix multiplication is fundamental in linear algebra and applications like deep
learning.
•Multiply an i×j matrix M by a j×k matrix N to produce an i×k matrix P.
•Element Calculation:
•Each element Prow,col is computed as the dot product (inner product) of:
•The row from matrix M (horizontal strip).
•The column from matrix N (vertical strip).
•Dot Product Formula:

Copyright © 2022 Elsevier


Copyright © 2022 Elsevier
Matrix Multiplication in CUDA: Mapping Threads to Data

•Thread-to-Data Mapping:
•Each thread in the CUDA grid is responsible for computing one element of the
output matrix P.

•Familiar CUDA Pattern:


•Index Calculation:
•Each thread calculates its row and col indices:
•row=blockIdx.y×blockDim.y+threadIdx.y
•col=blockIdx.x×blockDim.x+threadIdx.x
•These indices are used to locate the element of Prow,col.

•Handling Square Matrices:


•Simplifying assumption: The kernel handles only square matrices, where the
width and height are equal.

Copyright © 2022 Elsevier


Accessing Elements in Matrix Multiplication

Inner Product Calculation


•Definition: Prow,col​is the inner product of the rowth row of matrix M and the colth
column of matrix N.

Loop Structure
1.Initialization: Local variable Pvalue is set to 0 before entering the loop (Line 06).
2.Element Access:
•Matrix M:
•Linearized using row-major order.
•Beginning of rowth row: M[row×Width].
•kth element of rowth row: M[row×Width+k] (Line 08).
•Matrix N:
•Beginning of colth column: N[col].
•kth element of colth column: N[k×Width+col] (Line 08).
•Accessing the next element in the colth column requires skipping over an
entire row.
•This is because the next element of the same column is the same
element in the next row.

Copyright © 2022 Elsevier


Copyright © 2022 Elsevier
Thread Execution in Matrix Multiplication

• Thread Responsibilities
• Each thread computes one element of the resultant matrix P using its unique
row and column indices.
• After the loop, threads store results using:
• Index expression: row×Width+col (Line 10).
• Example Execution
• Small Example: 4x4 matrix P with BLOCK_WIDTH=2.
• Block Structure:
• Divided into 4 tiles, each handled by a 2x2 thread block.
• Each thread computes a specific P element:
• Thread (0,0) in block (0,0) computes P0,0​.
• Thread (0,1) in block (0,0) computes P0,1​.
• Thread (1,0) in block (0,0) computes P1,0​.
• Thread (1,1) in block (0,0) computes P1,1.
• Thread (0,0) in block (1,0) computes P2,0​. Block (1, 0):Computes the bottom-left tile
(elements P2,0​, P2,1​, P3,0​, P3,1​).
• Visualization: Each thread maps its indices to corresponding rows of M and
columns of N for dot product calculations.
Copyright © 2022 Elsevier
Accessing Multidimensional Data

index = row*width + col

width

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3


height
row 2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

col

Copyright © 2022 Elsevier


Copyright © 2022 Elsevier
Execution of the For-Loop - Iteration Details

• Thread (0,0) in Block (0,0)


• Iteration 0 (k = 0):
• Accessed Elements:
• M[0] → M0,0​
• N[0] → N0,0​
• Iteration 1 (k = 1):
• Accessed Elements:
• M[1] → M0,1
• N[4] → N1,0​
• Iteration 2 (k = 2):
• Accessed Elements:
• M[2] → M0,2
• N[8] → N2,0​
• Iteration 3 (k = 3):
• Accessed Elements:
• M[3] → M0,3​
• N[12] → N3,0​
Copyright © 2022 Elsevier

You might also like