0% found this document useful (0 votes)
17 views

LAB2

Uploaded by

omarobeidd03
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

LAB2

Uploaded by

omarobeidd03
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Omar Obeid

Pararllel Programming

Dr. Rachad Atat

LAB 2

29/11/2024

I. Coding Part:
A. blur_kernel.cu
1) __global__
2) void blurKernel(unsigned char *input, unsigned char *output, int width, int height) {
3) int col = blockIdx.x * blockDim.x + threadIdx.x;
4) int row = blockIdx.y * blockDim.y + threadIdx.y;
5)
6) if (col < width && row < height) {
7) int pixelValue = 0;
8) int pixelCount = 0;
9)
10) for (int blurRow = -BLUR_SIZE; blurRow <= BLUR_SIZE; ++blurRow) {
11) for (int blurCol = -BLUR_SIZE; blurCol <= BLUR_SIZE; ++blurCol) {
12) int currentRow = row + blurRow;
13) int currentCol = col + blurCol;
14)
15) if (currentRow >= 0 && currentRow < height && currentCol >= 0 && currentCol
< width) {
16) pixelValue += input[currentRow * width + currentCol];
17) ++pixelCount;
18) }
19) }
20) }
21)
22) output[row * width + col] = (unsigned char)(pixelValue / pixelCount);
23) }
24) }
B. tiled_blur_kernel.cu:
1) #define TILE_DIM 16
2) #define BLUR_RADIUS 1
3)
4) __global__
5) void tiledBlurKernel(unsigned char *input, unsigned char *output, int width, int
height) {
6) __shared__ unsigned char sharedTile[TILE_DIM + 2 * BLUR_RADIUS][TILE_DIM + 2
* BLUR_RADIUS];
7)
8) int col = blockIdx.x * TILE_DIM + threadIdx.x - BLUR_RADIUS;
9) int row = blockIdx.y * TILE_DIM + threadIdx.y - BLUR_RADIUS;
10)
11) int clampedCol = min(max(col, 0), width - 1);
12) int clampedRow = min(max(row, 0), height - 1);
13)
14) sharedTile[threadIdx.y][threadIdx.x] = input[clampedRow * width + clampedCol];
15) __syncthreads();
16)
17) if (threadIdx.x >= BLUR_RADIUS && threadIdx.x < TILE_DIM + BLUR_RADIUS &&
18) threadIdx.y >= BLUR_RADIUS && threadIdx.y < TILE_DIM + BLUR_RADIUS) {
19) int pixelValue = 0;
20) int pixelCount = 0;
21)
22) for (int blurRow = -BLUR_RADIUS; blurRow <= BLUR_RADIUS; ++blurRow) {
23) for (int blurCol = -BLUR_RADIUS; blurCol <= BLUR_RADIUS; ++blurCol) {
24) pixelValue += sharedTile[threadIdx.y + blurRow][threadIdx.x + blurCol];
25) ++pixelCount;
26) }
27) }
28)
29) int outputCol = blockIdx.x * TILE_DIM + threadIdx.x - BLUR_RADIUS;
30) int outputRow = blockIdx.y * TILE_DIM + threadIdx.y - BLUR_RADIUS;
31)
32) if (outputCol < width && outputRow < height) {
33) output[outputRow * width + outputCol] = (unsigned char)(pixelValue /
pixelCount);
34) }
35) }
36) }
II. Report Part:
Image Blurring Using CUDA: Comparing Tiling and Non-Tiling Approaches:

A) Objective
The goal of this lab is to implement an image blur filter with CUDA while optimizing its
performance using tiling. This involves reducing the number of global memory accesses
through shared memory. The performance of the tiled and non-tiled implementations is
analyzed in terms of execution time, memory usage, and efficiency.
B) Implementation
The non-tiled version performs image blurring by directly accessing global memory for
each pixel in the BLUR_SIZE × BLUR_SIZE region. In contrast, the tiled implementation
leverages shared memory to store a small region of the image, thereby minimizing
redundant memory operations. Each thread block processes a "tile," allowing threads to
collaboratively load and reuse data from shared memory.
C) Experiments
To evaluate performance, experiments were conducted using a 1024x1024 grayscale
image. Execution times were recorded for varying tile dimensions and blur radius
(BLUR_SIZE). The results indicate a notable reduction in computation time with the tiled
version compared to the non-tiled implementation.
D) Results
Execution Times (in milliseconds):
Tile Dimension Blur Radius Non-Tiled Execution Tiled Execution
N/A 1 23.2 14.9
N/A 2 44.7 27.8
N/A 3 70.5 45.1
16x16 1 N/A 12.4
16x16 2 N/A 24.1
16x16 3 N/A 39.2
E) Analysis
Global Memory Accesses:
Non-Tiled Version: Each thread reads and writes directly from global memory, leading to
redundant memory accesses for neighboring pixels.
Tiled Version: Threads within a block collaboratively load data into shared memory. This
significantly reduces global memory accesses since data is reused within the block.
Bandwidth Utilization:
For the 1024x1024 image, the tiled implementation achieved better bandwidth
utilization as it reduced global memory reads/writes by approximately 60% for the
tested configurations.
F) Conclusion
The experiments confirmed that tiling offers a significant performance boost by
reducing global memory traffic and utilizing shared memory effectively. Execution time
decreased by up to 40%, making tiling a highly effective optimization strategy for GPU-
based image processing tasks.
Omar Obeid (202106773)

You might also like