0% found this document useful (0 votes)
181 views18 pages

Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA

The document describes a project to compare CPU and GPU performance for calculating the sum of prime numbers. It includes: 1) An overview of the research subjects including parallel processing, C++ programming, and CUDA programming. 2) Work assignments for group members on different aspects of the project. 3) Code examples for finding prime numbers and calculating their sum on the CPU and GPU, including flowcharts showing the parallel processing approach on the GPU.

Uploaded by

Huy Huy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
181 views18 pages

Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA

The document describes a project to compare CPU and GPU performance for calculating the sum of prime numbers. It includes: 1) An overview of the research subjects including parallel processing, C++ programming, and CUDA programming. 2) Work assignments for group members on different aspects of the project. 3) Code examples for finding prime numbers and calculating their sum on the CPU and GPU, including flowcharts showing the parallel processing approach on the GPU.

Uploaded by

Huy Huy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

HO CHI MINH UNIVERSITY OF

TECHNOLOGY AND EDUCATION


FACULTY FOR HIGH QUALITY TRAINING

COMPUTER ORGANIZATION AND


ARCHITECTURE
***
FINAL ESSAY

PROJECT: THE GPU PROGRAMMING PARALLEL


WITH CUDA CORE SUPPORT

NGÔ TIẾN TÚ 20119175


LÊ TRỌNG HOÀNG 20119132
DƯƠNG HOÀNG GIA 20119129
NGUYỄN THỊ LÂM TRÚC 20119172
HỒ NGUYỄN MINH THƯ 20119166

Ho Chi Minh City, …., ………2022

1
Content
1. Introduction...............................................................................................................1
2. Overview...................................................................................................................2
2.1. Research subjects:............................................................................................2
2.2. Support Tools...................................................................................................2
2.3. Work Assignment.............................................................................................3
3. Sum of prime number................................................................................................4
3.1 Flowchart...........................................................................................................4
3.2. Coding..............................................................................................................5
3.3 Result.................................................................................................................9
4.Image processing problem........................................................................................10
4.1 Flowchart.........................................................................................................10
4.2 Coding.............................................................................................................11
4.3 Result...............................................................................................................13
5. CONCLUSIONS AND FUTURE WORK..............................................................14
6. References...............................................................................................................15

1
1. Introduction
The computer's invention paved the way for the digital age, and its
applications in computing and analyzing image data remain in high demand
today.
We learned about the architecture of a computer, including components such
as the Central Processing Unit (CPU), Arithmetic Logic Unit (ALU), Memory ,
I/O, Bus system, and the like, in the subject of Computer Architecture and
Organization. So, how can parallel computing processors be used to compare
and calculate data?
To come up with a solution, the team conducted research and discovered that
when programming on a computer, there is a difference in speed between
calculations or processing because the processor computes parallel between the
GPU and the CPU. To see the difference, the group studied programming
languages such as CUDA, C++, and others, as well as providing examples of
computation and image processing.
Since its original announcement in 2007, the Unified Computing Device
Architecture (CUDA) has evolved into the de facto standard for using Graphics
Computing Units (GPUs) for non-graphical applications. NVIDIA's CUDA is a
widely used parallel computing platform and programming model. It can only
be used with NVIDIA GPUs. OpenCL is a more complex version of CUDA that
is used to write parallel code for other types of GPUs such as AMD and Intel.
With simple programming APIs, CUDA allows the creation of batch parallel
applications that run on graphics processing units (GPUs). CUDA C or C++
allows C and C++ software developers to accelerate their applications and take
advantage of GPU power. CUDA programs are similar to plain C or C++
programs with the addition of keywords to take advantage of GPUS parallelism.
As a result, the team decided to research Programming on GPU Parallel
Processors with CUDA Core Support in order to gain a better insight into the
problem.
1
2. Overview

2.1. Research subjects:


- Parallel Processing is a computing technique that uses two or more processors
(CPUs) to handle different parts of a larger task. The amount of time it takes to
run a program can be reduced by splitting up different parts of a task across
multiple processors.
- C++ programming is an object-oriented programming language developed by
renowned computer scientist Bjorne Stroustrop as part of the C language
family's evolution. It was created as a cross-platform enhancement to C that
would give developers more control over memory and system resources.
- CUDA programming is a parallel computing platform and application
programming interface (API) that enables software to use specific types of
graphics processing units (GPUs) for general-purpose processing, a technique
known as general-purpose computing on GPUs (GPGPU).

2.2. Support Tools


- Microsoft Visual Studio is an integrated development environment (IDE)
developed by Microsoft for various types of software development,
including computer programs, websites, web apps, web services, and
mobile apps. Completion tools, compilers, and other features are included
to make the software development process easier.

2
2.3. Work Assignment

Works
Lê Trọng Dương Nguyễn Thị Hồ Nguyễn
Members Ngô Tiến Tú
Hoàng Hoàng Gia Lâm Trúc Minh Thư

Overview of
the Cuda ü ü
programming

The
functioning
ü ü ü
of parallel
processors

Program ü ü ü

Powerpoint ü ü

3
Report ü ü ü ü ü

4
3. Sum of prime number
3.1 Flowchart

Figure 1: Sum of prime number problem’s flowchart


- Accoding to flowchart, we design 2 codes for CPU and GPU

5
3.2. Coding

void FPN(int *s, int n)//Find prime numbers


{
int i, j;
bool t;
for (i = 2; i <= n; i++)
{
t = true;
for (j = 2; j < i; j++)
{
if (i % j == 0)
{
t = false;
break;
}
}
if (t == true)
{
s[i] = i;
}
else
{
s[i] = 0;
}
}
}
void ttcpu(int *s, int *tt, int n)//Calculate the sum of prime numbers
{
int i;
tt[0] = 0;
for (i = 2; i <= n; i++)
{
tt[0] = tt[0] + s[i];
}
}
int main(void)
{
//Introduce program
printf("Program : Find the sum of prime numbers from 1 to n using CPU \n\n");
//Declare variable
int n, i, *s, start, end, *tt;
double time_use;
//Enter the value n (1000 < n < 9999)
printf_s("Enter n : \nn=");
scanf_s("%d", &n);
//Memory allocation
s = (int*)malloc((n + 1) * sizeof(int));
tt = (int*)malloc(sizeof(int));
//Start recording time
start = clock();
//Find primes and calculate the sum of primes from 1 to n
FPN(s, n);
ttcpu(s, tt, n);
//Finish recording time
end = clock();
//Find time CPU used
time_use = (double)(end - start) / CLOCKS_PER_SEC;
//Print results
printf("Total number of primes from 1 to %d : %d\n", n, tt[0]);

6
printf_s("Time used : %lfs\n", time_use);
//Release memory
free(s);
free(tt);
return(0);

7
Figure 2: Code for sum of prime number on CPU

__global__ void FPN(int *s, int n)//Find prime numbers


{
int i = blockIdx.x * blockDim.x + threadIdx.x, j;
bool t;
if ((i >= 2) && (i <=n))
{
if (i == 2)
{
t = true;
}
else
{
for (j = 2; j < i; j++)
{
t = true;
if (i % j == 0)
{
t = false;
break;
}
}
}
if (t == true)
{

s[i] = i;
}
else
{
s[i] = 0;
}
}
}
__global__ void ttgpu(int *s, int *tt, int n)//Calculate the sum of prime numbers
{
int i;
tt[0] = 0;
for (i = 2; i <= n; i++)
{
tt[0] = tt[0] + s[i];
}
}
int main(void)
{
//Introduce program
printf("Program : Find the sum of prime numbers from 1 to n using GPU \n\n");
//Declare variable
int n, i, block, * s, * a, * tt;
clock_t start, end;
double time_use;
//Enter the value n (1000 < n < 9999)
printf_s("Enter n : \nn=");
scanf_s("%d", &n);
//Memory allocation
cudaMallocManaged((void**)&s, (n + 1) * sizeof(int));
cudaMallocManaged((void**)&tt, sizeof(int));
//Start recording time
start = clock();
//Find primes and calculate the sum of primes from 1 to n

8
FPN <<<n+1, 1 >>> (s, n);
ttgpu <<<1, 1 >>> (s, tt, n);
cudaDeviceSynchronize();
//Finish recording timeinish recording time
end = clock();
//Find time GPU used
time_use = (double)(end - start) / CLOCKS_PER_SEC;
//Print results
printf_s("Total number of primes from 1 to %d : %d\n", n, tt[0]);
printf_s("Time used : %lfs\n",time_use);
//Release memory
cudaFree(s);
cudaFree(tt);
return;
}

9
Figure 3: Code for sum of prime number on GPU

3.3 Result

Figure 4: The sum of prime numbers programing using CPU from 1 to 9173

Figure 5: The sum of prime numbers programing using GPU from 1 to 9173
Two above figures show the execution time of CPU and GPU when each
of them running the program (Sum of prime numbers from 1 to n – where n is
entered by user). The result show that the execution time of CPU is lower than
GPU. Therefore, we can conclude that for this problem, CPU is better than GPU
in calculation.

10
Figure 6: GPU is used when program running

4.Image processing problem


4.1 Flowchart
We design the flowchart like this, the first step is to read the picture from
storage. After that, the program will get the properties of this picture. It will
process each pixel from begin to end. In this case, the program will increase
color red. This image using in this problem is ppm format, so it has 8 bits to
represent for red color. If the pixel is processing have red properties plus 50
greater than 255, it will have the value is 255 and in contrast, it will have the
value of it plus 50.
Following the flowchart, we write 2 programs that can do similar things to get
the same result. The first one is the code on CPU and the other one is on GPU.

11
Figure 7: The flowchart of image process

4.2 Coding

int main()
{
clock_t start, end;
double time_use; // Time usage

ifstream image;
ofstream newimage;
image.open("apollo.ppm");
newimage.open("newimage.ppm");
start = clock(); // The initial time
//copy over header information
string type = "", width = "", heigh = "", RGB = "";

12
image >> type;
image >> width;
image >> heigh;
image >> RGB;

newimage << type << endl;


newimage << width << " " << heigh << endl;
newimage << RGB << endl;

//cout << type << width << heigh << RGB << endl;

string red = "", green = "", blue = "";


int r = 0, g = 0, b = 0;
while (image.eof()==false)
{
image >> red;
image >> green;
image >> blue;

stringstream redstream(red);
stringstream greenstream(green);
stringstream bluestream(blue);

redstream >> r;
greenstream >> g;
bluestream >> b;

if (r + 50 >= 255)
r = 255;
else
r += 50;

newimage << r << " " << g << " " << b << endl;
}
end = clock(); // get the end of time use
//image.close();
time_use = (double)(end - start) / CLOCKS_PER_SEC; //Copute the time usage
cout << "CPU time: " << time_use;

return 0;
}

Figure 8: Program of image processing on CPU

__global__ void Histogram_CUDA(unsigned char* Image, int* Histogram);

void Histogram_Calculation_CUDA(unsigned char* Image, int Height, int Width, int Channels, int* Histogram){
unsigned char* Dev_Image = NULL;
int* Dev_Histogram = NULL;

//allocate cuda variable memory


cudaMalloc((void**)&Dev_Image, Height * Width * Channels);
cudaMalloc((void**)&Dev_Histogram, 256 * sizeof(int));

//copy CPU data to GPU


cudaMemcpy(Dev_Image, Image, Height * Width * Channels, cudaMemcpyHostToDevice);
cudaMemcpy(Dev_Histogram, Histogram, 256 * sizeof(int), cudaMemcpyHostToDevice);

dim3 Grid_Image(Width, Height);


Histogram_CUDA << <Grid_Image, 1 >> >(Dev_Image, Dev_Histogram);

13
//copy memory back to CPU from GPU
cudaMemcpy(Histogram, Dev_Histogram, 256 * sizeof(int), cudaMemcpyDeviceToHost);

//free up the memory of GPU


cudaFree(Dev_Histogram);
cudaFree(Dev_Image);
}

__global__ void Histogram_CUDA(unsigned char* Image, int* Histogram){


int x = blockIdx.x;
int y = blockIdx.y;

int Image_Idx = x + y * gridDim.x;

atomicAdd(&Histogram[Image[Image_Idx]], 1);
}
Figure 9: Program of image processing on GPU

4.3 Result

Figure 4: CPU time of process image

Figure 10: GPU time of process image

Conclusion: For image processing, GPU is faster than CPU. Because of Vonn-
Newman architecture, the latency of GPU is greater than CPU. When the
program run in CPU, it just uses the data stored in main memory. But when the
program run in GPU, it has to copy data from main memory to GPU’s memory
and process it. When it finishes, the data processed has to copied from GPU’s
memory to main memory to read by CPU. So, GPU is more latency than CPU.
But for bigger calculations, execution time is very greater than transmit time, so
we can ignore it and GPU is stronger than CPU in this situation.

14
5. CONCLUSIONS AND FUTURE WORK

The group learned and investigated how to program on GPU parallel


processors using the Cuda and C++ programming languages in this topic.
However, the topic only covers a few minor applications, such as
programming two examples of how to calculate and process images in
order to compare CPU and GPU performance. As a result, the obtained
result shows the CPU and GPU execution times when each of them runs
the program (Sum of primes from 1 to n - where n is entered by the user).
The results show that the CPU takes less time to execute than the GPU.
Therefore, we can conclude that in this case, the CPU is superior to the
GPU in terms of computing. In addition, GPU outperforms CPU in image
processing. The GPU has a higher latency than the CPU due to the Vonn-
Newman architecture. When the program is run on the CPU, it only
accesses the data in main memory. When running in GPU, however, the
program must copy data from main memory to GPU memory and process
it. When it's done, the data it's processed must be copied from the GPU's
memory to main memory, where it can be read by the CPU. As a result,
the GPU has a higher latency than the CPU. However, because the
execution time is much longer than the transmit time for larger
calculations, we can ignore it and conclude that the GPU is superior to the
CPU in this case. We can develop a method to optimize processor speed
when doing parallel computation by analyzing and comparing the speeds
of this GPU and CPU.

15
6. References
[1] November 2006, NVIDIA, The Book: CUDA C++ PROGRAMMING
GUIDE,www.nvidia.com
[2] Jaegeun Han & Bharatkumar Sharma,The Book: Learn CUDA
Programming_A beginner's guide to GPU programming and parallel
computing with CUDA 10.x and C/C++.
[3] Bhaumik Vaidya, The Book: Hands-On GPU-Accelerated Computer
Vision with OpenCV and CUDA_Effective techniques for processing
complex image data in real-time using GPUs
[4] Eric Young & Frank Jargstorff, The Book: Image Processing & Video
algorithsm with CUDA.

16

You might also like