Algorithmic Considerations For Graphical Hardware Accelerated Applications
Algorithmic Considerations For Graphical Hardware Accelerated Applications
I. I NTRODUCTION
Research into general purpose computations on
graphics programmable hardware (GPGPU) has
become an area of great interest to the
high-performance computing (HPC) community.
GPGPU is the practice of writing software that solves
standard HPC problems using commodity graphics
hardware. The attraction of GPGPU for those writing
computationally intensive scientific software lies in the
price/performance ratio of this graphics hardware.
A high-end consumer class GPU from a vendor
such as Nvidia will cost between 300 and 500 dollars.
For this price, you get a card that is capable of
theoretical peak floating point performance many times
that of the main CPU in the system. The difficulty is in
harnessing that computational performance.
Given the difference between CPUs and GPUs, it is
not always a straightforward task to take an algorithm
written to run efficiently on a standard CPU and get it
to run with with comparable efficiency on graphics
hardware. This is due to a combination of architectural,
algorithmic, and development tool issues.
We will examine the roots of the GPGPU
movement, look at hardware architectures and software
II. H ISTORY
A. The Power Wall
In the mid-to-late 20th century, CPU speeds
improved at an average rate of 52% a year. The
continued miniaturization of transistors provided
seemingly endless opportunities for the dies to become
smaller, the chips to become faster, and the power
requirements to fall. For 30 years, the fundamental
contributor to increasing system performance was the
rising speed of the CPU. Programmers assumed that
they could increase the speed of their programs simply
by waiting a few months or years for a faster chip to
become available. However, as we increase the
frequency of a microprocessor, the power leakage rises
at roughly the cube of the increase. In other words,
doubling the clock speed would require 8 times the
power consumption. Previous to 2001, the problem had
been kept under control due to the continuing
miniaturization of the transistors on the die. Each die
shrinkage lowered the power consumption of the
affected transistors, largely offsetting the power
consumption increase due to the clock frequency [1].
However, around 2001, the microprocessor industry
ran into what is sometimes known as the power wall.
This is the point at which it is no longer possible to
offset power consumption increases with transistor
miniaturization. It marked the end of the era of
doubling CPU clock speeds every 2 years. Surprisingly
enough, it did not mark the end of doubling the
capacity of the chip fabrication dies each 2 years. The
power wall applies only to dies that have a single
processor on them. Given a single core CPU, it was
possible to replicate that CPU on an miniaturized die
C. GPGPU
Soon after the introduction of GPUs, people in the
HPC community realized that the cards represented an
unmatched price/performance bargain. For at most
$300, it was possible to get hardware that,
(theoretically, at least) was capable of many times the
floating-point performance of a desktop CPU.
Like-minded individuals began corresponding, and a
new movement, the general-purpose computing on
GPUs (GPGPU) initiative, began.
C. GPU Limitations
We have already discussed some shortcomings of
the GPU programming model: lack of flow control,
requirement for programs to run in lockstep, high
parallelism. However, there are other, even more severe
limitations.
yi C1 D Ax C yi :
This operation may be executed thousands, or even
millions of times during the course of solving a matrix.
It is executed once for each non-zero element in the
matrix for each iteration of the problem. Making this
computation efficient is the most important
computational optimization for these solvers. However,
this optimization is overshadowed by the the memory
requirements of a sparse solver.
R EFERENCES
VI. C ONCLUSION