0% found this document useful (0 votes)
63 views

Algorithmic Considerations For Graphical Hardware Accelerated Applications

Discusses the characteristics of algorithms that lend themselves well to algorithmic speed-up using GPUs.

Uploaded by

David J Harr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Algorithmic Considerations For Graphical Hardware Accelerated Applications

Discusses the characteristics of algorithms that lend themselves well to algorithmic speed-up using GPUs.

Uploaded by

David J Harr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

1

Algorithmic Considerations for Graphics


Hardware Accelerated Applications
Saied Farivar and David J Harr

AbstractRecently, there has been much work done


in the area of harnessing the computational power of
graphics hardware for non-graphics problems. In this
paper, we examine the actual performance capabilities of
modern graphics processing units (GPUs) and analyze the
characteristics of problems that successfully run on these
devices. We present the hardware architecture for one
representative GPU model, and discuss how the
architecture dictates the performance constraints for
client software. We then give an overview of several
active areas of interest in HPC and look at the
performance trade-offs between GPUs and more
traditional computing platforms. Finally, we will take an
in-depth look at implementing matrix operations on a
GPU as illustrative of any effort to employ GPUs in this
type of computational role.

I. I NTRODUCTION
Research into general purpose computations on
graphics programmable hardware (GPGPU) has
become an area of great interest to the
high-performance computing (HPC) community.
GPGPU is the practice of writing software that solves
standard HPC problems using commodity graphics
hardware. The attraction of GPGPU for those writing
computationally intensive scientific software lies in the
price/performance ratio of this graphics hardware.
A high-end consumer class GPU from a vendor
such as Nvidia will cost between 300 and 500 dollars.
For this price, you get a card that is capable of
theoretical peak floating point performance many times
that of the main CPU in the system. The difficulty is in
harnessing that computational performance.
Given the difference between CPUs and GPUs, it is
not always a straightforward task to take an algorithm
written to run efficiently on a standard CPU and get it
to run with with comparable efficiency on graphics
hardware. This is due to a combination of architectural,
algorithmic, and development tool issues.
We will examine the roots of the GPGPU
movement, look at hardware architectures and software

tools for doing computations on GPUs, discuss the


characteristics of algorithms that can be successfully
accelerated using GPUs, and then consider in detail the
performance of two algorithms running on a GPU, as
well as the reasons they can achieve this speedup.

II. H ISTORY
A. The Power Wall
In the mid-to-late 20th century, CPU speeds
improved at an average rate of 52% a year. The
continued miniaturization of transistors provided
seemingly endless opportunities for the dies to become
smaller, the chips to become faster, and the power
requirements to fall. For 30 years, the fundamental
contributor to increasing system performance was the
rising speed of the CPU. Programmers assumed that
they could increase the speed of their programs simply
by waiting a few months or years for a faster chip to
become available. However, as we increase the
frequency of a microprocessor, the power leakage rises
at roughly the cube of the increase. In other words,
doubling the clock speed would require 8 times the
power consumption. Previous to 2001, the problem had
been kept under control due to the continuing
miniaturization of the transistors on the die. Each die
shrinkage lowered the power consumption of the
affected transistors, largely offsetting the power
consumption increase due to the clock frequency [1].
However, around 2001, the microprocessor industry
ran into what is sometimes known as the power wall.
This is the point at which it is no longer possible to
offset power consumption increases with transistor
miniaturization. It marked the end of the era of
doubling CPU clock speeds every 2 years. Surprisingly
enough, it did not mark the end of doubling the
capacity of the chip fabrication dies each 2 years. The
power wall applies only to dies that have a single
processor on them. Given a single core CPU, it was
possible to replicate that CPU on an miniaturized die

and have the power consumption remain relatively


stable. This discovery marked the end of single
processor computing

video cards which had both hardware acceleration of


common 3d graphics operations and video output
circuitry on the same card. Soon, two of the biggest
manufacturers were the companies Nvidia and ATI.

From this point on, CPU manufacturers such as


Intel and AMD used their increasing transistor density
to put more cores onto a single die, while only
modestly improving clock speeds. In order to take
advantage of increasing CPU capacity, for the first
time, applications programmers were required to write
multithreaded programs, rather than relying on raw
single threaded CPU performance to improve through
raw clock speed increases. Thus, the change in focus
for CPUs from raw single processor performance to
multicore performance marked a fundamental paradigm
shift in the development strategy for even vanilla
application programmer. Where parallel computing had
once been the purview of dedicated server software
programmers or scientific or numerical software
software programmers, it had now become the primary
method for all applications programmers to improve
the performance of their software.

While game developers were happy to make use of


these accelerated video cards (in fact, by 2000,
essentially any new computer game required a video
card with some degree of hardware acceleration built
in), they chafed at the inflexibility of the interface.
Because the operations that the cards would accelerate
were baked into the hardware, each card had to be
coded for individually, and the only features that were
available for acceleration were ones that the
manufacturer had actually put into the card. If a
developer wanted to do a graphic effect slightly
differently than the way the hardware expected it, they
were right back into coding the effect manually on the
CPU and then trying to integrate the results into the
rest of their rendering pipeline.
In 2001, the Xbox console shipped with the Nvidia
GeForce 3 video architecture. The GeForce 3 was the
first graphics card that, instead of fixed-function
hardware circuitry, had programmable elements that
allowed the game programmer to determine how the
graphical effects would be applied to the game. Since
the GeForce 3 was no longer really a video card, but
more of a set of programmable graphics hardware, it
was referred as a graphics processing unit, which is
where the term GPU comes from.

B. Graphics Processing Units


At the same time this was happening, there was
revolution occurring in the computer graphics
community, particularly among the high-end PC game
aficionados. Since the publication of the video game
Doom by Id software in 1993, computer games had
been using ever more sophisticated graphics and
physics in order to improve the immersion of the
software. Game developers spent enormous amounts of
time writing hand-coded assembly language rendering
loops to maximize the performance of their games on
the target hardware. Often, a developer would create
multiple hand-tuned versions of their inner loops,
picking which one to use at runtime based on the
model of CPU the players computer had installed. The
processing and rendering of 3d computer game
graphics could bring even the most powerful consumer
CPU to a grinding halt.

It was not long before ATI and other competitors of


Nvidia responded with their own GPUs. Each
generation of GPUs got faster, the range of possible
effects greater, the on-board memory larger, the
throughput higher.
The first generation of GPUs offered programmable
logic that was only able to be used on the geometry of
a scene, the x, y, and coordinates that made up the
vertices, normals, and color values. Any operations on
the frame buffer were still handled by fixed-function
hardware. The programmable elements were known as
vertex shaders (which is also the term used for the
programs run on that hardware). In addition, the only
floating-point data types supported were 16-bit
half-floats and a variation of standard 32-bit floats.

To alleviate this problem, around 1997, a company


called 3dfx introduced a graphics coprocessor that
offloaded some of the more computationally expensive
numeric operations from the CPU to the add-in card.
For games that made use of 3dfxs card, they were able
to achieve framerates and polygon counts impossible
on an unaccelerated machine. Eventually, 3dfx burned
out and went broke, but their legacy lived on in the
creation of integrated graphics accelerator boards and

Before long, the programmability had moved to


operations on the frame buffer, via fragment shaders.
However, the two pieces of hardware were still
separate. A vertex shader could only work on
2

geometric data, and a fragment shader could only work


on pixel data. Around 2007, both ATI and Nvidia went
to a unified shader architecture that, for the first time,
did not differentiate between geometric data and pixel
data, allowing any hardware to be used for any
purpose. Furthermore, with the most recent generation
of GPUs, full support for hardware calculations using
64-bit floating point data types has been added.

A. CPU Hardware Architectures


Computing devices are ubiquitous in modern life.
From a desktop computer workstation to modern
cellular phone, most people have multiple devices that
can qualify under the broad term of computer. One
thing that all these devices have in common is the
general design of their central processing units.
Since the beginning of the computer era, CPUs
have used variants of the design propounded by early
computer pioneer, John von Neumann. In the von
Neumann design, there is a single (or more recently, a
small number, typically two or four) processor that sits
astride a bus connecting it to main memory. The
processor treats both programs and data as simply
bits, and program instructions are shuffled around
identically to data until it becomes time to execute
them. The speed of the central processor is usually
much faster than the ability of the bus to supply the
processor with data, resulting in data starvation for the
processor if not alleviated. This has been dubbed the
von Neumann bottleneck.[2] To alleviate this
situation, most modern processors employ a multi-level
memory hierarchy, with vary amounts of cache
memory sitting in-between main memory and the CPU,
in an attempt to hide memory latency. An Intel i7 CPU
has a transistor count of 2.2 billion transistors, and up
to 30% of that number is devoted to memory control to
ensure that the CPU gets data in a timely manner so as
not to starve the processor[3].

C. GPGPU
Soon after the introduction of GPUs, people in the
HPC community realized that the cards represented an
unmatched price/performance bargain. For at most
$300, it was possible to get hardware that,
(theoretically, at least) was capable of many times the
floating-point performance of a desktop CPU.
Like-minded individuals began corresponding, and a
new movement, the general-purpose computing on
GPUs (GPGPU) initiative, began.

III. H ARDWARE A RCHITECTURE AND S OFTWARE


T OOLS
There are significant differences between the
hardware architectures of a central processing unit and
a GPU. The central processing unit is designed for
general-purpose computing. The GPU is designed for
specialized graphics operations. The sharp focus of the
GPU on graphics processing tasks is what makes it
useful for parallel computing. The differences between
the CPU and GPU are fundamental and deep. In this
section will profile a typical general-purpose CPU, the
Intel i7. Also, we will look at the details of the Nvidia
corporation Tesla GPU architecture. The Tesla
architecture is representative of the principles
underlying modern GPU design. While GPUs made by
other manufacturers, such as ATI, may differ in detail,
the underlying principles will be similar. On the
software side, we will be talking about Nvidias CUDA
language, which has become somewhat of a de facto
standard for doing programming on GPUs. In a highly
informal survey on scholar.google.com of papers
published in the last 3 years containing the term GPU,
Nvidia GPUs and CUDA are mentioned more than five
times as often as any competing architecture or
programming language. While not scientific, this was
definitive enough to persuade us that Tesla and CUDA.

It is precisely the fact that the CPUs require such a


large amount of control hardware that makes it difficult
for them to keep up in clockspeed with GPUs. The
main culprit in the power wall was the proliferation of
control hardware required of a CPU as the clock speed
increases. As CPUs became more complex the amount
of control hardware became required became higher
and higher. Presently the control hardware of the CPU
me take up as much as 40% of the transistors on the
die.
The memory bandwidth available to the CPU is
also limiting factor in its keep abilities. On a typical
Intel motherboard, the system bus is capable of perhaps
12 GB/s throughput. On a data-bound process this
limits the number of cycles the CPU can execute
within a particular task. As an example, the high-end
core i7 processor is capable about 20 GB/s throughput
and a maximum sustained floating-point performance
of about 20 Gflops [3].

B. GPU Hardware Architectures

but even on these models, executing a branch means


that all processors in the block will have to follow both
code paths, and each core will simply discard the result
for the false path.

Compared to the CPU and GPU is extremely


focused. A GPU is made up of many small (compared
to a modern CPU) computational units with minimal
control hardware. In contrast to a CPU, the majority of
the transistors on the board are dedicated to
computation hardware and to memory. This is one of
the issues that allows GPUs to avoid the power-wall
problem plaguing CPUs. If you can double the number
of transistors on a GPU die, you can almost double the
number of computational units. This has resulted in the
performance of GPUs increasing on an accelerating
curve.

These types of convoluted activities required to


perform simple computations made GPU programming
the purview of numeric programmers who were also
conversant with graphics programming techniques. This
resulted in a high cost of entry to start using a GPU
for HPC.

C. CUDA - A Tool for Writing Parallel Computations


on GPUs

The Nvidia Tesla is a graphics architecture that has


been extended to support general computation through
improved parallel control hardware, exposed cache
operations, and support for double-precision floating
point operations. It is representative of the current state
of the art in GPU design and is unlikely to differ
significantly from future architectures. Tests of the
Tesla have shown sustained performance of over 80
Gflops in double precision floating point and memory
throughput approaching 150 GB/s.

In the last few years, companies such as ATI and


Nvidia have started to directly court the HPC
community. These efforts have consisted of activities
such as providing simplified programming interfaces
for the cards, exposing more of the hardware model for
better access to memory caches and registers, and
providing software tools to allow for simpler
programming models for general computation.
One example of these software tools is the CUDA
language, sponsored by Nvidia. CUDA is a set of
extensions to the C language to enable programmers to
do parallel programming on GPUs without having to
know about the graphics rendering pipeline. CUDA
uses a parallel task model in its implementation. A
programmer writes a small chunk of code that can be
run in parallel across many processors at once, then
CUDA takes responsibility for thread setup and data
transport. Before the data can be processed, it must be
copied up to the memory on the graphics card by the
program running on the host CPU. This transfer is
done using DMA across the PCI bus.

Clearly these numbers are significantly higher than


comparable CPUs. The difficulty is in translating these
theoretical capabilities into actual performance. In the
past, the programmer was forced to map the problem
into the graphics domain and to somehow adapt the
graphics operations of the GPU into data processing
operations. The most common technique was to
arrange the target data in a large 2-d array that was
treated as a texture by the GPU. In these textures, the
data was stored as 128-bit vector composed of 4 32-bit
floating point values. Using texture sampling
operations, a fragment shader would access the data,
perform the desired computation, and output the result
as another 128-bit vector to a target pixel buffer.

Once the data is in place and the thread has started


running, each thread is responsible for processing a
single set of data. Communication between the threads
is accomplished using shared memory, and
synchronization takes place through the use of barriers.

In addition, because the GPU uses a


single-instruction multiple-data (SIMD) model for its
computation, there will be some number of
computational units running a single program in
lockstep. The data for each unit will be different, but
the operations must be identical across the array of
processors. This makes the use of flow control in a
program difficult. If flow control were allowed, it
might lead to a situation where two different cores
would try to execute differing code paths. In previous
generations of GPUs, flow-control operations were
prohibited. Recent GPUs have relaxed this somewhat,

Threads in CUDA are arranged into blocks of


threads (typically 16 or 32) which run in lockstep over
the data. While communication between blocks is
accomplished through the shared memory, each block
of threads has a small amount of fast memory available
for intra-block thread communication. This fast
memory has a much lower latency than the shared

memory used between blocks, and functions in an


equivalent fashion to a level-1 cache on a CPU [4].

there are large amounts of data flowing through the


pipeline, the GPU is able to start and stop threads
quickly to respond to the incoming data. However, if
the threads are dependent on quick access to data, then
the processors may block waiting for data to be
retrieved from memory.

IV. A NALYZING GPU P ERFORMANCE ON S TANDARD


HPC A LGORITHMS
In the initial stages of GPGPU computing, there
were some amazing claims made for the capabilities of
the graphics processors. As an example, When Sony
introduced the PlayStation 3 video game console in
2005, they made the claim that the machine was
capable of 2 TFlops of performance. About 1.8 TFlops
was attributed to the Nvidia GPU that provided
graphics capabilities for game software. This was
clearly a wildly exagerrated claim for a $300 consumer
device. Similar claims of algorithmic speedups of two
or three orders of magnitude over CPU based
algorithms floated around. However, many of these
claims were based upon toy benchmarks that allowed
the GPU to function at or near its theoretical
computational and throughput maximums.

B. Comparison of CPU and GPU Algorithmic


Characteristics
CPUs rely on either temporal or spatial locality to
mask the cost of memory fetches. Therefore, programs
that either use data that is located close together in
RAM, or repeatedly use the same data run efficiently
on modern CPUs. On the other hand, algorithms that
use a scatter-gather model of memory access are
much more difficult to optimize for CPUs.
On the other hand, GPU memory is often triple or
quadruple ported, allowing multiple reads of the same
location. In addition, the memory bus of modern GPUs
is 256, 384, or even 512 bytes wide, allowing the
hardware to fetch as many as 16 words of memory in a
single cycle. These memory characteristics means that
a GPU excels at pulling memory from disparate
locations in graphics RAM and getting it to the
processor units in a timely manner.

When actual careful studies were made comparing


similar algorithms similarly tuned for both CPUs and
GPUs were made, the actual results were somewhat
different. The CPU fared better than given credit for,
but there was a significant speed advantage for GPUs
on a large class of problems.

Since GPU processing units are almost always


vectorized, a single processor can operate as many as
four numbers in a single cycle. When the data can be
vectorized, substantial performance gains can be
realized over the single data models.

A. Characteristics of GPU-friendly algorithms


In [5], Owens et al describe three characteristics of
algorithms that can be mapped to GPUs with a high
probability of significant speed-ups. First, the algorithm
should have substantial computational requirements.
Without significant computational requirements, many
of the processors on the GPU will be idled, wasting
resources that could be applied to the problem solution.

C. GPU Limitations
We have already discussed some shortcomings of
the GPU programming model: lack of flow control,
requirement for programs to run in lockstep, high
parallelism. However, there are other, even more severe
limitations.

Next, the algorithm should have a high degree of


parallelism. Given the large numbers of processors on a
typical GPU, for maximum flexibility, there must be
many ways for the card to utilize its hardware. Without
significant parallelism, processors may be unable to
perform any computations as they wait for other
portions of the task chain to complete, once again
resulting in wasted computational resources.

The GPU performance advantage over a standard


CPU comes about because of its unique hardware
architecture characteristics. However, in order to take
advantage of its strengths, certain conditions must be
fulfilled. In particular, to take advantage of the extreme
bandwidth of the GPU bus, the data must be resident
in the graphics card memory.

Finally, the algorithm should be more reliant on


data access rates than on data access speeds. When
5

The only way to get initial data up to the GPU is to


perform DMA transfers from main memory over a
standard computer peripheral bus. In most cases, this is
some variant of a PCI bus. These buses have a
maximum theoretical bandwidth of about 6 GB/s, and
an actual throughput closer to 2 GB/s for real data.

Vector (SpMV) multiplication operator. Like BLAS,


SpMV is a foundational operation for linear algebra.
Unlike BLAS, SpMV is used on sparse matrix
multiplication, which makes it extremely bandwidth
intensive. Given the memory performance of GPUs, it
should also be a good match for the GPU hardware.

When one factors in the setup costs of running the


programs on a GPU, it can severely impact the relative
performance advantage GPUs enjoy over CPUs. For
this reason, it is critical, when evaluating comparative
benchmarks of CPU and GPU performance that the
total cost of each algorithm be weighed.

A. Cast Study I: BLAS on GPUs


The basic linear algebra operations include vector
to vector multiplication, vector to matrix multiplication,
and matrix to matrix multiplication. BLAS supplies
primitives for all these operations. Of the three, the
most time-consuming and computationally expensive is
matrix to matrix multiplication. This is an O.n3 /
operation, which accounts for its numeric complexity.

A recent study tried to do a comparison of actual


CPU and GPU programs that implemented useful,
significant algorithms and were highly tuned for their
target architectures. Their results were instructive.
Instead of the orders of magnitude performance
increase reported by many researchers, they had much
more modest improvements, ranging from 3x to 16x.
[3]

One group of researchers decided that the best way


to attack the problem of accelerating BLAS was to
have parallel algorithmic implementations that
performed the same function. Each had nearly identical
floating-point operation and memory access counts, but
the order of operations and the locality of the memory
fetches were radically different. They derived a
heuristic for characterizing the target matrices, and
depending on the matrix characteristics, they chose one
of the algorithms.

Nevertheless, even with these more modest


performance improvements, it is clear that GPUs can
be effective platforms for numerically intensive
computations. The ability to plug a commodity
graphics card, or two, or three, into a computer and see
a speedup of several times, all for less than the price of
low-end computer, is remarkable.

Their relatively simple, straightforward


implementation of these routines netted them an
average 2.3x speedup over the same routines from
CuBLAS, which is Nvidias official implementation to
run on their GPUs. CuBLAS is distributed as part of
the CUDA developer tool chain. Given that Nvidia
supplies this library, making it a de facto system
library, it is natural to assume that they have spent
significant time tuning it for their hardware. Yet some
simple algorithmic changes to adapt to the memory
distribution of the target matrices and to minimize
redundant data loads and stores provided a doubling of
performance [7].

V. A LGORITHMIC C ASE S TUDIES


In order to illustrate concrete examples of the
techniques used to implement an algorithm efficiently
on GPUs, we here present analysis of various
implementations of two algorithms. Since linear
algebra is the primary computational activity of a wide
variety of HPC algorithms, both case studies are
techniques taken from matrix operations.
We have chosen to use operations from the Basic
Linear Algebra Subroutine (BLAS) library as our first
case study. BLAS is a widely used foundation of
popular linear algebra libraries and in fact constitutes
the basic building block of most of the higher level
operations of these libraries. An efficient BLAS
implementation is critical for good performance from
these libraries. BLAS is computationally intensive, so it
should make a good candidate for GPU speedup [6].

Another group took the tack of forcing operations


to occur in multiples of the size of the GPU processor
blocks. Instead of trying to conform the processors to
the structure of the target matrices, they would have all
the threads in a computational block fetching data and
computing results. However, any redundant processors
were simply duplicating the operations and data of the
last valid processor, and their results were discarded.
This approach had several advantages. It allowed

Our second case study is of the Sparse Matrix


6

element of the solution vector, yi C1 is calculated from


the sum of a previous estimate and an element of the
matrix

the GPU to allocate thread resources in its best native


granularity without requiring logic to try and determine
whether the current block would need to adjust the
number of allocated threads. At the same time, even
though extra processors were performing extra work, it
didnt actually tax the resources of the card at all, since
the data being accessed by the threads was already
present in local registers, since it was the same data as
a valid thread.

yi C1 D Ax C yi :
This operation may be executed thousands, or even
millions of times during the course of solving a matrix.
It is executed once for each non-zero element in the
matrix for each iteration of the problem. Making this
computation efficient is the most important
computational optimization for these solvers. However,
this optimization is overshadowed by the the memory
requirements of a sparse solver.

The combination of these factors resulted in a speed


up of as much as 33% over the Nvidia implementation
for larger matrices. This illustrates the importance of
tuning the implementation to fit the vagaries of the
hardware [8].
Finally, another group created a framework to
auto-generate a a whole family of kernels with
different locking sizes, memory access patterns, and
operator execution orders. Then, using a profile of the
target GPU, the tool would pick one of the kernels to
use for the target board.

Since an iterative solver is operating on a sparse


matrix, its memory access patterns are highly
non-localized. Even with the use of a purpose designed
data structure for sparse matrices, it is difficult to
reliably predict the memory access behavior of a
matrix before the calculations are carried out. In fact,
this is one of the features that makes sparse matrix
multiplication so attractive on a GPU. A standard CPU
is limited to a maximum data throughput of around 12
GB/s, which places an effective upper limit on the
number of SpMV operations the CPU can perform. On
the other hand, a high end GPU is capable of data
throughput of as much as 20 times that number, giving
the GPU an enormous edge in performing these
operations.

Their results were interesting, if not as spectacular


as some of the other efforts. Their kernels consistently
executed at about 85% of the speed of the Nvidia
supplied CuBLAS implementation which, as was noted
above, is hand-tuned for the target GPUs. We do not
know which version of CuBLAS they tested against,
whether it was the same version as the group in [7]
used or a different version, so there is no way to
directly compare the results of the two efforts [6].

One technique to maximize memory bandwidth is


to tune the sizes of the client matrices so that fetch
operations can be made in multiples of the bus size,
giving the hardware maximum opportunity to optimize
the data fetches. One group working with this
technique found that they were able to achieve a 6.5x
speedup on large sparse matrices over a similar
program running on a regular CPU.

B. Case Study II: Sparse Solvers on GPUs


Sparse linear solvers are concerned with solving
sparse linear systems. In sparse systems, the vast
majority of the coefficients of the representative
equations are 0. Any attempt to use traditional matrix
operations on a sparse matrix is hugely wasteful of
both computer memory and of processor cycles. As a
result, the scientific community has developed an entire
class of algorithms that is concerned with performing
multiplication on sparse matrices. These algorithms
have very different characteristics than dense solvers
like BLAS, and getting a sparse solver to run
efficiently on a GPU requires different techniques than
is required for BLAS and similar software.

In addition, their algorithm achieved a throughput


rate of 97% of the theoretical maximum bandwidth of
the GPU memory bus. Their results indicate that it is
possible to extract almost the maximum performance
of the hardware with careful tuning [9]
Because a sparse matrix consists of mostly 0
elements, the format the matrix is stored in can have a
large impact on the performance of the algorithm. A
research group looked at popular sparse matrix storage
formats and attempted to modify them to improve

Many solvers are based around the Sparse Matrix


Vector (SpMV) operator. In these solvers, an updated

locality of reference for SpMV operations.

Finally, we examined, in detail, several


implementations of two classes of problems that are
representative of many areas of HPC activity. Through
this examination, we were able to illustrate some of the
techniques used to translate parallel algorithms
successfully to run efficiently on GPUs.

Their efforts proved quite fruitful. In one set of


tests, they achieved 110% of the theoretical bus
bandwidth of the target card. They attributed this to
high locality of reference allowing the threads to reuse
previous data from the small local cache without
having to actually fetch it from main memory.

It is clear that GPUs are an effective platform for


HPC. Even those skeptical of the possibilities of
GPGPU admit that significant speedups are likely with
correct algorithm design and careful tuning of the
implementations. The most pessimistic benchmarks
show GPU implementations beating CPUs by factors of
four or five. With the capability to put multiple GPUs
in a single computer, these speedups can be further
extended.

Other tests were not quite so successful, but in all


cases, a tuned data structure improved the performance
of the SpMV operation and even, in some cases,
reduced the memory bandwidth required to transfer the
original matrix data from main memory over the PCI
bus, which, as we have indicated before, is perhaps the
most difficult bottleneck to overcome in GPU
programming [10].

R EFERENCES
VI. C ONCLUSION

[1] J. Owens, GPU architecture overview, in SIGGRAPH 07:


ACM SIGGRAPH 2007 courses. New York, NY, USA: ACM,
2007, pp. 2+. [Online]. Available:
http://dx.doi.org/10.1145/1281500.1281643
[2] J. W. Backus, Can programming be liberated from the von
neumann style? a functional style and its algebra of programs,
Commun. ACM, vol. 21, no. 8, pp. 613641, 1978.
[3] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D.
Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty,
P. Hammarlund, R. Singhal, and P. Dubey, Debunking the
100x gpu vs. cpu myth: an evaluation of throughput computing
on cpu and gpu, SIGARCH Comput. Archit. News, vol. 38, pp.
451460, June 2010. [Online]. Available:
http://doi.acm.org/10.1145/1816038.1816021
[4] S. Che, J. Meng, J. W. Sheaffer, and K. Skadron, A
performance study of general purpose applications on graphics
processors, in FIRST WORKSHOP ON GENERAL PURPOSE
PROCESSING ON GRAPHICS PROCESSING UNITS, 2007.
[5] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone,
and J. C. Phillips, GPU computing, Proceedings of the IEEE,
vol. 96, no. 5, pp. 879899, May 2008. [Online]. Available:
http://www.idav.ucdavis.edu/publications/print_pub?pub_id=936
[6] Y. Li, J. Dongarra, and S. Tomov, A note on auto-tuning
gemm for gpus, in Computational Science ICCS 2009, ser.
Lecture Notes in Computer Science, G. Allen, J. Nabrzyski,
E. Seidel, G. van Albada, J. Dongarra, and P. Sloot, Eds.
Springer Berlin / Heidelberg, 2009, vol. 5544, pp. 884892,
10.1007/978-3-642-01970-8_89. [Online]. Available:
http://dx.doi.org/10.1007/978-3-642-01970-8_89
[7] F. D. Igual, G. Quintana-Ort, and R. van de Geijn, Level-3
BLAS on a GPU: Picking the low hanging fruit. FLAME
Working Note #37, Universidad Jaume I, Depto. de Ingenieria
y Ciencia de Computadores., Technical Report DICC
2009-04-01, Apr. 2009, updated May 21, 2009.
[8] R. Nath, S. Tomov, and J. Dongarra, Accelerating gpu kernels
for dense linear algebra, in Proceedings of the 9th
international conference on High performance computing for
computational science, ser. VECPAR10. Berlin, Heidelberg:
Springer-Verlag, 2011, pp. 8392. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1964238.1964250
[9] O. Schenk, M. Christen, and H. Burkhart, Algorithmic
performance studies on graphics processing units, Journal of
Parallel and Distributed Computing, vol. 68, no. 10, pp.
13601369, Oct. 2008. [Online]. Available:
http://dx.doi.org/10.1016/j.jpdc.2008.05.008

We have looked at GPUs as a platform for


high-performance computing applications. In the past,
GPUs have been difficult to program and programs had
to be carefully modified to fit into the graphics
computation model. In addition, hardware shortcomings
of the platform made it unsuitable for some
applications.
Despite these problems, the allure of commoditized
performance hardware spurred many individuals to
continue to develop applications for the hardware. The
attractiveness of GPUs has only increased as GPU
speedups vis-a-vis CPU speedups continue to
accelerate.
Gradually, hardware manufacturers and GPGPU
application programmers have overcome many of the
initial barriers to effective use of the platform.
Double-precision floating point data became a
first-class citizen for hardware calculations. Software
tools eased the requirements for GPU programmers to
be familiar with the graphics pipeline. More hardware
interfaces were exposed, allowing for finer tuning of
memory management and computational
load-balancing.
We examined the hardware architectures of GPUs
and considered how those impacted the performance of
various classes of algorithms. Also, we noted the
classes of problems that were likely to be successful in
exploiting the GPU for increased performance.

[10] N. Bell and M. Garland, Implementing sparse matrix-vector


multiplication on throughput-oriented processors, in
Proceedings of the Conference on High Performance
Computing Networking, Storage and Analysis, ser. SC 09.
New York, NY, USA: ACM, 2009, pp. 18:118:11. [Online].
Available: http://doi.acm.org/10.1145/1654059.1654078
[11] N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha, A
memory model for scientific algorithms on graphics
processors, in Proceedings of the 2006 ACM/IEEE conference
on Supercomputing, ser. SC 06. New York, NY, USA: ACM,
2006. [Online]. Available:
http://doi.acm.org/10.1145/1188455.1188549
[12] J. Nickolls, I. Buck, M. Garland, and K. Skadron, Scalable
parallel programming with cuda, Queue, vol. 6, pp. 4053,
March 2008. [Online]. Available:
http://doi.acm.org/10.1145/1365490.1365500
[13] K. Fatahalian and M. Houston, A closer look at gpus,
Commun. ACM, vol. 51, pp. 5057, Oct. 2008. [Online].
Available: http://doi.acm.org/10.1145/1400181.1400197
[14] S. Matsuoka, T. Aoki, T. Endo, A. Nukada, T. Kato, and
A. Hasegawa, Gpu accelerated computingfrom hype to
mainstream, the rebirth of vector computing, Journal of
Physics, Conference Series, vol. 180, 2008.
[15] E. Lindholm, J. Nickolls, S. F. Oberman, and J. Montrym,
Nvidia tesla: A unified graphics and computing architecture,
IEEE Micro, vol. 28, no. 2, pp. 3955, 2008.
[16] S. Kestur, J. D. Davis, and O. Williams, BLAS Comparison on
FPGA, CPU and GPU. IEEE Computer Society, 2010, p.
288293. [Online]. Available:
http://research.microsoft.com/pubs/130834/ISVLSI_FINAL.pdf
[17] R. Vuduc, A. Chandramowlishwaran, J. Choi, M. Guney, and
A. Shringarpure, On the limits of gpu acceleration, in
Proceedings of the 2nd USENIX conference on Hot topics in
parallelism, ser. HotPar10. Berkeley, CA, USA: USENIX
Association, 2010, pp. 1313. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1863086.1863099

You might also like