Understanding The Efficiency of GPU Algorithms For Matrix-Matrix Multiplication
Understanding The Efficiency of GPU Algorithms For Matrix-Matrix Multiplication
Stanford University
Abstract
Utilizing graphics hardware for general purpose numerical computations has become a topic of considerable
interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse
of input data, has been widely explored on GPUs. We relax the streaming model’s constraint on input reuse and
perform an in-depth analysis of dense matrix-matrix multiplication, which reuses each element of input matrices
O(n) times. Its regular data access pattern and highly parallel computational requirements suggest matrix-matrix
multiplication as an obvious candidate for efficient evaluation on GPUs but, surprisingly we find even near-
optimal GPU implementations are pronouncedly less efficient than current cache-aware CPU approaches. We find
the key cause of this inefficiency is that the GPU can fetch less data and yet execute more arithmetic operations
per clock than the CPU when both are operating out of their closest caches. The lack of high bandwidth access to
cached data will impair the performance of GPU implementations of any computation featuring significant input
reuse.
Categories and Subject Descriptors (according to ACM CCS): I.3.1 [Computer Graphics]: Graphics processors
plications and must run efficiently if GPUs are to become a on the submatrices. Additionally, B is often stored trans-
useful platform for numerical computing. posed to offer a more cache-friendly representation when
walking its columns. The ATLAS [WPD01] software pack-
Cache-aware implementations for CPU architectures have
age, benchmarked in section 3, works even harder to achieve
been well studied [TCL98, WPD01] and several GPU algo-
cache efficiency. ATLAS tests the host CPU to determine the
rithms have been developed. Larsen and McAllister [LM01]
sizes of system caches, then self-tunes its algorithms to use
initially proposed an approach for computing matrix prod-
code snippets that are optimal for the detected cache sizes.
ucts on graphics hardware. They observed performance
Effective cache-aware implementations of matrix multipli-
equaling that of CPUs but their computations were lim-
cation on CPUs achieve much higher effective bandwidth
ited to 8-bit fixed-point data. Both Hall et al. [HCH03]
and hence numerical efficiency.
and Moravánszky [Mor03] describe improved algorithms for
modern hardware. Moravánszky reports his implementation
is outperformed by optimized CPU code. Despite these re-
2.2. Matrix-Matrix Multiplication on the GPU
sults, little work has been done to analyze the performance
limiting factors of these GPU implementations. We perform A simple approach to compute the product of two matri-
such an analysis and determine that the algorithms are band- ces on a GPU, although feasible only on architectures that
width limited due to the inability of GPUs to provide high support sufficiently long shaders, is to compute elements of
bandwidth access to frequently accessed data. Surprisingly, the resulting matrix in a single rendering pass. We refer to
we find that existing algorithms are near-optimal for current this approach as NV Single. We store 2×2 blocks of matrix
hardware. That is, better performance will require architec- elements in 4 component texels as described by [HCH03].
tural changes. This packing allows each input fetched from memory to be
used twice in a fragment program. Thus the access of data
used in the second operation is performed at register speed.
2. Implementations of Matrix-Matrix Multiplication Our shader, given below, reads 2 rows from matrix A and 2
We consider the problem of computing the product, C = AB, columns of B to compute 4 elements of the output matrix.
of two large, dense, N×N matrices. We quickly describe The values of variables i and j are provided to the shader via
naive and optimized CPU algorithms and then delve more interpolated texture coordinates.
deeply into solutions for a GPU. for (k=0;k<N/2;k++)
C[i,j].xyzw += A[i,k].xxzz*B[k,j].xyxy +
A[i,k].yyww*B[k,j].zwzw;
2.1. Matrix-Matrix Multiplication on CPUs
Each shader program performs 4 inner loop iterations of
The following CPU algorithm for multiplying matrices ex-
the CPU algorithm given in section 2.1. GPU texture caches
actly mimics computing the product by hand:
are organized to capture 2D locality, so sequential accesses
for (i=0;i<N;i++) along either axis are cache coherent. Additionally, shaders
for (j=0;j<N;j++) executing in parallel on adjacent fragments will fetch the
C[i,j] = 0; same data from a row or column of the inputs simultane-
for (k=0;k<N;k++)
ously. In practice, for sufficiently large matrices, instruction
C[i,j] += A[i,k]*B[k,j];
counts exceed the limits of all architectures and a small num-
Unfortunately, while simple, this algorithm suffers from ber of passes are required to complete the computation.
poor locality. Elements of B are accessed columnwise, and
In contrast, multipass algorithms are potentially more
therefore not in sequential order in memory (if matrices are
cache friendly with respect to inputs at the cost of perform-
stored in row-major order). While each iteration in j reuses
ing extra framebuffer reads and writes. We implement a vari-
row i of A, that row may have been evicted from the cache by
ation of the multipass algorithm presented by Larsen and
the time the inner-most loop completes. As a result, the al-
McAllister [LM01]. Our implementation packs 4 consecu-
gorithm is bandwidth limited and displays poor performance
tive elements from a matrix column into a texel, and thus per-
and low efficiency (i.e. time spent doing computation versus
forms a small number of 4×4 matrix by 4×1 vector products
loading data).
in a shader as described in detail by [Mor03]. This method
When sustaining cached accesses, the Pentium 4 can fetch makes 4x reuse of data in the texel obtained from matrix B,
a 128 bit SSE value (4 packed 32 bit floats) in a single cycle but only uses data fetched from A once. 6 fetches are needed
[HSU∗ 01]. The basic flaw of the naive triple-for-loop imple- to retrieve the operands needed for 4 mad operations, so in
mentation is that the size of its per-loop working set prevents total this approach requires 1.5x more texture fetches than
bandwidth amplification from the closest (and fastest) levels NV Single. However, since only a few rows and columns
of the memory hierarchy. A standard solution “blocks” the of the input textures are accessed in a given rendering pass,
inputs into small submatrices that fit in the processor caches, cache hit frequency is expected to be higher. A shader per-
and decomposes the computation of C into subcomputations forming a single matrix-vector product is given given below.
GFLOPS
C[i,j].xyzw = accum[i,j].xyzw + 6
A[i,k].xyzw * B[k/4,j].xxxx +
A[i,k+1].xyzw * B[k/4,j].yyyy + 4
A[i,k+2].xyzw * B[k/4,j].zzzz +
2
A[i,k+3].xyzw * B[k/4,j].wwww;
0
Our NVIDIA implementation, NV Multi, computes a sin- 0 256 512 768 1024
gle matrix-vector product per shader. We unroll this kernel Dimension of Matrices
3 to 6 times (the maximum possible amount allowed by the
driver) to construct our shader for the ATI hardware, ATI P4 3Ghz NV 5900 Ultra
Multi. We tested various amounts of unrolling on each plat- ATI 9800XT NV 6800 Ultra
ATI X800XT
form. The chosen values maximized algorithm performance.
Figure 1: Performance of multiplying square matrices on a
Performance tuning of algorithms on graphics hardware
3 GHz Pentium 4, NVIDIA GeForceFX 5900 Ultra, prere-
is difficult because vendors do not disclose specific architec-
lease GeForce 6800XT, ATI Radeon 9800XT, and prerelease
tural details, such as cache parameters or the physical lay-
Radeon X800XT.
out of texture data in memory. We tested a variety of tech-
niques intended to increase cache efficiency and leverage the
parallel fragment processing architecture of the GPU. We 100
varied the packing of matrix elements into textures, the un- 75
% Efficiency
of the total input bandwidth of NV Multi and even less for S., M C K ENNEY A., S ORENSEN D.: LAPACK
ATI Multi. The benefit of adding hardware support for ac- Users’ Guide, third ed. Society for Industrial
cumulation into floating point buffers pales in comparison to and Applied Mathematics, Philadelphia, PA,
the advantages of significantly widening the path from the 1999.
texture fetch units to the arithmetic units or the ability to do
[BFGS03] B OLZ J., FARMER I., G RINSPUN E.,
register level blocking.
S CHRODER P.: Sparse matrix solvers on
We realize that when evaluating non-rendering algorithms the GPU: conjugate gradients and multigrid.
on GPUs, it is important to remember that the design of ACM Trans. Graph. 22, 3 (2003), 917–924.
graphics architectures is optimized for graphics workloads. [BFH∗ 04] B UCK I., F OLEY T., H ORN D., S UGERMAN
Graphics applications primarily use 8-bit fixed point for- J., FATAHALIAN K., H OUSTON M., H ANRA -
mats. Since four times as many values can be fetched per HAN P.: Brook for GPUs: Stream computing on
clock, compute to fetch ratios are much closer to those of graphics hardware. In Proceedings of ACM SIG-
conventional CPUs. Additionally, shaders often utilize in- GRAPH 2004 (to appear) (2004).
terpolated values and constant parameters instead of texture
memory for the majority of their inputs. Thus, graphics ap- [HCH03] H ALL J. D., C ARR N. A., H ART J. C.: Cache
plications are more likely to perform many instructions per and bandwidth aware matrix multiplication on
texture fetch. In contrast, practical numerical applications the GPU. UIUC Technical Report UIUCDCS-
use full precision floating point values and depend upon iter- R-2003-2328 (2003).
ating through input textures for access to data. Rebalancing [HCSL02] H ARRIS M. J., C OOMBE G., S CHEUERMANN
GPUs to include wider or closer caches or an increased abil- T., L ASTRA A.: Physically-based visual simu-
ity to operate on and output larger amounts of data per shader lation on graphics hardware. In Proceedings of
would make GPU architectures better suited for numerical the ACM SIGGRAPH/EUROGRAPHICS confer-
workloads. Even streaming applications, already well suited ence on Graphics hardware (2002), Eurograph-
for GPU architectures, would benefit from bandwidth im- ics Association, pp. 109–118.
provements for regularly patterned texture accesses.
[HSU∗ 01] H INTON G., S AGER D., U PTON M., B OGGS
Although we investigate only matrix-matrix multiplica- D., C ARMEAN D., K YKER A., ROUSSEL P.:
tion in detail in this paper, our arguments apply to many The microarchitecture of the pentium 4 proces-
algorithms that feature data reuse. A wide variety of linear sor. Intel Technology Journal 22, 1 (2001).
algebra operations rely upon blocking to amplify bandwidth
and similar concerns apply to many general algorithms that [KW03] K RUGER J., W ESTERMANN R.: Linear algebra
repeatedly compute on large 2D (or n-D) pieces of memory. operators for GPU implementation of numerical
So long as graphics architectures continue to provide a low algorithms. ACM Trans. Graph. 22, 3 (2003),
bandwidth connection between arithmetic units and nearby 908–916.
local memories, such algorithms will continue to suffer in [LM01] L ARSEN E. S., M C A LLISTER D.: Fast matrix
efficiency when adapted to GPUs, despite increases in GPU multiplies using graphics hardware. In Proc. Su-
clock rates and raw computational power, and despite the percomputing 2001 (2001).
high levels of parallelism and numerical complexity inher-
[MA03] M ORELAND K., A NGEL E.: The FFT
ent in the operations. Addressing these issues in future hard-
on a GPU. In Proceedings of the ACM
ware designs will make GPUs a significantly more powerful
SIGGRAPH/EUROGRAPHICS conference on
platform for this broad class of applications.
Graphics hardware (2003), Eurographics Asso-
ciation, pp. 112–119.
5. Acknowledgments
[Mor03] M ORAVANSZKY A.: Dense matrix algebra on
We would like to thank ATI and NVIDIA for providing ac- the GPU, 2003. http://www.shaderx2.com/shaderx.PDF.
cess to their latest hardware, and Sean Treichler of NVIDIA
[TCL98] T HOTTETHODI M., C HATTERJEE S., L EBECK
and Steve Morein for invaluable discussions about the de-
A. R.: Tuning strassen’s matrix multiplication
sign of graphics architectures. Support for this research was
for memory efficiency. In Proceedings of the
provided by the Rambus Stanford Graduate Fellowship and
1998 ACM/IEEE conference on Supercomput-
Stanford School of Engineering fellowship programs.
ing (CDROM) (1998), IEEE Computer Society,
pp. 1–14.
References [WPD01] W HALEY R. C., P ETITET A., D ONGARRA
[ABB∗ 99] A NDERSON E., BAI Z., B ISCHOF C., B LACK - J. J.: Automated empirical optimizations of soft-
FORD S., D EMMEL J., D ONGARRA J., ware and the ATLAS project. Parallel Comput-
D U C ROZ J., G REENBAUM A., H AMMARLING ing 27, 1–2 (2001), 3–35.