0% found this document useful (0 votes)

34 views

Understanding The Efficiency of GPU Algorithms For Matrix-Matrix Multiplication

1. The document analyzes the efficiency of GPU algorithms for matrix-matrix multiplication compared to optimized CPU implementations. 2. Matrix-matrix multiplication has abundant parallelism but also significant data reuse that CPUs handle efficiently using caching. 3. The authors implement matrix multiplication on the GPU and find that despite near-optimal algorithms, GPU performance is limited by lower memory bandwidth compared to cached data accesses on CPUs.

Uploaded by

kere hore

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

Understanding The Efficiency of GPU Algorithms For Matrix-Matrix Multiplication

Uploaded by

kere hore

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Graphics Hardware (2004)

T. Akenine-Möller, M. McCool (Editors)

Understanding the Efficiency of GPU Algorithms for

Matrix-Matrix Multiplication

K. Fatahalian, J. Sugerman, and P. Hanrahan †

Stanford University

Abstract

Utilizing graphics hardware for general purpose numerical computations has become a topic of considerable
interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse
of input data, has been widely explored on GPUs. We relax the streaming model’s constraint on input reuse and
perform an in-depth analysis of dense matrix-matrix multiplication, which reuses each element of input matrices
O(n) times. Its regular data access pattern and highly parallel computational requirements suggest matrix-matrix
multiplication as an obvious candidate for efficient evaluation on GPUs but, surprisingly we find even near-
optimal GPU implementations are pronouncedly less efficient than current cache-aware CPU approaches. We find
the key cause of this inefficiency is that the GPU can fetch less data and yet execute more arithmetic operations
per clock than the CPU when both are operating out of their closest caches. The lack of high bandwidth access to
cached data will impair the performance of GPU implementations of any computation featuring significant input
reuse.
Categories and Subject Descriptors (according to ACM CCS): I.3.1 [Computer Graphics]: Graphics processors

1. Introduction highly parallel and numerically intensive with little reuse

of input data. However, many computations are both highly
The emergence of programmable graphics hardware has led
parallel and numerically intensive, but exhibit significant per
to increasing interest in offloading numerically intensive
element reuse. We study the efficiency of current graphics
computations to GPUs. The combination of high-bandwidth
architectures in computations where each input contributes
memories and hardware that performs floating-point arith-
to many outputs. Optimized approaches to these problems
metic at significantly higher rates than conventional CPUs
on traditional CPUs are able to exploit cache hierarchies to
makes graphics processors attractive targets for highly par-
effectively amplify memory bandwidth and keep processor
allel numerical workloads. Many applications beyond tra-
functional units busy. We question whether calculations that
ditional graphics specific ones have been demonstrated to
benefit from memory “blocking” strategies will perform ef-
run on GPUs [HCSL02, BFGS03, KW03, MA03, BFH∗ 04].
ficiently on GPUs and if analogous blocking strategies are
Particularly, algorithms that can be structured as stream-
necessary, or even possible, given the programming model
ing computations often realize notable performance gains.
and capabilities of current graphics architectures.
Streaming computations walk over a set of independent in-
puts and independently evaluate a kernel on each value. The
combination of regular, predictable data access with the in- Specifically, we investigate dense matrix-matrix multipli-
dependence of each kernel invocation maps very nicely to cation. It offers regular memory access and abundant par-
graphics architectures. allel computation but features O(n) data reuse and seems
Streaming computations can be characterized as being a natural candidate for a fast GPU implementation. More-
over, dense matrix-matrix multiplication is a building block
of numerical libraries such as LAPACK [ABB∗ 99]. These
† {kayvonf, yoel, hanrahan}@graphics.stanford.edu libraries are widely used in the development of scientific ap-

c The Eurographics Association 2004.

K. Fatahalian, J. Sugerman, & P. Hanrahan / Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication

plications and must run efficiently if GPUs are to become a on the submatrices. Additionally, B is often stored trans-
useful platform for numerical computing. posed to offer a more cache-friendly representation when
walking its columns. The ATLAS [WPD01] software pack-
Cache-aware implementations for CPU architectures have
age, benchmarked in section 3, works even harder to achieve
been well studied [TCL98, WPD01] and several GPU algo-
cache efficiency. ATLAS tests the host CPU to determine the
rithms have been developed. Larsen and McAllister [LM01]
sizes of system caches, then self-tunes its algorithms to use
initially proposed an approach for computing matrix prod-
code snippets that are optimal for the detected cache sizes.
ucts on graphics hardware. They observed performance
Effective cache-aware implementations of matrix multipli-
equaling that of CPUs but their computations were lim-
cation on CPUs achieve much higher effective bandwidth
ited to 8-bit fixed-point data. Both Hall et al. [HCH03]
and hence numerical efficiency.
and Moravánszky [Mor03] describe improved algorithms for
modern hardware. Moravánszky reports his implementation
is outperformed by optimized CPU code. Despite these re-
2.2. Matrix-Matrix Multiplication on the GPU
sults, little work has been done to analyze the performance
limiting factors of these GPU implementations. We perform A simple approach to compute the product of two matri-
such an analysis and determine that the algorithms are band- ces on a GPU, although feasible only on architectures that
width limited due to the inability of GPUs to provide high support sufficiently long shaders, is to compute elements of
bandwidth access to frequently accessed data. Surprisingly, the resulting matrix in a single rendering pass. We refer to
we find that existing algorithms are near-optimal for current this approach as NV Single. We store 2×2 blocks of matrix
hardware. That is, better performance will require architec- elements in 4 component texels as described by [HCH03].
tural changes. This packing allows each input fetched from memory to be
used twice in a fragment program. Thus the access of data
used in the second operation is performed at register speed.
2. Implementations of Matrix-Matrix Multiplication Our shader, given below, reads 2 rows from matrix A and 2
We consider the problem of computing the product, C = AB, columns of B to compute 4 elements of the output matrix.
of two large, dense, N×N matrices. We quickly describe The values of variables i and j are provided to the shader via
naive and optimized CPU algorithms and then delve more interpolated texture coordinates.
deeply into solutions for a GPU. for (k=0;k<N/2;k++)
C[i,j].xyzw += A[i,k].xxzz*B[k,j].xyxy +
A[i,k].yyww*B[k,j].zwzw;
2.1. Matrix-Matrix Multiplication on CPUs
Each shader program performs 4 inner loop iterations of
The following CPU algorithm for multiplying matrices ex-
the CPU algorithm given in section 2.1. GPU texture caches
actly mimics computing the product by hand:
are organized to capture 2D locality, so sequential accesses
for (i=0;i<N;i++) along either axis are cache coherent. Additionally, shaders
for (j=0;j<N;j++) executing in parallel on adjacent fragments will fetch the
C[i,j] = 0; same data from a row or column of the inputs simultane-
for (k=0;k<N;k++)
ously. In practice, for sufficiently large matrices, instruction
C[i,j] += A[i,k]*B[k,j];
counts exceed the limits of all architectures and a small num-
Unfortunately, while simple, this algorithm suffers from ber of passes are required to complete the computation.
poor locality. Elements of B are accessed columnwise, and
In contrast, multipass algorithms are potentially more
therefore not in sequential order in memory (if matrices are
cache friendly with respect to inputs at the cost of perform-
stored in row-major order). While each iteration in j reuses
ing extra framebuffer reads and writes. We implement a vari-
row i of A, that row may have been evicted from the cache by
ation of the multipass algorithm presented by Larsen and
the time the inner-most loop completes. As a result, the al-
McAllister [LM01]. Our implementation packs 4 consecu-
gorithm is bandwidth limited and displays poor performance
tive elements from a matrix column into a texel, and thus per-
and low efficiency (i.e. time spent doing computation versus
forms a small number of 4×4 matrix by 4×1 vector products
loading data).
in a shader as described in detail by [Mor03]. This method
When sustaining cached accesses, the Pentium 4 can fetch makes 4x reuse of data in the texel obtained from matrix B,
a 128 bit SSE value (4 packed 32 bit floats) in a single cycle but only uses data fetched from A once. 6 fetches are needed
[HSU∗ 01]. The basic flaw of the naive triple-for-loop imple- to retrieve the operands needed for 4 mad operations, so in
mentation is that the size of its per-loop working set prevents total this approach requires 1.5x more texture fetches than
bandwidth amplification from the closest (and fastest) levels NV Single. However, since only a few rows and columns
of the memory hierarchy. A standard solution “blocks” the of the input textures are accessed in a given rendering pass,
inputs into small submatrices that fit in the processor caches, cache hit frequency is expected to be higher. A shader per-
and decomposes the computation of C into subcomputations forming a single matrix-vector product is given given below.

c The Eurographics Association 2004.

K. Fatahalian, J. Sugerman, & P. Hanrahan / Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication

It accepts i, j, and k arguments via interpolated texture coor- 12

dinates. accum[i, j] holds the partial result of C[i, j] from the 10
previous rendering pass. In each pass, k is incremented by 4.
8

GFLOPS
C[i,j].xyzw = accum[i,j].xyzw + 6
A[i,k].xyzw * B[k/4,j].xxxx +
A[i,k+1].xyzw * B[k/4,j].yyyy + 4
A[i,k+2].xyzw * B[k/4,j].zzzz +
2
A[i,k+3].xyzw * B[k/4,j].wwww;
0
Our NVIDIA implementation, NV Multi, computes a sin- 0 256 512 768 1024
gle matrix-vector product per shader. We unroll this kernel Dimension of Matrices
3 to 6 times (the maximum possible amount allowed by the
driver) to construct our shader for the ATI hardware, ATI P4 3Ghz NV 5900 Ultra
Multi. We tested various amounts of unrolling on each plat- ATI 9800XT NV 6800 Ultra
ATI X800XT
form. The chosen values maximized algorithm performance.
Figure 1: Performance of multiplying square matrices on a
Performance tuning of algorithms on graphics hardware
3 GHz Pentium 4, NVIDIA GeForceFX 5900 Ultra, prere-
is difficult because vendors do not disclose specific architec-
lease GeForce 6800XT, ATI Radeon 9800XT, and prerelease
tural details, such as cache parameters or the physical lay-
Radeon X800XT.
out of texture data in memory. We tested a variety of tech-
niques intended to increase cache efficiency and leverage the
parallel fragment processing architecture of the GPU. We 100
varied the packing of matrix elements into textures, the un- 75
% Efficiency

rolling of loops in shaders to reduce rendering passes, and

utilized multiple render targets to perform multiplication of 50
4×4 submatrices of data within each shader. Additionally,
25
we attempted to force blocked texture and framebuffer ac-
cess across fragments by altering the size and order of ge- 0
ometric primitives sent to the card. We observed that the NV5900 NV5900 NV6800 NV6800 ATI9800 ATIX800 P4
rasterization order of fragments produced by a single prim- Single Multi Single Multi Multi Multi
itive covering the entire framebuffer is as efficient as any
ordering created as a result of tiling the screen with geome- % Peak FLOPS
try. Utilizing multiple shader outputs, despite allowing more % Peak Cache Bandwidth
computation to be performed per data access, yielded unsat- Figure 2: Percentage computational and bandwidth effi-
isfactory results. Despite our optimization attempts, the two ciency when multiplying 1024x1024 matrices.
algorithms described above yielded the best performance.
loading and binding shader programs. In contrast, rather than
attempt to instrument ATLAS code, we measure the entire
3. Performance Results
time the ATLAS routine runs. ATLAS’ initial self-tuning
We benchmarked our GPU algorithms and the CPU based metacompilation, however, was performed during library in-
matrix-matrix multiplication routine (sgemm) provided by stallation and is not included in the timings. A summary of
ATLAS. Experiments were performed on a 3 GHz Pentium algorithm performance on the different platforms is given
4 CPU (512 KB L2 cache) featuring peak performance of in Figure 1. GFLOPS values count each multiplication and
12 GFLOPS and an L1 cache bandwidth of 44.7GB/sec. each addition as distinct floating point operations (a mad in-
We tested our GPU algorithms on the ATI Radeon 9800XT, struction using float4s is thus eight operations).
a prerelease Radeon X800XT (500mhz core clock/500mhz
ATLAS outperforms all GPU implementations except
mem), the NVIDIA GeForce FX 5900 Ultra, and a prere-
those running on the ATI X800XT despite being billed
lease GeForce 6800 Ultra (350Mhz core/500Mhz mem), ca-
for any repacking or setup overheads and despite the sig-
pable of achieving 26.1, 63.7, 40.0, and 43.9 GFLOPS re-
nificantly higher peak computational capability of all the
spectively. We only measured the NV Single algorithm on
graphics adapters. As shown in Figure 2, the ATLAS im-
the NVIDIA hardware because for large matrices its kernel
plementation achieves 65% efficiency while we observe no
requires more instructions than the ATI card supports.
more than 17% efficiency on the NVIDIA chips and 19%
We seek to specifically evaluate the computational perfor- on ATI platforms. Table 1 breaks down the multiplication
mance of our GPU implementations and so explicitly ex- of 1024×1024 matrices in more detail. In addition to the
clude the overhead of repacking input matrices in system GFLOPS rate of each implementation, we computed effec-
memory, transferring them to the graphics card, and initially tive input bandwidth. For the GPU implementations, we

c The Eurographics Association 2004.

K. Fatahalian, J. Sugerman, & P. Hanrahan / Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication

be seen by comparing Tables 1 and 2. The implementations

Time (s) GFLOPS BW (GB/s)
are only able to utilize a small fraction of the GPUs’ com-
NV 5900 Single 0.781 2.75 7.22 putational power despite data access patterns that yield high
bandwidth. This inefficiency does not result from poor al-
NV 5900 Multi 0.713 3.01 9.07 gorithm design or sub-optimal implementation. ATI Multi
achieves over 95% of the bandwidth of perfect cache ac-
NV 6800 Single 0.283 7.59 15.36
cesses yet cannot keep the computation units busy. The
NV 6800 Multi 0.469 4.58 12.79 NVIDIA algorithms also leave chip arithmetic units idle de-
spite accessing data at a large percentage of peak bandwidth.
ATI 9800 Multi 0.445 4.83 12.06 From the microbenchmarks and the cards’ clock rates we
ATI X800 Multi 0.188 11.4 27.5 compute that the 6800 Ultra and X800XT can perform 16
mad’s per clock and fetch 16 floats per clock. Since our
P4 ATLAS 0.289 7.78 27.68 four component mad takes 8 floats as input (and one reg-
ister for the third operand), a shader with optimal memory
Table 1: Measurements from the multiplication of access would require in excess of 8 math operations per float
1024x1024 matrices. fetched in order to fully utilize the cards’ computational ca-
pacity. Our matrix-matrix multiplication kernels cannot ap-
proach this ratio given current hardware constraints on in-
GFLOPS Cache BW Seq BW struction counts, register space, and the number of shader
outputs.
NV 5900 40.0 11.4 4.1
NV 6800 43.9 18.3 3.8 4. Discussion and Conclusions
ATI 9800 26.1 12.2 7.3 We expected that matrix-matrix multiplication would be a
natural fit for the GPU and instead observe that optimized
ATI X800 63.7 28.4 15.6 implementations achieve bandwidth approaching that of a
theoretically optimal access pattern and yet achieve no bet-
Table 2: Peak computation and bandwidth rates (cache hit ter than 19% utilization of the arithmetic resources on our
and sequential fetch, in GB/sec) from our microbenchmarks. ATI hardware and no better than 17% utilization on our
NVIDIA hardware. Our intuition proved completely mis-
measure bandwidth as the total number of bytes fetched via leading. Given current hardware designs, no algorithm will
tex instructions divided by total time to run our algorithm perform significantly better than existing approaches.
using a modified shader with nearly all math operations re- Indeed, we discover that available floating point band-
moved (drivers eliminate texture instructions if data is never width from the closest cache on a GPU is up to several times
used by shader, thus we cannot eliminate all math opera- lower than that of current CPUs to their L1 caches. Without
tions). Since we do not know the extent to which ATLAS is increasing this bandwidth (we expect the ratio of available
able to avoid memory accesses by reusing operands in reg- computation to bandwidth to slowly worsen unless architec-
isters, we compute its bandwidth as if all 2n3 accesses were tural design efforts specifically strive to reduce it), the ef-
from memory. In practice, we expect this significantly over- ficiency of computing matrix-matrix products on graphics
estimates ATLAS’ effective memory bandwidth. On the ATI hardware will remain low.
platforms, bandwidth is computed assuming the transfer of
4 words per floating point value, although computation is One obvious way to increase performance is to perform
performed at 24 bit precision. more computation on data each time it is fetched by perform-
ing a level of blocking in the shader register file. The amount
Table 2 shows the results from a trio of microbenchmarks of potential reuse increases with the size of the submatrix
on our graphics cards. The first measures peak arithmetic multiplication performed in a shader. Current architectures
performance by running a shader consisting entirely of mad provide sufficient registers to multiply two 6×6 matrices in
instructions. The second measures peak bandwidth by re- a single shader, but do not permit shaders to generate more
peatedly accessing texel (0, 0) from a number of input tex- than a small number of outputs. As a result, we cannot use
tures. We measure throughput approaching that of the theo- larger blocking within shaders to increase the compute to
retical peak bandwidth of the cards. The final microbench- fetch ratio of our algorithms.
mark runs a similar test, but walks the input textures sequen-
Multipass algorithms rely upon the ability to accumulate
tially by mapping each texel of the input textures to each ren-
partial results into a final output. GPUs currently cannot ac-
dered fragment. This is essentially the bandwidth realized by
cumulate into a full precision floating point framebuffer so
kernels that perform pure streaming over their inputs.
our kernels must explicitly fetch the intermediate results pro-
The key to the poor efficiency of the GPU algorithms can duced by prior passes. However, this only accounts for 1/6

c The Eurographics Association 2004.

K. Fatahalian, J. Sugerman, & P. Hanrahan / Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication

of the total input bandwidth of NV Multi and even less for S., M C K ENNEY A., S ORENSEN D.: LAPACK
ATI Multi. The benefit of adding hardware support for ac- Users’ Guide, third ed. Society for Industrial
cumulation into floating point buffers pales in comparison to and Applied Mathematics, Philadelphia, PA,
the advantages of significantly widening the path from the 1999.
texture fetch units to the arithmetic units or the ability to do
[BFGS03] B OLZ J., FARMER I., G RINSPUN E.,
register level blocking.
S CHRODER P.: Sparse matrix solvers on
We realize that when evaluating non-rendering algorithms the GPU: conjugate gradients and multigrid.
on GPUs, it is important to remember that the design of ACM Trans. Graph. 22, 3 (2003), 917–924.
graphics architectures is optimized for graphics workloads. [BFH∗ 04] B UCK I., F OLEY T., H ORN D., S UGERMAN
Graphics applications primarily use 8-bit fixed point for- J., FATAHALIAN K., H OUSTON M., H ANRA -
mats. Since four times as many values can be fetched per HAN P.: Brook for GPUs: Stream computing on
clock, compute to fetch ratios are much closer to those of graphics hardware. In Proceedings of ACM SIG-
conventional CPUs. Additionally, shaders often utilize in- GRAPH 2004 (to appear) (2004).
terpolated values and constant parameters instead of texture
memory for the majority of their inputs. Thus, graphics ap- [HCH03] H ALL J. D., C ARR N. A., H ART J. C.: Cache
plications are more likely to perform many instructions per and bandwidth aware matrix multiplication on
texture fetch. In contrast, practical numerical applications the GPU. UIUC Technical Report UIUCDCS-
use full precision floating point values and depend upon iter- R-2003-2328 (2003).
ating through input textures for access to data. Rebalancing [HCSL02] H ARRIS M. J., C OOMBE G., S CHEUERMANN
GPUs to include wider or closer caches or an increased abil- T., L ASTRA A.: Physically-based visual simu-
ity to operate on and output larger amounts of data per shader lation on graphics hardware. In Proceedings of
would make GPU architectures better suited for numerical the ACM SIGGRAPH/EUROGRAPHICS confer-
workloads. Even streaming applications, already well suited ence on Graphics hardware (2002), Eurograph-
for GPU architectures, would benefit from bandwidth im- ics Association, pp. 109–118.
provements for regularly patterned texture accesses.
[HSU∗ 01] H INTON G., S AGER D., U PTON M., B OGGS
Although we investigate only matrix-matrix multiplica- D., C ARMEAN D., K YKER A., ROUSSEL P.:
tion in detail in this paper, our arguments apply to many The microarchitecture of the pentium 4 proces-
algorithms that feature data reuse. A wide variety of linear sor. Intel Technology Journal 22, 1 (2001).
algebra operations rely upon blocking to amplify bandwidth
and similar concerns apply to many general algorithms that [KW03] K RUGER J., W ESTERMANN R.: Linear algebra
repeatedly compute on large 2D (or n-D) pieces of memory. operators for GPU implementation of numerical
So long as graphics architectures continue to provide a low algorithms. ACM Trans. Graph. 22, 3 (2003),
bandwidth connection between arithmetic units and nearby 908–916.
local memories, such algorithms will continue to suffer in [LM01] L ARSEN E. S., M C A LLISTER D.: Fast matrix
efficiency when adapted to GPUs, despite increases in GPU multiplies using graphics hardware. In Proc. Su-
clock rates and raw computational power, and despite the percomputing 2001 (2001).
high levels of parallelism and numerical complexity inher-
[MA03] M ORELAND K., A NGEL E.: The FFT
ent in the operations. Addressing these issues in future hard-
on a GPU. In Proceedings of the ACM
ware designs will make GPUs a significantly more powerful
SIGGRAPH/EUROGRAPHICS conference on
platform for this broad class of applications.
Graphics hardware (2003), Eurographics Asso-
ciation, pp. 112–119.
5. Acknowledgments
[Mor03] M ORAVANSZKY A.: Dense matrix algebra on
We would like to thank ATI and NVIDIA for providing ac- the GPU, 2003. http://www.shaderx2.com/shaderx.PDF.
cess to their latest hardware, and Sean Treichler of NVIDIA
[TCL98] T HOTTETHODI M., C HATTERJEE S., L EBECK
and Steve Morein for invaluable discussions about the de-
A. R.: Tuning strassen’s matrix multiplication
sign of graphics architectures. Support for this research was
for memory efficiency. In Proceedings of the
provided by the Rambus Stanford Graduate Fellowship and
1998 ACM/IEEE conference on Supercomput-
Stanford School of Engineering fellowship programs.
ing (CDROM) (1998), IEEE Computer Society,
pp. 1–14.
References [WPD01] W HALEY R. C., P ETITET A., D ONGARRA
[ABB∗ 99] A NDERSON E., BAI Z., B ISCHOF C., B LACK - J. J.: Automated empirical optimizations of soft-
FORD S., D EMMEL J., D ONGARRA J., ware and the ATLAS project. Parallel Comput-
D U C ROZ J., G REENBAUM A., H AMMARLING ing 27, 1–2 (2001), 3–35.

c The Eurographics Association 2004.

Freebitco - in Script Ultimate by HaX3rZ Team
No ratings yet
Freebitco - in Script Ultimate by HaX3rZ Team
4 pages
CV Rohmad Wibowo PDF
No ratings yet
CV Rohmad Wibowo PDF
1 page
Dense Matrix Algebra On The GPU
No ratings yet
Dense Matrix Algebra On The GPU
22 pages
Christen 07
No ratings yet
Christen 07
8 pages
Sum Product Paper
No ratings yet
Sum Product Paper
10 pages
HPC-Practical-4Addition of two large vectors
No ratings yet
HPC-Practical-4Addition of two large vectors
4 pages
Lab Report 6
No ratings yet
Lab Report 6
12 pages
Pawan 09 Graph Algorithms
No ratings yet
Pawan 09 Graph Algorithms
26 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
Accelerating Large Graph Algorithms On The GPU Using Cuda
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using Cuda
12 pages
Data - Parallel Algorithms On Gpus
No ratings yet
Data - Parallel Algorithms On Gpus
31 pages
Kruger, Westermann - Linear Algebra Operators For GPU Implementation of Numerical Algorithms
No ratings yet
Kruger, Westermann - Linear Algebra Operators For GPU Implementation of Numerical Algorithms
9 pages
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
No ratings yet
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
55 pages
Unit 2 - GPU DFG
No ratings yet
Unit 2 - GPU DFG
27 pages
ORNL Tensor Core Training Aug2019
No ratings yet
ORNL Tensor Core Training Aug2019
113 pages
Matrix Computation On The GPU
No ratings yet
Matrix Computation On The GPU
455 pages
Modern GPU Architecture
No ratings yet
Modern GPU Architecture
93 pages
CUDA-OPENCL
No ratings yet
CUDA-OPENCL
17 pages
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
No ratings yet
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
24 pages
Accelerating CFD Simulations With Gpus: Patrice Castonguay
No ratings yet
Accelerating CFD Simulations With Gpus: Patrice Castonguay
67 pages
Accelerating Large Graph Algorithms On The GPU Using CUDA
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using CUDA
12 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
GPU Programming EE 4702-1 Final Examination: Name Solution
No ratings yet
GPU Programming EE 4702-1 Final Examination: Name Solution
10 pages
The Evolution of Gpus For General Purpose Computing
No ratings yet
The Evolution of Gpus For General Purpose Computing
38 pages
UNIT V Scalable Multi-GPU Programming (T2 Chapter 6) - P P With CUDA
No ratings yet
UNIT V Scalable Multi-GPU Programming (T2 Chapter 6) - P P With CUDA
43 pages
Intro To Matlab GPU Programming
No ratings yet
Intro To Matlab GPU Programming
35 pages
Cholesky Decomposition and Linear Programming Onagpu: Jin Hyuk Jung, Scholarly Paper Directed by Dianne P. O'Leary
No ratings yet
Cholesky Decomposition and Linear Programming Onagpu: Jin Hyuk Jung, Scholarly Paper Directed by Dianne P. O'Leary
21 pages
Lec 6
No ratings yet
Lec 6
16 pages
Introduction CUDA
No ratings yet
Introduction CUDA
46 pages
Three-Dimensional Computer Graphics Architecture: Tulika Mitra and Tzi-Cker Chiueh
No ratings yet
Three-Dimensional Computer Graphics Architecture: Tulika Mitra and Tzi-Cker Chiueh
9 pages
Web GPU
0% (1)
Web GPU
40 pages
Opencl Programming For The Cuda Architecture
No ratings yet
Opencl Programming For The Cuda Architecture
23 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Ray Tracing On GPU: University of Applied Sciences Basel (FHBB) Diploma Thesis
No ratings yet
Ray Tracing On GPU: University of Applied Sciences Basel (FHBB) Diploma Thesis
44 pages
Lec 30
No ratings yet
Lec 30
28 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
GPGPU
No ratings yet
GPGPU
139 pages
CUDA 4 1 Webinar v11-11-22
100% (1)
CUDA 4 1 Webinar v11-11-22
41 pages
217 Lec7
No ratings yet
217 Lec7
30 pages
250116_L4
No ratings yet
250116_L4
53 pages
10 - Introduction and Overview GPGPU
No ratings yet
10 - Introduction and Overview GPGPU
69 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Ray Tracing On GPU
No ratings yet
Ray Tracing On GPU
44 pages
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
No ratings yet
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
14 pages
Introduction To Software Rendering On Embedded Systems
No ratings yet
Introduction To Software Rendering On Embedded Systems
22 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
Parallel Computing Lab4
No ratings yet
Parallel Computing Lab4
13 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
High Performance Pattern Recognition On GPU
No ratings yet
High Performance Pattern Recognition On GPU
6 pages
Understanding The Graphics Pipeline
No ratings yet
Understanding The Graphics Pipeline
35 pages
Batch and Cull in Opengl
No ratings yet
Batch and Cull in Opengl
25 pages
Lecture 17-Introduction to GPU
No ratings yet
Lecture 17-Introduction to GPU
36 pages
PDC assignment
No ratings yet
PDC assignment
9 pages
Gpu Programming
100% (2)
Gpu Programming
96 pages
Cuda Basics
No ratings yet
Cuda Basics
44 pages
CUDA 2D Stencil Computations For The Jacobi Method: Jos e Mar Ia Cecilia, Jos e Manuel Garc Ia, and Manuel Ujald On
No ratings yet
CUDA 2D Stencil Computations For The Jacobi Method: Jos e Mar Ia Cecilia, Jos e Manuel Garc Ia, and Manuel Ujald On
4 pages
Scan Primitives
No ratings yet
Scan Primitives
11 pages
HPC 4 B
No ratings yet
HPC 4 B
5 pages
GP_NOTES
No ratings yet
GP_NOTES
23 pages
Design For Performance
100% (1)
Design For Performance
34 pages
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
From Everand
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
Fouad Sabry
No ratings yet
CV Muhammad Danis Dzikrur Rohman
No ratings yet
CV Muhammad Danis Dzikrur Rohman
1 page
JNR10S2R5M Etc
No ratings yet
JNR10S2R5M Etc
4 pages
CV Rohmad Wibowo PDF
No ratings yet
CV Rohmad Wibowo PDF
1 page
Creo 2.0 Advanced
75% (4)
Creo 2.0 Advanced
128 pages
Cpu Compare
No ratings yet
Cpu Compare
2 pages
1.1. Computers & Processors:: Chapter-1
No ratings yet
1.1. Computers & Processors:: Chapter-1
32 pages
Intel Xeon Processor E5-1400/2400 Product Family
No ratings yet
Intel Xeon Processor E5-1400/2400 Product Family
47 pages
Sertifikat InWOCNA DEF
No ratings yet
Sertifikat InWOCNA DEF
1,208 pages
IdeaCentre AIO 3 24ALC6 Spec
No ratings yet
IdeaCentre AIO 3 24ALC6 Spec
7 pages
Benchmark
No ratings yet
Benchmark
105 pages
Msi b450 Tomahawk 7c02 002r 109380 - en
No ratings yet
Msi b450 Tomahawk 7c02 002r 109380 - en
1 page
AMD Software Release Notes Ver. 2.11.26.106: Package Contents and Compatible Operating Systems
No ratings yet
AMD Software Release Notes Ver. 2.11.26.106: Package Contents and Compatible Operating Systems
4 pages
AMD Opteron™Processor Product Data Sheet: 940-Pin Package Specific Features
No ratings yet
AMD Opteron™Processor Product Data Sheet: 940-Pin Package Specific Features
4 pages
A88GMX Series Motherboard: Downloaded From Manuals Search Engine
No ratings yet
A88GMX Series Motherboard: Downloaded From Manuals Search Engine
109 pages
MSI 760GM-P23 (FX) AMD 760G Socket AM3 Micro ATX: Brand: Product Code: Product Name
No ratings yet
MSI 760GM-P23 (FX) AMD 760G Socket AM3 Micro ATX: Brand: Product Code: Product Name
2 pages
Pricelist Anandamcomp 01 November 2022
No ratings yet
Pricelist Anandamcomp 01 November 2022
2 pages
HP Vmware Esxi 5.0 Customized Image Release Notes
No ratings yet
HP Vmware Esxi 5.0 Customized Image Release Notes
7 pages
ACER AcerPower F5 Brochure (ACD1112)
No ratings yet
ACER AcerPower F5 Brochure (ACD1112)
2 pages
Price List
No ratings yet
Price List
4 pages
Computer Science Project Work Class 12
No ratings yet
Computer Science Project Work Class 12
17 pages
C-Zone SDN BHD: Facebook/compuzone
No ratings yet
C-Zone SDN BHD: Facebook/compuzone
2 pages
HP Elitebook 840 G6 Notebook PC: The Standout Business PC
No ratings yet
HP Elitebook 840 G6 Notebook PC: The Standout Business PC
4 pages
PC Modder
No ratings yet
PC Modder
228 pages
ELS Pricelist
No ratings yet
ELS Pricelist
10 pages
Presentations
No ratings yet
Presentations
1 page
Lenovo Diagnostics - LOG: 23-12-2021 10:28:28 - Passed
No ratings yet
Lenovo Diagnostics - LOG: 23-12-2021 10:28:28 - Passed
12 pages
BTD-400 - Quick Installation Guide
No ratings yet
BTD-400 - Quick Installation Guide
8 pages
Asus 1015PW Asus 1015B Asus X43U Asus 1215B
No ratings yet
Asus 1015PW Asus 1015B Asus X43U Asus 1215B
1 page
7641v3.0 (G52-76411X4) (760GM-P25 (FX) - 760GM-P23 (FX) - 760GM-P21 (FX) ) 100x150
No ratings yet
7641v3.0 (G52-76411X4) (760GM-P25 (FX) - 760GM-P23 (FX) - 760GM-P21 (FX) ) 100x150
201 pages
Athlon Specs PDF
No ratings yet
Athlon Specs PDF
92 pages
Pricelist
No ratings yet
Pricelist
8 pages
Change Log
No ratings yet
Change Log
9 pages

Understanding The Efficiency of GPU Algorithms For Matrix-Matrix Multiplication

Uploaded by

Understanding The Efficiency of GPU Algorithms For Matrix-Matrix Multiplication

Uploaded by

Graphics Hardware (2004)

T. Akenine-Möller, M. McCool (Editors)

Understanding the Efficiency of GPU Algorithms for

K. Fatahalian, J. Sugerman, and P. Hanrahan †

1. Introduction highly parallel and numerically intensive with little reuse

c The Eurographics Association 2004.

c The Eurographics Association 2004.

It accepts i, j, and k arguments via interpolated texture coor- 12

rolling of loops in shaders to reduce rendering passes, and

c The Eurographics Association 2004.

be seen by comparing Tables 1 and 2. The implementations

c The Eurographics Association 2004.

c The Eurographics Association 2004.

You might also like