LightSpMV_Faster_CSR-based_sparse_matrix-vector_multiplication_on_CUDA-enabled_GPUs
LightSpMV_Faster_CSR-based_sparse_matrix-vector_multiplication_on_CUDA-enabled_GPUs
Abstract—Compressed sparse row (CSR) is a frequently used fine-grained parallelism as well as imposing sufficient regu-
format for sparse matrix storage. However, the state-of-the-art larity on execution paths and memory access patterns. Owing
CSR-based sparse matrix-vector multiplication (SpMV) imple- to the uniform regularity of dense matrices, high efficiency
mentations on CUDA-enabled GPUs do not exhibit very high on GPUs has been achieved for dense matrix operations [5],
efficiency. This has motivated the development of some alterna- [6]. However, the acceleration of sparse matrix operations such
tive storage formats for GPU computing. Unfortunately, these
as SpMV is challenged by the unstructured characteristics of
alternatives are incompatible with most CPU-centric programs
and require dynamic conversion from CSR at runtime, thus sparse matrices.
incurring significant computational and storage overheads. We As the pioneering work on GPU-based SpMV, Bell and
present LightSpMV, a novel CUDA-compatible SpMV algorithm Garland [7] concentrated on the selection of appropriate
using the standard CSR format, which achieves high speed
storage formats, including CSR, diagonal (DIA), coordinate
by benefiting from the fine-grained dynamic distribution of
matrix rows over warps/vectors. In LightSpMV, two dynamic row (COO) and ELL, and the design of parallel kernels operating
distribution approaches have been investigated at the vector and efficiently on the corresponding formats. For CSR, Bell and
warp levels with atomic operations and warp shuffle functions as Garland implemented two parallel kernels: CSR-Scalar and
the fundamental building blocks. We have evaluated LightSpMV CSR-Vector. CSR-Scalar statically distributes matrix rows over
using various sparse matrices and further compared it to the CUDA threads with each thread processing one row. This
CSR-based SpMV subprograms in the state-of-the-art CUSP kernel intends to take advantage of the thread-level fine-grained
and cuSPARSE libraries. Performance evaluation reveals that parallelism, but suffers from uncoalesced memory accesses
on the same Tesla K40c GPU, LightSpMV is superior to both for threads within a warp. Furthermore, if consecutive rows
CUSP and cuSPARSE, with a speedup of up to 2.60 and 2.63 assigned to a warp have different row lengths (the row length
over CUSP, and up to 1.93 and 1.79 over cuSPARSE for single
is defined as the number of non-zeros in a row), all of the
and double precision, respectively. LightSpMV is available at
http://lightspmv.sourceforge.net. other threads within a warp have to keep idle until the threads
with the longest rows have completed.
83
Authorized licensed use limited to: University of Warwick. Downloaded on January 12,2022 at 17:23:12 UTC from IEEE Xplore. Restrictions apply.
Table of Contents
for this manuscript
In this paper, we investigate the general SpMV equation Kepler adopts a new SMX processor architecture for SMs
y = αAx + βy, where A is a sparse matrix of size R × C with each SM comprising 192 CUDA cores. All of the CUDA
with Nnz non-zeros, x of size C and y of size R are dense cores in an SM share 64 KB on-chip memory and this on-
vectors with x being the source and y being the destination, chip memory can be flexibly configured as 48 KB shared
and both α and β are scalars. Defining Ai,j (0 ≤ i < R and memory with 16 KB L1 cache, or 32 KB shared memory
0 ≤ j < C) to denote the element of A at the position (i, j), with 32 KB L1 cache, or 16 KB shared memory with 48 KB
and xi (yi ) to denote the i-th element of vector x (y), SpMV L1 cache. Kepler offers a L1/L2 caching hierarchy, where the
can be algorithmically computed as L1 cache is within each SM and the L2 cache is a dedicated
memory of size up to 1,536 KB. The L1 cache merely provides
yi = β · yi + Ai,j · xj (1) caching service for its corresponding SM, whereas the L2
Ai,j =0
cache provides a unified caching for all SMs across the GPU.
From Equation (1), SpMV can be interpreted as a procedure Local memory is cached by both the L1 and L2 caches,
of enumerating each non-zero of A and then updating the cor- whereas how global memory is cached depends on whether
responding element in y, with a total of 2(Nnz + R) floating the global memory is writable or read-only. Writable global
point operations (FLOPs). Fig. 2 shows the pseudocode of the memory can only be cached by the L2 cache. This is because
sequential CSR-based SpMV implementation. Kepler does not allow L1 to cache global memory any more,
84
Authorized licensed use limited to: University of Warwick. Downloaded on January 12,2022 at 17:23:12 UTC from IEEE Xplore. Restrictions apply.
Table of Contents
for this manuscript
85
Authorized licensed use limited to: University of Warwick. Downloaded on January 12,2022 at 17:23:12 UTC from IEEE Xplore. Restrictions apply.
Table of Contents
for this manuscript
86
Authorized licensed use limited to: University of Warwick. Downloaded on January 12,2022 at 17:23:12 UTC from IEEE Xplore. Restrictions apply.
Table of Contents
for this manuscript
87
Authorized licensed use limited to: University of Warwick. Downloaded on January 12,2022 at 17:23:12 UTC from IEEE Xplore. Restrictions apply.
Table of Contents
for this manuscript
TABLE II: Performance of the vector- and warp-level kernels 25 Single Precision 22.76
Speedup
CSR-Vector
webbase-1M 14.7 / 3.6 13.0 / 3.5 4.15 3.71 11.23 10.91
9.95
8.67 8.53 9.04
dblp-2010 11.4 / 5.1 9.6 / 4.8 2.25 1.98 10 7.83
6.17
in-2004 19.3 / 10.4 15.6 / 9.4 1.85 1.66 5
4.00 3.60
2.60 2.08 2.02 2.49
1.92 1.64 1.57 1.61 1.54 1.50 1.08 1.30
uk-2002 22.0 / 13.0 17.7 / 11.5 1.70 1.54 1.09 1.57
0
cop20k A 22.6 / 13.4 16.2 / 11.6 1.69 1.40
eu-2005 24.1 / 15.5 18.9 / 13.4 1.55 1.41 16 Double Precision 13.87
14 12.67 13.04
indochina-2004 22.5 / 15.8 17.4 / 13.1 1.42 1.34 12.43
12 CSR-Scalar
nlpkkt120 25.3 / 15.1 19.3 / 12.7 1.68 1.52 9.89 9.67 9.93
10 CSR-Vector
Speedup
8.26 Matrix
qcd5 4 31.9 / 21.4 23.8 / 17.8 1.49 1.34 7.66 7.46
8
6.20 6.07
rma10 28.0 / 22.5 21.4 / 18.0 1.24 1.19
6
pwtk 31.0 / 27.0 23.0 / 20.9 1.15 1.10 4 2.97
2.63 2.10 1.96 1.88 2.10 2.27
1.70 1.67 1.51 1.44
shipsec1 32.0 / 26.3 23.3 / 20.4 1.22 1.14 2 1.28 1.75 1.54 1.04 1.08
88
Authorized licensed use limited to: University of Warwick. Downloaded on January 12,2022 at 17:23:12 UTC from IEEE Xplore. Restrictions apply.
Table of Contents
for this manuscript
89
Authorized licensed use limited to: University of Warwick. Downloaded on January 12,2022 at 17:23:12 UTC from IEEE Xplore. Restrictions apply.