0% found this document useful (0 votes)
10 views

LightSpMV_Faster_CSR-based_sparse_matrix-vector_multiplication_on_CUDA-enabled_GPUs

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

LightSpMV_Faster_CSR-based_sparse_matrix-vector_multiplication_on_CUDA-enabled_GPUs

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Table of Contents

for this manuscript

LightSpMV: Faster CSR-based Sparse Matrix-Vector


Multiplication on CUDA-enabled GPUs
Yongchao Liu Bertil Schmidt
School of Computational Science & Engineering Institut für Informatik
Georgia Institute of Technology, Atlanta, GA 30332, USA Johannes Gutenberg Universität Mainz, Mainz 55128, Germany
Email: [email protected] Email: [email protected]

Abstract—Compressed sparse row (CSR) is a frequently used fine-grained parallelism as well as imposing sufficient regu-
format for sparse matrix storage. However, the state-of-the-art larity on execution paths and memory access patterns. Owing
CSR-based sparse matrix-vector multiplication (SpMV) imple- to the uniform regularity of dense matrices, high efficiency
mentations on CUDA-enabled GPUs do not exhibit very high on GPUs has been achieved for dense matrix operations [5],
efficiency. This has motivated the development of some alterna- [6]. However, the acceleration of sparse matrix operations such
tive storage formats for GPU computing. Unfortunately, these
as SpMV is challenged by the unstructured characteristics of
alternatives are incompatible with most CPU-centric programs
and require dynamic conversion from CSR at runtime, thus sparse matrices.
incurring significant computational and storage overheads. We As the pioneering work on GPU-based SpMV, Bell and
present LightSpMV, a novel CUDA-compatible SpMV algorithm Garland [7] concentrated on the selection of appropriate
using the standard CSR format, which achieves high speed
storage formats, including CSR, diagonal (DIA), coordinate
by benefiting from the fine-grained dynamic distribution of
matrix rows over warps/vectors. In LightSpMV, two dynamic row (COO) and ELL, and the design of parallel kernels operating
distribution approaches have been investigated at the vector and efficiently on the corresponding formats. For CSR, Bell and
warp levels with atomic operations and warp shuffle functions as Garland implemented two parallel kernels: CSR-Scalar and
the fundamental building blocks. We have evaluated LightSpMV CSR-Vector. CSR-Scalar statically distributes matrix rows over
using various sparse matrices and further compared it to the CUDA threads with each thread processing one row. This
CSR-based SpMV subprograms in the state-of-the-art CUSP kernel intends to take advantage of the thread-level fine-grained
and cuSPARSE libraries. Performance evaluation reveals that parallelism, but suffers from uncoalesced memory accesses
on the same Tesla K40c GPU, LightSpMV is superior to both for threads within a warp. Furthermore, if consecutive rows
CUSP and cuSPARSE, with a speedup of up to 2.60 and 2.63 assigned to a warp have different row lengths (the row length
over CUSP, and up to 1.93 and 1.79 over cuSPARSE for single
is defined as the number of non-zeros in a row), all of the
and double precision, respectively. LightSpMV is available at
http://lightspmv.sourceforge.net. other threads within a warp have to keep idle until the threads
with the longest rows have completed.

I. I NTRODUCTION Unlike CSR-Scalar, CSR-Vector statically distributes ma-


trix rows over a fixed number of warps (equal to the maximum
Sparse matrix-vector multiplication (SpMV) is an essential number of resident warps on a GPU in [7]) in a round-robin
primitive in sparse linear algebra, and dominates the compu- fashion. In this kernel, consecutive threads in a warp access
tational cost in many scientific computing applications such consecutive non-zeros in the corresponding row, thus allowing
as iterative methods [1] for solving large linear systems and for large segments of memory to be accessed with little or
eigenvalue problems, data mining [2], and graph analytics [3], no additional cost. However, this kernel can cause under-
[4]. Hence, acceleration of SpMV operations is an important utilization of hardware resources for short rows of lengths
objective with wide-reaching applications. However, unlike less than the warp size, resulting in low GPU occupancy.
dense matrices, a sparse matrix has only a very small propor- For short rows, the performance degradation on GPUs can
tion of non-zero elements. Furthermore, these non-zeros often be compensated by further splitting a warp into a set of
disperse in irregular structures. The sparsity and irregularity of equal-sized smaller vectors, where the number of lanes in a
sparse matrices make the efficient implementation of SpMV vector must be a power of two. For instance, Baskaran and
challenging. Consequently, a wide variety of sparse matrix Bordawekar [8] adopted a similar approach to CSR-Vector,
formats have been proposed and many of them are tailored but assigned one half-warp to each row. In contrast, the CUSP
to target particular applications, computing architectures or library [9] first computes the average row length in the whole
matrix structures. Among the existing formats, compressed matrix and then determines the vector size per row based on
sparse row (CSR) is predominantly used. Existing CPU-centric this average value.
software heavily relies on CSR, owing to its efficient compres-
sion of both structured and unstructured sparse matrices as well However, none of these approaches has addressed the load
as good amenability to efficient algorithms designed for CPUs. imbalance problem, because all of them statically distribute
rows over processing elements (threads, vectors or warps).
As graphics processing units (GPUs) are throughput- To improve load balancing, recent work including [10], [11]
oriented many-core processors, harnessing them for efficient and ours, described in this paper, has been proposed. Ashari
processing of matrix operations requires exposing substantial et al. [10] designed an adaptive CSR-based SpMV approach

978-1-4799-1925-3/15/$31.00 © 2015 IEEE 82 ASAP 2015


Authorized licensed use limited to: University of Warwick. Downloaded on January 12,2022 at 17:23:12 UTC from IEEE Xplore. Restrictions apply.
Table of Contents
for this manuscript

using compute unified device architecture (CUDA), called


0.1 0.7 0 0 row_offsets = 0 2 4 7 9
ACSR. ACSR partitions the whole execution of SpMV into
three stages. Stage 1 classifies matrix rows into different bins 0 0.2 0.8 0
A= column_indices = 0 1 1 2 0 2 3 1 3
on the host, based on row lengths, where the rows whose 0.5 0 0.3 0.9
length are in the range [2i−1 + 1, 2i ) are stored in the i- 0 0.6 0 0.4 values = 0.1 0.7 0.2 0.8 0.5 0.3 0.9 0.6 0.4
th bin. This binning approach is inherently analogous to the
matrix row slicing approach proposed in [12]. These bins are Fig. 1: CSR representation of an example sparse matrix
further partitioned into two separate groups G1 and G2 . G1
contains the bins whose rows have large lengths, while G2
comprises the remaining bins. Stage 2 computes the rows in
Nnz . Likewise, the corresponding column indices are explicitly
G2 using a bin-specific CUDA kernel similar to CSR-Vector, stored in a separate vector column_indices of size Nnz .
while Stage 3 computes the rows in G1 using a row-specific
In this case, there is no need to explicitly store the row index
kernel that assigns one thread block to each row. Greathouse
of each non-zero. Instead, we build a vector row_offsets
and Daga [11] also classified the rows based on row lengths, of size R + 1 in order to maintain the position offset in vectors
and implemented a CSR-Stream kernel and a CSR-Adaptive
values and column_indices for the first non-zero ele-
kernel using OpenCL [13] to deal with short and long rows,
ment of each row. Hence, given a row index i, the values and
respectively. CSR-Stream statically determines the number of column indices of all non-zeros in the row can be located by
non-zeros assigned to each wavefront, streams all of these
the position range [row_offsets[i], row_offsets[i+1])
values into fast local scratch-pad memory, and performs the
in vectors values and column_indices, respectively. Fig.
computation in parallel by all threads in a wavefront. Similar
1 illustrates an example.
to ACSR, CSR-Stream also requires grouping matrix rows on
the host before launching the kernels. CSR-Adaptive combines CSR enables good SpMV performance on CPUs, but shows
CSR-Stream with CSR-Vector, but still relies on host-side a relatively low performance on GPUs. Thus, some new GPU-
matrix row grouping (as of writing this paper, ViennaCL [14] centric storage formats have been proposed. Bell and Garland
has also incorporated the idea of CSR-Adaptive). The row [7] introduced the hybrid (HYB) format, which combines ELL
binning/grouping approach enables ACSR, CSR-Stream and and COO. HYB re-organizes the matrix into two distinct parts:
CSR-Adaptive to improve load balancing over variable row one is in ELL and the other in COO, and separates the two
lengths using CSR. Nonetheless, note that intrinsically, these parts by setting a threshold k on the maximum number of
approaches actually conduct a lightweight format transforma- non-zeros per row in the ELL part. Choi et al. [16] extended
tion, but with the advantage that the transformation takes less both CSR and ELL with blocked structures, while Monakov
runtime and less extra space than other alternatives to CSR. et al. [17] sliced ELL with fixed or variable slice sizes in order
In this paper, we present LightSpMV, a novel CSR-based to reduce the memory overhead of ELL. Dang and Schmidt
SpMV algorithm on CUDA-enabled GPUs. Our algorithm [12] proposed a sliced COO format by decomposing the matrix
uses the standard CSR format. Therefore, it does not need into a number of slices, while Su and Keutzer [18] introduced
any host-side preprocessing of the CSR data structure, and a Cocktail format, a combination of many different sparse
performs the whole SpMV computation by launching only a matrix formats, and further associated it with an OpenCL-
single CUDA kernel. Technically, it achieves high speed by based framework to recommend the best representations of
benefiting from the fine-grained dynamic distribution of matrix a given sparse matrix on different platforms.
rows over vectors, where a warp is virtualized as a single These alternatives may improve the performance of GPU-
instruction multiple data (SIMD) vector and can be further based SpMV. However, there are also a few associated draw-
split into a set of equal-sized smaller vectors for finer-grained backs. Firstly, if completely replacing CSR with a new format,
vector processing. Two dynamic row distribution approaches, we would need to devote a large amount of engineering
i.e. vector-level distribution and warp-level distribution, have effort because many existing software programs are based
been investigated based on atomic operations and warp shuffle on CSR. Secondly, if using the new format only for GPU-
functions. The performance of our algorithm has been evalu- based SpMV by overriding the computation based on CSR,
ated using a set of sparse matrices with variable average row the dynamic conversion from CSR to the new format would
lengths and standard deviations. Performance comparison to incur additional computational and storage cost. In terms of
the CSR-based SpMV subprograms of the NVIDIA CUSP [9] computational overhead, the format conversion may take a
and cuSPARSE [15] libraries demonstrated that on the same runtime comparable to that of tens to hundreds of SpMV
Kepler-based Tesla K40c GPU, LightSpMV is superior to both operations. In this case, even if SpMV is used in iterative
CUSP and cuSPARSE for each matrix in terms of both single methods, we still require a considerable amount of SpMV
and double precision. iterations to amortize the cost incurred by format conversion.
In terms of storage overhead, intermediate storage is a potential
II. BACKGROUND hurdle for certain formats. For instance, ELL is well suited to
A. Compressed Sparse Row (CSR) Format SIMD vector computing, but has a memory footprint directly
proportional to the product of the number of rows and the
CSR is a widely used format for sparse matrix storage. largest row length. This makes ELL only applicable to small
Given a sparse matrix A of size R × C with Nnz non- row lengths, and is also more likely to encounter memory
zeros, CSR iterates over each non-zero element row-by-row allocation failure compared to CSR. In this regard, even though
in the matrix, left-to-right and top-to-bottom, and then stores ELL could demonstrate better performance on vector and
all of the entries continuously in a vector values of size SIMD architectures, it still cannot be used to fully replace

83
Authorized licensed use limited to: University of Warwick. Downloaded on January 12,2022 at 17:23:12 UTC from IEEE Xplore. Restrictions apply.
Table of Contents
for this manuscript

CSR, especially for software demanding high reliability. Even


for HYB (which is more memory efficient than ELL), this risk procedure sequentialCSRSpMV()
 iterate each row
still exists if k is not properly determined to distinguish the for (i = 0; i < R; ++i) do
ELL part from the COO part.  compute the dot product of two vectors
sum = 0;
for (j = row offsets[i]; j < row offsets[i + 1]; ++j) do
B. Memory Overhead of Typical Sparse Matrix Formats sum += values[j] * x[column indices[j]];
end for
 finalize and save the multiplication result
On a certain computing architecture, besides speed, mem- y[i] = α * sum + β * y[i];
ory overhead is an important factor that should be taken into end for
end procedure
account. Given a matrix A, CSR uses (R + 1)log2 (Nnz ) bits
to store the vector row_offsets, and log2 (C)Nnz bits to
store the vector column_indices. Assuming that each non- Fig. 2: Pseudocode of the CSR-based SpMV
zero is represented by b bits (e.g. b = 32 for single precision
and b = 64 for double precision), the memory footprint of the
vector values is bNnz bits. This results in a total memory
footprint of (R + 1)log2 (Nnz ) + (log2 (C) + b)Nnz bits Interestingly, except for the proprietary cuSPARSE library
for the CSR-represented matrix A. COO is the simplest com- [15], none of the GPU-accelerated SpMV in the literature has
pact representation of sparse matrices, which consecutively implemented the general equation y = αAx + βy, to the best
stores the row index, the column index and the value of every of our knowledge. Instead, they have used one of the following
non-zero in three separate vectors. Thus, the memory footprint two simplified variants: y = Ax + y [7], [10], [12], [18] and
can be easily computed as (log2 (R) + log2 (C) + b)Nnz y = Ax [11], [14], [19]. It should be noted that although CUSP
bits. ELL is similar to CSR, but merely stores non-zeros is designed based on [7], it actually implements y = Ax,
and column indices. For matrix A, ELL first computes the rather than y = Ax + y. The benefit of using these variants
largest row length K, and then stores A as a dense matrix is that it isolates the sparse components of the computation
of size R × K by zero padding the rows whose lengths are from the dense ones, and thereby makes the overall number
smaller than K. Hence, ELL requires RKb bits to store all of FLOPs independent of the number of matrix rows. This
non-zeros and RKlog2 (C) bits to store the column indices, leads to a total of 2Nnz FLOPs for y = Ax + y and 2Nnz − 1
resulting in a total of RK(log2 (C) + b) bits. If most row FLOPs for y = Ax. On the contrary, the drawback is that the
lengths in A are much smaller than K, ELL would result in GPU capability of accelerating SpMV cannot be truly reflected
a severe waste of memory space. As for HYB, its memory by the variants, since the multiplication β · yi may have a
footprint is subject to the threshold k (0 ≤ k ≤ C) to separate substantial effect on the overall performance, especially for
the ELL part from the COO part, as mentioned above. The the cases where A has many rows, but only few non-zeros
ELL part is represented as a dense matrix of size R × k. per row. In addition, considering that each non-zero of A is
Given a row, if its length is less than or equal to k, the accessed exactly once, it is reasonable to say that given a
whole row will be dispatched to the ELL part (zero-padded storage format, SpMV performance highly depends on both the
if applicable), and otherwise, the first k non-zeros in the speed of loading all of the non-zeros in A and the efficiency
row will be stored in the ELL part, leaving the rest into the of reusing data in x, or y, or both.
 
COO part. Assuming that Nnz (Nnz − R · k ≤ Nnz ≤ Nnz )
non-zeros are stored in the COO part, the memory foot- D. Kepler GPU Architecture
print of the HYB-represented matrix A can be computed as
 A CUDA-enabled GPU is a shared-memory processor
Rk(log2 (C) + b) + (log2 (R) + log2 (C) + b)Nnz bits.
To determine the value of k, Bell and Garland [7] suggested to comprising a fully configurable array of CUDA cores. These
ensure that at least max(4096, R3 ) rows contain ≤ k non-zeros CUDA cores are further grouped into a set of multi-threaded
on CUDA-enabled GPUs. streaming multiprocessors (SMs). The GPU architecture has
evolved through three generations: Tesla, Fermi and Kepler
C. Sparse Matrix-Vector Multiplication [20], and we will focus only on the Kepler architecture.

In this paper, we investigate the general SpMV equation Kepler adopts a new SMX processor architecture for SMs
y = αAx + βy, where A is a sparse matrix of size R × C with each SM comprising 192 CUDA cores. All of the CUDA
with Nnz non-zeros, x of size C and y of size R are dense cores in an SM share 64 KB on-chip memory and this on-
vectors with x being the source and y being the destination, chip memory can be flexibly configured as 48 KB shared
and both α and β are scalars. Defining Ai,j (0 ≤ i < R and memory with 16 KB L1 cache, or 32 KB shared memory
0 ≤ j < C) to denote the element of A at the position (i, j), with 32 KB L1 cache, or 16 KB shared memory with 48 KB
and xi (yi ) to denote the i-th element of vector x (y), SpMV L1 cache. Kepler offers a L1/L2 caching hierarchy, where the
can be algorithmically computed as L1 cache is within each SM and the L2 cache is a dedicated
 memory of size up to 1,536 KB. The L1 cache merely provides
yi = β · yi + Ai,j · xj (1) caching service for its corresponding SM, whereas the L2
Ai,j =0
cache provides a unified caching for all SMs across the GPU.
From Equation (1), SpMV can be interpreted as a procedure Local memory is cached by both the L1 and L2 caches,
of enumerating each non-zero of A and then updating the cor- whereas how global memory is cached depends on whether
responding element in y, with a total of 2(Nnz + R) floating the global memory is writable or read-only. Writable global
point operations (FLOPs). Fig. 2 shows the pseudocode of the memory can only be cached by the L2 cache. This is because
sequential CSR-based SpMV implementation. Kepler does not allow L1 to cache global memory any more,

84
Authorized licensed use limited to: University of Warwick. Downloaded on January 12,2022 at 17:23:12 UTC from IEEE Xplore. Restrictions apply.
Table of Contents
for this manuscript

and instead reserves it for local memory accesses for register


spills and stack data. Besides the L2 cache, read-only global (a) vector-level row distribution
memory can be optionally cached by the 48 KB read-only function getRowIndexVector()
 compute the lane ID of each thread
cache, and the use of this read-only cache is automatically laneId = threadIdx.x % V;
 get the row index
managed by the compiler. To enable the read-only cache if (laneId == 0) then
for read-only global memory, programmers must give the row = atomicAdd(row counter, 1);
end if
compiler a hint using the const __restrict keyword.  broadcast the row index to all other threads within the vector
As for texture memory, Kepler introduces texture objects (or return (row = shfl(row, 0, V));
end function
bindless textures), allowing programs to map textures at any (b) warp-level row distribution
time and passing texture handles as function arguments. With function getRowIndexWarp()
 compute the lane ID and vector ID of each thread within the warp
texture objects, there is no need to know at compile time which warpLaneId = threadIdx.x & (warpSize - 1);
textures to use at runtime, thus enabling much more dynamic warpVectorId = warpLaneId / V;
 get the row index
execution and flexible programming. if (warpLaneId == 0) then
row = atomicAdd(row counter, warpSize / V);
end if
III. PARALLELIZATION USING CUDA  broadcast the row index to all other threads within the vector
return (row = shfl(row, 0, warpSize) + warpVectorId);
As mentioned above, the major drawback of CSR-Vector end function

is its static distribution of matrix rows over a fixed number of


vectors of size V lanes, where V ∈ {2, 4, 8, 16, 32}. This static Fig. 3: Pseudocode for dynamic row retrieval and broadcasting
distribution can yield good performance for matrices whose
row lengths have low variation. However, for matrices with
highly varied row lengths, some vectors are possibly assigned
much more non-zeros than other vectors. In this case, we might some shared memory for the thread communication within
encounter severely unbalanced distributions of non-zeros over each vector.
vectors, where some vectors may finish much earlier than The Kepler architecture implements a new warp shuf-
others. A promising approach to alleviating load imbalance fle instruction, associated with a set of intrinsic functions
is to dynamically distribute rows over vectors. In this paper, (e.g. __shfl and __shfl_down), for the communication
we propose two dynamic row distribution schemes: vector- between threads within a warp. Using the warp shuffle in-
level dynamic row distribution and warp-level dynamic row struction, we can realize the store-and-load operation in the
distribution, at the vector and warp levels, respectively. conventional approach in just a single step, with no need of
shared memory to perform data exchange. Furthermore, the
A. Vector-level Dynamic Row Distribution warp shuffle instruction takes an optional width parameter
1) Row retrieval and broadcasting: The core of our vector- which permits the sub-division of a warp into vectors, and
level dynamic row distribution is the dynamic row retrieval allows each vector to behave as a separate entity with a starting
and broadcasting at the vector level, which works as follows. logic lane ID of 0. Fig. 3(a) shows the pseudocode for the
Initially, each vector obtains a row index i from a global vector-level dynamic row index retrieval and broadcasting.
row management (GRM) data structure, and computes y[i]. 2) Kernel configuration and launch: In our implementa-
When a vector has completed its current row, it will retrieve tion, we launch a fixed number of vectors for the kernel. This
a new row from GRM. GRM contains an integer-type vari- is realized by configuring a fixed number of threads per block
able row_counter, which is stored in global memory and and a fixed number of thread blocks per grid. The number of
represents the lowest row index among all of the unprocessed threads per block, denoted as T , is set to be the maximum
rows. Once receiving a new request from any vector, GRM allowable number of threads per block for a certain GPU
increments row_counter and returns the old value as the architecture (equal to 1,024 for Kepler GPUs). The number of
response to the request. Since GRM is responsible to serve thread blocks per grid, denoted as B, is calculated by dividing
requests from any vector across the device, the update to the product of the number of SMs and the maximum number of
row_counter must be mutually exclusive. In our implemen- resident threads per SM by T . In this way, we intend to achieve
tation, we have used the atomic function atomicAdd to re- high GPU occupancy by saturating the computation. We have
alize the mutually exclusive access. Note that row_counter conducted a few tests by varying B, and found out that whether
must be reset to zero before launching the SpMV kernel (e.g. increasing or decreasing B results in worse performance.
using the cudaMemset function).
On the host side, our implementation takes almost zero
In our kernel, the first thread within each vector is respon- additional overhead , and merely needs a simple driver routine
sible for retrieving a new row index from GRM. After gaining in charge of both the initialization of row_counter and
the row index, the first thread needs to broadcast this row index the kernel launch. As mentioned above, the GPU hardware
to all of the other threads within the vector. One conventional resources will be underutilized, once the row length is less
approach is to write the row index into a variable stored in than the vector size. To address this discrepancy, inspired by
shared memory, which must be declared with the volatile [9], we calculate the average number of non-zeros per row (i.e.
keyword in order to allow for implicit synchronization between rint(Nnz /R)), and then determine the vector size at runtime.
all threads within the vector, and then ask all of the other Fig. 4 shows the pseudocode for the host-side driver.
threads to get the row index by reading this variable. This
approach requires separate store and load operations on shared On the device, we have stored the CSR-based matrix A in
memory. Meanwhile, each thread block also has to reserve global memory, represented by three vectors row_offsets,

85
Authorized licensed use limited to: University of Warwick. Downloaded on January 12,2022 at 17:23:12 UTC from IEEE Xplore. Restrictions apply.
Table of Contents
for this manuscript

procedure spmvHostDriver(cudaDeviceProp& prop, ...) procedure spmvCudaKernel()


 set thread block and kernel grid configuration  get the lane ID and vector ID for the thread
T = prop.maxThreadsPerBlock; laneId = threadIdx.x % V;
B = prop.multiProcessorCount vectorId = threadIdx.x / V;
* prop.maxThreadsPerMultiProcessor / numThreadsPerBlock; shared volatile int space[NUM VECTORS PER BLOCK][2];
 reset row counter to zero  get a row index
cudaMemset(row counter, 0, sizeof(int)); row = getRowIndexVector();
 calculate the average row length while (row < R) do
mean = rint(Nnz / R);  get the starting and end offsets for the row
 launch the kernel if (laneId < 2) then
if (mean ≤ 2) then  set the vector size to 2 space[vectorId][laneId] = row offsets[row + laneId];
spmvCudaKernel <<< B, T >>> (2, ...); end if
else if (mean ≤ 4) then  set the vector size to 4 row start = space[vectorId][0];
spmvCudaKernel <<< B, T >>> (4, ...); row end = space[vectorId][1];
else if (mean ≤ 64) then  set the vector size to 8  compute the dot product of the vector
spmvCudaKernel <<< B, T >>> (8, ...); sum = 0;
else  set the vector size to 32 if (V == warpSize) then
spmvCudaKernel <<< B, T >>> (32, ...); i = row start - (row start & (V - 1)) + laneId;
end if if (i ≥ row start && i < row end) then
end procedure sum += values[i] * x[column indices[i]];
end if
for (i += V; i < row end; i += V) do
Fig. 4: Pseudocode for the host-side driver sum += values[i] * x[column indices[i]];
end for
else
for (i = row start + laneId; i < row end; i += V) do
sum += values[i] * x[column indices[i]];
column_indices and values. Since they are read-only, end for
end if
we have used the const __restrict keyword to hint to  intra-vector reduction
the compiler to use the 48 KB read-only cache on Kepler. sum *= α;
for (i = V >> 1 ; i > 0; i >>= 1) do
Since vector y must be writable, it is deployed in global sum += shfl down(sum, i, V);
memory. As for vector x, it is read-only but often randomly end for
 save the result
accessed, due to the irregular dispersion of non-zeros within if (laneId == 0) then
rows. In this regard, we have deployed x in texture memory y[row] = sum + β * y[row];
end if
using the texture object application programming interface. It  get a new row
is worth mentioning that we have attempted to deploy A in row = getRowIndexVector();
end while
texture memory, but observed a slight performance drop. From end procedure
[20], this read-only cache is accessible only by the texture
unit on the Fermi architecture, but has been further opened Fig. 5: Pseudocode for the vector-level distribution kernel
by Kepler to make it directly accessible to SMs for general
load operations. This implies that this read-only cache can
be used for texture fetches, read-only global memory loads,
in which these atomic operations occur is undefined, it can
or even both simultaneously. Thus, we would not expect to
be ensured that all of the row indices assigned to the whole
see any performance difference between texture fetches and
warp are consecutive. This has inspired the design of our warp-
read-only global memory loads. Nonetheless, since the use
level dynamic row distribution used to significantly reduce the
of texture fetches has slightly worse performance, we have
number of atomic operations per warp.
decided to deploy A in global memory as mentioned above.
The pseudocode for the SpMV kernel based on the vector-level The warp-level distribution enables only one atomic opera-
row distribution is given in Fig. 5. tion to be issued by a single warp, distributing 32
V consecutive
rows to all vectors within the warp at a time. In addition to the
B. Warp-level Dynamic Row Distribution reduced number of atomic operations per warp, the warp-level
distribution ensures that the row indices assigned to a vector
For the vector-level distribution, the number of atomic
are consecutive and within the vector, threads with larger lane
operations issued by a single warp is subject to the vector
IDs access larger row indices, which is generally preferred for
size and can be calculated by dividing the warp size by
GPUs. In contrast, the vector-level distribution is not able to
the vector size. As mentioned above (also see Fig. 4), the
guarantee this ordering. Meanwhile, it is also possible that the
vector size is dynamically determined as per the average
row indices assigned to a vector are not consecutive, due to
number of non-zeros per row. Hence, for very small vector
the undefined execution order of atomic operations between
sizes, the latency incurred by the atomic operations could
threads within the warp to which the vector belongs.
overwhelmingly dominate the overall execution time, since
each vector merely has very few non-zeros per row on average In general, the procedure for row index retrieval and
to compute. For example, if the vector size is set to 2, 16 broadcasting works as follows. Same as for the vector-level
atomic operations will be issued by a single warp, but only two distribution, the first thread in a warp takes charge of the
non-zeros per row on average will be computed by each vector. retrieval of a new base row index from GRM, and broadcasts
it is well-known that in the single instruction multiple thread this base row index to all of the other threads within the warp.
(SIMT) architecture, a warp executes one common instruction Subsequently, each thread calculates its own vector ID within
at a time. This implies that if one atomic operation is issued by the warp, and then gets the row index assigned to itself by
more than one of the threads within a warp, all of these atomic summing up the base row index and the its vector ID. The
operations will be serialized, but will never be interrupted pseudocode for the warp-level dynamic row index retrieval and
across the GPU prior to their completion. Although the order broadcasting is shown in Fig. 3(b). To write the CUDA kernel

86
Authorized licensed use limited to: University of Warwick. Downloaded on January 12,2022 at 17:23:12 UTC from IEEE Xplore. Restrictions apply.
Table of Contents
for this manuscript

TABLE I: Information of sparse matrices used


(a) Warp shuffle function for double precision Name Rows / Cols Nnz μ/σ Src
function shfl down(value, delta, vectorSize)
int2 tmp = *reinterpret cast<int2*>(&value); webbase-1M 1,000,005 3,105,536 3 / 25 N
tmp.x = shfl down(tmp.x, delta, vectorSize); dblp-2010 326,186 1,615,400 5/8 F
tmp.y = shfl down(tmp.y, delta, vectorSize);
return *reinterpret cast<double*>(&tmp); in-2004 1,382,908 16,917,053 12 / 37 F
end function uk-2002 18,520,486 298,113,762 16 / 28 F
(b) texture fetch for double precision
function texFetch(x, i) cop20k A 121,192 2,624,331 22 / 14 N
int2 tmp = tex1Dfetch<int2>(x, i); eu-2005 862,664 19,235,140 22 / 29 F
return hiloint2double(tmp.y, tmp.x);
end function indochina-2004 7,414,866 194,109,311 26 / 216 F
nlpkkt120 3,542,400 96,845,792 27 / 3 F
Fig. 6: Warp shuffle and texture fetch for double precision qcd5 4 49,152 1,916,928 39 / 0 N
rma10 46,835 237,4001 51 / 28 N
pwtk 217,918 11,634,424 53 / 5 N
for the warp-level distribution, we only need to replace the shipsec1 140,874 7,813,404 55 / 11 N
function getRowIndexVetor in Fig. 5 with the function kron g500-logn21 2,097,152 182,082,942 87 / 756 F
getRowIndexWarp, keeping the remaining code untouched. rail4284 4,284 / 1,092,610 11,279,748 2,633 / 4,210 N
N and F denote NVIDIA Research and the university of Florida sparse matrix collection,
respectively; μ is the average row length and σ is the standard deviation of row lengths.
C. Double Precision Support
To enable support for double precision, we need to make
the following two code changes for both the vector-level To measure speed, we have used the billion FLOPs per
and warp-level kernels. One is on the intra-vector reduction second (GFLOPS) metric, which is computed as 2N nz −1
t×109 for
2Nnz
based on warp shuffle functions and the other on the texture the equation y = Ax, as t×10 9 for the equation y = Ax + y,
2(Nnz +R)
fetches of the values in vector x. This is because neither and as t×109 for the equation y = αAx + βy, where t is
warp shuffle functions nor texture fetches provide support the wall clock runtime in second. In our tests, we iteratively
for double-precision floating point values. To address both execute each evaluated algorithm 1,000 times and then average
problems, one commonly used solution is to re-interpret one the total runtime to get the runtime t for a single run. It
double-type value by two integer-type values. should be stressed that t does not include the host-side data
To implement the double-type __shfl_down for intra- preparation time, which is usually application specific and
supposed to be the same for all CSR-based approaches, for
vector reduction, we first re-interpret the double-type value
both the sparse matrix and the two vectors. In the following,
to an int2-vector-type value tmp, then exchange the two
integer values in tmp by two runs of the integer-type we refer to the equation y = αAx + βy, by default, for the
performance evaluation.
__shfl_down, and finally re-interpret tmp back to the
double-type value (see Fig. 6(a)) [21]. For the texture fetches, All tests are conducted on a workstation with two hex-core
we bind the double-type vector x to an int2-vector-type texture Intel Xeon X5650 2.67 GHz CPUs and 96 GB RAM, running
object, invoke the texture function to re-interpret the double- the Linux operating system (Ubuntu 14.04). This workstation
type value as one int2-vector-type value tmp, and then convert has been further equipped with a Kepler-based Tesla K40c
tmp back to the double-type value using the intrinsic function GPU. This GPU comprises 15 SMs (a total of 2,880 CUDA
__hiloint2double as in CUSP [9] (see Fig. 6(b)). cores), works at a GPU clock rate of 745 MHz and a memory
clock rate of 3,004 MHz, and has 12 GB device memory
IV. P ERFORMANCE E VALUATION associated with a unified L2 cache of size 1.5 MB. All of the
CUDA-based programs evaluated are compiled using CUDA
A. Experimental Design 6.5 toolkit, along with the GNU GCC 4.8.2 compiler, where
We have evaluated our algorithm from three perspectives: the GPU architecture is specified as -arch sm_35 and the
(i) performance of the two variants based on the vector-level -O3 optimization level is used.
and the warp-level distribution approaches; (ii) performance
comparison to the CSR-based SpMV subprograms in the open- B. Evaluation of the Two Variants
source CUSP (version 0.4); and (iii) performance comparison
to the CSR-based SpMV subprograms in the closed-source Firstly, we have evaluated and compared the performance
cuSPARSE (in CUDA 6.5 toolkit). A variety of sparse matrices of our two kernel variants: the vector-level kernel and the
have been used for the performance evaluation, among which warp-level kernel (see Table II). From Table II, the vector-
one half are obtained from NVIDA Research (http://www. level kernel produces an average performance of 14.8 GFLOPS
nvidia.com/content/NV Research/matrices.zip) and the other with the maximum performance of 27.0 GFLOPS for single
half from the University of Florida sparse matrix collection precision, and an average performance of 12.2 GFLOPS with
[23]. Table I provides the information of the sparse matrices the maximum performance of 20.9 GFLOPS for double pre-
used in this study, where all of the matrices are listed top-to- cision. In contrast, the warp-level kernel yields an average
bottom in ascending order of average row length. This set of performance of 21.7 GFLOPS with the maximum performance
sparse matrices spans a broad spectrum of average row lengths of 32.0 GFLOPS for singe precision, and an average perfor-
ranging from 3 up to 2,633 with standard deviations varying mance of 16.6 GCUPS with the maximum performance of 23.8
from 0 up to 4,210. GFLOPS for double precision.

87
Authorized licensed use limited to: University of Warwick. Downloaded on January 12,2022 at 17:23:12 UTC from IEEE Xplore. Restrictions apply.
Table of Contents
for this manuscript

TABLE II: Performance of the vector- and warp-level kernels 25 Single Precision 22.76

Warp / Vector (GFLOPS) Speedup 20


Matrix 16.87 16.62
CSR-Scalar
Single Double Single Double 14.38
15

Speedup
CSR-Vector
webbase-1M 14.7 / 3.6 13.0 / 3.5 4.15 3.71 11.23 10.91
9.95
8.67 8.53 9.04
dblp-2010 11.4 / 5.1 9.6 / 4.8 2.25 1.98 10 7.83
6.17
in-2004 19.3 / 10.4 15.6 / 9.4 1.85 1.66 5
4.00 3.60
2.60 2.08 2.02 2.49
1.92 1.64 1.57 1.61 1.54 1.50 1.08 1.30
uk-2002 22.0 / 13.0 17.7 / 11.5 1.70 1.54 1.09 1.57
0
cop20k A 22.6 / 13.4 16.2 / 11.6 1.69 1.40
eu-2005 24.1 / 15.5 18.9 / 13.4 1.55 1.41 16 Double Precision 13.87
14 12.67 13.04
indochina-2004 22.5 / 15.8 17.4 / 13.1 1.42 1.34 12.43
12 CSR-Scalar
nlpkkt120 25.3 / 15.1 19.3 / 12.7 1.68 1.52 9.89 9.67 9.93
10 CSR-Vector

Speedup
8.26 Matrix
qcd5 4 31.9 / 21.4 23.8 / 17.8 1.49 1.34 7.66 7.46
8
6.20 6.07
rma10 28.0 / 22.5 21.4 / 18.0 1.24 1.19
6
pwtk 31.0 / 27.0 23.0 / 20.9 1.15 1.10 4 2.97
2.63 2.10 1.96 1.88 2.10 2.27
1.70 1.67 1.51 1.44
shipsec1 32.0 / 26.3 23.3 / 20.4 1.22 1.14 2 1.28 1.75 1.54 1.04 1.08

kron g500-logn21 4.8 / 4.8 4.0 / 4.0 1.00 1.00 0


rail4284 13.5 / 13.5 9.3 / 9.3 1.00 1.00
Warp and Vector denote the warp-level and vector-level kernel, respec-
tively; Single and Double denote single and double precision, respectively.
Matrix

Fig. 7: Performance comparison to CUSP


The warp-level kernel is always superior to the vector-
level kernel for the first 12 sparse matrices with average
row length ≤ 64, while for the remaining two matrices with CUSP: spmv_csr_scalar_tex (CSR-Scalar) and
average row length > 64, the two kernels demonstrate the same spmv_csr_vector_tex (CSR-Vector). As mentioned in
performance. This observation can be explained as follows. As Section II-C, both of the two subprograms target the simplified
indicated in Fig. 4, both kernels may set the vector size V to equation y = Ax. For fair comparison, we have implemented
2, 4 or 8 for average row length ≤ 64, and 32 for average separate CUDA kernels for this simplified equation based
row length > 64. In this regard, for the first 12 matrices, on the warp-level distribution approach. Fig. 7 shows the
the warp-level kernel can issue 16×, 8×, or 4× fewer atomic speedups of our kernels over CSR-Scalar and CSR-Vector.
operations per warp than the vector-level kernel, leading to From the figure, LightSpMV is far superior to CSR-Scalar,
better performance. As for the rest, both kernels issue only one achieving average speedups of 10.76 and 8.73 with maximum
atomic operation per row, and in theory are supposed to have speedups of 22.76 and 13.87 for single and double precision,
identical performance. Overall, the warp-level kernel achieves respectively. CSR-Vector runs much faster than CSR-Scalar,
an average speedup of 1.67 and 1.52, with the maximum but is still inferior to our algorithm for each case. Compared
speedup of 4.15 and 3.71 for single and double precision, to CSR-Vector, the average speedups of our algorithm are 1.72
respectively, compared to the vector-level kernel. and 1.70, and the maximum speedups are 2.60 and 2.63 for
As mentioned above, the number of atomic operations per single and double precision, respectively. In particular, for the
warp in the vector-level kernel is inversely proportional to V . matrix rail4284, both LightSpMV and CSR-Vector assign
This implies that as V increases, the speedup of the warp-level a warp to process one row, but the former is still able to run
kernel over the vector-level kernel is supposed to decrease and 1.30× faster than the latter in terms of single precision, and
ultimately reach one. From the table, it can be observed that 1.08× faster in terms of double precision. This performance
the reported speedups demonstrate a trend consistent with our superiority has reflected, to a great extent, the effectiveness
expectation with respect to both single and double precision, of our warp-level dynamic row distribution approach.
where the largest speedups of 4.15 and 3.71 are achieved with
V = 4 and the lowest speedups of 1.00 and 1.00 with V = 32 D. Comparison to cuSPARSE
for single and double precision respectively. Moreover, for a
fixed vector size V , the speedups of the warp-level kernel over Finally, we have compared our algorithm to the two CSR-
the vector-level kernel are expected to decrease, along with the based SpMV subprograms in cuSPARSE: cusparseScsrmv
increase of average row lengths. This is because larger row and cusparseDcsrmv for single and double precision,
lengths lead to more arithmetic computation per warp, and thus respectively, both of which implement the same equation
are able to reduce the influence of atomic operations within a y = αAx + βy as LightSpMV. Fig. 8 shows the speedups
warp on the overall execution time. From Table II, the observed of our algorithm compared to cuSPARSE. From the figure,
speedups for the middle 11 matrices (from the 2nd to the 10th LightSpMV outperforms cuSPARSE for each case, where the
matrix) with the same V = 8 show a roughly consistent trend former yields an average speedup of 1.47 with the maximum
with our expectation, albeit having slight fluctuations. speedup of 1.93 for single precision, and an average speedup of
1.32 with the maximum speedup of 1.79 for double precision.
C. Comparison to CUSP The speedups are considerable, since we implement the true
analog of the basic linear algebra subprograms (BLAS) gemv
Secondly, we have compared LightSpMV to the operation with no need of any a priori knowledge about the
two CSR-based SpMV template subprograms in sparsity and irregularity of the matrices used.

88
Authorized licensed use limited to: University of Warwick. Downloaded on January 12,2022 at 17:23:12 UTC from IEEE Xplore. Restrictions apply.
Table of Contents
for this manuscript

3 (SRFN) Johannes Gutenberg University Mainz, and the Carl-


Single precision Double precision
1.93 Zeiss-Foundation.
2 1.79
1.61 1.67
1.56 1.59 1.56
1.51 1.59 1.50 1.58
1.40
2 1.47 1.45 1.35 1.37 1.37 R EFERENCES
Speedup

1.24 1.36 1.27 1.22


1.17 1.15 1.14
1.09 1.06 1.02
1.03
1 [1] Y. Saad. Iterative Methods for Sparse Linear Systems. Society for
Industrial Mathematics, 2003.
1
[2] E. J. Im, K. Yelick, Optimization of Sparse Matrix Kernels for Data
0
Mining, Proc. of the Workshop on Text Mining, 2001
[3] J. R. Gilbert, S. Reinhardt, V. B. Shah, High Performance Graph
Algorithms from Parallel Sparse Matrices, Lecture Notes in Computer
Science, vol. 4699, pp. 260-269, 2007.
[4] X. Yang, S. Parthasarathy, and P. Sadayappan. Fast Sparse Matrix-Vector
Matrix
Multiplication on GPUs: Implications for Graph Mining. Proc. VLDB
Fig. 8: Performance comparison to cuSPARSE Endow., vol. 4, no. 4, pp. 231-242, 2011
[5] S. Barrachina, M. Castillo, F. D. Igual, R. Mayo, E. S. Quintana-Ort,
Solving Dense Linear Systems on Graphics Processors, Lecture Notes in
V. C ONCLUSION Computer Science, vol. 5168, pp. 739-748, 2008.
[6] V. Volkov, J. W. Demmel, Benchmarking GPUs to Tune Dense Linear
SpMV is a widely used kernel in scientific computing, and Algebra, Proceedings of the 2008 ACM/IEEE conference on Supercom-
its performance is also critical to the overall performance of puting, 2008.
many applications. Typically, SpMV operations are bandwidth- [7] N. Bell, M. Garland, Implementing Sparse Matrix-Vector Multiplication
limited and their performance are highly dependent on the on Throughput-oriented Processors, Proceedings of the Conference on
High Performance Computing Networking, Storage and Analysis, 2009
distributions of non-zeros in a given matrix. Hence, to fully
[8] M. M. Baskaran, R. Bordawekar, Optimizing Sparse Matrix-Vector
optimize SpMV, researchers prefer taking into account the Multiplication on GPUs, IBM Research Report RC24704, IBM, 2009
structure of the matrix, while choosing both the storage format
[9] N. Bell, M. Garland, CUSP : Generic Parallel Algorithms for Sparse
used in memory and the corresponding computational kernel. Matrix and Graph Computations (v0.4), http://cusplibrary.github.io, 2014
CSR is the predominantly used storage format in existing CPU- [10] A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, P. Sadayappan,
centric software tools, but unfortunately does not exhibit high Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applica-
performance on CUDA-enabled GPUs. tions, Proceedings of the International Conference for High Performance
Computing, Networking, Storage and Analysis, pp. 781-792, 2014
In this paper, we have presented LightSpMV, a paral- [11] J. L. Greathouse, M. Daga, Efficient Sparse Matrix-Vector Multipli-
lelized CSR-based SpMV implementation using the CUDA cation on GPUs using the CSR Storage Format, Proceedings of the
programming model. In general, the major advantages of International Conference for High Performance Computing, Networking,
LightSpMV could be represented from two aspects. One is Storage and Analysis, pp. 769-780, 2014
convenient integration of our CUDA kernels into existing [12] H. V. Dang, B. Schmidt, CUDA-enabled Sparse MatrixVector Multipli-
CPU-centric software programs, as we implement the kernels cation on GPUs using Atomic Operations, Parallel Computing, vol. 39,
no. 1, pp. 737-750, 2013
as CUDA C++ template functions, do not need any host-side
[13] Khronos Group, The Open Standard for Parallel Programming of
pre-processing of the CSR data structure, and merely launch Heterogeneous Systems, https://www.khronos.org/opencl, 2014
a single CUDA kernel to perform the whole SpMV compu- [14] K. Rupp, F. Rudolf, J. Weinbub, ViennaCL - A High Level Linear
tation. The other is high speed, compared to existing CSR- Algebra Library for GPUs and Multi-Core CPUs, in Int’l Workshop
based GPU-centric implementations. To achieve high speed on GPUs and Scientific Applications, 2010.
on GPUs, we have investigated two parallelization approaches [15] NVIDIA, The NVIDIA CUDA Sparse Matrix Library (cuSPARSE), In
at the vector and warp levels, respectively, by introducing two CUDA 6.5 toolkit, 2014
dynamic row distribution schemes using atomic operations and [16] J. W. Choi, A. Singh, R. W. Vuduc, Model-driven Autotuning of Sparse
warp shuffle functions as the fundamental building blocks. We Matrix-Vector multiply on GPUs, ACM SIGPLAN Not. vol. 45, pp.
115126, 2010
have evaluated the performance of LightSpMV using a set of
[17] A. Monakov, A. Lokhmotov, A. Avetisyan, Automatically Tuning Sparse
sparse matrices with variable average row lengths and standard Matrix-Vector Multiplication for GPU architectures, in: Proc. HiPEAC,
deviations, and have further compared this performance to that LNCS, pp. 111125, 2010.
of the corresponding subprograms in the top-performing CUSP [18] B. Y. Su, K. Keutzer, clSpMV: A Cross-Platform OpenCL SpMV
and cuSPARSE. In the tests, we take as input a sparse matrix Framework on GPUs, in Proc. of the Intl Conf. on Supercomputing,
file, represented in Matrix Market format [22], and parse 2012.
the matrix file using the read_matrix_market_file [19] I. Reguly, M. Giles, Efficient Sparse Matrix-Vector Multiplication on
subprogram in CUSP [9]. Performance evaluation shows that Cache-based GPUs, Innovative Parallel Computing, pp. 1-12, 2012
on the same Tesla K40c GPU, LightSpMV is superior to CUSP [20] NVIDIA, NVIDIAs Next Generation CUDA Compute Architecture:
and cuSPARSE for each matrix in terms of single and double Kepler GK110, NVIDIA White Paper, 2013
precision. Specifically, LightSpMV is able to run up to 2.60× [21] J. Luitjens, Faster Parallel Reductions on Kepler, http://devblogs.nvidia.
com/parallelforall/faster-parallel-reductions-kepler, 2014.
and 2.64× faster than the CSR-Vector kernel in CUSP, and
[22] R. F. Boisvert, R. Pozo, K. Remington, R. F. Barrett, J. J. Dongarra,
up to 1.93× and 1.79× faster than cuSPARSE for single and Matrix market: a Web Resource for Test Matrix Collections, Proceedings
double precision, respectively. of the IFIP TC2/WG2.5 working conference on Quality of numerical
software: assessment and enhancement, pp. 125-137, 1997
ACKNOWLEDGMENT [23] T. A. Davis, Y. Hu, The University of Florida Sparse Matrix Collection,
ACM Transactions on Mathematical Software, vol. 38, no. 1, 2011
We acknowledge funding by US National Science Founda- roceedings of the 2013 IEEE 27th International Symposium on Parallel
tion grant IIS-1416259, the Center for Computational Sciences and Distributed Processing, pp. 1085-1097, 2013

89
Authorized licensed use limited to: University of Warwick. Downloaded on January 12,2022 at 17:23:12 UTC from IEEE Xplore. Restrictions apply.

You might also like