Performance Evaluation of Multithreaded Sparse Matrix-Vector Multiplication Using OpenMP
Performance Evaluation of Multithreaded Sparse Matrix-Vector Multiplication Using OpenMP
2
is shown as follows.
bcsrNz = {1,1,1,1,0,0,2,2,2,0,2,0,3,3,0,0,3,3,4,0,
4,4,4,0,5,5,0,5,5,0};
bcsrCols = {0,6,3,3,6};
bcsrRowStart ={0,2,3,5}.
Taking 2*2 block size as an example, the pseudo
code in Figure 2 implements SpMV based on BCSR
using OpenMP, where blockRows = m/r means the
number of block rows in the matrix.
We can see from the codes that x0, x1, y0 and y1 are
reused twice in registers. The data accessing locality of
vector x is also improved, since it processes four
elements of the sparse matrix each time.
Figure 2. Pseudo code of multithreaded SpMV
for BCSR using OpenMP
4. Load balancing of SpMV
Considering load balancing, scheduling overhead
and some other factors, the OpenMP API specifies
four different scheduling schemes: static, dynamic,
guided and runtime [20]. The runtime scheme will
choose a schedule type and chunk size from
environment variables until runtime. So we mainly
discuss the other three scheduling schemes.
The static scheme divides the iterations of a loop
into pieces by an input parameter chunk. The pieces
are statically assigned to each thread in a round-robin
fashion. By default, the iterations are divided into P
evenly sized contiguous chunks, where P is the
number of threads we assigned. Since the schedule can
be statically determined, this method has the least
runtime overhead.
The dynamic scheme also divides the iterations of a
loop into pieces. However, it will distribute each piece
dynamically. Each thread obtains the next set of
iterations only if it finished the current piece of
iterations. If not specified by the input parameter
chunk, the default count of iterations in a piece is 1.
The guided scheme works in a fashion similar to the
dynamic schedule. The partition size can be calculated
by Formula (1), where LC
k
is the count of unassigned
iterations, P is the number of threads and is a scale
factor (recommended to 1 or 2). The guided scheme
also provides an input parameter chunk. It specifies the
minimum number of iterations to dispatch each time.
When no chunk is specified, its default value is 1. The
chunk sizes in guided scheduling begin large and
slowly decrease in size, resulting in fewer
synchronizations than dynamic scheduling, while still
providing load balancing.
NC
k
= LC
k
/ (*P) (1)
As an example, Table 1 gives different partition
sizes for a problem, in which the count of iterations N
= 1000 and threads number P = 4.
Table 1. Partition sizes for different schemes
Scheme Partition size
static(default) 250, 250, 250, 250
dynamic
(chunk=50)
50, 50, 50, 50, 50,
guided
(chunk=50)
250, 188, 141, 106, 79,
59, 50, 50, 50, 27
In case of CSR storage format, the coarse grain row
partitioning scheme is popularly applied [17]. The
matrix will be partitioned into blocks of rows by the
number of threads. Each block is assigned to a thread.
As described in the left part of Figure 3, each thread
operates on its own parts of the csrNz, csrCols,
//The following sentence is an OpenMP direction.
#pragma omp parallel for private (j,k,h,t)
The following code is serial SpMV based on BCSR.
for (i=0; i<blockRows; i++)
{
t = i<<1;
register doulbe y0=y[t];
register double y1=y[t+1];
for(j=bcsrRowStart[i]; j<bcsrRowStart[i+1]; j++)
{
k = j<<2;
h = bcsrCols[j];
register double x0 = x[h];
register double x1 = x[h+1];
y0 += bcsrNz[k] * x0;
y0 += bcsrNz[k+1] * x1;
y1 += bcsrNz[k+2] * x0;
y1 += bcsrNz[k+3] * x1;
}
y[t] = y0;
y[t+1] = y1;
}
661
csrRowStart and y arrays. So their accesses on these
data are independent and can be parallelized. All
threads access elements of the x array. Since accesses
on x are read only, the data can reside in each
processors cache without causing invalidation traffic
due to the cache coherency protocol [16]. An
advantage of row partitioning is that each thread
operates on its own part of the y array, which allows
for better temporal locality on the arrays elements in
case of distinct caches.
Figure 3. Matrix partition for SpMV
There is a complementary approach to row
partitioning, column partitioning, where each thread is
assigned a block of columns. Although column
partitioning is more naturally applied to the CSC
(Compressed Sparse Column) storage format, it can
also be applied to the CSR format. A disadvantage of
column partitioning is that all threads must perform
writes on all the elements of the vector y. To solve this
problem, the best method is to make each thread use its
own vector y and perform a reducing operation at the
end of the multiplication [18]. The reducing operation
will cause more memory accesses as it need to visit the
own vector y of each thread. It will also cost (P-
1)*sizeof(y) extra space to store the y array for each
thread, where P is the number of threads we used.
Since each thread only need to operate on its own parts
of vector x in column partitioning, it will bring in
better temporal locality for vector x. To combine the
advantages of the above two partitioning methods, we
can use block partitioning, where each thread is
assigned a two-dimensional block as shown in the right
part of Figure 3. On the other hand, the
implementation of block partitioning should be more
complex and need more extra space. Their effects on
the performance of SpMV are beyond the scope of this
paper.
In our implementation of multithreaded SpMV, we
used row partitioning with all the three scheduling
schemes provided by OpenMP. We also applied a
static balancing scheme, non-zero scheduling [18],
which is based on the number of non-zero elements.
Each thread is assigned approximately the same
number of non-zero elements and thus the similar
number of floating-point operations. To implement
non-zero scheduling through the default static
scheduling, we need to define an array nzStart[P],
where P is the number of threads we used, and
nzStart[i] is the index of the first row assigned to the
ith thread. Correspondingly, nzStart [i+1] - 1 is the
index of the last row assigned to the ith thread. The
cost to init nzStart can be ignored, since we just need
to scan the array csrRowStart only once. The value of
P should be obtained at runtime through the
environment variable OMP_NUM_THREADS. The
implementation based on CSR with the non-zero
scheduling is shown as Figure 4.
Figure 4. Pseudo code of multithreaded SpMV
with the non-zero scheduling
5. Experimental Evaluation
In this section, we evaluate the performance of
multithreaded SpMV with different storage formats
and scheduling schemes. The testing environment is
based on Dawning S4800A1 with the following
configuration: 4 AMD Opteron dual-core processors
870, 2GHz, 1MB L2 cache, 16 GB DDR2 RAM, and 2
TB SCSI disk. The operating system is Turbo Linux
3.4.3-9.2. We compiled our code with Intel compiler
9.1 using the optimization option -O3.
Considering the count of non-zero elements and
matrix size (small, medium, large), we collected 8
testing sparse matrices from a popular sparse matrix
collection, UF Sparse Matrix Collection [21]. The
details of experimental matrices are shown in Table 2.
Table 2: Properties of the test matrices
Name Rows Columns Von-zeros
bcsstk17.RSA 10974 10974 219812
bcsstk28.RSA 4410 4410 111717
af23560.rsa 23560 23560 484256
epb1.rua 14734 14734 95053
raefsky2.rb 3242 3242 294276
raefsky3.rb 21200 21200 1488768
twotone.rua 120750 120750 1224224
venkat01.rb 62424 62424 1717792
//The following sentence is an OpenMP direction.
#pragma omp parallel for private(i, j)
for(int t=0; t< P; t++){
for(i= nzStart [t]; i< nzStart [t+1]; i++){
double temp = y[i];
for(j=csrRowStart[i]; j<csrRowStart[i+1]; j++){
int index = csrCols[j];
temp += csrNz[j] * x[index];
}
y[i] = temp;
}
}
662
Our implementations are using OpenMP with the
default static scheduling scheme. For BCSR format,
we chose 2*2 as the block size. The iteration time is
10000. To compare the CSR and BCSR formats, we
chose four matrices to test: bcsstk17.RSA, raefsky3.rb,
epb1.rua and bcsstk28.RSA. They are respectively
large, medium and small.
As shown in Figure 5 - Figure 8, BCSR performs
better than CSR for most matrices. BCSR format can
improve the data accessing locality and reuse data in
registers. The improvement is about 28.09% for the
four matrices on average.
0
!0
!
?0
?
! ? + 8
\um| o thud:
T
`
m
)
CSk
bCSk
Figure 5. Computing time with CSR and BCSR
for bcsstk17.RSA matrix
0
?0
+0
b0
80
!00
! ? + 8
\um| o thud:
T
`
m
)
CSk
bCSk
Figure 6. Computing time with CSR and BCSR
for raefsky3.rb matrix
0
?
+
b
8
!0
! ? + 8
\um| o thud:
T
`
m
)
CSk
bCSk
Figure 7. Computing time with CSR and BCSR
for epb1.rua matrix
0
?
+
b
8
!0
!?
!+
! ? + 8
Number of threads
T
i
m
e
(
s
e
c
)
CSk
bCSk
Figure 8. Computing time with CSR and BCSR
for bcsstk28.RSA matrix
The speedup we got for CSR and BCSR are shown
in Figure 9 and Figure 10. We can see that most of the
testing matrices can get a scalable even super-linear
speedup for both CSR and BCSR as the number of
threads increases. If we use eight threads, the speedup
is from 1.82 to 14.68 with CSR format, and 1.93 to
18.74 with BCSR format. The average speedup we got
with CSR and BCSR is 8.81 and 9.76 for eight threads
respectively. However, there are three matrices
(raefsky3.rb, twotone.rua and venkat01.rb) only get a
limited speedup, which is less than 2.0 while the thread
number is 8. They can not achieve higher speedup
when the number of threads is more than 3. We
observed that these matrices are exactly the largest
three matrices in the test set. The counts of non-zero
elements of them are all over one million.
Correspondingly, the length of vector x to be
multiplied with them is longer than others. As vector x
is stored continuously in memory, the longer it is, the
more TLB misses we will get when accessing vector x
irregularly. This should be a reason why we only got a
limited speedup for the large matrices.
0
4
8
12
16
1 2 3 4 5 6 7 8
Number of threads
S
p
e
e
d
u
p
bcsstk17.RSA raefsky3.rb venkat01.rb
bcsstk28.RSA epb1.rua raefsky2.rb
twotone.rua af23560.rsa
Figure 9. Speedup of CSR using 8 matrices for
1 to 8 threads
663
0
5
10
15
20
1 2 3 4 5 6 7 8
Number of threads
S
d
u
!0
!
?0
?
30
! ? + 8
Number of threads
T
`
m
)
:tut`
d)uum`,!)
_u`dd,!)
uouzo
Figure 11. Different scheduling schemes with
CSR for bcsstk17.RSA matrix
0
?0
+0
b0
80
!00
! ? + 8
\um| o thud:
T
`
m
)
:tut`
d)uum`,!)
_u`dd,!)
uouzo
Figure 12. Different scheduling schemes with
CSR for raefsky3.rb matrix
0
!0
!
! ? + 8
Number of threads
T
`
m
)
:tut`
d)uum`,!)
_u`dd,!)
uouzo
Figure 13. Different scheduling schemes with
CSR for epb1.rua matrix
0
!
?
3
+
! ? + 8
Number of threads
T
i
m
e
(
s
e
c
)
:tut`
d)uum`,!)
_u`dd,!)
uouzo
Figure 14. Different scheduling schemes with
CSR for bcsstk28.RSA matrix
6. Conclusions and future work
In this paper, we implemented and evaluated multi-
threaded SpMV using OpenMP. We evaluated two
storage formats for the sparse matrix, as well as the
scheduling schemes provided by OpenMP and non-
zero scheduling scheme. In most cases, the non-zero
scheduling performs better than the other two
scheduling schemes. Our implementation obtained
satisfactory scalability for most matrices except some
large matrices. To solve this problem, our future work
will consider block partitioning. Since the large matrix
will be partitioned into several small matrices, it
should get better data accessing locality of vector x and
y. We will also implement a hybrid parallelization
664
using MPI and OpenMP for distributed memory
parallel machines.
7. Acknowledgment
This work was supported in partial by the National
Natural Science Foundation of China under contract
No.60303020 and No.60533020, the National 863 Plan
of China under contract No.2006AA01A102 and
No.2006AA01A125. We thank the reviewers for their
careful reviews and helpful suggestions.
8. References
[1] S. Borkar. Design challenges of technology scaling. IEEE
Micro, 19(4):2329, Jul-Aug, 1999.
[2] J. L. Hennessy and D. A. Patterson. Computer
Architecture: A Quantitative Approach; fourth edition.
Morgan Kaufmann, San Francisco, 2006.
[3] Eun-Jin Im, Katherine Yelick , Richard Vuduc, Sparsity:
Optimization Framework for Sparse Matrix Kernels,
International Journal of High Performance Computing
Applications, Vol. 18, No. 1, 135-158 (2004)
[4] E.-J. Im and K.A.YelickOptimizing Sparse Matrix
Computations for Register Reuse in SPARSITY, In
proceedings of the International Conference on
Computational Science, volume 2073 of LNCS, pages 127-
136,San Francisco, CA, May 2001.Springer.
[5] Berkeley Benchmarking and OPtimization (BeBOP)
Project. http://Bebop.cs.berkeley.edu.
[6] Richard Vuduc, James Demmel, Katherine
Yelick ,OSKI:A library of automatically tuned sparse matrix
kernels, Proceedings of SciDAC 2005, Journal of Physics:
Conference Series, June 2005.
[7] R. Vuduc, J. W. Demmel, K. A. Yelick, S. Kamil, R.
Nishtala, B. LeePerformance optimizations and bounds for
sparse matrix-vector multiply In proceedings of
Supercomputing, Baltimore,MD,USA,November 2002.
[8] Richard Wilson Vuduc. Automatic Performance of
Sparse Matrix Kernels. The dissertation of Ph.D, Computer
Science Division, U.C. Berkeley, December 2003.
[9] Rajesh Nishtala, Richard W. Vuduc, James W. Demmel,
Katherine A. Yelick. Performance Modeling and Analysis of
Cache Blocking in Sparse Matrix Vector Multiply.Report
No.UCB/CSD-04-1335
[10] Rajesh Nishtala, Richard Vuduc, James W. Demmel,
Katherine A. Yelick. When Cache Blocking of Sparse Matrix
Vector Multiply Works and Why.
[11] E.-J. Im. Optimizing the Performance of Sparse Matrix-
Vector Multiplication. PhD thesis, UC Berkeley,
Berkeley,CA, USA, 2000.
[12] S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith.
Efficient management of parallelism in object oriented
numerical software libraries. In E. Arge, A. M. Bruaset, and
H. P. Langtangen, editors, Modern Software Tools in
Scientific Computing, pages 163202, 1997.
[13] J. W. Willenbring, A. A. Anda, M. Heroux. Improving
sparse matrix-vector product kernel performance and
availability. In Proc. Midwest Instruction and Computing
Symposium, Mt. Pleasant, IA, 2006.
[14] E. Im, K. Yelick. Optimizing sparse matrix-vector
multiplication on SMPs. In 9th SIAM Conference on Parallel
Processing for Scientific Computing. SIAM, Mar. 1999.
[15] J. C. Pichel, D. B. Heras, J. C. Cabaleiro, F. F. Rivera.
Improving the locality of the sparse matrix-vector product on
shared memory multiprocessors. In PDP, pages 6671. IEEE
Computer Society, 2004.
[16] U. V. Catalyuerek, C. Aykanat. Decomposing
irregularly sparse matrices for parallel matrix-vector
multiplication. Lecture Notes in Computer Science, 1117:
7586, 1996.
[17] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick,
and J. Demmel. Optimization of sparse matrix-vector
multiplication on emerging multicore platforms. In
Proceedings of Supercomputing, November 2007.
[18] K. Kourtis, G. Goumas, N. Koziris. Improving the
Performance of Multithreaded Sparse Matrix-Vector
Multiplication Using Index and Value Compression. In
Proceedings of the 37th International Conference on Parallel
Processing, Washington, DC, USA, 2008. pp: 511-519.
[19] Ankit Jain. Masters Report: pOSKI: An Extensible
Autotuning Framework to Perform Optimized SpMVs on
Multicore Architectures
[20] LAI Jianxin HU Changju.Analysis of Task Schedule
Overhead and Load Balance in OpenMP. Computer
Engineering, 2006, 32(18): 58-60.
[21] Matrix Market. http://math.nist.gov/MatrixMarket/
665