Performance Evaluation of Multithreaded Sparse Matrix-Vector Multiplication Using OpenMP

Sparse matrix-vector multiplication is an important computational kernel in scientific applications. However, it performs poorly on modern processors because of a low compute-to-memory ratio and its irregular memory access patterns. This paper discusses the implementations of sparse matrix-vector algorithm using OpenMP to execute iterative methods on the Dawning S4800A1.

Uploaded by

pventi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

180 views

Performance Evaluation of Multithreaded Sparse Matrix-Vector Multiplication Using OpenMP

Uploaded by

pventi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Performance Evaluation of Multithreaded Sparse

Matrix-Vector Multiplication using OpenMP

Shengfei Liu
1, 2
, Yunquan Zhang
1
, Xiangzheng Sun
1, 2
, RongRong Qiu
2, 3
1) Institute of Software, Chinese Academy of Sciences, P.R.China
2) Graduate University of Chinese Academy of Sciences, P.R.China
3) National Research Center for Intelligent Computing Systems, P.R.China
{lsf,zyq,sxz}@mail.rdcps.ac.cn
[email protected]
Abstract
Sparse matrix-vector multiplication is an important
computational kernel in scientific applications.
However, it performs poorly on modern processors
because of a low compute-to-memory ratio and its
irregular memory access patterns. This paper
discusses the implementations of sparse matrix-vector
algorithm using OpenMP to execute iterative methods
on the Dawning S4800A1. Two storage formats (CSR
and BCSR) for sparse matrices and three scheduling
schemes (static, dynamic and guided) provided by the
standard OpenMP are evaluated. We also compared
these three schemes with non-zero scheduling, where
each thread is assigned approximately the same
number of non-zero elements. Experimental data
shows that, the non-zero scheduling can provide the
best performance in most cases. The current
implementation provides satisfactory scalability for
most of matrices. However, we only get a limited
speedup for some large matrices that contain millions
of non-zero elements.
1. Introduction
OpenMP is a shared memory parallel programming
standard that is based on fork/join parallel execute
model. In recent years, it is widely used in parallel
computing for SMP (symmetric multi-processing) and
multi-core processors. Multi-core processors become
more and more popular as they can better trade-off
among performance, energy efficiency and reliability
[1, 2]. The problem of application scalability up to a
lot of processing cores is considered of vital important.
Sparse matrix-vector multiplication (SPMV) is one of
these applications, which is widely used in scientific
computing and many industrial applications. SpMV
usually performs poorly on modern processors because
of low compute-to-memory ratio and irregular memory
access patterns.
In this paper, we discuss the performance of
multithreaded sparse matrix-vector multiplication on
the Dawning S4800A1. We will show the effect of
storage formats and scheduling schemes on the
performance of SpMV. The rest of this paper is
organized as follows: Section 2 provides an overview
of related work. Section 3 introduces how to
parallelize SpMV using OpenMP. Section 4 discusses
the schedule schemes provided by OpenMP. We also
implemented a static scheduling scheme based on the
number of non-zero elements, with its name non-zero
scheduling. Experimental results are shown and
discussed in Section 5. Section 6 presents conclusions
and future work.
2. Related work
There are many existing optimization techniques for
serial SpMV, including register blocking [7, 8], cache
blocking [9, 10], multi-vector technology [3, 4],
symmetrical structure [8], diagonal technology [8],
rearrangement [8], and so on. The SPARSITY [3, 4]
package is widely used to solve the automatic
optimizations of SpMV and SpMM (Sparse Matrix
Multi-Vector Multiplication). BeBOP(Berkeley
Benchmarking and Optimization Group)[5] has
developed a software package OSKI(Optimized Sparse
Kernel Interface )[6] that provides C programming
interface for the automatic optimizations of sparse
matrix computing kernel, which is used in mathematics
2009 11th IEEE International Conference on High Performance Computing and Communications
978-0-7695-3738-2/09 $25.00 2009 IEEE
DOI 10.1109/HPCC.2009.75
659
libraries and specific applications. Register blocking
and cache blocking were first presented in [11], but
OSKI was the first released autotuning library that
incorporated them. In order to improve the
performance of SpMV, both SPARSITY and OSKI
adopted a heuristic algorithm to determine the optimal
block size of sparse matrix. Although OSKI is serial
library, it is being integrated into higher-level parallel
linear solver libraries, including PETSc [12] and
Trilinos [13]. PETSc uses MPI (Message Passing
Interface) to implement SpMV on the platform with
distributed memory system. For load balancing, PETSc
has used row partitioning scheme with equal numbers
of rows per process.
The past researches on multithreaded SpMV are
mainly on SMP clusters. They applied and evaluated
known optimization techniques, including register and
cache blocking, reordering techniques and minimizing
communication costs [14, 15, 16]. In the work of [17],
Williams and Oliker summarized a rich collection of
optimizations for different multithreading architectures.
They pointed out that memory bandwidth could be a
significant bottleneck in CMP (chip multiprocessor)
systems. Kornilios and Georgios improved the
performance of multithreaded SpMV using index and
value compression techniques in [18]. Based on OSKI,
BeBOP team developed pOSKI (Parallel Optimized
Sparse Kernel Interface) software package [19] for
multi-core architecture. The pOSKI package also used
automatic optimization techniques like OSKI and some
other techniques introduced in [17].
3. Parallelization of SPMV
The form of SpMV is y = y + Ax, where A is a
sparse matrix, x and y are dense vectors, x is called
source vector, and y is called destination vector. One
widely used efficient sparse matrix storage format is
CSR (Compressed Sparse Row). It only stores the non-
zero elements with its column index, and the index of
the first non-zero element of each row. If matrix A is
an m*n matrix, and the number of non-zero elements is
nz, CSR format needs to store the following three
arrays.
csrNz[nz] stores the value of each non-zero element
in matrix A.
csrCols[nz] stores the column index of each element
in csrNz array.
csrRowStart[m+1] stores the index of the first non-
zero element of each*row in array csrNz and csrCols,
and csrRowStart[m]=nz.
We can see that matrix A needs nz*2+m+1 storage
space in CSR format.
For example, if matrix A
1
= , where
m=4, n=4, nz=6, A
1 0 0 2
0 0 3 0
0 0 4 0
5 0 0 6

1
should be stored as follows.
csrNz ={1,2,3,4,5,6};
csrCols ={0,3,2,2,0,3};
csrRowStart ={0,2,3,4,6}.
Since we need to store the location information
explicitly as well as the value of each non-zero element,
extra communication time is needed to access these
location data. Accordingly, both computing time and
memory accessing cost should be considered in the
optimization and parallelization of SpMV. We
implemented the multithreaded SpMV for CSR format
using OpenMP as Figure 1.
Figure 1. Pseudo code of multithreaded SpMV
for CSR using OpenMP
//The following sentence is an OpenMP direction.
#pragma omp parallel for private (j)
//The following code is serial SpMV based on CSR.
for (i = 0; i < m; i++)
{
double temp = y[i];
for (j= csrRowStart[i]; j < csrRowStart[i+1]; j++)
{
int index = csrCols[j];
temp += csrNz[j]*x[index];
}
y[i] = temp;
}
In the above codes, we define j as a private variable
so that each thread has its own copy of j. The iteration
variable i and variables defined in the loop, such as
temp and index, will be private variables for each
thread by default. The low efficiency of SpMV is
mainly caused by poor data accessing locality and the
indirect accesses of vector x. An improved way to store
sparse matrix is BCSR (Block Compressed Sparse
Row), in which the matrix is divided into blocks.
Based on BCSR, register blocking algorithm further
improves data accessing efficiency by reusing data in
registers. Paper [4] discussed more details about how
to optimize sparse matrix computing by reusing
registers data in SPARSITY.
Register blocking algorithm divides sparse matrix
A[m][n] into many small r*c block. After A is divided
into m/r rows and n/c columns, elements in each block
are computed one by one. In this situation, there will
be c elements of vector x in registers reused r times for
each block. When r=c=1, BCSR is equal to CSR.
Suppose the number of non-zero blocks is nzb, BCSR
also needs three arrays to store the sparse matrix.
660
bcsrNz[nzb*r*c] stores the value of elements in
each non-zero block.
bcsrCols[nzb] stores the column index of the first
elements in each non-zero block.
bcsrRowStart[m/r+1] stores the index of the first
non-zero block in bcsrCols for each block row, and
bcsrRowStart [m/r] = nzb, which is the number of non-
zero blocks in the given matrix.
We take A
2
= as an
example, where m=6, n=9, nz=20, the corresponding
BCSR format of A
1 1 1 0 0 0 2 2 2
1 0 0 0 0 0 0 2 0
0 0 0 3 3 0 0 0 0
0 0 0 0 3 3 0 0 0
0 0 0 4 0 4 5 5 0
0 0 0 4 4 0 5 5 0

2
is shown as follows.
bcsrNz = {1,1,1,1,0,0,2,2,2,0,2,0,3,3,0,0,3,3,4,0,
4,4,4,0,5,5,0,5,5,0};
bcsrCols = {0,6,3,3,6};
bcsrRowStart ={0,2,3,5}.
Taking 2*2 block size as an example, the pseudo
code in Figure 2 implements SpMV based on BCSR
using OpenMP, where blockRows = m/r means the
number of block rows in the matrix.
We can see from the codes that x0, x1, y0 and y1 are
reused twice in registers. The data accessing locality of
vector x is also improved, since it processes four
elements of the sparse matrix each time.

Figure 2. Pseudo code of multithreaded SpMV
for BCSR using OpenMP
4. Load balancing of SpMV
Considering load balancing, scheduling overhead
and some other factors, the OpenMP API specifies
four different scheduling schemes: static, dynamic,
guided and runtime [20]. The runtime scheme will
choose a schedule type and chunk size from
environment variables until runtime. So we mainly
discuss the other three scheduling schemes.
The static scheme divides the iterations of a loop
into pieces by an input parameter chunk. The pieces
are statically assigned to each thread in a round-robin
fashion. By default, the iterations are divided into P
evenly sized contiguous chunks, where P is the
number of threads we assigned. Since the schedule can
be statically determined, this method has the least
runtime overhead.
The dynamic scheme also divides the iterations of a
loop into pieces. However, it will distribute each piece
dynamically. Each thread obtains the next set of
iterations only if it finished the current piece of
iterations. If not specified by the input parameter
chunk, the default count of iterations in a piece is 1.
The guided scheme works in a fashion similar to the
dynamic schedule. The partition size can be calculated
by Formula (1), where LC
k
is the count of unassigned
iterations, P is the number of threads and is a scale
factor (recommended to 1 or 2). The guided scheme
also provides an input parameter chunk. It specifies the
minimum number of iterations to dispatch each time.
When no chunk is specified, its default value is 1. The
chunk sizes in guided scheduling begin large and
slowly decrease in size, resulting in fewer
synchronizations than dynamic scheduling, while still
providing load balancing.
NC
k
= LC
k
/ (*P) (1)
As an example, Table 1 gives different partition
sizes for a problem, in which the count of iterations N
= 1000 and threads number P = 4.
Table 1. Partition sizes for different schemes
Scheme Partition size
static(default) 250, 250, 250, 250
dynamic
(chunk=50)
50, 50, 50, 50, 50,
guided
(chunk=50)
250, 188, 141, 106, 79,
59, 50, 50, 50, 27
In case of CSR storage format, the coarse grain row
partitioning scheme is popularly applied [17]. The
matrix will be partitioned into blocks of rows by the
number of threads. Each block is assigned to a thread.
As described in the left part of Figure 3, each thread
operates on its own parts of the csrNz, csrCols,
//The following sentence is an OpenMP direction.
#pragma omp parallel for private (j,k,h,t)
The following code is serial SpMV based on BCSR.
for (i=0; i<blockRows; i++)
{
t = i<<1;
register doulbe y0=y[t];
register double y1=y[t+1];
for(j=bcsrRowStart[i]; j<bcsrRowStart[i+1]; j++)
{
k = j<<2;
h = bcsrCols[j];
register double x0 = x[h];
register double x1 = x[h+1];
y0 += bcsrNz[k] * x0;
y0 += bcsrNz[k+1] * x1;
y1 += bcsrNz[k+2] * x0;
y1 += bcsrNz[k+3] * x1;
}
y[t] = y0;
y[t+1] = y1;
}
661
csrRowStart and y arrays. So their accesses on these
data are independent and can be parallelized. All
threads access elements of the x array. Since accesses
on x are read only, the data can reside in each
processors cache without causing invalidation traffic
due to the cache coherency protocol [16]. An
advantage of row partitioning is that each thread
operates on its own part of the y array, which allows
for better temporal locality on the arrays elements in
case of distinct caches.
Figure 3. Matrix partition for SpMV
There is a complementary approach to row
partitioning, column partitioning, where each thread is
assigned a block of columns. Although column
partitioning is more naturally applied to the CSC
(Compressed Sparse Column) storage format, it can
also be applied to the CSR format. A disadvantage of
column partitioning is that all threads must perform
writes on all the elements of the vector y. To solve this
problem, the best method is to make each thread use its
own vector y and perform a reducing operation at the
end of the multiplication [18]. The reducing operation
will cause more memory accesses as it need to visit the
own vector y of each thread. It will also cost (P-
1)*sizeof(y) extra space to store the y array for each
thread, where P is the number of threads we used.
Since each thread only need to operate on its own parts
of vector x in column partitioning, it will bring in
better temporal locality for vector x. To combine the
advantages of the above two partitioning methods, we
can use block partitioning, where each thread is
assigned a two-dimensional block as shown in the right
part of Figure 3. On the other hand, the
implementation of block partitioning should be more
complex and need more extra space. Their effects on
the performance of SpMV are beyond the scope of this
paper.
In our implementation of multithreaded SpMV, we
used row partitioning with all the three scheduling
schemes provided by OpenMP. We also applied a
static balancing scheme, non-zero scheduling [18],
which is based on the number of non-zero elements.
Each thread is assigned approximately the same
number of non-zero elements and thus the similar
number of floating-point operations. To implement
non-zero scheduling through the default static
scheduling, we need to define an array nzStart[P],
where P is the number of threads we used, and
nzStart[i] is the index of the first row assigned to the
ith thread. Correspondingly, nzStart [i+1] - 1 is the
index of the last row assigned to the ith thread. The
cost to init nzStart can be ignored, since we just need
to scan the array csrRowStart only once. The value of
P should be obtained at runtime through the
environment variable OMP_NUM_THREADS. The
implementation based on CSR with the non-zero
scheduling is shown as Figure 4.
Figure 4. Pseudo code of multithreaded SpMV
with the non-zero scheduling
5. Experimental Evaluation
In this section, we evaluate the performance of
multithreaded SpMV with different storage formats
and scheduling schemes. The testing environment is
based on Dawning S4800A1 with the following
configuration: 4 AMD Opteron dual-core processors
870, 2GHz, 1MB L2 cache, 16 GB DDR2 RAM, and 2
TB SCSI disk. The operating system is Turbo Linux
3.4.3-9.2. We compiled our code with Intel compiler
9.1 using the optimization option -O3.
Considering the count of non-zero elements and
matrix size (small, medium, large), we collected 8
testing sparse matrices from a popular sparse matrix
collection, UF Sparse Matrix Collection [21]. The
details of experimental matrices are shown in Table 2.
Table 2: Properties of the test matrices
Name Rows Columns Von-zeros
bcsstk17.RSA 10974 10974 219812
bcsstk28.RSA 4410 4410 111717
af23560.rsa 23560 23560 484256
epb1.rua 14734 14734 95053
raefsky2.rb 3242 3242 294276
raefsky3.rb 21200 21200 1488768
twotone.rua 120750 120750 1224224
venkat01.rb 62424 62424 1717792
//The following sentence is an OpenMP direction.
#pragma omp parallel for private(i, j)
for(int t=0; t< P; t++){
for(i= nzStart [t]; i< nzStart [t+1]; i++){
double temp = y[i];
for(j=csrRowStart[i]; j<csrRowStart[i+1]; j++){
int index = csrCols[j];
temp += csrNz[j] * x[index];
}
y[i] = temp;
}
}
662
Our implementations are using OpenMP with the
default static scheduling scheme. For BCSR format,
we chose 2*2 as the block size. The iteration time is
10000. To compare the CSR and BCSR formats, we
chose four matrices to test: bcsstk17.RSA, raefsky3.rb,
epb1.rua and bcsstk28.RSA. They are respectively
large, medium and small.
As shown in Figure 5 - Figure 8, BCSR performs
better than CSR for most matrices. BCSR format can
improve the data accessing locality and reuse data in
registers. The improvement is about 28.09% for the
four matrices on average.
0

!0
!
?0
?
! ? + 8
\um| o thud:
T
`
m

)
CSk
bCSk
Figure 5. Computing time with CSR and BCSR
for bcsstk17.RSA matrix
0
?0
+0
b0
80
!00
! ? + 8
\um| o thud:
T
`
m

)
CSk
bCSk
Figure 6. Computing time with CSR and BCSR
for raefsky3.rb matrix
0
?
+
b
8
!0
! ? + 8
\um| o thud:
T
`
m

)
CSk
bCSk
Figure 7. Computing time with CSR and BCSR
for epb1.rua matrix
0
?
+
b
8
!0
!?
!+
! ? + 8
Number of threads
T
i
m
e
(
s
e
c
)
CSk
bCSk
Figure 8. Computing time with CSR and BCSR
for bcsstk28.RSA matrix
The speedup we got for CSR and BCSR are shown
in Figure 9 and Figure 10. We can see that most of the
testing matrices can get a scalable even super-linear
speedup for both CSR and BCSR as the number of
threads increases. If we use eight threads, the speedup
is from 1.82 to 14.68 with CSR format, and 1.93 to
18.74 with BCSR format. The average speedup we got
with CSR and BCSR is 8.81 and 9.76 for eight threads
respectively. However, there are three matrices
(raefsky3.rb, twotone.rua and venkat01.rb) only get a
limited speedup, which is less than 2.0 while the thread
number is 8. They can not achieve higher speedup
when the number of threads is more than 3. We
observed that these matrices are exactly the largest
three matrices in the test set. The counts of non-zero
elements of them are all over one million.
Correspondingly, the length of vector x to be
multiplied with them is longer than others. As vector x
is stored continuously in memory, the longer it is, the
more TLB misses we will get when accessing vector x
irregularly. This should be a reason why we only got a
limited speedup for the large matrices.
0
4
8
12
16
1 2 3 4 5 6 7 8
Number of threads
S
p
e
e
d
u
p
bcsstk17.RSA raefsky3.rb venkat01.rb
bcsstk28.RSA epb1.rua raefsky2.rb
twotone.rua af23560.rsa
Figure 9. Speedup of CSR using 8 matrices for
1 to 8 threads
663
0
5
10
15
20
1 2 3 4 5 6 7 8
Number of threads
S

d
u

bcsstk17.RSA raefsky3.rb venkat01.rb

bcsstk28.RSA epb1.rua raefsky2.rb
twotone.rua af23560.rsa
Figure 10. Speedup of BCSR using 8 matrices
for 1 to 8 threads
We also evaluated the three scheduling schemes
provided by OpenMP and non-zero scheduling for
CSR format. The matrices we chose are also
bcsstk17.RSA, raefsky3.rb, epb1.rua and
bcsstk28.RSA. The performances of these scheduling
schemes are shown in Figure 11 - Figure 14.
Generally, the non-zero scheduling performs better
than the three scheduling schemes provided by
OpenMP. Although the guided scheduling partitions
the matrix into fewer blocks than the dynamic
scheduling, it may cost more runtime overhead for
each partition. The non-zero scheduling and the default
static scheduling has a similar runtime overhead,
because the implementation of non-zero scheduling
scheme is based on the default static scheduling, which
is statically determined and has the least runtime
overhead. However, all thread will get a similar
number of floating-point operations in the non-zero
scheduling. So it performs better than the default static
scheduling in most cases.
0

!0
!
?0
?
30
! ? + 8
Number of threads
T
`
m

)
:tut`
d)uum`,!)
_u`dd,!)
uouzo
Figure 11. Different scheduling schemes with
CSR for bcsstk17.RSA matrix
0
?0
+0
b0
80
!00
! ? + 8
\um| o thud:
T
`
m

)
:tut`
d)uum`,!)
_u`dd,!)
uouzo
Figure 12. Different scheduling schemes with
CSR for raefsky3.rb matrix
0

!0
!
! ? + 8
Number of threads
T
`
m

)
:tut`
d)uum`,!)
_u`dd,!)
uouzo
Figure 13. Different scheduling schemes with
CSR for epb1.rua matrix
0
!
?
3
+

! ? + 8
Number of threads
T
i
m
e
(
s
e
c
)
:tut`
d)uum`,!)
_u`dd,!)
uouzo
Figure 14. Different scheduling schemes with
CSR for bcsstk28.RSA matrix
6. Conclusions and future work
In this paper, we implemented and evaluated multi-
threaded SpMV using OpenMP. We evaluated two
storage formats for the sparse matrix, as well as the
scheduling schemes provided by OpenMP and non-
zero scheduling scheme. In most cases, the non-zero
scheduling performs better than the other two
scheduling schemes. Our implementation obtained
satisfactory scalability for most matrices except some
large matrices. To solve this problem, our future work
will consider block partitioning. Since the large matrix
will be partitioned into several small matrices, it
should get better data accessing locality of vector x and
y. We will also implement a hybrid parallelization
664
using MPI and OpenMP for distributed memory
parallel machines.
7. Acknowledgment
This work was supported in partial by the National
Natural Science Foundation of China under contract
No.60303020 and No.60533020, the National 863 Plan
of China under contract No.2006AA01A102 and
No.2006AA01A125. We thank the reviewers for their
careful reviews and helpful suggestions.
8. References
[1] S. Borkar. Design challenges of technology scaling. IEEE
Micro, 19(4):2329, Jul-Aug, 1999.
[2] J. L. Hennessy and D. A. Patterson. Computer
Architecture: A Quantitative Approach; fourth edition.
Morgan Kaufmann, San Francisco, 2006.
[3] Eun-Jin Im, Katherine Yelick , Richard Vuduc, Sparsity:
Optimization Framework for Sparse Matrix Kernels,
International Journal of High Performance Computing
Applications, Vol. 18, No. 1, 135-158 (2004)
[4] E.-J. Im and K.A.YelickOptimizing Sparse Matrix
Computations for Register Reuse in SPARSITY, In
proceedings of the International Conference on
Computational Science, volume 2073 of LNCS, pages 127-
136,San Francisco, CA, May 2001.Springer.
[5] Berkeley Benchmarking and OPtimization (BeBOP)
Project. http://Bebop.cs.berkeley.edu.
[6] Richard Vuduc, James Demmel, Katherine
Yelick ,OSKI:A library of automatically tuned sparse matrix
kernels, Proceedings of SciDAC 2005, Journal of Physics:
Conference Series, June 2005.
[7] R. Vuduc, J. W. Demmel, K. A. Yelick, S. Kamil, R.
Nishtala, B. LeePerformance optimizations and bounds for
sparse matrix-vector multiply In proceedings of
Supercomputing, Baltimore,MD,USA,November 2002.
[8] Richard Wilson Vuduc. Automatic Performance of
Sparse Matrix Kernels. The dissertation of Ph.D, Computer
Science Division, U.C. Berkeley, December 2003.
[9] Rajesh Nishtala, Richard W. Vuduc, James W. Demmel,
Katherine A. Yelick. Performance Modeling and Analysis of
Cache Blocking in Sparse Matrix Vector Multiply.Report
No.UCB/CSD-04-1335
[10] Rajesh Nishtala, Richard Vuduc, James W. Demmel,
Katherine A. Yelick. When Cache Blocking of Sparse Matrix
Vector Multiply Works and Why.
[11] E.-J. Im. Optimizing the Performance of Sparse Matrix-
Vector Multiplication. PhD thesis, UC Berkeley,
Berkeley,CA, USA, 2000.
[12] S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith.
Efficient management of parallelism in object oriented
numerical software libraries. In E. Arge, A. M. Bruaset, and
H. P. Langtangen, editors, Modern Software Tools in
Scientific Computing, pages 163202, 1997.
[13] J. W. Willenbring, A. A. Anda, M. Heroux. Improving
sparse matrix-vector product kernel performance and
availability. In Proc. Midwest Instruction and Computing
Symposium, Mt. Pleasant, IA, 2006.
[14] E. Im, K. Yelick. Optimizing sparse matrix-vector
multiplication on SMPs. In 9th SIAM Conference on Parallel
Processing for Scientific Computing. SIAM, Mar. 1999.
[15] J. C. Pichel, D. B. Heras, J. C. Cabaleiro, F. F. Rivera.
Improving the locality of the sparse matrix-vector product on
shared memory multiprocessors. In PDP, pages 6671. IEEE
Computer Society, 2004.
[16] U. V. Catalyuerek, C. Aykanat. Decomposing
irregularly sparse matrices for parallel matrix-vector
multiplication. Lecture Notes in Computer Science, 1117:
7586, 1996.
[17] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick,
and J. Demmel. Optimization of sparse matrix-vector
multiplication on emerging multicore platforms. In
Proceedings of Supercomputing, November 2007.
[18] K. Kourtis, G. Goumas, N. Koziris. Improving the
Performance of Multithreaded Sparse Matrix-Vector
Multiplication Using Index and Value Compression. In
Proceedings of the 37th International Conference on Parallel
Processing, Washington, DC, USA, 2008. pp: 511-519.
[19] Ankit Jain. Masters Report: pOSKI: An Extensible
Autotuning Framework to Perform Optimized SpMVs on
Multicore Architectures
[20] LAI Jianxin HU Changju.Analysis of Task Schedule
Overhead and Load Balance in OpenMP. Computer
Engineering, 2006, 32(18): 58-60.
[21] Matrix Market. http://math.nist.gov/MatrixMarket/
665

Digital Modulations using Matlab
From Everand
Digital Modulations using Matlab
Mathuranathan Viswanathan
4/5 (6)
Worksheets and Handouts ACTDailyJournal
No ratings yet
Worksheets and Handouts ACTDailyJournal
3 pages
Tips & Tricks DD15 Troubleshooting
No ratings yet
Tips & Tricks DD15 Troubleshooting
1 page
Technical Analysis of Infosys Technology
67% (3)
Technical Analysis of Infosys Technology
70 pages
Merge-Based_Parallel_Sparse_Matrix-Vector_Multiplication_SC2016 (1)
No ratings yet
Merge-Based_Parallel_Sparse_Matrix-Vector_Multiplication_SC2016 (1)
12 pages
Yang 2018 Europa R
No ratings yet
Yang 2018 Europa R
16 pages
LightSpMV_Faster_CSR-based_sparse_matrix-vector_multiplication_on_CUDA-enabled_GPUs
No ratings yet
LightSpMV_Faster_CSR-based_sparse_matrix-vector_multiplication_on_CUDA-enabled_GPUs
8 pages
SpV8: Pursuing Optimal Vectorization and Regular Computation Pattern in SPMV
No ratings yet
SpV8: Pursuing Optimal Vectorization and Regular Computation Pattern in SPMV
21 pages
Multithreaded Architectures: (Applied Parallel Programming)
No ratings yet
Multithreaded Architectures: (Applied Parallel Programming)
29 pages
Systolic Sparse Matrix Vector Multiply in The Age of TPUs and Accelerators
No ratings yet
Systolic Sparse Matrix Vector Multiply in The Age of TPUs and Accelerators
10 pages
2018 - Optimizing Sparse Matrix-Vector Multiplications on armv8-based many-core architecture
No ratings yet
2018 - Optimizing Sparse Matrix-Vector Multiplications on armv8-based many-core architecture
14 pages
Efficient_Sparse_Matrix-Vector_Multiplication_on_GPUs_Using_the_CSR_Storage_Format_SC2014 (1)
No ratings yet
Efficient_Sparse_Matrix-Vector_Multiplication_on_GPUs_Using_the_CSR_Storage_Format_SC2014 (1)
12 pages
Chen 2022 FG SPMSP V
No ratings yet
Chen 2022 FG SPMSP V
29 pages
NVR 2008 004
No ratings yet
NVR 2008 004
32 pages
Ece408 Lecture19 Sparse Matrix VK SP23
No ratings yet
Ece408 Lecture19 Sparse Matrix VK SP23
28 pages
Extended
No ratings yet
Extended
9 pages
Design And Analysis Of Algorithm
From Everand
Design And Analysis Of Algorithm
Bhupendra Mandloi
No ratings yet
ASpT PPoPP19
No ratings yet
ASpT PPoPP19
15 pages
Group 5 Report
No ratings yet
Group 5 Report
21 pages
2212 07490
No ratings yet
2212 07490
41 pages
Slides
No ratings yet
Slides
46 pages
Sparse Matrix Storage Format
No ratings yet
Sparse Matrix Storage Format
4 pages
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
Performance Comparison of Storage Formats For Sparse Matrices
No ratings yet
Performance Comparison of Storage Formats For Sparse Matrices
14 pages
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
Matrix-Matrix Multiplication
No ratings yet
Matrix-Matrix Multiplication
8 pages
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
A Friendly Introduction to MATLAB Programming
From Everand
A Friendly Introduction to MATLAB Programming
Orhan Gazi
No ratings yet
Task Level Parallelization of All Pair Shortest Path Algorithm in Openmp 3.0
No ratings yet
Task Level Parallelization of All Pair Shortest Path Algorithm in Openmp 3.0
4 pages
Accelerating Matrix Multiplication With Block Sparse Format and NVIDIA Tensor Cores - NVIDIA Technical Blog
No ratings yet
Accelerating Matrix Multiplication With Block Sparse Format and NVIDIA Tensor Cores - NVIDIA Technical Blog
7 pages
OPENMP
No ratings yet
OPENMP
37 pages
Sparse Matrix-Matrix Multiplication On Multilevel Memory Architectures: Algorithms and Experiments
No ratings yet
Sparse Matrix-Matrix Multiplication On Multilevel Memory Architectures: Algorithms and Experiments
24 pages
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Open MP2362 HHDHD
No ratings yet
Open MP2362 HHDHD
23 pages
Data Structures and Algorithms with Python
From Everand
Data Structures and Algorithms with Python
Aadinath Pothuvaal
No ratings yet
Exploring Better Speculation and Data Locality in Sparse Matrix Vector Multiplication On Intel Xeon
No ratings yet
Exploring Better Speculation and Data Locality in Sparse Matrix Vector Multiplication On Intel Xeon
36 pages
Optimizing Sparse Matrix Vector Multiplication On SMPS: Eun-Jin Im and Katherine Yelick
No ratings yet
Optimizing Sparse Matrix Vector Multiplication On SMPS: Eun-Jin Im and Katherine Yelick
9 pages
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
Accelerating CPU-Based Sparse General Matrix Multiplication With Binary Row Merging
No ratings yet
Accelerating CPU-Based Sparse General Matrix Multiplication With Binary Row Merging
12 pages
Asplos24 3
No ratings yet
Asplos24 3
18 pages
Shan vd. - 2018 - Improving MPI Reduction Performance for Manycore Architectures with OpenMP and Data Compression
No ratings yet
Shan vd. - 2018 - Improving MPI Reduction Performance for Manycore Architectures with OpenMP and Data Compression
12 pages
PDC-Lab 21BCE10419
No ratings yet
PDC-Lab 21BCE10419
20 pages
Project Assignment 3 Multi Processor System (DV 2544) : Susheel Sagar
No ratings yet
Project Assignment 3 Multi Processor System (DV 2544) : Susheel Sagar
4 pages
DS_UNIT_1
No ratings yet
DS_UNIT_1
16 pages
Lecture 7
No ratings yet
Lecture 7
19 pages
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
A SECURE DATA AGGREGATION TECHNIQUE IN WIRELESS SENSOR NETWORK
From Everand
A SECURE DATA AGGREGATION TECHNIQUE IN WIRELESS SENSOR NETWORK
Dr Chaitra HV
No ratings yet
Matrix and Graph
No ratings yet
Matrix and Graph
44 pages
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Computer
No ratings yet
Computer
14 pages
Computer
No ratings yet
Computer
14 pages
gibridnyy-mpi-openmp-algoritm-pereuporyadocheniya-simmetrichnyh-razrezhennyh-matrits-i-ego-primenenie-k-resheniyu-slau
No ratings yet
gibridnyy-mpi-openmp-algoritm-pereuporyadocheniya-simmetrichnyh-razrezhennyh-matrits-i-ego-primenenie-k-resheniyu-slau
14 pages
COA_Imple
No ratings yet
COA_Imple
22 pages
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
From Everand
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
Bruce Dang
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
2 6
No ratings yet
2 6
8 pages
SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations
No ratings yet
SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations
15 pages
04 OpenMPPerformances
No ratings yet
04 OpenMPPerformances
15 pages
Report - Viber String
No ratings yet
Report - Viber String
26 pages
Sparse Matrices in MATLAB P:Design and Implementation: December 2004
No ratings yet
Sparse Matrices in MATLAB P:Design and Implementation: December 2004
13 pages
防火卷帘技术说明书
No ratings yet
防火卷帘技术说明书
15 pages
Bulletin 35
No ratings yet
Bulletin 35
27 pages
Lisa Davis-Where Do We Stand
No ratings yet
Lisa Davis-Where Do We Stand
3 pages
Instruction Manual Model 501
No ratings yet
Instruction Manual Model 501
9 pages
NC, CNC and DNC
67% (3)
NC, CNC and DNC
7 pages
1-Vicky, Quotation of 1-1.2TPH Wood Pellet Line, RICHI
100% (1)
1-Vicky, Quotation of 1-1.2TPH Wood Pellet Line, RICHI
17 pages
Preliminaries and General (Civil Works)
No ratings yet
Preliminaries and General (Civil Works)
18 pages
The Elements of Journalism - Book Review
No ratings yet
The Elements of Journalism - Book Review
4 pages
Fluids At Rest Density: ρ= mass of thematerial volume of the material m V
No ratings yet
Fluids At Rest Density: ρ= mass of thematerial volume of the material m V
4 pages
Topas Osiris Monitor Training Manual
No ratings yet
Topas Osiris Monitor Training Manual
32 pages
Hydroponic System With Automated Hydrolysis Using Renewable Energy Self-Sustainable
No ratings yet
Hydroponic System With Automated Hydrolysis Using Renewable Energy Self-Sustainable
6 pages
einhell-services-brochure-water-technology-2023-en
No ratings yet
einhell-services-brochure-water-technology-2023-en
56 pages
BRSR Core
No ratings yet
BRSR Core
3 pages
All Kerala Cm Exam 2025 Maths Std Set 2 With Ms
No ratings yet
All Kerala Cm Exam 2025 Maths Std Set 2 With Ms
14 pages
Solar Energy
No ratings yet
Solar Energy
38 pages
200-32-325 R01 Dt. 14.07.21
No ratings yet
200-32-325 R01 Dt. 14.07.21
18 pages
ICWAI Paper 1 Fundamentals of Economics and Management
100% (2)
ICWAI Paper 1 Fundamentals of Economics and Management
416 pages
Workbook 10: The Second Semester
No ratings yet
Workbook 10: The Second Semester
79 pages
Humangeography5 Rev
No ratings yet
Humangeography5 Rev
46 pages
Australian CAR 139
No ratings yet
Australian CAR 139
220 pages
Tieng Anh 11 - Unit 5 Lesson 6
No ratings yet
Tieng Anh 11 - Unit 5 Lesson 6
19 pages
Beam End Force Summary: Beam Node L/C FX Fy FZ MX My MZ
No ratings yet
Beam End Force Summary: Beam Node L/C FX Fy FZ MX My MZ
1 page
790CSE AI Newsletter April - June 2023
No ratings yet
790CSE AI Newsletter April - June 2023
7 pages
Community Policing Essay
100% (2)
Community Policing Essay
9 pages
3-Jotun Thinner No. 17
No ratings yet
3-Jotun Thinner No. 17
2 pages
TYP Detail (BMU-011)
No ratings yet
TYP Detail (BMU-011)
1 page
NARA Glossary: Term NARA Standard Definition
No ratings yet
NARA Glossary: Term NARA Standard Definition
10 pages