0% found this document useful (0 votes)
45 views8 pages

Computers and Mathematics With Applications: Yueqiang Shang

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views8 pages

Computers and Mathematics With Applications: Yueqiang Shang

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Computers and Mathematics with Applications 57 (2009) 1369–1376

Contents lists available at ScienceDirect

Computers and Mathematics with Applications


journal homepage: www.elsevier.com/locate/camwa

A distributed memory parallel Gauss–Seidel algorithm for linear


algebraic systemsI
Yueqiang Shang ∗
Faculty of Science, Xi’an Jiaotong University, Xi’an 710049, PR China
School of Mathematics and Computer Science, Guizhou Normal University, Guiyang 550001, PR China

article info a b s t r a c t

Article history: A distributed memory parallel Gauss–Seidel algorithm for linear algebraic systems is
Received 26 June 2008 presented, in which a parameter is introduced to adapt the algorithm to different
Received in revised form 17 December 2008 distributed memory parallel architectures. In this algorithm, the coefficient matrix and
Accepted 12 January 2009
the right-hand side of the linear algebraic system are first divided into row-blocks in the
natural rowwise-order according to the performance of the parallel architecture in use.
Keywords:
And then these row-blocks are distributed among local memories of all processors through
Parallel computing
Linear algebraic system
torus-wrap mapping techniques. The solution iteration vector is cyclically conveyed among
Gauss–Seidel method processors at each iteration so as to decrease the communication. The algorithm is a true
Distributed memory system Gauss–Seidel algorithm which maintains the convergence rate of the serial Gauss–Seidel
Parallel algorithm algorithm and allows existing sequential codes to run in a parallel environment with a little
investment in recoding. Numerical results are also given which show that the algorithm is
of relatively high efficiency.
© 2009 Elsevier Ltd. All rights reserved.

1. Introduction

The solving of linear algebraic systems lies at the core of many scientific and engineering simulations. Many practical
problems can be translated into a large scale linear algebraic system. Therefore the parallel solving of large scale linear
algebraic systems is of great importance in the scientific computation field. Methods for a linear algebraic system generally
fall into two categories: direct methods and iterative methods. The direct methods can arrive at the solution within a
limited number of steps. However, direct methods may be impractical if the coefficient matrix of the linear algebraic system
to be solved is large and sparse because the sought-after factor can be dense [1]. This is the reason why the iterative
methods, which are able to take advantage of sparse systems, are often preferable compared with the direct methods in
engineering fields. Through iterative methods, an approximate solution within the error tolerance could be obtained under
the assumption that the algorithm is convergent. Among classical iterative methods, the Gauss–Seidel method has several
interesting properties. In general, if the Jacobi method also converges, the Gauss–Seidel method will converge faster than
the Jacobi method. Even though the SOR method with the optimal relaxation parameter is faster than the Gauss–Seidel
method, however, choosing an optimal SOR relaxation parameter can be difficult for many problems of practical interest [2].
Therefore, the Gauss–Seidel method is very attractive in practice and it is usually used as the smoother of the multigrid
method for partial differential equations which typically yields good multigrid convergence properties.
Parallel implementations of Gauss–Seidel method have generally been developed for regular problems, for example, the
solution of Laplace’s equations by finite differences [3,4], where a red-black coloring scheme is used to provide independence

I This work was supported by the Science and Technology Foundation of Guizhou Province, China [2008] 2123.
∗ Corresponding address: Faculty of Science, Xi’an Jiaotong University, P.O. Box 2325, Xi’an 710049, PR China.
E-mail address: [email protected].

0898-1221/$ – see front matter © 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.camwa.2009.01.034
1370 Y. Shang / Computers and Mathematics with Applications 57 (2009) 1369–1376

in the calculations and some parallelism. This scheme has been extended to multi-coloring for additional parallelism in more
complicated regular problems in [5]. In [6], Koester, Ranka and Fox have presented a parallel Gauss–Seidel algorithm for
sparse power system matrices, in which a two-part matrix ordering technique has been developed—first to partition the
matrix into block-diagonal bordered form using diakoptic techniques and then to multi-color the data in the last diagonal
block using graph-coloring techniques. Unfortunately, the elegant simplicity of structured grid multi-color Gauss–Seidel is
lost on 3D unstructured finite element applications as the number of required colors increases dramatically. Motivated
by the above observation, Adams has proposed an efficient parallel Gauss–Seidel algorithm for unstructured problems
suiting distributed memory computers in [7]. This algorithm takes advantage of the domain decomposition provided by the
distribution of the stiffness matrix that is common in parallel computing. It first colors the processors and uses an ordering of
the colors to provide a processor inequality operator. A weakness of this parallel-block multi-color Gauss–Seidel algorithm
is that it requires different orderings depending on the number of processors. In [8,9], Amodio and Mazzia have developed
a parallel Gauss–Seidel method to solve block-banded systems. They used a parallel structure which consists of one-
dimensional logically connected processors. While in [10], Kim and Lee have used a parallel structure which consists of two-
dimensional logically connected processors. On the other hand, to avoid parallelization difficulties, a processor-localized
Gauss–Seidel is often employed instead of a true Gauss–Seidel method, in which each processor performs the Gauss–Seidel
method as a subdomain solver for a block Jacobi method (see, e.g., [11]). While a processor-localized Gauss–Seidel is easy
to parallelize, as shown in [11], its convergence rate usually suffers and can even lead to divergence when the number of
processors increases.
In this paper, we present a distributed memory parallel Gauss–Seidel algorithm for general linear algebraic systems,
in which a parameter is introduced to adapt the algorithm to different parallel architectures. The general idea of this
algorithm is to divide the coefficient matrix and the right-hand side of the linear algebraic system into row-blocks in
the natural rowwise-order, then to distribute these blocks among local memories of all processors through a torus-wrap
mapping technique and cyclically convey the solution iteration vector at each iteration to decrease the communication. It
is a true parallel Gauss–Seidel method and hence the convergence rate of the serial Gauss–Seidel algorithm is maintained.
Furthermore, when the linear algebraic system arises from the discretization of a partial differential equation, this algorithm
does not depend on the domain decomposition, the topological structure of grids, the sequencing skills of elements and other
skills, hence it allows existing sequential codes to run in a parallel environment with a little investment in recoding.
The remainder of the paper is organized as follows. In Section 2, the Gauss–Seidel method and its convergence properties
are recalled. In Section 3, the parallel implementation of Gauss–Seidel method is discussed and the corresponding parallel
Gauss–Seidel algorithm is described and analyzed. A numerical experiment and analysis on numerical results are given in
Section 4. Finally, some conclusions are drawn in Section 5.

2. The Gauss–Seidel method

Consider the following real linear algebraic system Ax = b:

a00 a01 ··· a0 n − 1 x0 b0


    
 a10 a11 ··· a1 n−1   x1   b1 
.. .. ..  .  =  . , (1)
 .   . 

. . . . .

an − 1 0 an−1 1 ··· an − 1 n − 1 xn−1 bn−1

where A = (aij )n×n is a known nonsingular n × n matrix with nonzero diagonal entries, b = (b0 , b1 , . . . , bn−1 )T is the
right-hand side and x = (x0 , x1 , . . . , xn−1 )T is the vector of unknowns.
(0) (0) (0)
The Gauss–Seidel method for the above linear algebraic system reads: Given an initial guess x(0) = (x0 , x1 , . . . , xn−1 )T ,
(k+1) (k+1)
x(k+1)
= (x0 , x1 , . . . , x(nk−+11) )T
is obtained by the following iterative procedure
 !
i−1 n −1
 (k+1) 1 X (k+1)
X (k)
,

xi = bi − aij xj − aij xj
aii (2)
j =0 j=i+1
i = 0, 1, . . . , n − 1.

By introducing vectors z = (z0 , z1 , . . . , zn−1 )T and t = (t0 , t1 , . . . , tn−1 )T , where



0
 i=0
i −1
zi = (3)
X
i = 1, 2, . . . , n − 1,
−
 aij xj
j =0

n −1
X
ti = bi − aij xj i = 0, 1, . . . , n − 1. (4)
j=i+1
Y. Shang / Computers and Mathematics with Applications 57 (2009) 1369–1376 1371

scheme (2) can be written as

(k+1) 1
xi = (zi(k+1) + ti(k) ). (5)
aii
As for the convergence of scheme (2) or (5), we have the following well-known result that if the coefficient matrix A is
strictly diagonally dominant or irreducibly diagonally dominant, scheme (2) or (5) converges for any initial guess x(0) (see,
e.g., [1]).

3. Parallel Gauss–Seidel algorithm

In this section, we detailedly discuss the parallel implementation of the Gauss–Seidel method described in Section 2 on
distributed-memory parallel architectures.

3.1. Strategy for data distribution and storage

Firstly, we divide the linear algebraic system (1) into m = n/g row-blocks in the natural rowwise order (for simplicity
of presentation, we assume n can be divided exactly by g) as follows.
A0 X0 B0
    
 A1   X1   B1 
 .  .  =  . , (6)
 .  .   . 
. . .
Am−1 Xm−1 Bm−1
where Ai (i = 0, 1, . . . , m − 1) are g × n matrices each consisting of the successive rows of the coefficient matrix A, Xi
and Bi (i = 0, 1, . . . , m − 1) are g-dimensional vectors consisting of the successive components of the unknown vector x
and the right-hand side vector b, respectively. Then, we distribute and store these row-blocks of (6) among local memories
of all processors in the parallel system through a torus-wrap mapping technique. Specifically, assuming p is the number of
processors in the parallel system, the 0th processor stores the 0th, pth, 2pth,. . . row-blocks of (6), the 1st processor stores
the 1st, (p + 1)th, (2p + 1)th, . . . row-blocks of (6), the 2nd processor stores the 2nd, (p + 2)th, (2p + 2)th, . . . row-blocks
of (6), etc. If p = 3, m = 8, for example, then the 0th processor stores the A0 , A3 , A6 and B0 , B3 , B6 of (6), the 1st processor
stores the A1 , A4 , A7 and B1 , B4 , B7 of (6), and the 2nd processor stores the A2 , A5 and B2 , B5 of (6), respectively.

3.2. Parallel implementation

From scheme (2) or (5), we can see that the Gauss–Seidel iterative process has a very strong sequential property in
the sense that xj+1 cannot be computed until xj is known. To obtain good load-balancing, we adopt the above strategy
to distribute and store the coefficient matrix and the right-hand side of the linear algebraic system (1). The Master/Slave
model is employed in which the master processor (it also performs the task of a slave processor) partitions the computation
task, initializes the iteration vector, collects and prints the computed results, while the slave processors perform the actual
computation involved. Meanwhile, we cyclically convey the solution iteration vector at each iteration to decrease the
communication and adopt the technique of overlapping communication and computations to increase the parallel efficiency.
Concretely speaking, the (k + 1)th iteration is performed as follows, where  is the error tolerance.
(1) All the slave processors summate the data at the right-hand side of the dominant diagonal of A in parallel, i.e., compute
(k)
ti (i = 0, 1, . . . , n − 1) in parallel, and set sign = 0.
(k+1) (k+1) (k+1) (k+1) (k)
(2) The 0th slave processor first computes x0 , x1 , . . . , xg −1 , and checks whether the conditions |xi − xi | <
 (i = 0, 1, . . . , g − 1) hold. If there exists one of the above inequalities not being true, set sign = 1. Then the 0th
(k+1) (k+1)
slave processor sends x0 , x1 , . . . , x(gk−+11) and sign to the 1st slave processor. After that, the 0th slave processor
and the 1st processor perform the following tasks in parallel: the 0th slave processor uses the newly computed
(k+1) (k+1)
x0 , x1 , . . . , x(gk−+11) to compute other relevant items of the remainder local rows, i.e., to compute the partial sum
(k+1) (k+1)
in zi involving x0 , x(1k+1) , . . . , x(gk−+11) , while the 1st processor receives x0(k+1) , x(1k+1) , . . . , xg(k−+11) and sign from the
(k+1) (k+1)
0th slave processor and computes xg , xg +1 , . . . , x(2gk+−11) , checks whether the conditions |xi(k+1) − x(i k) | <  (i =
g , g + 1, . . . , 2g − 1) hold. Invalidity of any the above inequalities leads to sign = 1.
(k+1) (k+1)
(3) After computing xg , xg +1 , . . . , x(2gk+−11) and sign, the 1st slave processor sends x0(k+1) , x(1k+1) , . . . , x(2gk+−11) and sign to the
2nd slave processor. Then the 1st slave processor and the 2nd slave processor carry out the following tasks in parallel:
(k+1) (k+1)
the 1st slave processor uses x0 , x1 , . . . , x(gk−+11) and the newly computed x(gk+1) , xg(k++11) , . . . , x(2gk+−11) to compute other
(k+1)
relevant items of the remainder local rows, while the 2nd slave processor receives x0 , x(1k+1) , . . . , x2g
(k+1)
−1 and sign from
(k+1) (k+1) (k+1) (k+1)
the 1st slave processor and computes x2g , x2g +1 , . . . , x3g −1 , checks whether the conditions |xi − x(i k) | <  (i =
2g , 2g + 1, . . . , 3g − 1) hold. If there is one of the above inequations not satisfied, set sign = 1.
1372 Y. Shang / Computers and Mathematics with Applications 57 (2009) 1369–1376

(4) Go on as the above process until the v th (0 ≤ v ≤ p − 1) slave processor, which stores the last row-block of (6), receives
(k+1) (k+1)
x0 , x1 , . . . , x(nk−+g1−) 1 and sign, computes x(nk−+g1) , xn(k−+g1+) 1 , . . . , x(nk−+11) , and then checks whether |xi(k+1) − x(i k) | <  (i =
n − g , n − g + 1, . . . , n − 1) hold, if there is one of the above inequations not valid, set sign = 1.
(k+1)
(5) The v th slave processor checks whether sign = 0 is satisfied, if it is true, sends x0 , x1(k+1) , . . . , xn(k−+11) to master
(k+1) (k+1)
processor and stops the iteration, or else, broadcasts x0 , x1 , . . . , xn(k−+g1−) 1 to other slave processors and starts the
next iteration.

3.3. Parallel algorithm

Based on the above parallel implementation of Gauss–Seidel method, we describe our parallel Gauss–Seidel algorithm
as follows, in which total is the number of the row-blocks of (6) stored on the processor.

Algorithm. Parallel Gauss–Seidel algorithm.


Sub-algorithm for master processor.

(1) Initialize the iteration vector and the parameter g according to the size of the problem and the performance of the
parallel architecture in use.
(2) Broadcast the parameter g, initial iteration vector and other information about partitioning the computation task to
slave processors.
(3) Receive the approximate solution vector and print it.
Sub-algorithm for the jth (j > 0) slave processor.

(1) Receive the parameter g, initial iteration vector and other information about partitioning the computation task from
master processor, read my local data.
(2) For k = 1, 2, . . ., until convergence:
(k)
(i) Set sign = 0, compute my local ti in parallel with other slave processors.
(ii) For r = 0, 1, . . . , total − 1, do:
(k+1)
(a) Receive x0 , x(1k+1) , . . . , x(rk∗+p+1)j∗g −1 and sign from the (j − 1)th processor, compute xr(k∗+p+1)j∗g , x(rk∗+p+1)j∗g +1 , . . . ,
(k+1)
xr ∗p+(j+1)∗g −1 and check whether they satisfy the accurate requirement. If the accurate requirement is not
satisfied, set sign = 1.
(k+1) (k+1) (k+1)
(b) Send x0 , x1 , . . . , xr ∗p+(j+1)∗g −1 and sign to the (j + 1)th processor, then compute other relevant items of
the remainder local rows.
(k+1) (k+1)
(iii) Broadcast or receive x0 , x1 , . . . , x(nk−+11) .

3.4. Speedup and parallel efficiency

The performance of parallel algorithms in a homogeneous distributed memory parallel environment is measured by
speedup and parallel efficiency, defined as

Ts Spar (p)
Spar (p) = , Epar (p) = , (7)
T (p) p
where Ts is the wall (elapsed) time for the best sequential algorithm, T (p) stands for the wall (elapsed) time for the parallel
algorithm using p processors, while Spar (p) and Epar (p) denote the parallel speedup and parallel efficiency for parallel
computation, respectively. Because the best sequential algorithm is not known in general, in practical applications, Ts is
usually replaced by T (1), the wall time of the same parallel program on one processor (i.e., both master and slave tasks are
performed on one processor). Thus, the speedup Spar (p) and parallel efficiency Epar (p) are commonly computed as follows:

T (1) Spar (p)


Spar (p) = , Epar (p) = . (8)
T (p) p
The parallel efficiency takes into account of the loss of efficiency due to the data communication and other extra costs, such
as on a partitioning computation task, on scheduling processes and on data management, introduced by the parallelization.
The wall time for parallel computation with p processors can be decomposed into the following form

T (p) = Tpar (p) + Tseq (p) + Tcomu (p) + Textra (p), (9)
T (1)
where Tpar (p) is the average CPU time spent on computation on slave processors satisfying Tpar (p) = parp for our algorithm,
Tseq (p) is the CPU time taken on the serial components of the computation satisfying Tseq (p) = Tseq (1), Tcomu (p) is the
Y. Shang / Computers and Mathematics with Applications 57 (2009) 1369–1376 1373

communication time with p processors and Textra (p) denotes other extra time (for example, idle time) introduced by the
parallelization. Therefore, the speedup Spar (p) can be written as
T (1) Tpar (1) + Tseq (1) + Tcomu (1) + Textra (1)
Spar (p) = =
T (p) Tpar (1)/p + Tseq (1) + Tcomu (p) + Textra (p)
Tpar (1) + Tseq (1) + Tcomu (1) + Textra (1)
< . (10)
Tseq (1) + Tcomu (p) + Textra (p)
This is called Amdahl’s law.
At each iteration of our parallel algorithm, There are m times communication: m − 1 for local communication and one
for global communication. The amount of transferred data at the kth (k = 0, 1, . . . , m − 2) local communication is (k + 1)g
units (float or double) data plus an integer sign, while that of the global communication is n units (float or double) data.
Therefore, assuming the network bandwidth is big enough to transfer n units data at a time, the total communication time
with p (p > 1) processors for our parallel algorithm is approximatively
gm(m − 1)
   
Tcomu (p) = log2 (p)tw + (n + 1) log2 (p)tc + (m − 1)tw + + m − 1 tc × IC
2
+ {log2 (p)tw + n log2 (p)tc } × (IC − 1) + tw + ntc
(m − 1)(gm + 2)
  
= {[log2 (p) + m − 1] × IC + 1}tw + n log2 (p) + × IC + log2 (p) + n tc , (11)
2
where IC is the iteration count satisfying the stopping criterion, tc is the transfer time of a unit (float or double) datum and tw
is the communication startup time. Assume that ta is the time of a (float or double) number operation of type multiplication,
division, addition or subtraction and that the time of reading, valuing and printing a unit datum is also equal to ta , then for
our algorithm

Tpar (1) = (n2 + n)ta + 2n2 ta × IC , Tseq (1) = 2nta ,


g (p − 1) ta
2 2
 
Textra (p) = + O(g (p − 1)ta ) × IC . (12)
2
Therefore, noting n = mg and neglecting the communication time Tcomu (1) when only one processor is employed, the
speedup and parallel efficiency of our parallel algorithm can be written as

(n2 + n)ta + 2n2 ta × IC



Spar (p) = (n + n)ta + 2n ta × IC + 2nta
 2 2
+ 2nta
p
(m − 1)(gm + 2)
  
+ [(log2 (p) + m − 1) × IC + 1]tw + n log2 (p) + × IC + log2 (p) + n tc
2
− 1
g (p − 1)2
 2  
+ + O(g (p − 1)) ta × IC
2 (
2n2 × IC + n2 + n
  
n tw
+ 2n + log2 (p) + − 1 × IC + 1
 2 2
= 2n × IC + n + 3n
p g ta
)−1
( − 1)(n + 2)
n
" ! #
g (p − 1)
2 2
 
g tc
+ n log2 (p) + × IC + log2 (p) + n + + O(g (p − 1)) × IC (13)
2 ta 2


Epar (p) = 2n2 × IC + n2 + 3n 2n2 × IC + n2 + n + 2np


    

n
 
tw
n
g
− 1 (n + 2)
+ log2 (p) + − 1 × IC + 1 p + n log2 (p) +  × IC + log2 (p) + n
g ta 2
−1
g 2 (p − 1)2
 
tc

×p + + O(g (p − 1)) p × IC . (14)
ta 2 

Remark. The above analysis is for the case where the coefficient matrix A is dense. When A is sparse, the corresponding
speedup and parallel efficiency can be easily obtained by a similar procedure if we know the number of nonzero entries of A.
1374 Y. Shang / Computers and Mathematics with Applications 57 (2009) 1369–1376

From (13) and (14), we can see that for a size-fixed problem and a given parallel system, the speedup of the algorithm
varies with the row-block-partitioning parameter g. In general, the bigger the parameter g, the fewer the times of the
communication (which equals gn at each iteration) and hence the less the communication startup time leading to a decreased
total communication time. Meanwhile, when the parameter g grows, the load-balancing becomes worse leading to more idle
time among processors and then leading to a degraded parallel efficiency. Therefore, to obtain good parallel performance,
the parameter g should be suitably chosen.

3.5. The choice of parameter g

In parallel computing, due to the different structures and performance of parallel architectures, an algorithm achieving
good performance on a parallel architecture does not necessarily suit and perform well on another one. Therefore, portability
and flexibility are very important for a parallel algorithm. Among the attractive features of our parallel algorithm is the great
flexibility in the choice of the parameter g to adapt the algorithm to different parallel architectures. From the above analysis,
we can see that for a size-fixed problem, the speedup and then the parallel efficiency are the functions of the row-block-
partitioning parameter g which can be suitably valued according to the performance of the parallel system in use. For an
application of our parallel Gauss–Seidel algorithm, one may obtain good parallel performance by suitably choosing the value
of the row-block-partitioning parameter g according to the size of the problem and the performance (mainly, the values of
tc
ta
and ttw ) of the parallel systems in use. Specifically speaking, for a given parallel system, we can first use some standard
a
routines (for example, in PVM and MPI softwares) to measure its communication startup time tw , transfer time of unit datum
tc and the unit arithmetic operation time ta . Then from (13) and through a calculus approach, we may obtain good speedups
by choosing the parameter g as
s
n(n+2) tc tw
3 2
× ta
+n× ta
g ≈ . (15)
(p − 1)2

4. Numerical experiment

4.1. Numerical results

In our experiment, the parallel platform is made up of 4 PCs, each with a Pentium IV 2.4 GHz CPU, 80G hard disk and
512M DRAM (the master processor has 2G DRAM), installed Microsoft Windows2000 operation system, Microsoft VC 6.0
and connected together by 100 Mbps Ethernet. Message-passing is supported through PVM 3.4, which is a public-domain
software from Oak Ridge National Laboratory [12]. PVM is a software system that enables a collection of homogeneous or
heterogeneous computers to be used as a coherent and flexible concurrent computational resource, or a ‘‘parallel virtual
machine’’. PVM consists of two parts: a daemon process that any user can install on a machine, and a user library that
contains routines for initiating processes on other machines, for communicating between processes, and for changing the
configuration of machines. User’s programs written in C, C++ or Fortran can access PVM through provided library routines
for functions such as process initiation, message transmission, synchronization and coordination via barriers or rendezvous.
Our codes for experiment were written in C++ making use of a PVM message-passing library.
For simplicity, the linear algebraic system Ax = b in our experiment is chosen as the following special form.

i = j,

n
A = (ai,j )n×n = 0.8|i−j| 0 < |i − j| ≤ n/4, (16)
other,
0

i, j = 0, 1, . . . , n − 1. The vector b of right-hand side is randomly valued. Processors compute with double precision. The
initial iteration vector is chosen as zero vector and the stopping criterion is

|xi(k+1) − x(i k) | < 10−6 (i = 0, 1, . . . , n − 1).


The numerical results are listed in Tables 1 and 2.

4.2. Analysis based on numerical results

From Table 2, we can see that our parallel Gauss–Seidel algorithm is of relatively high efficiency. Good speedups were
obtained with a certain row-block-partitioning parameter g in the experiment.
As predicted by the previous analysis, Tables 1 and 2 show that for a size-fixed problem and a given parallel system, the
wall (elapsed) time and the speedup of the algorithm are closely correlated with the row-block-partitioning parameter g.
When the row-block-partitioning parameter g is smaller than a certain value, the increasing speed of the communication
time is higher than the decreasing speed of the computation time as the number of the processors increases and hence
Y. Shang / Computers and Mathematics with Applications 57 (2009) 1369–1376 1375

Table 1
The wall (elapsed) time in seconds of the parallel Gauss–Seidel algorithm with double precision, where p denotes the number of processors, n the order of
the coefficient matrix and g the row-block-partitioning parameter.
p n = 16 000 n = 24 000
g = 100 g = 500 g = 1000 g = 2000 g = 100 g = 500 g = 1000 g = 2000

1 59.093 50.609 50.109 49.109 132.938 114.110 112.500 112.438


2 59.547 30.656 28.953 29.204 124.250 68.797 63.422 62.344
3 66.922 34.094 27.281 31.360 130.516 66.407 55.125 53.969
4 79.938 36.953 30.391 33.641 150.797 68.235 56.094 55.954

Table 2
The speedup of the parallel Gauss–Seidel algorithm.
p n = 16 000 n = 24 000
g = 100 g = 500 g = 1000 g = 2000 g = 100 g = 500 g = 1000 g = 2000

2 0.992 1.651 1.731 1.711 1.070 1.659 1.774 1.804


3 0.883 1.484 1.837 1.578 1.019 1.718 2.041 2.083
4 0.739 1.370 1.649 1.485 0.881 1.672 2.006 2.009

Fig. 1. Speedup versus (a) the row-block-partitioning parameter g and (b) number of processors.

results in a increased rather than decreased wall time for the algorithm. Consequently, the speedups are smaller than
one and decrease as the number of the processors increases; see the second columns of Tables 1 and 2 for the case
n = 16 000, g = 100. Only within a certain range of the row-block-partitioning parameter g, does the speedup increase with
growing g. However, when the row-block-partitioning parameter g exceeds another certain value, the speedup decreases
instead of increasing as g further increases; see Fig. 1(a) for the case p = 3. A possible reason may be that with the
growing row-block-partitioning parameter g, the time of communication decreases; at the same time, however, the load
imbalance among processors becomes worse, i.e., there is more waiting time among processors. Therefore, when the increase
in the waiting time is larger than the decrease in the communication time, decreased speedups and then decreased parallel
efficiency are observed. As is shown in Table 2 and Fig. 1(a), the ‘‘optimal’’ value of g for the cases n = 16 000 and n = 24 000
are 1000 and 2000, respectively. Here and hereafter, ‘‘optimal’’ value means that after which, the speedup decreases rather
than increases as the corresponding parameter increases in our experiment.
It is also found that for a certain value of the parameter g, when the size of the problem is fixed and the number of
processors is small, the speedup increases with the number of processors; nevertheless, when the number of processors
exceeds a certain value, the speedup decreases instead of increasing as the number of processors increases; see Fig. 1(b) for
the case g = 1000. The reason may be that, on one hand, as the number of processors grows, the communication and the
idle time among processors increases sharply; on the other hand, with more processors, the granularity of computation task
for each processor becomes smaller making the relative ratio of the time spent on communication and waiting to that spent
on the computation higher. When this decrease in computation time less than the increase in communication and waiting
time introduced by these additional processors, a declining speedup, and then decreasing parallel efficiency are observed.
On the other hand, when the parameter g and the number of processors are fixed, the speedup increases as the size of
the problem increases; see Table 2. A possible explanation is that the relative ratio of time spent on computation to that
spent on communication and waiting becomes higher as the size of the problem increases when the parameter g and the
number of processors are fixed and consequently, the speedups increase.
1376 Y. Shang / Computers and Mathematics with Applications 57 (2009) 1369–1376

5. Conclusion

In this paper, we have discussed a distributed memory parallel Gauss–Seidel algorithm for general linear algebraic
systems. A numerical experiment performed on a local area of 4 PCs with PVM for the message-passing is also given. The
conclusions are as follows.
• By introducing a row-block-partitioning parameter g, the algorithm is flexible adapting to different distributed memory
parallel systems. With a suitable value of the parameter g depending on the performance (mainly, the relative ratio
of communication to computation speeds) of the parallel system in use and the size of the problem, good parallel
performance may be obtained.
• The technique of cyclically conveying the solution iteration vector leads to a large decrease in the communication time
and hence results in a high speedup and parallel efficiency of the algorithm with a certain value of the parameter g.
• For a parameters-given problem, there is an ‘‘optimal’’ number of processors which maximizes the parallel efficiency.
The larger the size of the problem, the larger the ‘‘optimal’’ number of processors.

Acknowledgements

The author would like to thank the editor and referees for their valuable comments and suggestions which helped to
improve the results of this paper. The free software PVM was also extensively used in this study.

References

[1] G.H. Golub, C.F. Van Loan, Matrix Computations, The Johns Hopkins University Press, Baltimore, 1983.
[2] D.M. Young, Iterative Solution of Large Linear Systems, Academic Press, New York, 1971.
[3] L.M. Adams, H.F. Jordan, Is SOR color-blind? SIAM J. Sci. Statist. Comput. 7 (2) (1986) 490–506.
[4] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, D. Walker, Solving Problems on Concurrent Processors, Prentice Hall, 1988.
[5] G. Golub, J.M. Ortega, Scientific Computing with an Introduction to Parallel Computing, Academic Press, Boston, MA, 1993.
[6] D.P. Koester, S. Ranka, G.C. Fox, A parallel Gauss–Seidel algorithm for sparse power system matrices, in: Supercomputing ’94 Proceedings, 14–18 Nov,
1994, 184–193.
[7] M.F. Adams, A distributed memory unstructured Gauss–Seidel algorithm for multigrid smoothers, in: ACM/IEEE Proceedings of SC01: High
Performance Networking and Computing, 2001.
[8] P. Amodio, F. Mazzia, A parallel Gauss–Seidel method for block tridiagonal linear systems, SIAM J. Sci. Comput. 6 (1995) 1451–1461.
[9] P. Amodio, F. Mazzia, Parallel iterative solvers for boundary value methods, Math. Comput. Modelling 23 (1996) 29–43.
[10] T. Kim, C.O. Lee, A parallel Gauss–Seidel method using NR data flow ordering, Appl. Math. Comput. 99 (1999) 209–220.
[11] M. Adams, M. Brezina, J. Hu, R. Tuminaro, Parallel multigrid smoothing: Polynomial versus Gauss–Seidel, J. Comput. Phys. 188 (2003) 593–610.
[12] A. Geist, A. Beguelin, J. Dongarra, et al., PVM: Parallel Virtual Machine—A Users’ Guide and Tutorial for Networked Parallel Computing, MIT Press,
Cambridge, 1994.

You might also like