0% found this document useful (0 votes)
57 views

Linear Equation

The document summarizes a parallel algorithm for solving linear systems of ordinary differential equations on an Intel Hypercube parallel computer. The algorithm discretizes the system using the box scheme and solves the resulting system in parallel across processors assigned to subintervals, using a modified recursive doubling technique. Numerical experiments show the algorithm achieves good parallel efficiency with minimal communication overhead between processors.

Uploaded by

sam_kamali85
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Linear Equation

The document summarizes a parallel algorithm for solving linear systems of ordinary differential equations on an Intel Hypercube parallel computer. The algorithm discretizes the system using the box scheme and solves the resulting system in parallel across processors assigned to subintervals, using a modified recursive doubling technique. Numerical experiments show the algorithm achieves good parallel efficiency with minimal communication overhead between processors.

Uploaded by

sam_kamali85
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Solution of Linear Systems of Ordinary Di erential Equations on an INTEL Hypercube

L. Lustman B. Neta Naval Postgraduate School Department of Mathematics Monterey, CA 93943 and C. P. Katti Jawaharlal Nehru University School of Computer and Systems Sciences New Delhi 110067, India

Keywords: initial value problems, parallel processing, hypercube, box scheme, recursive doubling. Subject classi cation: 65L, 65W

Here we develop and test a parallel scheme for the solution of linear systems of ordinary initial value problems based on the box scheme and a modi ed recursive doubling technique. The box scheme may be replaced by any stable integrator. The algorithm can be modi ed to solve boundary value problems. Software for both problems is available upon request.

Abstract

1 Introduction
We consider the solution of linear problems on a hypercube. By a hypercube we intend `a distributed memory MIMD computer with communication between processors : : : via a network having the topology of a p-dimensional cube, with the vertices considered as processors and the edges as communication links', 4]. See also Fox 1,2] and Fox and Otto 3]. Our method of solution is based on the box scheme to discretize the system of initial value problems: y = Ay + f (x) y(a) = y0 where y and f are n-dimensional vectors and A is an n n matrix. We obtain, in parallel, fundamental solutions on subintervals. The resulting system of equations is solved by a modi ed version of the recursive doubling technique (see Stone, 7]). Another technique which parallelizes the solution by subinterval decomposition has been proposed by Skeel 6]. In the next section, the general problem is stated and some information on the INTEL Hypercube is given. The algorithm for initial value problems is described in section 3, and the e ciency of the algorithm is discussed in section 4, where we detail the numerical experiments performed with our algorithm. In the last section we present our conclusions.
0

2 The general problem


The numerical solution of ordinary di erential systems is an intrinsically sequential procedure: given the data at a point x (or at several points x x ; h : : : x ; Kh), one advances to the following point x + h. In order to parallelize this procedure, we make the basic remark that for a linear system y = A(x)y on an interval a b], the solution at the right endpoint is a linear function of the values at the left endpoint: y(b) = Y a b]y(a): Here Y a b] is the value at x = b of Y , the fundamental solution on the interval, which is de ned by: Y = A(x)Y Y (a) = I the identity matrix
0 0

To solve a problem on the interval xmin xmax], we propose to assign several contiguous subintervals x0 x1] x1 x2] : : : xN 1 xN ] with x0 = xmin xN = xmax to the N processors, and let each of them compute in parallel the corresponding fundamental solution. This is a task requiring a possibly large numbers of sequential steps, for the numerical evaluation of Y x x +1 ]. After these quantities are ready, one may obtain y(xi) from the initial data y(xmin) by matrix multiplication, as obviously Y a b]Y b c] = Y a c] Let us remark at the outset that this elementary procedure may be extended to inhomogeneous equations, with initial data: y = A(x)y + f (x) y(xmin) = given or two-point boundary data: y = A(x)y + f (x) xmin x xmax B1 y(xmin) + B2 y(xmax) = given Such extensions necessitate only the linearity of the equations and initial or boundary conditions. We shall discuss these general algorithms, as well as the steps necessary to obtain computational e ciency. In order to address e ciency matters, we must also brie y present the machine on which the algorithm is run. The iPSC/2 Intel Hypercube is a MIMD (multiple instructionmultiple data) machine, consisting of several processors in hypercube connection. Each such processor|also called a node|executes its own program, on data in its own memory. The nodes are controlled by another processor|the host|which loads the programs into the nodes and starts them running. Host and nodes communicate by message passing these messages are strings of arbitrary length, with an arbitrary `message type' (an integer), which may be sent from any processor to any other processor. At any moment a processor may send a message, nd whether messages of certain type are pending, or receive messages. The communication may be performed synchronously, i.e. the processing stops until a message is sent or received, or asynchronously, where processing and communication overlap. It is seen, therefore, that an algorithm is optimal on such a machine if it may be set as several parallel processes, each working on its own data, with a minimum of inter-process communication. We shall see that our ordinary di erential equation solvers t very well the Intel architecture.
;

3 The algorithm for the initial value problem


Step 1.
0

Using N processors to solve the linear inhomogeneous system with initial conditions: y = A(x)y + f (x) y(xmin) = g 4

divide the required interval into N subintervals: x0 x1] x1 x2] : : : xN 1 xN ] with x0 = xmin xN = xmax The algorithm will produce numerical approximations for y(xj ) j = 1 : : : N . Step 2. Do in parallel: Processor j , working on the interval xj 1 xj ] solves numerically the following two systems: Yj = A(x)Yj Yj (xj 1) = I , the identity matrix and A(x) j + f (x) j= j (xj 1 ) = 0 In our program this is done using the box scheme (see, e.g. Neta and Katti, 5]). The matrix Yj is the fundamental solution on the subinterval, whereas j incorporates the inhomogeneous e ect of the forcing function f . When this step is completed, one may recursively compute y(xj ) from: y(x1) = Y1(x1)g + 1(x1) y(x2) = Y2(x2)y(x1) + 2(x2) . . . y(xN ) = YN (xN )y(xN 1) + N (xN )
; ; 0 ; 0 ; ;

The last step of the algorithm is an e cient performance of the recursion above, assuming that N = 2m, as usual on a hypercube. Step 3. (This is a modi cation of the recursive doubling due to Stone, 7]) 3a) For 1 j N initialize: yj = j (xj ) Mj = Yj (xj ) Also initialize: y1 = g + M1y1 k = 1. 3b) For all j > k compute: yj = yj + Mj yj k Mj = Mj Mj k
; ;

3c) For all j > k replace M y by M y : yj = yj Mj = Mj 3d) Set k = 2k: If k < N repeat steps (3b)-(3c) above. Otherwise the algorithm ends with yj the numerical approximations to the solutions at xj . 5

4 The e ciency of the algorithm


We begin our discussion with an investigation of the communication overhead. In step 3 there will be interprocessor communication, as processor j obtains data from processor j ; k. It is obvious that the algorithm requires only one additional bu er per processor, to hold M and y |under the assumption that the matrix multiplications are performed in the order shown. It is also possible to perform steps (3b)-(3c) in parallel, but then care must be exercised to avoid data corruption by message passing. One option is to use just one bu er, but accept data only when ready. We shall call this the `send on request' scheme. The other option is to broadcast data as soon as it is ready. This we shall call the `multiple bu er' scheme. Yet another possibility is to use as temporary bu ers the memory provided by the hypercube communication technology. For example, processor 1 sends data to processor 5, in a message with message type 1 (the identity of the sender). Processor 5, executing step (3b) with k = 1, expects data from processor 4, so it will accept only a message with message type 4. The data from processor 1 are left in the communication bu ers, to be read when processor 5 reaches the stage k = 4. This version, the `message type' scheme, is clearly the simplest to program. We have implemented all three versions mentioned above. As expected, the `send on request' program has a higher communication overhead, because about twice as many messages are passed as in the other schemes. The multiple bu er scheme and the message type scheme run substantially at the same speed, although the messages arrive in a di erent order. The test problems show that the message type scheme is preferable, unless the data to be transferred are so bulky as to slow down communication. This certainly does not happen in this program, which transfers matrices of moderate size. Moreover, as the size of the problem (i.e. the dimension of the vector y) increases, more and more work will be done on actually solving the di erential equations, and the communication overhead will be less signi cant. An idea about the magnitude of the communication overhead may be obtained from the data in the following tables, which summarize several numerical experiments in solving the following system: where fi is adjusted so that the exact solution is:

y i = yi + xyi+1 + fi y 10 = y10 + f10


0 0 ; ;

0 i < 10
3x

y = (1 ex e x e2x e

2x

e3x e

x sin(x) cos(x))

The rst table shows the total time spent by each processor in solving the problem, as obtained from the mclock system call. Most of the work is done in computing the fundamental solutions, and communication is a relatively small quantity. Even the `send on request' scheme, which has a large number of messages transmitted, does not in uence strongly the run times, which seem nearly constant on the various processors. Another e ciency measure, critical for comparing single processor and multiprocessor 6

Processor :

No. of steps per processor 10 871 856 847 827 873 852 834 825 20 1641 1630 1621 1600 1637 1617 1608 1597 40 3183 3163 3162 3139 3188 3159 3153 3138 80 6265 6244 6244 6221 6270 6234 6240 6220 Table 1: System of order 10, `send on request' Total busy time in msec versions of the same mathematical procedure, is the total running time needed for the complete solution. We can roughly estimate this quantity as follows: let the unknown vector y be of dimension n, and assume that the numerical solution involves s steps (of size h) to reach xmax from xmin. A single processor algorithm will need a time proportional to

sn

as it evaluates n right hand sides s times (we assume that most of the computational work is spent on obtaining the right hand sides of the di erential equations, and ignore matrix-vector or matrix-matrix multiplications). Our parallel algorithm, using N processors, will have a running time of: because each processor executes only s=N steps, but the quantity computed is the fundamental solution, an n n matrix. Thus, it appears that there will be a gain only if n < N , i.e. the order of the di erential system is less than the number of processors. Even if there is no obvious gain in parallelization if all one needs is the solution of one di erential problem, the algorithm proposed may become e cient when used as the rst step of an inverse problem, or distributed parameter problem. In such a case, the same system is solved repeatedly with di erent initial conditions (say) then, after obtaining the quantities Mj j in the processors, one may use Step 3 of the algorithm to obtain sets of values yj from sets of initial conditions. The discussion and numerical experiment data of this section has been concerned only with initial value problems. It is also valid for the parallel solution of boundary value problems.

s n2 N

5 Conclusions
We have presented a parallel algorithm for solving ordinary initial value problems. We have shown that this algorithm is easy to program, and that machine-dependent optimization is readily achievable. Moreover, the algorithm is very exible: as the equations are solved 7

independently on each subinterval, it is possible to use di erent subinterval sizes, or di erent solution strategies in each subinterval, in order to control the error or balance the work among processors. The algorithm can be modi ed slightly to solve boundary value problems. We have identi ed certain classes of practical mathematical procedures, for which our methods will be useful these include various forms of inverse problems. The basic limitation of our algorithm is that it applies only to linear problems. We are currently working on a method of parallelizing the solution of general, nonlinear ordinary di erential systems.

Acknowledgements
This research was conducted for the O ce of Naval Research and was partially funded by the Naval Postgraduate School. Preliminary results were obtained by the last two authors supported partially by the National Science Foundation, through grants No. INT-8519159 and INT-8613396. The authors gratefully acknowledge comments and corrections by the editor and referees.

References
1] G. C. Fox, Concurrent Processing for Scienti c Calculations, Proceedings COMPCON 1984 Conference, (1984), 70{73. 2] G. C. Fox, Are Concurrent Processors General Purpose Computers?, IEEE Trans. Nucl. Sci., NS-32, (1984), 182{186. 3] G. C. Fox and S. W. Otto, Algorithm for Concurrent Processors, Physics Today, 37 No. 5, (1984), 50{59. 4] H. B. Keller and P. Nelson, Hypercube Implementations of Parallel Shooting, Appl. Math. Comp., 31,(1986). 5] B. Neta and C. P. Katti, Solution of Linear Initial Value Problems on a Hypercube, Technical Report NPS-53-89-001, Naval Postgraduate School, Monterey CA, (1988). 6] R. D. Skeel, Waveform Iteration and the Shifted Picard Splitting, SIAM J. Sci. Stat. Comput., 10, (1989), 756{776. 7] H. S. Stone, An E cient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations, J. Assoc. Comput. Mach., 20, (1973), 27{38.

You might also like