KSR Conf Paper 1
KSR Conf Paper 1
304
formance potential of a parallel architecture, and the level
of performance that can be expected on real applications 6.0 -
(as opposed to highly tuned benchmarks). , BTbenchmark
The eight benchmarks are representative of a range of
applications programs in computational aeronautics, and
- 5.0 - MK S R l
iPSC860
are denoted herein and elsewhere by the following two-let-
ter 'abbreviations: z
?I
g. 4.0 -
' t--..CM-5
BT - blocktridiagonal G
0
SP - scalarpentadiagonal c
.O
c
3.0 -
LU - incomplete LU factorization 0
2
EP - embarassingly parallel C
2.0 -
FT - fast Fourier transform L
MG - multi-grid
3 1.0 -
CG - conjugategradient
IS - integersort 0.0 ' I
0 32 64 96 128
The first three are known as pseudo-applications and
the remaining five benchmarks are known as kernels. As
befits these descriptions, the pseudo-applications are con-
siderably more complicated than the kernels. We therefore
focus this discussion on two of the pseudo-applications, . SP benchmark
BT and SP.
The BT and SP benchmarks are very similar in struc- D--flKSR1
ture: they model two different ways of solving the Navier-
*iPSC860
CM-5
Stokes equations by an AD1 (alternating-direction
implicit) method. Both schemes are embedded in the well
g. 4.0
known ARC3D code for modeling three-dimensional fluid 0
c
dynamics. From a parallel programming standpoint, the .P 3.0 .
c
most significant difference between BT and SP is in the 2
ratio of local computation to inter-processor communica- a
c
0
tion (remote memory references); SP requires roughly 2.0 .
three times as much inter-processor communication per E
unit computation as does BT [111. 2a
The KSRl performance results presented in this paper 1.0 '
0 32 64 96 128
together with data for other computers are presented in Number of processors
[lo].
Figure shows the scalability of the KSRl performance Figure 1 Performance results on the NAS BT
on the BT and SP benchmarks. In keeping with the prac- and SP benchmarks.
tice in the NAS reports (e.g., [lo] the performance is pre-
sented as a ratio to the Cray Y-MP/l which is regarded by EP. Furthermore, on the BT benchmark, a KSR1-64 deliv-
many as a supercomputing guide post. Results from the ers 30% of peak performance as compared with 10% on
Intel iPSC860 and Thinking Machines CM-5 are also pre- the CM-5 and 7% on the iPSC860.
sented in Figure 1, but there is insufficient data to ascertain The prefetch capability of the KSRl architecture makes
their scalability on these benchmarks. The performance it possible to overlap computations and interprocessor
level of the KSR1-128 on the BT benchmark (5.88 Y-MP/l communications. This capability was exploited in the
equivalents) is the highest attained so far by any highly implementations of both the BT and SP benchmarks. The
p'arallel computer on any of the NAS benchmarks except KSRl performance on SP is lower than on BT primarily
305
because the single-processor performance is lower (due to These features of the ALLCACHE memory system, in
less reuse of subcached data), not because of the lower turn,enable software features which make the KSRl easier
computiition-to-communicationratio of the benchmark. to use [2]:
I Benchmark 1 I 1
I
User time
(s)
I
MFLOPS
I
Performance
; . 1
a familiar, shared-memory programming paradigm,
and
an incremental approach to parallelization.
Here we will focus on how the ALLCACHE architec-
BT 1 439.0 I 412 I 1.80 ture naturally leads to a key conceptual difference between
I SP I 377.7 I 270 I 1.25 I the approach to programming the KSRl and MF’P’s,
regardless of whether they support shared-memory, mes-
LU 1040.0 62 0.32 sage-passing, or dabparallel programming models.
On MPP’s, the software must explicitly define three
EP 69.8 381 1.81 things:
FT 13.6 415 2.12 the distribution of the data among the memories of a
specified processor set,
MG 20.6 190 1.06
I
1
CG 1
I
21.7 I
I
69 I
I
0.54 I
4
how the computations are to be partitioned into tasks,
and
IS 40.2 N/A 0.28 the execution schedule of the tasks.
The key difference between the KSRl ‘and MF’P’s is
Table 1 gives the performance of a KSR1-32 on all that the ALLCACHE memory system obviates the specifi-
eight benchmarks. On average, the performance of a cation of data distribution, leaving only the last two items
KSR1-32 is roughly 1.15 times that of a Cray Y-MP/l and to be specified by the software. Thus, it is most natural to
the delivered performance is roughly 20% of its peak per- provide language constructs which specify how compuLi-
formance. All of the KSRl implementations, with the tions are to be partitioned and scheduled rather than how
exception of FT, started from the sample implementations the data is to be distributed.
that ,are distributed by NAS. The programs were incremen-
tally modified to exploit the scalability of the KSR1. We 3.1: Parallelization directives
believe that the performance of several of the benchmarks,
particularly LU and CG, can be substantially improved There are three major parallel constructs in KSR For-
with further tuning. manw (see Burke [2] for more details on KSR Fortran ,and
equivalent support for applications written in C):
3: Ease of Use
TILE- The iteration space defined by a Fortran do
Having established that the KSRl delivers high perfor- loop nest is partitioned into tiles, or groups of loop iter-
mance, both in relative and absolute terms, we now turn to ations, which are executed in parallel.
ease-of-use issues. Many attributes of the KSRl system D R E G I O F 4 - Multiple instantiations of the
contribute to ease-of-use; they have been categorized into s,me code segment are executed in parallel.
architecture- and software-related aspects [l]. In the
former category we especially mention: PARALLELS ECTIONS- Multiple unique code seg-
ments are executed in parallel.
hardware-based access to a global address space,
Of the three, the tile directive is the most sophisticated
automatic memory management and data transfer, and the most commonly used. The tile directive specifies
hardware-based packet routing and cache coherency the loop indices over which tiling is to occur. These indices
management, and define an iteration space. For example, in Figure 2, the
indices i, j , k define the iteration space. A point in this itera-
automatic enhancement of data locality. tion space corresponds to unique values of the loop indices
i,j,k. The tile directive causes the iteration space to be par-
306
titioned into rectilinear sub-spaces called tiles, each of
which contains enough loop iterations to justify the paral- Figure 2: Iteration space and data space.
lelization overhead. The tile size and the order in which
tiles are executed are collectively referred to as the tiling
ITERATION SPACE PARTITIONED INTO TILES
strategy. The strategy is automatically detemined by the
runtime system, or it can be specified in the application i-
program.
k
3.2: An example: a tiled loop nest
i
Here we present a simple example which illustrates
how the ALLCACHE architecture manifests itself in soft-
ware. The code fragment below performs the matrix opera-
tion C = a A B where a is a scalar and A , B , C are MAPPING OF TILES TO DATA SPACE
matrices:
307
enables users to run an application on a single processor mizing its performance on a single processor, and then par-
even when the size of the data set exceeds the capacity of allelizing key portions of the program.
the local cache. In this situation, ALLCACHE automati- AMBER is in fact not a single code, but a suite of pro-
cally migrates data which has not been accessed recently grams developed at the University of Califomia, San Fran-
to the local caches of other cells (if they have excess cisco, by Peter Kollman and Associates. The major
capacity) rather than to disks as in conventional virtual computational modules are MINMD, GIBBS and SANDER
memory systems. Transfers between local caches occur at [13]. MINMD performs either energy minimization or
substantially higher rates than to extemal devices. The molecular dynamics simulations. GIBBS computes free-
benefit is that an application can utilize the entire memory energy differences between two similar states of a mole-
of the KSRl system without making a single m&cation. cule using either a perturbation or thermodynamic integra-
The next step, of course, is to scale the performance. To tion approach. SANDER perfoms molecular refinement
achieve this, a programmer can incrementally add parallel using NMR data as input. All three modules have been
constructs which distribute the computations among avail- ported and incrementally parallelized on the KSRl; this
able processors until the desired level of performance is example focuses on the MINMD and GIBBS modules.
attained. Ideally, the performance of an application should
scale linearly with the number of processors that partici- 4.1: Porting the MINMD module to a single pro-
pate in the computations. cessor
By contrast, on MPP's which do not automatically man-
age data distribution and transfer, the user must reduce the The single-processor port of MlNMD was straightfor-
size of a data set to fit within the memory of a single pro- ward and involved only minor changes to the Fortran 77
cessor. Only after extensive program modifications to dis- source code. Improvements in the single-processor perfor-
tribute data and partition the computational tasks can the mance were realized largely by reducing the storage of
application exploit the scalability of the distributed mem- unnecessary temporary arrays and redundant loops. The
ory system. superscalar nature of the KSRl processor eliminates the
The benefits of an incremental approach to parallelism need for the temporary storage arrays that enh'ance perfor-
also apply to new software development. A programmer mance on vector machines.
does not have to first master a new programming para- For example, in the original code, the components of
digm, such as message-passing or data-parallel program- the interatomic distances were first stored in arrays,
ming styles. A program can be developed on a single XWIJ(I), YWIJ(I) and ZWIJ(I), where I is the loop
processor and then parallelized. If the initial approach to index (atom number). The resultant distance was stored in
parallelization proves to be inefficient, the programmer an additional array, RWIJ ( I ) . These temporary arrays then
can easily try different approaches without rewriting the were used in another loop to compute
entire application in order to change the distribution of
global data structures, as is required on other MPP's. RWIJ(I)= XWIJ(I)**2 t YWIJ(I)**2 t ZWIJ(I)**2,
In our experience with many benchmarks and applica-
tions, the incremental approach to parallelization requires and subsequently, in yet another loop, to compute
substantially less programming effort than partially or
RWIJ(1) = l.O/SQRT(RWIJ(I)).
even totally rewriting a program to exploit the scalability
of the distributed-memory architecture. The time required
These steps were combined into a scalar operation (i.e.
to port and parallelize codes can be measured in days or
non-array) and the multiple loops were coalesced into a
weeks rather than months or years. Our experience with
single loop, thus eliminating four temporary storage
the Amber application code on the KSRl is a typical
arrays.
example of incremental parallelization in practice.
The single-processor port and optimization of MINMD
involved modifying fewer than 50 out of 20,000 lines of
4: An example: parallelization of AMBER source code, and was accomplished in less th'm a week.
The single-processor performance on the minimization of
This example illustrates how the KSRl's ALLCACHE
a cg6+cions+water data set (7682 atoms, 274 dna) is com-
memory system and software environment enabled the
pared with several other systems in Table 2.
incremental parallelization and optimization of AMBER, a
sizable computational chemistry application [121. AMBER
is written entirely in Fortran 77 and contains over 110,000 4.2: Parallelization of the MZNMD Module
lines of source code. This large program was implemented Single-processor performance profiling indicated that
on the KSRl by first porting it to a single processor, opti-
over 90% of the run time was spent on the nonbonded cal-
308
culation, and 510% each in the dihedral and pairlist gen- might be assigned to the nonbonded computation, one to
erator routines. Therefore, these routines were the focus of the dihedral forces, and one to the angular + bonded force
the parallelization effort. It was recognized that the contri- components.
butions to the resultant forces could be computed indepen- Fewer than 400 lines of the 20,000 lines of source code
dently (e.g., no dependency on one another), therefore the had to be modified to enable a fully functional, high-per-
overall time for the force calculation would depend on the formance implementation of MlNMD on the KSR1. The
time it took for the most time consuming of the indepen- parallelization effort was accomplished in less than a
dent computations. month. A total of 12 KSR1-specific parallel directives
The following code fragment indicates how the force were inserted in the source code.
computation was parallelized by nesting parallel con- Table 2 shows the KSRl performance, along with the
structs: performance of some other systems, on energy minimiza-
tion of the cg6+cions+water data set (7682 atoms 274
subroutine force( ha). The results show that the performance on the non-
... bonded portion of the calculation scales well.
C*KSR* Parallel Sections
return
end
309
To measure the KSRl performance on the molecular 1000 out of 110,OOO lines of s o m e code. Of course, the
dynamics component of MINMD, a 4400-atom data set ultimate proof of the ease of use of the KSRl is in a grow-
was run for 1000 iterations. The results of running this ing catalog of third-party applications.
simulation on 16 and 30 KSRl processors are presented in
Table 3 Acknowledgements
4.3: Results for GIBBS module The Linpack 1000 and NxN results were obtained by
Dr.Nick Camp of Kendall Square Research.
The GIBBS module was ported and pdlelized in a
similar fashion to MINMD. The initial port and single-pro- References
cessor optimizations were completed in a day and involved 1. S . Frank, H. Burkhardt and J. Rothnie, “The KSR1:
ch,anging fewer than 20 lines of the 30,000 lines of source Bridging the Gap between Shared Memory and MPPs,”
code. The parallelization scheme is essentially the same as Compcon ‘93Proceedings.
is used in MINMD, but involves additional parallelization 2. Burke, E. “An Overview of System Software for the
of routines used to constrain various forces. The overall KSR1.” CompCon ‘93 Proceedings.
effort was completed in less than a week and involved 3. G. Zorpette. The power of parallelism. IEEE Specfrum
modifying and adding a total of 400 lines of code. 29, No. 9, September 1992, pp. 28-33.
Table 4 contains performance data for a 1OOO-iteration
4. G. Cybenko & D.J. Kuck. Revolution or evolution? IEEE
run using the same 4400 atom data set as was used to pro- Spectrum 29, No. 9, September 1992, pp. 39-41.
duce Table 3. The overall performance on 25 processors is
approximately ten times that of a Convex C210. 5. O.M. Lubeck, M.L. Simmons and H.J. Wasserman. “The
Performance Realities of Massively Parallel Processors:
A Case Study.” Proceedings Supercomputing ‘92, IEEE
Computer Society Press, Minneapolis, November 16-20,
Table 4: GIBBS performancefor 4400 atom 1992.
data set for 1000 iterations
6. S . Hiranandani, K. Kennedy, and C.-W. Tseng. Compiler
I Subroutine I 15processors 1 25processors I Optimizations for Fortran D on MIMD Distributed-
Memory Machines. Proceedings of Supercomputing ‘91,
I Pairlist I 300sec I 220sec I Albuquerque, NM, November 1991.
7. J.J. Dongarra. Performance of Various Computers Using
Nonbonded 1490 sec 908 sec Standard Linear Equations Software. [email protected],
Sept. 28 1992.
Total time 1969 sec 1360 sec 8. D. Bailey, J. Barton, T. Lasinski, and H. Simon, eds. The
I Iter/Sec I 0.51 I 0.74 I NAS Parallel Benchmarks. Technical Report RNR-91-
02, NASA Ames Research Center, Moffett Field, CA
Speedup 1.0 1.44 94035, January 1991.
9. D.H. Bailey, E. Barszcz, J.T. Barton, D.S. Browning,
Efficiency* NA 87% R.L. Carter, L. Dagum, R.A. Fatoohi, P.O. Frderickson,
T.A. Lasinski, R.S. Schreiber, H.D. Simon, V.
Venkatakrishnan, and S.K. Weeratunga. The NAS
5: Concluding Remarks Parallel Benchmarks. I d J . of Supercomp. Appl. 5(3),
Fall 1991, pp.63-73.
We have shown that the KSRl truly delivers high per-
formance in both absolute terms and an unprecedented 10. D.H. Bailey, E. Barszcz, L. Dagum, and H.D. Simon.
NAS Parallel Benchmark Results. Technical Report
delivered-to-peak ratio on a highly parallel machine. Data
RNR-92-002, December 7, 1992.
will continue to accumulate, and performance results will
improve as programmers and compilers gain experience 11. S.R. Breit, W. Celmaster, W. Coney, R. Foster, B.
with the KSRl architecture Gaiman, G. Montry and C. Selvidge. The Role of
Architectural Balance in the Implementation of the NAS
The approach to programming the KSRl is conceptu- Parallel Benchmarks on the BBN TC2000 Computer.
ally different from MPP’s and permits an incremental Submitted to ASME Symposium on CFD Algorithms and
approach to parallelization and performance tuning. The Applications for Parallel Processors, Spring 1993.
discussion of Amber illustrates how this incremental 12. D. A. Pearlman, D. A. Case, J. C. Caldwell, G. L. Seibel,
approach translates into ease of use. The effort involved in U. C. Singh, P. A. Weiner and P. A. KolIman. Amber 4.0.
parallelizing the three major modules of Amber was less UCSF, 1991.
than two person months and involved changing fewer than 13. Amber 4.0, User’s Guide. 1991, page 7.
310