0% found this document useful (0 votes)
127 views8 pages

KSR Conf Paper 1

The document discusses how the KSR1 parallel computer architecture delivers both high performance and ease of use. It presents performance results from standard benchmarks to demonstrate the KSR1's high performance compared to other systems. It also describes how the KSR1's shared memory architecture allows for an incremental approach to parallel programming that makes it easy to use, as shown through an example computational chemistry application.

Uploaded by

dmctek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views8 pages

KSR Conf Paper 1

The document discusses how the KSR1 parallel computer architecture delivers both high performance and ease of use. It presents performance results from standard benchmarks to demonstrate the KSR1's high performance compared to other systems. It also describes how the KSR1's shared memory architecture allows for an incremental approach to parallel programming that makes it easy to use, as shown through an example computational chemistry application.

Uploaded by

dmctek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Technical Applications on the KSR1:

High Performance and Ease of Use

Stephen R. Breit, Chani Pangali and David M. Zirl


[email protected]

Kendall Square Research


170 Tracer Lane, Waltham, MA 02154

Abstract Just as there has been a general move toward large-scale


High performance is the chief motivation f o r an indus- parallel computers, there is also a move towards providing
try-wide move towara3 highly parallel computer architec- some form of support for the shared-memory progrrun-
tures. Yet, the dificulty of programming this new ming model, whether in hardware [31 or software [61. The
generation of computers has inhibited their widespread poor usability of earlier generations of large-scale parallel
use for production computing. This paper shows that the systems has hindered their acceptance for production com-
K S R P delivers both high performance and ease of use. puting. The KSRl is the first scalable parallel architecture
KSRl performance results on standard benchmark suites, which fully supports the shared-memory programming
including the NAS parallel benchmarks, are presented and model in hardware.
compared with other computers to demonstrate its high We begin with a comprehensive discussion of KSRl
delivered performance. It is then shown, through a simple performance results on standard benchmark suites. Next,
example, how the KSRl’s A L L L A C H P memory architec- we discuss how the ALLCACHE memory architecture
ture enables an incremental approach to parallel program- impacts the programming environment, and show through
ming which makes it easy to use. This incremental a simple example how this permits an incremental
approach is demonstrated using a real technical applica- approach to parallelization. The typical starting point for a
tion in the field of computational chemistry, parallel application program on the KSRl is a version that
runs on a workstation or a shared-memory multiprocessor
1: Introduction such as the SGI 4D/480 or the Cray Y-MP/8. The ease of
use which results from this incremental approach to pard-
Companion papers to this one have discussed the lelization is demonstrated through an extended discussion
KSRl’s ALLCACHE memory architecture and systems of the parallelization of AMBER 4.0, a program‘for simu-
software environment 111 [2]. In this paper, we show how lating biological molecules.
these features of the KSRl enable both high performance
and ease of use. 2: Performance results
Most computer manufacturers now recognize that
large-scale parallelism is the key to high-performance, Presentation of performance measurements by a com-
low-cost computing [31. While numerous vendors have puter manufacturer has been characterized as “putting
introduced parallel computers which promise high perfor- one’s best flop forward.” Nevertheless, there is consider-
m<ancebased on their peak performance rating, few have able value in examining performance results on standard
delivered on the promise [4] [5]. This has led to general benchmarks and application codes. The KSRl perfor-
recognition that performance comparisons should be based mance data are presented here for:
on delivered performance (on standard benchmark suites a single processor,
and real applications) rather than peak performance.
It is argued in Frank, etal [l] that high delivered perfor- an entry-level32-processor configuration, and
m,ance is the chief benefit of the KSRl’s shared-memory a mid-range 128-processor configuration.
programming model. This claim is supported herein by
performance results on the Linpack and NAS parallel The single-processor performance, as seen from the results
benchmarks; the KSRl delivers higher performance, both on the Livermore Fortran Kernels and Linpack 100x100
in absolute terms and as a percentage of peak than compet- benchmark, is comparable to that of other superscalar,
ing MPP’s (massively parallel processors).

1063-6390/93 $3.00 0 1993 IEEE 303


RISC microprocessors of similar clock speed (e.g., 2.2: Linpack benchmark
IBM RS/6000 Model 320). The single-processor perfor-
mMce is also close to the performance of the traditional There are now three standard benchmarks that comprise
“minisupercomputer” such as the Convex C210. the “Linpack” benchmark suite:
For m,my users, the availability of a full-function UNIX
Linpack 100 - the solution of a 100 x 100 set of simul-
opemting system, combined with a robust implementation
taneous equations using the standard, unmodified, For-
of NQS makes the KSR1-32 an amactive system for pro-
tran source code.
cessing hundreds of user jobs (e.g., an effective throughput
system). As such, if we consider the average user workload Linpack lo00 - the solution of a 1000 x 1000 set of
to be represented by the Livermore Fortran Kernels, the simultaneous equations using any algorithm or coding
throughput capacity of a KSR1-32 is roughly thirty times language, e.g., “no holds b‘ured.”
that of a single-processor Convex C210.
Linpack NxN - for parallel systems, the vendor may
As a multiprocessor, we find a KSR1-32 to have the
use any algorithm or coding language to solve M NxN
capacity of one to four times that of one processor of a
set of simultaneous equations. The vendor is free to
Cray Y-MP. See, for instance, the results on the NAS Par-
select any value of N .
allel Benchmark, Linpack 1000x1000 or AMBER 4.0.
Tests on a KSR1-128 demonstrate that the KSRl archi- The Linpack 100 benchmark is a good measure of com-
tecture does scale to a large number of processors (see, piled performance on a vectorizable program. The KSRl
e.g., the Linpack NxN and NAS parallel benchmark single-processor performance of 15 MFLOPS is compara-
results). In fact, both the single- and multiprocessor perfor- ble to the Convex C210 (at 17 MFLOPS), ‘and signific‘antly
mance is superior to that of both the Intel iPSC860 and the better than that of the IBM RS/6000 Model 320 (at 9
Intel Delta systems. This is noteworthy since the Intel pro- MFLOPS).
cessors have a clock rate of 40 MHz in contrast to the 20 On the Linpack 1000 benchmark, the KSR1-32 perfor-
M H z clock rate of the KSR1. mance of 5 13 MFLOPS is comparable to that of a Cray Y-
MPE. On this moderately sized problem, a KSR1-32 out-
2.1: Livermore Fortran kernels strips conventional MPP designs such as the iPSC860/128
(which achieves 219 MFLOPS with 128 processors) or the
Lawrence Livermore National Laboratory (LLNL) has Intel Delta (which achieves 446 MFLOPS with 512 pro-
long been at the forefront of scientific computing. Given cessors) [7]. The superior performance of the KSRl is
their need to continually evaluate multiple computers, attributable to the low-latency, high-bandwidth communi-
Frank McMahon and others at LLNL developed a pomble cation mech,mism so necessary to supporting shared mem-
benchmark. Originally known as the 14 Livermore loops, ory programming.
the benchmark was expanded to 24 kemels in 1987 and On the Linpack NxN benchmark, a KSR1-128 delivers
has since been known under the title Livermore FOIZIM 3.4 GFLOPS with N=9216.
Kemels. The kernels were extracted from actual produc-
tion codes at the lab. Typically, results are reported for the 2.3: NAS parallel benchmarks
long-vector length of L=471.
The harmonic mean for the 24 kernels when run on a The NAS parallel benchmarks are a suite of eight
single processor of the KSRl is 6.6 MFLOPS. This com- benchmarks that were developed at NASA Ames Research
pares to 6.54 MFLOPS for the Convex C210,36 MFLOPS Center for the purpose of comparing the perfor“ance of
for the Cray Y-MP/l and 5.6 MFLOPS for the IBM RS/ parallel supercomputers. The benchmarks are completely
6000 Model 320. specified with “pencil and paper” so that programmersare
The LFK test is a measure of the robustness of a system 6-ee, within some general rules, to tune their implementa-
at h‘mdling a variety of codes, ranging from fully vectoriz- tions to a particular parallel architecture [8] [9]. In prac-
able to fully scalar. The KSRl processor design enables tice, most implementors start with the sLmple, single-
both scalar and vectorized codes to execute with multiple processor Fortran implementations that are available from
operations being scheduled during every clock cycle. It is the NAS (Numerical Aerodynamic Simulation) Systems
worth noting that the KSRl processor delivers a higher Division of NASA Ames Research Center [lo]. These
percentage of peak performance (16%) than the Cray Y- sample implementations together comprise more th‘an
MP/1 (11%)on these kernels. 15,000 lines of source code, so it is a substantial undertak-
ing to parallelize, not to mention optimize, the entire suite.
For this reason, the NAS p a l l e l benchmark suite is con-
sidered to be one of the best ways of showing both the per-

304
formance potential of a parallel architecture, and the level
of performance that can be expected on real applications 6.0 -
(as opposed to highly tuned benchmarks). , BTbenchmark
The eight benchmarks are representative of a range of
applications programs in computational aeronautics, and
- 5.0 - MK S R l
iPSC860
are denoted herein and elsewhere by the following two-let-
ter 'abbreviations: z
?I

g. 4.0 -
' t--..CM-5

BT - blocktridiagonal G
0
SP - scalarpentadiagonal c
.O
c
3.0 -
LU - incomplete LU factorization 0
2
EP - embarassingly parallel C
2.0 -
FT - fast Fourier transform L

MG - multi-grid
3 1.0 -
CG - conjugategradient
IS - integersort 0.0 ' I

0 32 64 96 128
The first three are known as pseudo-applications and
the remaining five benchmarks are known as kernels. As
befits these descriptions, the pseudo-applications are con-
siderably more complicated than the kernels. We therefore
focus this discussion on two of the pseudo-applications, . SP benchmark
BT and SP.
The BT and SP benchmarks are very similar in struc- D--flKSR1
ture: they model two different ways of solving the Navier-
*iPSC860
CM-5
Stokes equations by an AD1 (alternating-direction
implicit) method. Both schemes are embedded in the well
g. 4.0
known ARC3D code for modeling three-dimensional fluid 0
c
dynamics. From a parallel programming standpoint, the .P 3.0 .
c
most significant difference between BT and SP is in the 2
ratio of local computation to inter-processor communica- a
c
0
tion (remote memory references); SP requires roughly 2.0 .
three times as much inter-processor communication per E
unit computation as does BT [111. 2a
The KSRl performance results presented in this paper 1.0 '

were obtained by one of the authors ( S . Breit), Gautam


Shah of Georgia Institute of Technology and other mem- t 1
bers of the Kendall Square technical staff. These results 0.0 I I

0 32 64 96 128
together with data for other computers are presented in Number of processors
[lo].
Figure shows the scalability of the KSRl performance Figure 1 Performance results on the NAS BT
on the BT and SP benchmarks. In keeping with the prac- and SP benchmarks.
tice in the NAS reports (e.g., [lo] the performance is pre-
sented as a ratio to the Cray Y-MP/l which is regarded by EP. Furthermore, on the BT benchmark, a KSR1-64 deliv-
many as a supercomputing guide post. Results from the ers 30% of peak performance as compared with 10% on
Intel iPSC860 and Thinking Machines CM-5 are also pre- the CM-5 and 7% on the iPSC860.
sented in Figure 1, but there is insufficient data to ascertain The prefetch capability of the KSRl architecture makes
their scalability on these benchmarks. The performance it possible to overlap computations and interprocessor
level of the KSR1-128 on the BT benchmark (5.88 Y-MP/l communications. This capability was exploited in the
equivalents) is the highest attained so far by any highly implementations of both the BT and SP benchmarks. The
p'arallel computer on any of the NAS benchmarks except KSRl performance on SP is lower than on BT primarily

305
because the single-processor performance is lower (due to These features of the ALLCACHE memory system, in
less reuse of subcached data), not because of the lower turn,enable software features which make the KSRl easier
computiition-to-communicationratio of the benchmark. to use [2]:

Table 1: KSR1-32 performance results on the an industry-standard software environment (OSF/I


NAS parallel benchmarks Unix and all the functionality it implies),

I Benchmark 1 I 1
I
User time
(s)
I
MFLOPS
I
Performance

; . 1
a familiar, shared-memory programming paradigm,
and
an incremental approach to parallelization.
Here we will focus on how the ALLCACHE architec-
BT 1 439.0 I 412 I 1.80 ture naturally leads to a key conceptual difference between
I SP I 377.7 I 270 I 1.25 I the approach to programming the KSRl and MF’P’s,
regardless of whether they support shared-memory, mes-
LU 1040.0 62 0.32 sage-passing, or dabparallel programming models.
On MPP’s, the software must explicitly define three
EP 69.8 381 1.81 things:
FT 13.6 415 2.12 the distribution of the data among the memories of a
specified processor set,
MG 20.6 190 1.06
I
1
CG 1
I
21.7 I
I
69 I
I
0.54 I
4
how the computations are to be partitioned into tasks,
and
IS 40.2 N/A 0.28 the execution schedule of the tasks.
The key difference between the KSRl ‘and MF’P’s is
Table 1 gives the performance of a KSR1-32 on all that the ALLCACHE memory system obviates the specifi-
eight benchmarks. On average, the performance of a cation of data distribution, leaving only the last two items
KSR1-32 is roughly 1.15 times that of a Cray Y-MP/l and to be specified by the software. Thus, it is most natural to
the delivered performance is roughly 20% of its peak per- provide language constructs which specify how compuLi-
formance. All of the KSRl implementations, with the tions are to be partitioned and scheduled rather than how
exception of FT, started from the sample implementations the data is to be distributed.
that ,are distributed by NAS. The programs were incremen-
tally modified to exploit the scalability of the KSR1. We 3.1: Parallelization directives
believe that the performance of several of the benchmarks,
particularly LU and CG, can be substantially improved There are three major parallel constructs in KSR For-
with further tuning. manw (see Burke [2] for more details on KSR Fortran ,and
equivalent support for applications written in C):
3: Ease of Use
TILE- The iteration space defined by a Fortran do
Having established that the KSRl delivers high perfor- loop nest is partitioned into tiles, or groups of loop iter-
mance, both in relative and absolute terms, we now turn to ations, which are executed in parallel.
ease-of-use issues. Many attributes of the KSRl system D R E G I O F 4 - Multiple instantiations of the
contribute to ease-of-use; they have been categorized into s,me code segment are executed in parallel.
architecture- and software-related aspects [l]. In the
former category we especially mention: PARALLELS ECTIONS- Multiple unique code seg-
ments are executed in parallel.
hardware-based access to a global address space,
Of the three, the tile directive is the most sophisticated
automatic memory management and data transfer, and the most commonly used. The tile directive specifies
hardware-based packet routing and cache coherency the loop indices over which tiling is to occur. These indices
management, and define an iteration space. For example, in Figure 2, the
indices i, j , k define the iteration space. A point in this itera-
automatic enhancement of data locality. tion space corresponds to unique values of the loop indices
i,j,k. The tile directive causes the iteration space to be par-

306
titioned into rectilinear sub-spaces called tiles, each of
which contains enough loop iterations to justify the paral- Figure 2: Iteration space and data space.
lelization overhead. The tile size and the order in which
tiles are executed are collectively referred to as the tiling
ITERATION SPACE PARTITIONED INTO TILES
strategy. The strategy is automatically detemined by the
runtime system, or it can be specified in the application i-
program.
k
3.2: An example: a tiled loop nest
i
Here we present a simple example which illustrates
how the ALLCACHE architecture manifests itself in soft-
ware. The code fragment below performs the matrix opera-
tion C = a A B where a is a scalar and A , B , C are MAPPING OF TILES TO DATA SPACE
matrices:

subroutine matmult (alpha,a,b,c,n,m,p)


El
integer n,m,p
real alpha, a(n,pI, b(p,mI, c(n,ml
real ctemp
c*ksr* tile ( i, j, private= (ctemp) )
do i = 1,n
do j = 1,m
ctemp = 0 .
do k = l , p
ctemp = ctemp + a(i,k)*b(k,j)
end do
c(i,j) = alpha*ctemp
end do
end do
c*ksr* end tile
end

We especially note that there are no declarations or


directives in the program that indicate how the arrays
a,b, c are to be distributed in physical memory, or even
that they are to be shared among the processors. These
arrays are implicitly shared because they are within the
scope of the subroutine matmult. In fact, data which are
within the scope of a subprogram are shared by default
unless they are explicitly declared private, as is necessary 3.3: Incremental parallelization
for the scalar temporary ctemp.
Figure2 shows how tiling the iteration space implies a The real benefit of freeing programmers from having to
distribution of the arb, and c arrays. However, this distribu- explicitly distribute data in the memory system is that it
tion is automatically handled by the ALLCACHE hard- enables them to take an incremental approach to pardleliz-
ware, ‘and not by the programmer, compiler, or runtime ing an application program. The definition of incremental
software. Similarly, the ALLCACHE hardware automati- parallelization is intimately related to scalability. To
clally copies the scalar alpha upon the first reference by a explain this, we advance the notion that there are two
processor; all subsequent references will be local. aspects to scalability: scalability of the memory size, and
scalability of performance. A KSRl programmer can
immediately exploit the scalability of the memory system,
and then incrementally exploit the scalability of perfor-
mance.
To be more specific, the ALLCACHE memory system

307
enables users to run an application on a single processor mizing its performance on a single processor, and then par-
even when the size of the data set exceeds the capacity of allelizing key portions of the program.
the local cache. In this situation, ALLCACHE automati- AMBER is in fact not a single code, but a suite of pro-
cally migrates data which has not been accessed recently grams developed at the University of Califomia, San Fran-
to the local caches of other cells (if they have excess cisco, by Peter Kollman and Associates. The major
capacity) rather than to disks as in conventional virtual computational modules are MINMD, GIBBS and SANDER
memory systems. Transfers between local caches occur at [13]. MINMD performs either energy minimization or
substantially higher rates than to extemal devices. The molecular dynamics simulations. GIBBS computes free-
benefit is that an application can utilize the entire memory energy differences between two similar states of a mole-
of the KSRl system without making a single m&cation. cule using either a perturbation or thermodynamic integra-
The next step, of course, is to scale the performance. To tion approach. SANDER perfoms molecular refinement
achieve this, a programmer can incrementally add parallel using NMR data as input. All three modules have been
constructs which distribute the computations among avail- ported and incrementally parallelized on the KSRl; this
able processors until the desired level of performance is example focuses on the MINMD and GIBBS modules.
attained. Ideally, the performance of an application should
scale linearly with the number of processors that partici- 4.1: Porting the MINMD module to a single pro-
pate in the computations. cessor
By contrast, on MPP's which do not automatically man-
age data distribution and transfer, the user must reduce the The single-processor port of MlNMD was straightfor-
size of a data set to fit within the memory of a single pro- ward and involved only minor changes to the Fortran 77
cessor. Only after extensive program modifications to dis- source code. Improvements in the single-processor perfor-
tribute data and partition the computational tasks can the mance were realized largely by reducing the storage of
application exploit the scalability of the distributed mem- unnecessary temporary arrays and redundant loops. The
ory system. superscalar nature of the KSRl processor eliminates the
The benefits of an incremental approach to parallelism need for the temporary storage arrays that enh'ance perfor-
also apply to new software development. A programmer mance on vector machines.
does not have to first master a new programming para- For example, in the original code, the components of
digm, such as message-passing or data-parallel program- the interatomic distances were first stored in arrays,
ming styles. A program can be developed on a single XWIJ(I), YWIJ(I) and ZWIJ(I), where I is the loop
processor and then parallelized. If the initial approach to index (atom number). The resultant distance was stored in
parallelization proves to be inefficient, the programmer an additional array, RWIJ ( I ) . These temporary arrays then
can easily try different approaches without rewriting the were used in another loop to compute
entire application in order to change the distribution of
global data structures, as is required on other MPP's. RWIJ(I)= XWIJ(I)**2 t YWIJ(I)**2 t ZWIJ(I)**2,
In our experience with many benchmarks and applica-
tions, the incremental approach to parallelization requires and subsequently, in yet another loop, to compute
substantially less programming effort than partially or
RWIJ(1) = l.O/SQRT(RWIJ(I)).
even totally rewriting a program to exploit the scalability
of the distributed-memory architecture. The time required
These steps were combined into a scalar operation (i.e.
to port and parallelize codes can be measured in days or
non-array) and the multiple loops were coalesced into a
weeks rather than months or years. Our experience with
single loop, thus eliminating four temporary storage
the Amber application code on the KSRl is a typical
arrays.
example of incremental parallelization in practice.
The single-processor port and optimization of MINMD
involved modifying fewer than 50 out of 20,000 lines of
4: An example: parallelization of AMBER source code, and was accomplished in less th'm a week.
The single-processor performance on the minimization of
This example illustrates how the KSRl's ALLCACHE
a cg6+cions+water data set (7682 atoms, 274 dna) is com-
memory system and software environment enabled the
pared with several other systems in Table 2.
incremental parallelization and optimization of AMBER, a
sizable computational chemistry application [121. AMBER
is written entirely in Fortran 77 and contains over 110,000 4.2: Parallelization of the MZNMD Module
lines of source code. This large program was implemented Single-processor performance profiling indicated that
on the KSRl by first porting it to a single processor, opti-
over 90% of the run time was spent on the nonbonded cal-

308
culation, and 510% each in the dihedral and pairlist gen- might be assigned to the nonbonded computation, one to
erator routines. Therefore, these routines were the focus of the dihedral forces, and one to the angular + bonded force
the parallelization effort. It was recognized that the contri- components.
butions to the resultant forces could be computed indepen- Fewer than 400 lines of the 20,000 lines of source code
dently (e.g., no dependency on one another), therefore the had to be modified to enable a fully functional, high-per-
overall time for the force calculation would depend on the formance implementation of MlNMD on the KSR1. The
time it took for the most time consuming of the indepen- parallelization effort was accomplished in less than a
dent computations. month. A total of 12 KSR1-specific parallel directives
The following code fragment indicates how the force were inserted in the source code.
computation was parallelized by nesting parallel con- Table 2 shows the KSRl performance, along with the
structs: performance of some other systems, on energy minimiza-
tion of the cg6+cions+water data set (7682 atoms 274
subroutine force( ha). The results show that the performance on the non-
... bonded portion of the calculation scales well.
C*KSR* Parallel Sections

C*KSR* Section !Section 1: nonbonded forces


Table 2: M/NMD energy minimization of
C*KSR* Parallel Region(numthreads=inb_procsl cg6+c/ons+ water data set (7682
call nnbond [ ) atoms +274 dna)
C*KSR* End Parallel Region
System
C*KSR* Section! Section 2: dihedral forces

C*KSR* Parallel Region(numthreads=iphigrocs)


call dihedral ( 1 Cray Y-MP/l NA
C*KSR* End Parallel Region
HP 72/50MHz 462 NA
C*KSR* Section! Section3:bonded+angular forces
call bond0 Convex C210 609 NA
call angl()
KSR1-1 926 8 14
C*KSR* End Parallel Sections
KSRl - 16 162 58
c a l l combine-forces-and-energy

return
end

At the top level, the PARALLEL SECTIONS construct


was used to parallelize the three primary components of
the force computation, e.g., one section each for the bond Subroutine 16processors' 30 processors2
plus angular forces, the dihedral forces and the nonbonded
forces. At the second level, the PARALLEL REGION con- Pairlist 195 sec 130 sec
shvct was employed to further distribute the nonbonded
'and dihedral computations. In addition, the creation of the I Nonbonded I 875 sec ~ I 425sec 1
nonbonded pairlist was parallelized (not shown in frag-
ment above).
1 Totaltime I 1398sec I 852sec I
For the nonbonded force computation, the list of non- Iter/sec 0.7 15 1.173
bonded atomic pairs is distributed evenly among the pro-
cessors, while for the dihedral calculation, the number of Speedup 1.o 1.64
dihedrals is split evenly among the processors, This
ensures good load balance between the processors. The
I Efficiency I NA I 87.5 % I
overall distribution of processors is chosen to ensure that
each of the three sections indicated by the PARALLEL
'. 16 cells: 13 for nonbonded calculation, 2 for
dihedral, 1 for bond+angle
SECTIONS directive will take about the same execution 2. 30 cells: 25 for nonbonded, 4 for dihedral,
time. For example, with 10 processors available, eight 1 for bond+angle

309
To measure the KSRl performance on the molecular 1000 out of 110,OOO lines of s o m e code. Of course, the
dynamics component of MINMD, a 4400-atom data set ultimate proof of the ease of use of the KSRl is in a grow-
was run for 1000 iterations. The results of running this ing catalog of third-party applications.
simulation on 16 and 30 KSRl processors are presented in
Table 3 Acknowledgements

4.3: Results for GIBBS module The Linpack 1000 and NxN results were obtained by
Dr.Nick Camp of Kendall Square Research.
The GIBBS module was ported and pdlelized in a
similar fashion to MINMD. The initial port and single-pro- References
cessor optimizations were completed in a day and involved 1. S . Frank, H. Burkhardt and J. Rothnie, “The KSR1:
ch,anging fewer than 20 lines of the 30,000 lines of source Bridging the Gap between Shared Memory and MPPs,”
code. The parallelization scheme is essentially the same as Compcon ‘93Proceedings.
is used in MINMD, but involves additional parallelization 2. Burke, E. “An Overview of System Software for the
of routines used to constrain various forces. The overall KSR1.” CompCon ‘93 Proceedings.
effort was completed in less than a week and involved 3. G. Zorpette. The power of parallelism. IEEE Specfrum
modifying and adding a total of 400 lines of code. 29, No. 9, September 1992, pp. 28-33.
Table 4 contains performance data for a 1OOO-iteration
4. G. Cybenko & D.J. Kuck. Revolution or evolution? IEEE
run using the same 4400 atom data set as was used to pro- Spectrum 29, No. 9, September 1992, pp. 39-41.
duce Table 3. The overall performance on 25 processors is
approximately ten times that of a Convex C210. 5. O.M. Lubeck, M.L. Simmons and H.J. Wasserman. “The
Performance Realities of Massively Parallel Processors:
A Case Study.” Proceedings Supercomputing ‘92, IEEE
Computer Society Press, Minneapolis, November 16-20,
Table 4: GIBBS performancefor 4400 atom 1992.
data set for 1000 iterations
6. S . Hiranandani, K. Kennedy, and C.-W. Tseng. Compiler
I Subroutine I 15processors 1 25processors I Optimizations for Fortran D on MIMD Distributed-
Memory Machines. Proceedings of Supercomputing ‘91,
I Pairlist I 300sec I 220sec I Albuquerque, NM, November 1991.
7. J.J. Dongarra. Performance of Various Computers Using
Nonbonded 1490 sec 908 sec Standard Linear Equations Software. [email protected],
Sept. 28 1992.
Total time 1969 sec 1360 sec 8. D. Bailey, J. Barton, T. Lasinski, and H. Simon, eds. The
I Iter/Sec I 0.51 I 0.74 I NAS Parallel Benchmarks. Technical Report RNR-91-
02, NASA Ames Research Center, Moffett Field, CA
Speedup 1.0 1.44 94035, January 1991.
9. D.H. Bailey, E. Barszcz, J.T. Barton, D.S. Browning,
Efficiency* NA 87% R.L. Carter, L. Dagum, R.A. Fatoohi, P.O. Frderickson,
T.A. Lasinski, R.S. Schreiber, H.D. Simon, V.
Venkatakrishnan, and S.K. Weeratunga. The NAS
5: Concluding Remarks Parallel Benchmarks. I d J . of Supercomp. Appl. 5(3),
Fall 1991, pp.63-73.
We have shown that the KSRl truly delivers high per-
formance in both absolute terms and an unprecedented 10. D.H. Bailey, E. Barszcz, L. Dagum, and H.D. Simon.
NAS Parallel Benchmark Results. Technical Report
delivered-to-peak ratio on a highly parallel machine. Data
RNR-92-002, December 7, 1992.
will continue to accumulate, and performance results will
improve as programmers and compilers gain experience 11. S.R. Breit, W. Celmaster, W. Coney, R. Foster, B.
with the KSRl architecture Gaiman, G. Montry and C. Selvidge. The Role of
Architectural Balance in the Implementation of the NAS
The approach to programming the KSRl is conceptu- Parallel Benchmarks on the BBN TC2000 Computer.
ally different from MPP’s and permits an incremental Submitted to ASME Symposium on CFD Algorithms and
approach to parallelization and performance tuning. The Applications for Parallel Processors, Spring 1993.
discussion of Amber illustrates how this incremental 12. D. A. Pearlman, D. A. Case, J. C. Caldwell, G. L. Seibel,
approach translates into ease of use. The effort involved in U. C. Singh, P. A. Weiner and P. A. KolIman. Amber 4.0.
parallelizing the three major modules of Amber was less UCSF, 1991.
than two person months and involved changing fewer than 13. Amber 4.0, User’s Guide. 1991, page 7.

310

You might also like