1990 Duncan Parallel Architectures
1990 Duncan Parallel Architectures
A Survev of J
Parallel Computer
Architectures
Ralph Duncan, Control Data Corporation
T
his decade has witnessed the in- Include pipelined vector processors
troduction of a wide variety of and other architectures that intuitively
new computer architectures for seem to merit inclusion as parallel
parallel processing that complement and The diversity of architectures, but which are difficult to
extend the major approaches to parallel gracefully accommodate within
computing developed in the 1960s and
parallel computer Flynn’s scheme.
1970s. The recent proliferation of parallel architectures can
processing technologies has included new We will examine each of these impera-
parallel hardware architectures (systolic bewilder the non- tives as we seek a definition that satisfies
and hypercube), interconnection technolo- all of them and provides the basis for a
gies (multistage switching topologies),
specialist. This tutorial reasonable taxonomy.
and programming paradigms (applicative reviews alternative
programming). The sheer diversity of the Low-level parallelism. There are two
field poses a substantial obstacle to the approaches to parallel reasons to exclude machines that employ
nonspecialist who wishes to comprehend processing within the only low-level parallel mechanisms from
what kinds of parallel architectures exist the set of parallel architectures. First, fail-
and how their relationship to one another framework of a high- ure to adopt a more rigorous standard might
defines an orderly schema. make the majority of modern computers
This discussion attempts to place recent level taxonomy. “parallel architectures,” negating the
architectural innovations in the broader term’s usefulness. Second, architectures
context of parallel architecture develop- having only the features listed below do
ment by surveying the fundamentals of The difficulty in precisely defining the not offer an explicit, coherent framework
both newer and more established parallel term is intertwined with the problem of for developing high-level parallel solu-
computer architectures and by placing specifying a parallel architecture taxon- tions:
these architectural alternatives in a coher- omy. A central problem for specifying a
ent framework. The survey’s primary definition and consequent taxonomy for Instruction pipelining - the decom-
emphasis concerns architectural con- modern parallel architectures is to satisfy position of instruction execution into a
structs rather than specific parallel ma- the following set of imperatives: linear series of autonomous stages,
chines. allowing each stage to simultaneously
Exclude architectures incorporating perform a portion of the execution
only low-level parallel mechanisms process (such as decode, calculate ef-
Terminology and that have become commonplace fea- fective address, fetch operand, exe-
taxonomy tures of modern computers. cute, and store).
Maintain elements of Flynn’s useful M u l t i p l e CPU f u n c t i o n a l units -
Problems. Diverse definitions have taxonomy’ based on instruction and providing independent functional
been proposed for parallel architectures. data streams. units for arithmetic and Boolean
-E
to solve problems through concurrent
Associative memory
execution.
Figure 1 shows a taxonomy based on the
imperatives discussed earlier and the pro-
posed definition. This informal taxonomy
Systolic uses high-level categories to delineate the
principal approaches to parallel computer
architectures and to show that these ap-
Shared memory
proaches define a coherent spectrum of
architectural alternatives. Definitions for
each category are provided below.
This taxonomy is not intended to sup-
plant efforts to construct more fully articu-
t
lated taxonomies. Such taxonomies pro-
MIMD/SIMD
vide comprehensive subcategories to re-
flect permutations of architectural charac-
teristics and to cover lower level features.
Dataflow
The “Further reading” section at the end
MlMD paradigm references several thoughtful taxonomic
studies that address these goals.
Reduction
Synchronous
Wavefront
architectures
Figure 1. High-level taxonomy of parallel computer architectures. Synchronous parallel architectures co-
ordinate ConculTent operations in lockstep
through global clocks, central control
units, or vector unit controllers.
operations that execute concurrently. M I M D (multiple instruction, multiple
9 Separate CPU and I 1 0 processors - data streams) - involves multiple Pipelined vector processors. The first
freeing the CPU from I/O control re- processors autonomously executing vector processor architectures were devel-
sponsibilities by using dedicated 1/0 diverse instructions on diverse data. oped in the late 1960s and early 1970s2.’to
processors; solutions range from rela- directly support massive vector and matrix
tively simple 1/0 controllers to com- Although these distinctions provide a calculations. Vector processors4 are char-
plex peripheral processing units. useful shorthand for characterizing archi- acterized by multiple, pipelined functional
tectures, they are insufficient for classify- units, which implement ‘arithmetic and
Although these features contribute signifi- ing various modern computers. For ex- Boolean operations for both vectors and
cantly to performance engineering, their ample, pipelined vector processors merit scalars and which can operate concur-
presence does not make a computer a par- inclusion as parallel architectures, since rently. Such architectures provide parallel
allel architecture. they exhibit substantial concurrent arith- vector processing by sequentially stream-
metic execution and can manipulate hun- ing vector elements through a functional
Flynn’s taxonomy. Flynn’s taxonomy dreds of vector elements in parallel. How- unit pipeline and by streaming the output
classifies architectures on the presence of ever, they are difficult to accommodate results of one unit into the pipeline of
single or multiple streams of instructions within Flynn’s taxonomy, because they another as input (a process known as
and data. This yields the four categories lack processors executing the same in- “chaining”).
below: struction in SIMD lockstep and lack the A representative architecture might
asynchronous autonomy of the MIMD have a vector addition unit consisting of
SISD (single instruction, single data category. six pipeline stages (see Figure 2). If each
stream) - defines serial computers. pipeline stage in the hypothetical architec-
M I S D (multiple instruction, single data Definition and taxonomy. A first step ture shown in the figure has a cycle time of
stream) - would involve multiple to providing a satisfactory taxonomy is to 20 nanoseconds, then 120 ns elapse from
processors applying different instruc- articulate a definition of parallel architec- the time operands a1 and bl enter stage 1
tions to a single datum; this hypotheti- ture. The definition should include appro- until result c l is available. When the pipe-
cal possibility is generally deemed priate computers that the Flynn schema line is filled, however, a result is available
impractical. cannot handle and exclude architectures every 20 ns. Thus, start-up overhead of
S I M D (single instruction, multiple data incorporating only low-level parallelism. pipelined vector units has significant per-
streams) - involves multiple proces- Therefore, aparallel architecture provides formance implications. In the case of the
sors simultaneously executing the an explicit, high-level framework for the register-to-register architecture depicted,
same instruction on different data (this development of parallel programming so- special high-speed vector registers hold
definition is discussed further prior to lutions by providing multiple processors, operands and results. Efficient perform-
examining array processors below). whether simple or complex, that cooperate ance for such architectures (for example,
6 COMPUTER
Vector
register A
the Cray- 1 and Fujitsu VP-200) is obtained
when vector operand lengths are multiples a1
of the vector register size. Memory-to-
memory architectures (such as the Control
Data Cyber 205 and Texas Instruments
Advanced Scientific Computer) use spe- Vector
register C
cial memory buffers instead of vector
registers.
Recent vector processing supercomput- STI ST2 ST3 ST4 ST5 ST6
ers (such as the Cray X-MP/4andETA-lO) Vector
unite four to I O vector processors through register B a8 a7 a6 a5
a large shared memory. Since such archi- c4 c3
b8 b7 b6 b5
tectures can support task-level parallel-
ism, they could arguably be termed MIMD
architectures, although vector processing
capabilities are the fundamental aspect of
their design.
-0
large-scale scientific calculations, such as
image processing and nuclear energy
modeling. Processor arrays developed in
the late 1960s (such as the Illiac-IV) and
more recent successors (such as the Bur-
roughs Scientific Processor) utilize proc- Figure 3. SIMD execution.
essors that accommodate word-sized oper-
ands. Operands are usually floating-point
(or complex) values and typically range in I
1-bit
size from 32 to64 bits. Various IN schemes serial Memory
have been used to provide processor-to- processors bit planes
processor or processor-to-memory com-
munications, with mesh and crossbar ap-
n I I
proaches being among the most popular.
One variant of processor array architec-
tures involves using a large number of one-
bit processors. In bit-plane architectures,
the array of processors is arranged in a
symmetrical grid (such as 64x64) and as-
sociated with multiple “planes” of mem-
ory bits that correspond to the dimensions
o f the processor grid (see Figure 4).
Processor n ( P n ) ,situated in the processor Planel Plane2 Plane,
grid at location (x, y ) , operates on the
memory bits at location (x, y ) in all the Figure 4. Bit-plane array processing.
February 1990 7
Program Program Array
memory controller ’ * controller program controller and associative mem-
ory to share data.
A A Most current associative memory pro-
- -
cessors use a bit-serial organization, which
involves concurrent operations on a single
ALU
and
bit-slice (bit-column) of all the words in
special the associative memory. Each associative
registers memory word, which usually has a very
large number of bits (for example, 32,768),
is associated with special registers and
comparison logic that functionally consti-
tute a processor. Hence, an associative
processor with4,096 words effectively has
4,096 processing elements.
Figure 6 depicts a row-oriented com-
parison operation for a generic bit-serial
architecture. A portion of the comparison
Figure 5. Associative memory processing organization. register contains the value to be matched.
All of the associative processing elements
start at a specified memory column and
associated memory planes. Usually, op- parallel according to its contents. Research compare the contents of four consecutive
erations are provided to copy, mask, and in constructing associative memories be- bits in their row against the comparison
perform arithmetic operations on entire gan in the late 1950s with the obvious goal register contents, setting a bit in the A
memory planes, as well as on columns and of being able to search memory in parallel register to indicate whether or not their row
rows within a plane. for data that matched some specified da- contains a match.
Loral’s Massively Parallel Processor‘ tum. “Modem” associative memory proc- In Figure 7 a logical OR operation is
and ICL’s Distributed Array Processor essors developed in the early 1970s (for performed on a bit-column and the bit-
exemplify this kind of architecture, which example, Bell Laboratories’ Parallel Ele- vector in register A , with register B receiv-
is often used for image processing applica- ment Processing Ensemble, or PEPE) and ing the results. A zero in the mask register
tions by mapping pixels to the memory’s recent architectures (for example, Loral’s indicates that the associated word is not to
planar structure. Thinking Machines’ Associative Processor, or Aspro) have be included in the current operation.
Connection Machine organizes as many as naturally been geared to database-oriented
65,536 one-bit processors as sets of four- applications, such as tracking and surveil- Systolic architectures. In the early
processor meshes united in a hypercube lance. 1980s H.T. Kung of Carnegie Mellon
topology. Figure 5 shows the characteristic func- University proposed systolic architectures
tional units of an associative memory pro- to solve the problems of special-purpose
Associative memory processor architec- cessor. A program controller (serial com- systems that must often balance intensive
tures. Computers built around an associa- puter) reads and executes instructions, computations with demanding I/O
tive memory’ constitute a distinctive type invoking a specialized array controller bandwidthsR Systolic architectures (sys-
of SIMD architecture that uses special when associative memory instructions are tolic arrays) are pipelined multiprocessors
comparison logic to access stored data in encountered. Special registers enable the in which data is pulsed in rhythmic fashion
m
Comparison register
001 1010
E
Search pattern Associative registers
Associative memory
\ A B Mask
register
T
i1 T 1
0
Words Words
1 Bit-column
1
0
Associative memory
0
search window
*Bits per word -1 +Bits per word -1
Figure 6. Associative memory comparison operation. Figure 7. Associative memory logical OR operation.
8 COMPUTER
h
f 9 h
e 0 f
ca C
dbO db
February 1990 9
.. by “divide and conquer” algorithms organ- interconnection network. Nodes share data
Iu /uI
native to depending on further implemen- search, these architectures have princi-
tation refinements in pipelined vector pally been constructed in an effort to pro-
computers to provide the significant per- vide a multiprocessor architecture that will
formance increases needed to make some “scale” (accommodate a significant in-
scientific applications tractable (such as crease in processors) and will satisfy the
three-dimensional fluid modeling). Fi- performance requirements of large scien-
nally, the cost-effectiveness of n-proces- tific applications characterized by local
sor systems over n single-processor sys- data references.
tems encourages MIMD experimentation. Various interconnection network to-
pologies have been proposed to support
I
.I I
.I Distributed memory architectures. architectural expandability and provide
Distributed memory architectures (Figure efficient performance for parallel pro-
10) connect processing nodes (consisting grams with differing interprocessor com-
Figure 10. MIMD distributed memory of an autonomous processor and its local munication patterns. Figure 1 la-e depicts
architecture structure. memory) with a processor-to-processor the topologies discussed below.
10 COMPUTER
cessors). Example solutions include add- tered by message-passing architectures, memory architecture also has a local
ing additional interconnection network such as message sending latency as data is memory used as a cache. Multiple copies
pathways to unite all nodes at the same queued and forwarded by intermediate of the same shared memory data, therefore,
tree level. nodes. However, other problems, such as may exist in various processors’ caches at
data access synchronization and cache a given time. Maintaining a consistent
Hypercube topology architectures. A coherency, must be solved. version of such data is the cache coherency
Boolean n-cube or “hypercube” topology Coordinating processors with shared problem, which concerns providing new
uses N = 2” processors arranged in an n- variables requires atomic synchronizing versions of the cached data to each in-
dimensional cube, where each node has mechanisms to prevent one process from volved processor whenever a processor
n=log,N bidirectional links to adjacent accessing a datum before another finishes updates its copy. Although systems with a
nodes: Individual nodes are uniquely iden- updating it. These mechanisms provide an small number of processors can use hard-
tified by n-bit numeric values ranging from atomic operation that subjects a “key” to a ware “snooping” mechanisms to determine
0 to N-l and assigned in a manner that comparison test before allowing either the when shared memory data has been up-
ensures adjacent nodes’ values differ by a key or associated data to be updated. The dated, larger systems usually rely on soft-
single bit. The communication diameter of “test-and-set” mechanism, for example, is ware solutions to minimize performance
such a hypercube topology architecture is an atomic operation for testing the key’s impact.
n=log,N. value and, if the test result is true, updating Figure 12a-c illustrates some major al-
Hypercube architecture research has the key value. ternatives for connecting multiple proces-
been strongly influenced by the desire to Typically, each processor in a shared sors to shared memory (outlined below).
develop a “scalable” architecture that sup-
ports the performance requirements of 3 D
scientific applications. Extant hypercube
architectures include the Cosmic Cube,
P P P
Ametek Series 2010, Intel Personal Super- E IS
computer, and Ncube/lO. Cache Cache Cache
Shared-memory architectures.
Shared memory architectures accomplish
interprocessor coordination by providing a
global, shared memory that each processor
can address. Commercial shared-memory
architectures, such as Flexible Corpora-
tion’s Flex/32 and Encore Computer’s
Multimax, were introduced during the
1980s. These architectures involve mul-
tiple general-purpose processors sharing
memory, rather than a CPU and peripheral
1/0 processors. Shared memory computers Figure 12. MIMD shared-memory interconnection schemes: (a) bus interconnec-
d o not have some of the problems encoun- tion; (b) 2x2 crossbar; (c) 8x8 omega MIN routing a P, request to M,.
February 1990 11
types is predicated on MIMD principles of
asynchronous operation and concurrent
manipulation of multiple instruction and
data streams. However, each of these ar-
chitectures is also based on a distinctive
organizing principle as fundamental to
its overall design as MIMD characteris-
tics. These architectures, therefore, are
described under the category “MIMD-
based architectural paradigms” to high-
light their distinctive foundations as
well as the MIMD characteristics they
have in common.
12 COMPUTER
shown at step 3. At step 4, the node store Programs may “reference” named expres-
unit obtains the relevant instruction op- sions, which always return the same value
code from memory. The node store unit (the referential transparency property).
then fills in the relevant token fields (step Reduction programs are function applica-
5 ) , and assigns the instruction to a proces- tions constructed from primitive functions.
sor. The execution of the instruction will Reduction program execution consists
create a new result token to be used as of recognizing reducible expressions, then
input to the node 4 instruction. replacing them with their calculated val-
ues. Thus, an entire reduction program is
Reduction architectures. Reduction, ultimately reduced to its result. Since the
or demand-driven, architectures12 imple- general execution paradigm only enables
ment an execution paradigm in which an an instruction for execution when its re-
instruction is enabled ior execution when sults are needed by a previously enabled
its results are required as operands for instruction, some additional rule is needed
Node 1 Node 2 another instruction already enabled for to enable the first instruction(s) and begin
execution. Most reduction architecture computation.
research began in the late 1970s to explore Practical challenges for implementing
Figure 14. Dataflow graph-program new parallel execution paradigms and to reduction architectures include synchro-
fragment. provide architectural support for applica- nizing demands for an instruction’s results
tive (functional) programming languages. (since preserving referential transparency
Reduction architectures execute pro- requires calculating an expression’s re-
graph nodes as representing asynchronous grams that consist of nested expressions. sults once only) and maintaining copies of
tasks, although they are often single Expressions are recursively defined as lit- expression evaluation results (since an
instructions. Graph arcs represent com- erals or function applications on argu- expression result could be referenced
munications paths that carry execution ments that may be literals or expressions. more than once but could be consumed
results needed as operands in subsequent
instructions.
Some of the diverse mechanisms used to
implement dataflow computers (such as
the Manchester Data Flow Computer, MIT 3
Tagged Token Data Flow architecture, and flnstruc - Tok \
Toulouse LAU System)” are outlined be-
low. Static implementations load all graph
nodes into memory during initialization
and allow only one instance of a node to be
executed at a time; dynamic architectures
allow the creation of node instances at
runtime and multiple instances of a node to
be executed concurrently.
Some architectures directly store
“tokens” containing instruction results in-
to a template for the instruction that will
use them as operands. Other architectures
use token-matching schemes, in which a
matching unit stores tokens and tries to
match them with instructions. When a
complete set of tokens (all required op-
erands) is assembled for an instruction,
an instruction template containing the
relevant operands is created and queued
for execution.
Figure 15 shows how a simplified token-
matching architecture might process the
program fragment shown in Figure 14. At
step 1, the execution of (3*a) results in the
creation of a token that contains the result
(15) and an indication that the instruction
at node 3 requires this as an operand. Step
2 shows the matching unit that will match
this token and the result token of (5*b)with
the node 3 instruction. The matching unit
creates the instruction token (template) Figure 15. Dataflow token-matching example.
February 1990 13
Node 1 Node 1
G) 52
Node 2
r”(
Loc: 1 h Demand
Need: c
Loc: 1
Value: 35
Dest: 1
3 Node 6 Node 7
Memory Memory
store store store
Figure 16. Reduction architecture demand token production. Figure 17. Reduction architecture result token production.
by subsequent reductions upon first being a=+bc; and regular, local interconnection net-
delivered). b=+de; works. However, wavefront arrays replace
Reduction architectures employ either c=*fg; the global clock and explicit time delays
string reduction or graph reduction to d = 1; e = 3; f = 5 ; g = 7. used for synchronizing systolic data pipe-
implement demand-driven paradigms. lining with asynchronous handshaking as
String reduction involves manipulating Rather dissimilar architectures (such as the mechanism for coordinating inter-
literals and copies of values, which are the Newcastle Reduction Machine, North processor data movement. Thus, when a
represented as strings that can be dynam- Carolina Cellular Tree Machine, and Utah processor has performed its computations
ically expanded and contracted. Graph Applicative Multiprocessing System) and is ready to pass data to its successor, it
reduction involves manipulating literals have been proposed to support both string- informs the successor, sends data when the
and references (pointers) to values. Thus, and graph-reduction approaches. successor indicates it is ready, and receives
a program is represented as a graph, and an acknowledgment from the successor.
garbage collection reclaims dynamically Wavefront array architectures. The handshaking mechanism makes com-
allocated memory as the reduction Wavefront array processors’j combine putational wavefronts pass smoothly
proceeds. systolic data pipelining with an asynchro- through the array without intersecting, as
Figures 16 and 17 show a simplified nous dataflow execution paradigm. S-Y. the array’s processors act as a wave propa-
version of a graph-reduction architecture Kung developed wavefront array concepts gating medium. In this manner, correct
that maps the program below onto tree- in the early 1980s to address the same kind sequencing of computations replaces the
structured processors and passes tokens of problems that stimulated systolic array correct timing of systolic architectures.
that demand or return results. Figure 16 research - producing efficient, cost-ef- Figure 18a-c depicts wavefront array
depicts all the demand tokens produced fective architectures for special-purpose concepts, using the matrix multiplication
by the program, as demands for the values systems that balance intensive computa- example used earlier to illustrate systolic
of references propagate down the tree. In tions with high 1/0 bandwidth (see the operation (Figure 9). The example archi-
Figure 17, the last two result tokens systolic array section above). tecture consists of processing elements
produced are shown as they are passed to Wavefront and systolic architectures are (PES) that have a one-operand buffer for
the root node. both characterized by modular processors each input source. Whenever the buffer for
14 COMPUTER
a memory input source is empty and the
associated memory contains another oper-
and, that available operand is immediately
read. Operands from other PES are ob-
tained using a handshaking protocol.
Figure 18a shows the situation after
memory input buffers are initially filled. In
Figure 18b PE( 1 , l ) adds the product ae to
its accumulator and transmits operands a
and e to neighbors; thus, the first computa-
tional wavefront is shown propagating
from PE(1,l) to PE(1,2) and PE(2,l). Fig-
ure 18c shows the first computational
wavefront continuing to propagate, while a
second wavefront is propagated by
PE( 1 , l ) .
Kung arguedI3 that wavefront arrays
enjoy several advantages over systolic
arrays, including greater scalability, sim-
pler programming, and greater fault toler-
ance. Wavefront arrays constructed at
Johns Hopkins University and at the Stan-
dard Telecommunications Company and
Royal Signals and Radar Establishment (in
the United Kingdom) should facilitate
further assessment of wavefront arrays’
proposed advantages.
Acknowledgments
Particular thanks are due Lawrence Snyder
and the referees for constructive comments.
Thanks go to the following individuals for
providing research materials, descriptions of
parallel architectures, and insights: Theodore
Bashkow, Laxmi Bhuyan, Jack Dongarra, Paul
Edmonds, Scott Fahlman, Dennis Gannon, E.
Allen Garrard, H.T.Kung, G.J. Lipovski, David
Lugowski, Miroslaw Malek, Robert Masson, Figure 18. Wavefront array matrix multiplication.
February 1990 15
Susan Miller, James Peitrocini, Malco;m Rim- L. Snyder, “A Taxonomy of Synchronous Paral- 6. K.E. Batcher, “Design of a Massively Paral-
mer, Howard J. Siegel, Charles Seitz, Vason lel Machines,” Proc. 17th Int’l Conf. Parallel lel Processor,” IEEE Trans. Computers,
Srini, Salvatore Stolfo, David Waltz, and Jon Processing, University Park, Penn., August, Vol. C-29, Sept. 1980, pp. 836-844.
Webb. 1988.
This research was supported by the Rome Air 7. T. Kohonen, Content-Addressable Memo-
Development Center under contract F30602- ries, 2nd edition, Springer-Verlag, New
87-D-0092. York. 1987.
L-----
In Canada, Call Toll Free 1-800-387-2173
Call today, and get fhe Inmac Computer
Products Catalog!
Guaranteed Delivery Instant Credi Ralph Duncan is a system software design
Over3000Products 45Day consultant with Control Data’s Government
To Choose From Product Trial Systems Group. His recent technical work has
involved fault-tolerant operating systems, par-
allel architectures, and automated code genera-
tion.
Duncan holds an MS degree in information
and computer science from the Georgia Insti-
tute of Technology, an MA from the University
of California at Berkeley, and a BA from the
University of Michigan. He is a member of the
IyY I
Street Address/P.O. Box
IEEE and the Computer Society.
--------
) State Zip Readers may contact the author at Control
Area Code Phone No. s900202 Data Corp., Government Systems, 300 Em-
-
”Your Worldwide Sourcefor Mail To: Inmac bassy Row, Atlanta, GA 30328.
computer s ~ ~ ~ iFurniture.
es, ond
Datn Communicntions Pmducts.”
COMPUTER
Reader Service Number 2