0% found this document useful (0 votes)
56 views12 pages

1990 Duncan Parallel Architectures

The document surveys various parallel computer architectures, highlighting the evolution and complexity of these systems since the 1960s. It discusses the challenges in defining parallel architectures and proposes a taxonomy based on high-level constructs, including SIMD, MIMD, and systolic architectures. The tutorial aims to provide a coherent framework for understanding the diverse approaches to parallel processing and their interrelationships.

Uploaded by

pppamungkas3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views12 pages

1990 Duncan Parallel Architectures

The document surveys various parallel computer architectures, highlighting the evolution and complexity of these systems since the 1960s. It discusses the challenges in defining parallel architectures and proposes a taxonomy based on high-level constructs, including SIMD, MIMD, and systolic architectures. The tutorial aims to provide a coherent framework for understanding the diverse approaches to parallel processing and their interrelationships.

Uploaded by

pppamungkas3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

SURVEY & TUTORIAL SERIES

A Survev of J

Parallel Computer
Architectures
Ralph Duncan, Control Data Corporation

T
his decade has witnessed the in- Include pipelined vector processors
troduction of a wide variety of and other architectures that intuitively
new computer architectures for seem to merit inclusion as parallel
parallel processing that complement and The diversity of architectures, but which are difficult to
extend the major approaches to parallel gracefully accommodate within
computing developed in the 1960s and
parallel computer Flynn’s scheme.
1970s. The recent proliferation of parallel architectures can
processing technologies has included new We will examine each of these impera-
parallel hardware architectures (systolic bewilder the non- tives as we seek a definition that satisfies
and hypercube), interconnection technolo- all of them and provides the basis for a
gies (multistage switching topologies),
specialist. This tutorial reasonable taxonomy.
and programming paradigms (applicative reviews alternative
programming). The sheer diversity of the Low-level parallelism. There are two
field poses a substantial obstacle to the approaches to parallel reasons to exclude machines that employ
nonspecialist who wishes to comprehend processing within the only low-level parallel mechanisms from
what kinds of parallel architectures exist the set of parallel architectures. First, fail-
and how their relationship to one another framework of a high- ure to adopt a more rigorous standard might
defines an orderly schema. make the majority of modern computers
This discussion attempts to place recent level taxonomy. “parallel architectures,” negating the
architectural innovations in the broader term’s usefulness. Second, architectures
context of parallel architecture develop- having only the features listed below do
ment by surveying the fundamentals of The difficulty in precisely defining the not offer an explicit, coherent framework
both newer and more established parallel term is intertwined with the problem of for developing high-level parallel solu-
computer architectures and by placing specifying a parallel architecture taxon- tions:
these architectural alternatives in a coher- omy. A central problem for specifying a
ent framework. The survey’s primary definition and consequent taxonomy for Instruction pipelining - the decom-
emphasis concerns architectural con- modern parallel architectures is to satisfy position of instruction execution into a
structs rather than specific parallel ma- the following set of imperatives: linear series of autonomous stages,
chines. allowing each stage to simultaneously
Exclude architectures incorporating perform a portion of the execution
only low-level parallel mechanisms process (such as decode, calculate ef-
Terminology and that have become commonplace fea- fective address, fetch operand, exe-
taxonomy tures of modern computers. cute, and store).
Maintain elements of Flynn’s useful M u l t i p l e CPU f u n c t i o n a l units -
Problems. Diverse definitions have taxonomy’ based on instruction and providing independent functional
been proposed for parallel architectures. data streams. units for arithmetic and Boolean

February ‘1990 o o 1 x ~ ~ 1 ~ ~ ~ 9 o / n ~ n n - o o oos 1990


s o 1 .IEEE
oo 5
Vector

-E
to solve problems through concurrent

Synchronous SIMD -[ Processor array

Associative memory
execution.
Figure 1 shows a taxonomy based on the
imperatives discussed earlier and the pro-
posed definition. This informal taxonomy
Systolic uses high-level categories to delineate the
principal approaches to parallel computer
architectures and to show that these ap-

MlMD -[ Distributed memory

Shared memory
proaches define a coherent spectrum of
architectural alternatives. Definitions for
each category are provided below.
This taxonomy is not intended to sup-
plant efforts to construct more fully articu-

t
lated taxonomies. Such taxonomies pro-
MIMD/SIMD
vide comprehensive subcategories to re-
flect permutations of architectural charac-
teristics and to cover lower level features.
Dataflow
The “Further reading” section at the end
MlMD paradigm references several thoughtful taxonomic
studies that address these goals.
Reduction

Synchronous
Wavefront
architectures
Figure 1. High-level taxonomy of parallel computer architectures. Synchronous parallel architectures co-
ordinate ConculTent operations in lockstep
through global clocks, central control
units, or vector unit controllers.
operations that execute concurrently. M I M D (multiple instruction, multiple
9 Separate CPU and I 1 0 processors - data streams) - involves multiple Pipelined vector processors. The first
freeing the CPU from I/O control re- processors autonomously executing vector processor architectures were devel-
sponsibilities by using dedicated 1/0 diverse instructions on diverse data. oped in the late 1960s and early 1970s2.’to
processors; solutions range from rela- directly support massive vector and matrix
tively simple 1/0 controllers to com- Although these distinctions provide a calculations. Vector processors4 are char-
plex peripheral processing units. useful shorthand for characterizing archi- acterized by multiple, pipelined functional
tectures, they are insufficient for classify- units, which implement ‘arithmetic and
Although these features contribute signifi- ing various modern computers. For ex- Boolean operations for both vectors and
cantly to performance engineering, their ample, pipelined vector processors merit scalars and which can operate concur-
presence does not make a computer a par- inclusion as parallel architectures, since rently. Such architectures provide parallel
allel architecture. they exhibit substantial concurrent arith- vector processing by sequentially stream-
metic execution and can manipulate hun- ing vector elements through a functional
Flynn’s taxonomy. Flynn’s taxonomy dreds of vector elements in parallel. How- unit pipeline and by streaming the output
classifies architectures on the presence of ever, they are difficult to accommodate results of one unit into the pipeline of
single or multiple streams of instructions within Flynn’s taxonomy, because they another as input (a process known as
and data. This yields the four categories lack processors executing the same in- “chaining”).
below: struction in SIMD lockstep and lack the A representative architecture might
asynchronous autonomy of the MIMD have a vector addition unit consisting of
SISD (single instruction, single data category. six pipeline stages (see Figure 2). If each
stream) - defines serial computers. pipeline stage in the hypothetical architec-
M I S D (multiple instruction, single data Definition and taxonomy. A first step ture shown in the figure has a cycle time of
stream) - would involve multiple to providing a satisfactory taxonomy is to 20 nanoseconds, then 120 ns elapse from
processors applying different instruc- articulate a definition of parallel architec- the time operands a1 and bl enter stage 1
tions to a single datum; this hypotheti- ture. The definition should include appro- until result c l is available. When the pipe-
cal possibility is generally deemed priate computers that the Flynn schema line is filled, however, a result is available
impractical. cannot handle and exclude architectures every 20 ns. Thus, start-up overhead of
S I M D (single instruction, multiple data incorporating only low-level parallelism. pipelined vector units has significant per-
streams) - involves multiple proces- Therefore, aparallel architecture provides formance implications. In the case of the
sors simultaneously executing the an explicit, high-level framework for the register-to-register architecture depicted,
same instruction on different data (this development of parallel programming so- special high-speed vector registers hold
definition is discussed further prior to lutions by providing multiple processors, operands and results. Efficient perform-
examining array processors below). whether simple or complex, that cooperate ance for such architectures (for example,

6 COMPUTER
Vector
register A
the Cray- 1 and Fujitsu VP-200) is obtained
when vector operand lengths are multiples a1
of the vector register size. Memory-to-
memory architectures (such as the Control
Data Cyber 205 and Texas Instruments
Advanced Scientific Computer) use spe- Vector
register C
cial memory buffers instead of vector
registers.
Recent vector processing supercomput- STI ST2 ST3 ST4 ST5 ST6
ers (such as the Cray X-MP/4andETA-lO) Vector
unite four to I O vector processors through register B a8 a7 a6 a5
a large shared memory. Since such archi- c4 c3
b8 b7 b6 b5
tectures can support task-level parallel-
ism, they could arguably be termed MIMD
architectures, although vector processing
capabilities are the fundamental aspect of
their design.

SIMD architectures. SIMD architec-


tures (see Figure 3) typically employ a
central control unit, multiple processors, Figure 2. Register-to-register vector architecture operation.
and an interconnection network (IN) for
either processor-to-processor or proces-
sor-to-memory communications. The con-
trol unit broadcasts a single instruction to Instruction Data
stream stream
all processors, which execute the instruc-
tion in lockstep fashion on local data. The m m
interconnection network allows instrbc-
tion results calculated at one processor to
be communicated to another processor for
use as operands in a subsequent instruc-
tion. Individual processors may be allowed
-0
-0
to disable the current instruction.
Control
Processor array architectures. Proces- unit
sor arrays’ structured for numerical SIMD
execution have often been employed for

-0
large-scale scientific calculations, such as
image processing and nuclear energy
modeling. Processor arrays developed in
the late 1960s (such as the Illiac-IV) and
more recent successors (such as the Bur-
roughs Scientific Processor) utilize proc- Figure 3. SIMD execution.
essors that accommodate word-sized oper-
ands. Operands are usually floating-point
(or complex) values and typically range in I
1-bit
size from 32 to64 bits. Various IN schemes serial Memory
have been used to provide processor-to- processors bit planes
processor or processor-to-memory com-
munications, with mesh and crossbar ap-
n I I
proaches being among the most popular.
One variant of processor array architec-
tures involves using a large number of one-
bit processors. In bit-plane architectures,
the array of processors is arranged in a
symmetrical grid (such as 64x64) and as-
sociated with multiple “planes” of mem-
ory bits that correspond to the dimensions
o f the processor grid (see Figure 4).
Processor n ( P n ) ,situated in the processor Planel Plane2 Plane,
grid at location (x, y ) , operates on the
memory bits at location (x, y ) in all the Figure 4. Bit-plane array processing.

February 1990 7
Program Program Array
memory controller ’ * controller program controller and associative mem-
ory to share data.
A A Most current associative memory pro-

- -
cessors use a bit-serial organization, which
involves concurrent operations on a single
ALU
and
bit-slice (bit-column) of all the words in
special the associative memory. Each associative
registers memory word, which usually has a very
large number of bits (for example, 32,768),
is associated with special registers and
comparison logic that functionally consti-
tute a processor. Hence, an associative
processor with4,096 words effectively has
4,096 processing elements.
Figure 6 depicts a row-oriented com-
parison operation for a generic bit-serial
architecture. A portion of the comparison
Figure 5. Associative memory processing organization. register contains the value to be matched.
All of the associative processing elements
start at a specified memory column and
associated memory planes. Usually, op- parallel according to its contents. Research compare the contents of four consecutive
erations are provided to copy, mask, and in constructing associative memories be- bits in their row against the comparison
perform arithmetic operations on entire gan in the late 1950s with the obvious goal register contents, setting a bit in the A
memory planes, as well as on columns and of being able to search memory in parallel register to indicate whether or not their row
rows within a plane. for data that matched some specified da- contains a match.
Loral’s Massively Parallel Processor‘ tum. “Modem” associative memory proc- In Figure 7 a logical OR operation is
and ICL’s Distributed Array Processor essors developed in the early 1970s (for performed on a bit-column and the bit-
exemplify this kind of architecture, which example, Bell Laboratories’ Parallel Ele- vector in register A , with register B receiv-
is often used for image processing applica- ment Processing Ensemble, or PEPE) and ing the results. A zero in the mask register
tions by mapping pixels to the memory’s recent architectures (for example, Loral’s indicates that the associated word is not to
planar structure. Thinking Machines’ Associative Processor, or Aspro) have be included in the current operation.
Connection Machine organizes as many as naturally been geared to database-oriented
65,536 one-bit processors as sets of four- applications, such as tracking and surveil- Systolic architectures. In the early
processor meshes united in a hypercube lance. 1980s H.T. Kung of Carnegie Mellon
topology. Figure 5 shows the characteristic func- University proposed systolic architectures
tional units of an associative memory pro- to solve the problems of special-purpose
Associative memory processor architec- cessor. A program controller (serial com- systems that must often balance intensive
tures. Computers built around an associa- puter) reads and executes instructions, computations with demanding I/O
tive memory’ constitute a distinctive type invoking a specialized array controller bandwidthsR Systolic architectures (sys-
of SIMD architecture that uses special when associative memory instructions are tolic arrays) are pipelined multiprocessors
comparison logic to access stored data in encountered. Special registers enable the in which data is pulsed in rhythmic fashion

m
Comparison register
001 1010

E
Search pattern Associative registers

Associative memory
\ A B Mask
register

T
i1 T 1

0
Words Words

1 Bit-column
1
0

Associative memory
0

search window
*Bits per word -1 +Bits per word -1

Figure 6. Associative memory comparison operation. Figure 7. Associative memory logical OR operation.

8 COMPUTER
h
f 9 h
e 0 f

ca C

dbO db

Figure 8. Systolic flow of data from


and to memory.

from memory and through a network of


processors before returning to memory
(see Figure 8). A global clock and explicit
timing delays synchronize this pipelined
data flow, which consists of operands ob-
tained from memory and partial results to
be used by each processor. Modular pro-
cessors united by regular, local intercon-
nections provide basic building blocks for
a variety of special-purpose systems. Dur-
ing each time interval, these piocessors
execute a short, invariant sequence of in-
structions.
Systolic arrays address the performance
requirements of special-purpose systems
by achieving significant parallel computa- ae ag
tion and by avoiding 1 / 0 and memory + . * +
bandwidth bottlenecks. A high degree of cf ch
parallelism is obtained by pipelining data
through multiple processors, typically in
two-dimensional fashion. Systolic archi-
tectures maximize the computations per-
formed on a datum once it has been ob- be bg
tained from memory or an external device. + * +
Hence, once a datum enters the systolic Figure 9. Systolic matrix multipli-
df dh
array, it is passed to any processor that cation.
needs it, without an intervening store to
memory. Only processors at the topologi-
cal boundaries of the array perform 1/0 to
and from memory. rithm-specific architectures, particularly processors that can execute independent
Figure 9a-e shows how a simple systolic for signal processing. In addition, pro- instruction streams, using local data. Thus,
array could calculate the outer product of grammable (reconfigurable) systolic MIMD computers support parallel solu-
two matrices, architectures (such as Carnegie Mellon’s tions that require processors to operate in a
Warp and Saxpy’s Matrix-1) have been largely autonomous manner. Although
A = l a b l and B = l e f l constructed that are not limited to imple- software processes executing on MIMD
lcdl lghl menting a single algorithm. Although sys- architectures are synchronized by passing
tolic concepts were originally proposed for messages through an interconnection net-
The zero inputs shown moving through the VLSI-based systems to be implemented at work or by accessing data in shared mem-
array are used for synchronization. Each the chip level, systolic architectures have ory units, MIMD architectures are asyn-
processor begins with an accumulator set been implemented at a variety of physical chronous computers, characterized by
to zero and, during each cycle, adds the levels. decentralized hardware control.
product of its two inputs to the accumula- The impetus for developing MIMD
tor. After five cycles the matrix product is architectures can be ascribed to several
complete. MIMD architectures interrelated factors. MIMD computers
A growing number of special-purpose support higher level parallelism (subpro-
systems use systolic organization for algo- MIMD architectures employ multiple gram and task levels) that can be exploited

February 1990 9
.. by “divide and conquer” algorithms organ- interconnection network. Nodes share data

ImUmI ized as largely independent subcalcula-


tions (for example, searching and sorting).
MIMD architectures may provide an alter-
by explicitly passing messages through the
interconnection network, since there is no
shared memory. A product of 1980s re-

Iu /uI
native to depending on further implemen- search, these architectures have princi-
tation refinements in pipelined vector pally been constructed in an effort to pro-
computers to provide the significant per- vide a multiprocessor architecture that will
formance increases needed to make some “scale” (accommodate a significant in-
scientific applications tractable (such as crease in processors) and will satisfy the
three-dimensional fluid modeling). Fi- performance requirements of large scien-
nally, the cost-effectiveness of n-proces- tific applications characterized by local
sor systems over n single-processor sys- data references.
tems encourages MIMD experimentation. Various interconnection network to-
pologies have been proposed to support
I
.I I
.I Distributed memory architectures. architectural expandability and provide
Distributed memory architectures (Figure efficient performance for parallel pro-
10) connect processing nodes (consisting grams with differing interprocessor com-
Figure 10. MIMD distributed memory of an autonomous processor and its local munication patterns. Figure 1 la-e depicts
architecture structure. memory) with a processor-to-processor the topologies discussed below.

Ring topology architectures. The com-


munication diameter ( N / 2 ) of ring topol-
ogy architectures can be reduced by adding
chordal connections. Using chordal con-
nections or multiple rings can increase a
ring-based architecture’s fault tolerance.
Typically, fixed-size message packets are
used that include a node destination field.
Ring topologies are most appropriate for
a small number of processors executing
algorithms not dominated by data com-
munications.

Mesh topology architectures. A two-


dimensional mesh, or lattice, topology has
n2 nodes, each connected to its four imme-
diate neighbors. Wraparound connections
at the edges are sometimes provided to
reduce the communication diameter from
2(n-1) to 2 * (Integer-part of n / 2 ) . Com-
munications may be augmented by provid-
ing additional diagonal links or by using
buses to connect nodes by rows and col-
umns. The topological correspondence
between meshes and matrix-oriented algo-
rithms encourages mesh-based architec-
ture research.

Tree topology architectures. Tree topol-


ogy architectures, such as Columbia Uni-
0 = root versity’s DADO2 and Non-Von, have been
constructed to support divide-and-conquer
1 = level 1 algorithms for searching and sorting, im-
age processing algorithms, and dataflow
2 = level 2 and reduction programming paradigms.
Although a variety of tree-structured to-
pologies have been suggested, complete
binary trees are the most analyzed variant.
Several strategies have been employed
to reduce the communication diameter of
Figure 11. MIMD interconnection network topologies: (a) ring; (b) mesh; tree topologies (2(n-1) for a complete
(c) tree; (d) hypercube; (e) tree mapped to a reconfigurable mesh. binary tree with n levels and 2”-1 pro-

10 COMPUTER
cessors). Example solutions include add- tered by message-passing architectures, memory architecture also has a local
ing additional interconnection network such as message sending latency as data is memory used as a cache. Multiple copies
pathways to unite all nodes at the same queued and forwarded by intermediate of the same shared memory data, therefore,
tree level. nodes. However, other problems, such as may exist in various processors’ caches at
data access synchronization and cache a given time. Maintaining a consistent
Hypercube topology architectures. A coherency, must be solved. version of such data is the cache coherency
Boolean n-cube or “hypercube” topology Coordinating processors with shared problem, which concerns providing new
uses N = 2” processors arranged in an n- variables requires atomic synchronizing versions of the cached data to each in-
dimensional cube, where each node has mechanisms to prevent one process from volved processor whenever a processor
n=log,N bidirectional links to adjacent accessing a datum before another finishes updates its copy. Although systems with a
nodes: Individual nodes are uniquely iden- updating it. These mechanisms provide an small number of processors can use hard-
tified by n-bit numeric values ranging from atomic operation that subjects a “key” to a ware “snooping” mechanisms to determine
0 to N-l and assigned in a manner that comparison test before allowing either the when shared memory data has been up-
ensures adjacent nodes’ values differ by a key or associated data to be updated. The dated, larger systems usually rely on soft-
single bit. The communication diameter of “test-and-set” mechanism, for example, is ware solutions to minimize performance
such a hypercube topology architecture is an atomic operation for testing the key’s impact.
n=log,N. value and, if the test result is true, updating Figure 12a-c illustrates some major al-
Hypercube architecture research has the key value. ternatives for connecting multiple proces-
been strongly influenced by the desire to Typically, each processor in a shared sors to shared memory (outlined below).
develop a “scalable” architecture that sup-
ports the performance requirements of 3 D
scientific applications. Extant hypercube
architectures include the Cosmic Cube,
P P P
Ametek Series 2010, Intel Personal Super- E IS
computer, and Ncube/lO. Cache Cache Cache

Reconfigurahle topology architectures. I I I Bus


Although distributed memory architec- I E IS
tures possess an underlying physical topol-
ogy, reconfigurable topology architectures MO Mi M2
provide programmable switches that allow MO Mi
users to select a logical topology matching ia) ib) r

application communication patterns. The


functional reconfigurability available in
research prototypes ranges from specify-
ing different topologies (such as Lawrence
Snyder’s Configurable Highly Parallel
Computer, or Chip) to partitioning a base
topology into multiple interconnection
topologies of the same type (such as
Howard J. Siegel’s Partitionable SIMD/
MIMD System, or Pasm). A significant
motivation for constructing reconfigurable
topology architectures is that a single
architecture can act as many special-pur-
pose architectures that efficiently support
the communications patterns of particular
algorithms or algorithm steps.

Shared-memory architectures.
Shared memory architectures accomplish
interprocessor coordination by providing a
global, shared memory that each processor
can address. Commercial shared-memory
architectures, such as Flexible Corpora-
tion’s Flex/32 and Encore Computer’s
Multimax, were introduced during the
1980s. These architectures involve mul-
tiple general-purpose processors sharing
memory, rather than a CPU and peripheral
1/0 processors. Shared memory computers Figure 12. MIMD shared-memory interconnection schemes: (a) bus interconnec-
d o not have some of the problems encoun- tion; (b) 2x2 crossbar; (c) 8x8 omega MIN routing a P, request to M,.

February 1990 11
types is predicated on MIMD principles of
asynchronous operation and concurrent
manipulation of multiple instruction and
data streams. However, each of these ar-
chitectures is also based on a distinctive
organizing principle as fundamental to
its overall design as MIMD characteris-
tics. These architectures, therefore, are
described under the category “MIMD-
based architectural paradigms” to high-
light their distinctive foundations as
well as the MIMD characteristics they
have in common.

MIMD/SIMD architectures. A variety


of experimental hybrid architectures con-
structed during the 1980s allow selected
portions of a MIMD architecture to be
controlled in SIMD fashion (for example,
SIMD slave processors DADO, Non-Von, Pasm, and Texas Re-
configurable Array Computer, or
TRAC).I4 The implementation mecha-
Figure 13. MIMD/SIMD operation. nisms explored for reconfiguring architec-
tures and controlling SIMD execution are
quite diverse. Using a tree-structured,
message-passing computer‘” as the base
B u s interconnections. Time-shared connects N processors to N memories by architecture for a MIMD/SIMD architec-
buses (Figure 12a) offer a fairly simple deploying multiple “stages” or banks of ture helps illustrate the general concept.
way to give multiple processors access to a switches in the interconnection network The master/slaves relation of a SIMD
shared memory. A single, time-shared bus pathway. architecture’s controller and processors
effectively accommodates a moderate When N is a power of 2, one approach is can be mapped onto the node/descendents
number of processors (from four to 20), to employ log,N stages of N/2 switches, relation of a subtree (see Figure 13). When
since only one processor accesses the bus using 2x2 switches. A processor making a the root processor node of a subtree oper-
at a given time. Some bus-based architec- memory access request specifies the de- ates as a SIMD controller, it transmits
tures, such as the experimental Cm* archi- sired destination (and pathway) by issuing instructions to descendent nodes that exe-
tecture, employ two kinds of buses - a a bit-value that contains a control bit for cute the instructions on local memory data.
local bus linking a cluster of processors each stage. The switch at stage i examines The flexibility of MIMD/SIMD archi-
and a higher level system bus linking dedi- the ith bit to determine whether the input tectures obviously makes them attractive
cated service processors associated with (request) is to be connected to the upper or candidates for further research. Specific
each cluster. lower output. incentives for recent development efforts
Figure 12c shows an omega network include supporting parallel image process-
Crossbar interconnections. Crossbar connecting eight processors and memo- ing and expert system applications.
interconnection technology uses a cross- ries, where a control bit equal to zero in
bar switch of n2 crosspoints to connect n dicates a connection to the upper output. Dataflow architectures. The funda-
processors to n memories (see Figure 12b). Expandability is a significant feature of mental feature of dataflow architectures is
Processors may contend for access to a such a MIN, since its communication an execution paradigm in which instruc-
memory location, but crossbars prevent diameter is proportional to log2N. The tions are enabled for execution as soon as
contention for communication links by BBN (Bolt, Beranek, and Newman) But- all of their operands become available.
providing a dedicated pathway between terfly, for example, can be configured Thus, the sequence of executed instruc-
each possible processor/memory pairing. with as many as 256 processors. tions is based on data dependencies, allow-
Power, pinout, and size considerations ing dataflow architectures to exploit con-
have limited crossbar architectures to a currency at the task, routine, and instruc-
small number of processors (from four to MIMD-based tion levels. A major incentive for dataflow
16). The Alliant FX/8 is a commercial architecture research, which dates from
architecture that uses a crossbar scheme to
architectural J.B. Dennis’ pioneering work in the mid-
connect processors and cache memories. paradigms 1970s, is to explore new computational
models and languages that can be effec-
Multistage interconnection networks. MIMD/SIMD hybrids, dataflow archi- tively exploited to achieve large-scale
Multistage interconnection networks tectures, reduction machines, and parallelism.
( M I N s ) ~strike a compromise between the wavefront arrays all pose a similar diffi- Dataflow architectures execute data-
price/performance alternatives offered by culty for an orderly taxonomy of parallel flow graphs, such as the program fragment
crossbars and buses. An N x N MIN architectures. Each of these architectural depicted in Figure 14. We can think of

12 COMPUTER
shown at step 3. At step 4, the node store Programs may “reference” named expres-
unit obtains the relevant instruction op- sions, which always return the same value
code from memory. The node store unit (the referential transparency property).
then fills in the relevant token fields (step Reduction programs are function applica-
5 ) , and assigns the instruction to a proces- tions constructed from primitive functions.
sor. The execution of the instruction will Reduction program execution consists
create a new result token to be used as of recognizing reducible expressions, then
input to the node 4 instruction. replacing them with their calculated val-
ues. Thus, an entire reduction program is
Reduction architectures. Reduction, ultimately reduced to its result. Since the
or demand-driven, architectures12 imple- general execution paradigm only enables
ment an execution paradigm in which an an instruction for execution when its re-
instruction is enabled ior execution when sults are needed by a previously enabled
its results are required as operands for instruction, some additional rule is needed
Node 1 Node 2 another instruction already enabled for to enable the first instruction(s) and begin
execution. Most reduction architecture computation.
research began in the late 1970s to explore Practical challenges for implementing
Figure 14. Dataflow graph-program new parallel execution paradigms and to reduction architectures include synchro-
fragment. provide architectural support for applica- nizing demands for an instruction’s results
tive (functional) programming languages. (since preserving referential transparency
Reduction architectures execute pro- requires calculating an expression’s re-
graph nodes as representing asynchronous grams that consist of nested expressions. sults once only) and maintaining copies of
tasks, although they are often single Expressions are recursively defined as lit- expression evaluation results (since an
instructions. Graph arcs represent com- erals or function applications on argu- expression result could be referenced
munications paths that carry execution ments that may be literals or expressions. more than once but could be consumed
results needed as operands in subsequent
instructions.
Some of the diverse mechanisms used to
implement dataflow computers (such as
the Manchester Data Flow Computer, MIT 3
Tagged Token Data Flow architecture, and flnstruc - Tok \
Toulouse LAU System)” are outlined be-
low. Static implementations load all graph
nodes into memory during initialization
and allow only one instance of a node to be
executed at a time; dynamic architectures
allow the creation of node instances at
runtime and multiple instances of a node to
be executed concurrently.
Some architectures directly store
“tokens” containing instruction results in-
to a template for the instruction that will
use them as operands. Other architectures
use token-matching schemes, in which a
matching unit stores tokens and tries to
match them with instructions. When a
complete set of tokens (all required op-
erands) is assembled for an instruction,
an instruction template containing the
relevant operands is created and queued
for execution.
Figure 15 shows how a simplified token-
matching architecture might process the
program fragment shown in Figure 14. At
step 1, the execution of (3*a) results in the
creation of a token that contains the result
(15) and an indication that the instruction
at node 3 requires this as an operand. Step
2 shows the matching unit that will match
this token and the result token of (5*b)with
the node 3 instruction. The matching unit
creates the instruction token (template) Figure 15. Dataflow token-matching example.

February 1990 13
Node 1 Node 1

G) 52
Node 2
r”(
Loc: 1 h Demand
Need: c
Loc: 1
Value: 35
Dest: 1

3 Node 6 Node 7

Memory Memory
store store store

Figure 16. Reduction architecture demand token production. Figure 17. Reduction architecture result token production.

by subsequent reductions upon first being a=+bc; and regular, local interconnection net-
delivered). b=+de; works. However, wavefront arrays replace
Reduction architectures employ either c=*fg; the global clock and explicit time delays
string reduction or graph reduction to d = 1; e = 3; f = 5 ; g = 7. used for synchronizing systolic data pipe-
implement demand-driven paradigms. lining with asynchronous handshaking as
String reduction involves manipulating Rather dissimilar architectures (such as the mechanism for coordinating inter-
literals and copies of values, which are the Newcastle Reduction Machine, North processor data movement. Thus, when a
represented as strings that can be dynam- Carolina Cellular Tree Machine, and Utah processor has performed its computations
ically expanded and contracted. Graph Applicative Multiprocessing System) and is ready to pass data to its successor, it
reduction involves manipulating literals have been proposed to support both string- informs the successor, sends data when the
and references (pointers) to values. Thus, and graph-reduction approaches. successor indicates it is ready, and receives
a program is represented as a graph, and an acknowledgment from the successor.
garbage collection reclaims dynamically Wavefront array architectures. The handshaking mechanism makes com-
allocated memory as the reduction Wavefront array processors’j combine putational wavefronts pass smoothly
proceeds. systolic data pipelining with an asynchro- through the array without intersecting, as
Figures 16 and 17 show a simplified nous dataflow execution paradigm. S-Y. the array’s processors act as a wave propa-
version of a graph-reduction architecture Kung developed wavefront array concepts gating medium. In this manner, correct
that maps the program below onto tree- in the early 1980s to address the same kind sequencing of computations replaces the
structured processors and passes tokens of problems that stimulated systolic array correct timing of systolic architectures.
that demand or return results. Figure 16 research - producing efficient, cost-ef- Figure 18a-c depicts wavefront array
depicts all the demand tokens produced fective architectures for special-purpose concepts, using the matrix multiplication
by the program, as demands for the values systems that balance intensive computa- example used earlier to illustrate systolic
of references propagate down the tree. In tions with high 1/0 bandwidth (see the operation (Figure 9). The example archi-
Figure 17, the last two result tokens systolic array section above). tecture consists of processing elements
produced are shown as they are passed to Wavefront and systolic architectures are (PES) that have a one-operand buffer for
the root node. both characterized by modular processors each input source. Whenever the buffer for

14 COMPUTER
a memory input source is empty and the
associated memory contains another oper-
and, that available operand is immediately
read. Operands from other PES are ob-
tained using a handshaking protocol.
Figure 18a shows the situation after
memory input buffers are initially filled. In
Figure 18b PE( 1 , l ) adds the product ae to
its accumulator and transmits operands a
and e to neighbors; thus, the first computa-
tional wavefront is shown propagating
from PE(1,l) to PE(1,2) and PE(2,l). Fig-
ure 18c shows the first computational
wavefront continuing to propagate, while a
second wavefront is propagated by
PE( 1 , l ) .
Kung arguedI3 that wavefront arrays
enjoy several advantages over systolic
arrays, including greater scalability, sim-
pler programming, and greater fault toler-
ance. Wavefront arrays constructed at
Johns Hopkins University and at the Stan-
dard Telecommunications Company and
Royal Signals and Radar Establishment (in
the United Kingdom) should facilitate
further assessment of wavefront arrays’
proposed advantages.

he diversity of recently intro-


duced parallel computer architec-
tures confronts the interested
observer with what R.W. Hockney has
felicitously termed “a confusing menag-
erie of computer designs.”
This discussion has tried to address the
difficulty of understanding these diverse
parallel architecture designs. An underly-
ing goal was to explain how the principal
types of parallel architectures work. The
informal taxonomy of parallel architecture
types proposed here is meant to show that
the parallel architectures reviewed define a
coherent spectrum of architectural alterna-
tives. The discussion shows that parallel
architectures embody fundamental organ-
izing principles for concurrent execution,
rather than disparate collections of hard-
ware and software features..

Acknowledgments
Particular thanks are due Lawrence Snyder
and the referees for constructive comments.
Thanks go to the following individuals for
providing research materials, descriptions of
parallel architectures, and insights: Theodore
Bashkow, Laxmi Bhuyan, Jack Dongarra, Paul
Edmonds, Scott Fahlman, Dennis Gannon, E.
Allen Garrard, H.T.Kung, G.J. Lipovski, David
Lugowski, Miroslaw Malek, Robert Masson, Figure 18. Wavefront array matrix multiplication.

February 1990 15
Susan Miller, James Peitrocini, Malco;m Rim- L. Snyder, “A Taxonomy of Synchronous Paral- 6. K.E. Batcher, “Design of a Massively Paral-
mer, Howard J. Siegel, Charles Seitz, Vason lel Machines,” Proc. 17th Int’l Conf. Parallel lel Processor,” IEEE Trans. Computers,
Srini, Salvatore Stolfo, David Waltz, and Jon Processing, University Park, Penn., August, Vol. C-29, Sept. 1980, pp. 836-844.
Webb. 1988.
This research was supported by the Rome Air 7. T. Kohonen, Content-Addressable Memo-
Development Center under contract F30602- ries, 2nd edition, Springer-Verlag, New
87-D-0092. York. 1987.

8. H.T. Kung, “Why Systolic Architectures?,”


References Computer, Vol. 15, No. 1, Jan. 1982, pp. 37-
46.
Further reading 1. M.J. Flynn, “Very High Speed Computing
Systems,” Proc. IEEE, Vol. 54, 1966, pp. 9. H.J. Siegel, Interconnection Networks for
J.J. Dongarra, ed., Experimental Parallel I90 1 - 1909. Large-Scale Parallel Processing: Theory
Computing Architectures, North-Holland, and Case Studies, Lexington Books, Lex-
Amsterdam, 1987. 2. W.J. Watson, “The ASC - A Highly ington, Mass., 1985.
Modular Flexible Super Computer
R.W. Hockney, “Classification and Evaluation Architecture,” Proc. AFIPS FJCC, 1972, I O . S.J. Stolfo and D.P. Miranker, “The DADO
of Parallel Computer Systems,” in Springer- pp. 221-228. Production System Machine,” J. Parallel
Verlag Lecture Notes in Computer Science, No. and Distributed Computing, Vol. 3, No. 2,
295, 1987, pp. 13-25. 3. N.R. Lincoln, “A Safari through the Con- June 1986, pp. 269-296.
trol Data Star-100 with Gun and Camera,”
D.J. Kuck, “High-speed Machines and Their Proc. AFIPS NCC, June 1978. 1 1. V. Srini, “An Architectural Comparison of
Compilers,’’ in Parallel Processing Systems, D. Dataflow Systems,” Computer, Vol. 19,
Evans, ed., Cambridge Univ. Press, 1982. 4. K. Hwang, ed., Tutorial Supercomputers: No. 3, Mar. 1986, pp. 68-88.
Design and Applications, Computer Soci-
J. Schwartz, “A Taxonomic Table of Parallel ety Press, Los Alamitos, Calif., 1984. 12. P.C. Treleaven, D.R. Brownbridge, and
Computers, Based on 55 Designs,” Courant (Chapters 1 and 2 contain salient articles on R.P. Hopkins, “Data-Driven and Demand-
Institute, NYU, New York, Nov. 1983. vector architectures.) Driven Computer Architecture,” ACM
Computing Surveys, Vol. 14, No. I , Mar.
D.B. Skillicorn, “A Taxonomy for Computer 5 . K. Hwang and F. Briggs, Computer Archi- 1982, pp. 93-143.
Architectures,” Computer, Vol. 21, No. 1 1 , tectures and Parallel Processing, McGraw-
Nov. 1988, pp. 46-57. Hill, New York, 1984. 13. S-Y. Kung et al., “Wavefront Array Proces-
sors - Concept to Implementation,” Com-
puter, Vol. 20, No. 7, July 1987, pp. 18-33.

14. G.J. Lipovski and M. Malek, Parallel


Computing: Theory and Comparisons,
Wiley and Sons, New York, 1987. (Includes
reprints of original papers describing recent
parallel architectures.)

L-----
In Canada, Call Toll Free 1-800-387-2173
Call today, and get fhe Inmac Computer
Products Catalog!
Guaranteed Delivery Instant Credi Ralph Duncan is a system software design
Over3000Products 45Day consultant with Control Data’s Government
To Choose From Product Trial Systems Group. His recent technical work has
involved fault-tolerant operating systems, par-
allel architectures, and automated code genera-
tion.
Duncan holds an MS degree in information
and computer science from the Georgia Insti-
tute of Technology, an MA from the University
of California at Berkeley, and a BA from the
University of Michigan. He is a member of the

IyY I
Street Address/P.O. Box
IEEE and the Computer Society.

--------
) State Zip Readers may contact the author at Control
Area Code Phone No. s900202 Data Corp., Government Systems, 300 Em-
-
”Your Worldwide Sourcefor Mail To: Inmac bassy Row, Atlanta, GA 30328.
computer s ~ ~ ~ iFurniture.
es, ond
Datn Communicntions Pmducts.”

COMPUTER
Reader Service Number 2

You might also like