0% found this document useful (0 votes)
148 views

@vtucode - in 21CS643 Module 1 2021 Scheme

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views

@vtucode - in 21CS643 Module 1 2021 Scheme

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 127

PM 1'l|¢G-NH-‘ Hfllfidrflfleflflti

MODULE-1
— I _

Parallel Computer Models


Dyer the last t.wo decades. computer and communication technologies have literally transformed the
world we live in. Parallel processing has emerged as the key enabling technology in modern computers.
driven by the ever-incrsing demand for higher performance. lower costs. and sustained productivity in
real-life applications.
Parallelism appears in various forms. such as lookahead. pipelining. vectorization. concurrency.
simultaneity. data parallelism. partitioning. interleaving. overlapping. multiplicity. replication. time sharing.
space sharing. multitasking. mulriprogramming. multithreading. and distributed computing at different
processing levels.
in this chapter. we model physical architectures of parallel computers. vector supercomputers.
multiprocessors. multicomputers. and massively parallel processors.Theoretical machine models are also
presented. including the parallel random-access 1THCl"tlHfl- (PRAl""b) and the complexity model of\"'LSl
(very large-scale integration) circulrs..flu-chitecuiral development tracks are identified with case scudies
in the book. Hardware and software subsystems are introduced to pave the way for detailed studies in
subsequent chapters.

THE STATE OF COMPUTING


1 Modem computers arc equipped with powerlill hardware facilities driven by extensive
software packages. To assess state-of-the-art computing, we first review historical milestones
in the development of computers. Then we take a grand tour of the crucial hartlwarc and software elements
built into modern computer systems. We then examine the evolutional relations in milestone architectural
development. Basic hardware and software factors are identified in analyzing the perfonnance of computers.

1.1.1 Computer Development Milestones


Prior to I945, computers were made with mechanical or electromechanical pans. The earliest mechanical
computer can be traced hack to 5110 BC in the ihrm of the abacus used in China. The abacus is manually
operated to perform decimal arithmetic with carry pmpagation digit by digit.
Blaise Pascal built a mechanical adder.~"suhtract'or in France in 1642. Charles Babbage desig ned a diffcrcnoe
engine in England for polynomial evaluation in 182?. Konrad Zusc built the first binary mechanical computer
in Germany in 1941. Howard Aiken proposed the very first electromechanical decimal computer, which was
built as the Harvard Mark I by IBM in 1944. Both Zu-;c's and Aikcn‘s machines were designed for general-
pu rpose computations.
The MIGIITLH HI" l'mrJI||r_.u|r¢\ :

4 i Advanced Colnpuoerfirehitecru-.re

Obviously, the fact that computing and communication were carried out with moving mechanical parts
greatly limited the computing speed and reliability' ofmechanical co mputers. Modern computers were marked
by the introduction of electronic components. The moving parts in mechanical computers were replaced by
high-mobility electrons in electronic computers. lnibrmation transmission by mechanical gears or levers was
replaced by electric signals traveling almost at the speed of light.

C/nmputer Generation: Over the past several doeades, electronic computers have gone through roughly
five generations of development. Table l.l provides s summary ofthe five generations ofelectronic computer
development. Each of the first three generations lasted about 10 years. The fourth generation covered a time
span of 15 years. The fifili generation today has processors and memory devices with more than l billion
transistors on a single silicon chip.
The division ofgcnerations is marked primarily by major changes in hardware and software technologies.
The entries in Table l.l indicate the new hardware and soflware features introduced with each generation.
Most i'-eatures introduced in earlier generations have been passed to later generations.

Table 1.1 Five Generations o]"Eloct.ronlc Cornputers

Gem.'rer.rir.-n Ter.'lrno.lr.-g;' anal Sr'i,f.l'u'ane' and R€pm.~a'=' !J‘.l'mi 92


A .n'.'lritor.‘mn- .-"lpp!ieerrirms .§}'.~r.lerm

First Vacuum tubes and relay Maehinefassembly languages, EN IAC,


{I945 54} memories, CPU driven by single user, no subroutine Princeton IAS,
PC and accumulator, linkage, IBM 71]].
fined-point arithmetic. programmed l.I'O using CPU.
Second Discrete transistors and IILL used with compilers, IBM T090,
(I955 64} core fllflfflflfllfi, subroutine libraries, batch CDC I-I5-D4,
floating-point arithmetic, processing monitor. Univac LARC
l-"G processors, muitipiexed
memory access.
Third Integrated circuits (SS1!- Mutt iprogram ming and time- IBM 315-l1"3TD,
(_ I965 T4} MSI}, microprograrnrning, sharing US, multiuser applications. cue osco,
pipelining, cache, and Tl-rise.
iookahead processors. PDP-B.
Fourth LSINLSI and semiconductor Multiprocessor U S, languages-._ VAX 9000,
-[ I975 90} memory, multiproccssors, compilers, and environments Cray X-MP,
vector snpercompnw, tor parallel processing. IBM 3090,
multicomputers. BEN TCZDIIHJ.
Filth Advanced‘VLSl processors, Sup-ersealar systems See Tables I. 3..
[I99] present) memory, and switches, on a chip, tmnisively parallel and Chapter I3.
high-density packaging, processing, grand challenge
scalable architectures. applications, heterogeneous
processing.

Progr-as in Hardwun: As far as hardware technology is concerned, the first generation (I945 I954} used
vacu|.n"n tubes and relay memories interconnected by insulated wires. The second generation (1955-1964)
HM‘ If TILE I'm-I!I;|r1rIit\
Rrreilel Cnrnprrter Models _ M 5

was marked by the use of discrete transistors, diodes, and magnetic ferrite cores, interconnected by printed
circuits.
The third generation -[i 1965-1974) began to use irrregrritcri circuits {IC s) for both logic and memory in
.srmH-scale or mcdirrrn-scale irrregruzirion {SSI or MSI ) and multilayered printed circuits. The fourth generation
( l 974-1991 ] usod l'rrr'ge-scale or 1-'cr'__\-‘large-sem"e irrrcgrariorr (LSI or ‘s-‘LS1 ). Semiconductor memory replaced
core memory as computers movcd from the third to the fourth generation.
The fifth generation (l99I-present) is highlighted by the use of high-density and high-speed processor
and memory chips based on advanced VLSI ncchoology. For example, 64-bit Crl-Iz range processors are now
available on a single chip with over one billion transistors.

The first Generation From the architectural and software points of view, first generation computers
were built with a single eerrrroi proeessirrg rm ir (CPU) which performed serial fixed-point arithmetic using a
program co1.mter,branch instructions, and an accumulator. The CPU must be involved in all memory access
and irrprrrr'r1rr.§rrrrr{:lr'Cl] operations. Machine or assembly languages were used.
Representative systems include the ENI.r*iC (Electronic Numerical Integrator and Calculator} built at the
Moore School of the University of Pennsylvania in 195'[l;the [AS [Institute ibrhdvaneed Studies) computer
ba_sed on a design proposed by John von Neumann, Arthur Burks, and Herman Goldstinc at Princeton in
1946; and the IBM Till, the first electronic stored-program commercial computer built by IBM in l953.
S-ubroutinc linkage was not implemented in early computers.

The Second Generation Index registers, floating-point arithmetic. multiplexed memory. and U0
processors were introduced with second-generation computers. High iei-‘oi lringrrriges {l~lLLs], such as Fortran,
Algol, and Cobol, were introduced along with compilers, subroutine libraries, and batch proccs sing monitors.
Register transfer language was developed by Irving Reed {I 957) for systematic design of digital computers.
Representative systems include the IBM 7030 (the Stretch computer) featining instruction lookahcad and
error-correcting memories built in 1962, the Univac LARC [Livennore Atomic Research Computer} built in
1959, and the CDC 1604 built in the 1960s.

The Third Genernti-on The third generation was represented by the lBMr‘3tii'i—3?O Series, the CDC
ofitllifitilltl Series, Texas lnstrurnents ASC {Advanced Scientific Computer), and Digital Equipmenfs PDP-8
Series from the mizl-I9-fifls to the mid l9'F'l]s.
Microprogrammed control became popular with this generation. Pipelining and cache memory were
introduced to close |.|p the specd gap between the CPU and main memory. The ideaofmultiprogramming was
implemented to interleave C PU and i-"O activities across multiple user programs. This lcd to the development
of time-sharing opcrrifing .s_y.n'cms {OS} using virtual memory with greater sharing or multiplexing of
resources.

The Fourth Generation Parallel computers in various architectures appeared in the fourth generation of
computers using shared or distributed memory or optional vector hardware. Multiprocessing OS, special
languages, and compilers were developed for parallelism. Sofhvare tools and environments were created for
parallel processing or distributed computing.
Representative systems include the VAX 9000, Cray X-MP, IBMr'3(l9D VF, BBN TC-2000, etc. During
these 15 years (l9'F5—l990), the technology ofparallel processing gradually became mature and entered the
production mainstream.
E i Advanced Cmnpimerfiichitecture

The Fifth Generation "These systems emphasise supersealar processors. cluster computers, and rnosrii-'e'I_1-'
por.oHer' processing ('M'PF). Scalable and latency tolerant architectures are being adopted in MPP systems
toting advanced VLSI tectmologics, high-density packaging, a11d optical technologies.
Fifih-generation computers achieved Terafiops (I0 II' floating-point operations per second) performance
by the mid-19905, and have now crossed the Petaflop (IOU floating point operations per sooond] range.
Ueremgemous processing is emerging to solve lalgoseale problems using a network of heterogeneous
computers. Early fifth-generation MPP systems were represented by several projects at Fujitsu (VTP500),
Cray Research [MPP), Thinking Machines Corporation [the CM-5}, and Intel (the Paragon]. For present-day
examples ofadvanced processors and systems; Chapter 13.

1.1.1 Elements of Modem Computers


Hardware. software, and programming elements of a modem computer system are briefly introduced below
in the context of parallel processing.

Computing Problem: It has been long recognized that the concept of computer architecture is no longer
rest rictod to the st ructure ofthe bare machine hardware. A modern computer is an integrated system consisting
of machine hardware, an instruction set, system software, application programs, and user interfaces. These
system elements are depicted in Fig. 1. 1. The use ofa computer isdriven by real-life problemsdemanding cost
effective solutions. Depending on the nature of the problems, the soh.|tions may require different computing
resources.

"W19
PS’ “ms Operatl ng
System

A‘and Data
"ms Manning Hardware
5
Strum
H%
Arelitoeture

Pro-gramrntng

Bmdlng Applications Software


[com pits, toad]
High-level
Languages

Performance
Evsl uatlon

Fig. 1.1 Elements ofa modern oon1pu1:e1- system

For numerical problems in science and technology, the solutions demand complex mathematical
formulations and intensive integer or floating-point computations. For alphanumerieal problems in business
F?» if run! I'nrl'J|||;1rlM'\
lh|'o0lelCornpu'te|'.fl-llodels i 1

and government. the solutions demand efficient transaction processing, large database management, and
information retrieval operation s.
For artificial intelligence [Al] problems, the solutions demand logic inferences and symbolic manipulations.
These computing problems have been labeled rrurrrericof eompnrirrg. rrtrnsoerion pro:-tosing, and logical
reasoning. S-omc complex problems may demand a combination ofthesc processing modes.

Algorithm: and Dam Strueturm Special algorithms and data structures are needed to specify the
computations and communications involved in computing problems. Most numerical algorithms are
deterministic, using regularly structured data. Symbolic processing may use heuristics or nondctcnninistic
searches over large knowledge bases.
Problem formulation and the development ofparallcl algorithmsoficn require interdisciplinary interactions
among thcoreticians, esperimcntalists, and computer programmers. There arc many books dealing with the
design and map ping ofalgorithms or heuristics onto parallel computers. ln this book, we are more conccmed
about the resources mapping problem than about the design and analysis ofparallcl algorithms.

Hardware Resource: The system architecture ofa computer is represented by three nested circles on the
right in Fig. 1 .l ..-it modem computer system demonstrates its power through coordinated cfiorts by hardware
resotnccs, an operating system, and application software. Processors, memory, and peripheral devices form
the hardware core of a computer system. We will study instruction-set processors, memory organization,
muhiproccssors, supercomputers, multicomputers, and massively parallel computers.
Special hardware inte rfams are oflen built into l-“O devices such as display terminals, workstations, optical
page scanners, magnetic ink character recognizers, modems. network adaptors, voice data entry, printers,
and plotters. These peripherals are connected to main frame computers directly or through local or wide-area
networks.
In addition, sofiware interface programs are needed. These software interfaces include file transfer
systems, editors, word processors, device drivers, interrupt handlers, network communication programs, etc.
These programs greatly facilitate the portability of user programs on difierent machine architectures.

Operating System An effective operating system manages the allocation and deallocation of resources
during the execution ofuser programs. Beyond the OS, application software must be developed to benefit the
users. Standard bench mark programs are needed for performance evaluation.
Mapping is a bidirectional process matching algorithmic structure with hardware architecture, and vice
versa. Efficient mapping will benefit the programmer and produce belier source codes. The mapping of
algorithmic and data structures onto the machine architecture includes processor scheduling, memory maps,
interproccssorcommunications, etc. These activities are usually architecture-dependent.
Optimal mappings are sought for various computer architectures. The implementation of these mappings
relies on efficient compiler and operating system support. Parallelism can be exploited at algorithm design
time, at program time, at compile time, and at run time. Techniques for exploiting parallelism at these levels
fortn the core of parallel processing technology.

System S-ofhvure Support Software support is needed for the development of etficient programs in high-
lcvcl languages. The source code written in a I-[LL must be first translated into object code by an optimizing
compiler. The cornpifer assigns variables to registers or to memory words, and generates machine operations
corresponding to HLL operators. to produce machine code which can he recognized by the machine hardware.
A loader is used to initiate the program execution through the CIS kcmcl.
re» Meemw um r-...=-mm. '
B i _ Admrrcad Compunerfirchitectn-.re

Resource binding demands the use of the compiler, assembler, loader. and OS kernel to commit physical
machine resources to program execution. The effectiveness of this process determines the eflieieney
of hardware utilization and the programmability of the computer. Today, programming parallelism is
still difiicult for most programmers due to the fact that existing languages were originally developed for
sequential computers. Programmers arc sometimes forced to program hardware-dependent f-eatures instead of
programming parallelism in a generic and portable w'ay. Ideally, we need to develop a parallel programming
environment with architecturc-independent languages, compilers, and software tools.
To develop a parallel language, we aim for etfieiency in its implementation, portability across different
machines, compatibility with existing sequential languages, expressiveness of parallelism, and ease of
programming. One can attempt a new language approach or try to extend existing sequential languages
gradually. A new language approach has the advantage of using explicit high-level constructs for specifying
parallelism. However, new languages are often incompatible with existing languages and require new
compilers or new passes to existing compilers. Most systems choose the language extension approach; one
way to achieve this is by providing appropriate function libraries.

Compiler Support There are three compiler upgrade approaches: preproecssor, preer:|mpii'er, and
pnrnifctfrbing compiler. A prcprocessor uses a sequential compiler and a low-level library of the target
computer to implement high-level parallel constructs. The precompiler approach requires some program flow
analysis, dependence checking, and liniitod optimizations towanl parallelism detection. The third approach
demands a fully developed parallelizing or vectorizing compiler which can automatically detect parallelism
in source code and transform sequential codes into parallel constnrcts. These approaches will be studied in
Chapter ltl.
The etficieucy of the binding process depends on the effectiveness of the preprocessor, the precompiler,
the parallelizing compiler, the loader, and the OS support. Due to unpredictable program behavior, none ofthe
existing compilers can be considered fully automatic or lirlly intelligent in detecting all types of parallelism.
Very often eonrjriifer dircerit-'e.s' are inserted into the source code to help the compiler do a better job. Users
may interact with the compiler to restr|.|cture the programs. This has been proven uselirl in enhancing the
performance of parallel computers.

1.1.3 Evolution ofCompui:erAr'chitecturie


The study of computer architecture involves both hardware organization and programrrtingfsoftware
requirements. As se-en by an assembly language programmer, computer architecture is abstracted by its
instruction set, which includes opcode (operation codes], addressing modes, registers, virtual memory, etc.
From the hardware implementation point of view, the abstract machine is organized with CPUs, caches,
buses, microcode, pipelines, physical memory, etc. Therefore, the study ofarchitecture covers both instruction-
set architectttres and machine implementation organizations.
Over the past decades, computer architecture has gone through cvolutional rather than revolutional
changes. Sustaining fcaturcs arc those that were proven performance delivercrs. As depicted in Fig. 1.2, we
started with the von Neumann architecture built as a sequential machine executing scalar data . The sequential
computer was improved from hit-serial to word—parallel operations, and from fixed—point to floating point
operations. The von Neumann architccture is slow due to sequential execution of instructions in programs.
l'h1'Ml.'I;Ifl\lI' HI-l'l_lNf.l]l(1|llf\
BuJciielCm1-pu'ne|'Medeis '.I—I. 9

Legends:
|l'E: Instruction Fetch and Execute.
SIMD: Sings Instruction stream and
Multlplo Data strmms
E “finial MIMD: Multiple Instruction straarls
snd Multiple Data streams
Functional
i-it
Q Parall rn

st
Mu ltlpla
Func Units Pipeline

tr
y-to oglstor-to
- es"
s rnory -Register

Associative
Processor QEE
5
'E
%
PTO-09$
Arte !l'
Muitleo

Massively parslte-l
processors {M PP)

Fig. 1.‘! Tree showing arehlneerural evolution from sequential scalar composers no veeror processors and
Mulll

parallel computers

Loulalhaeld, Paruilelism, and Pijllelining Lookahead techniques were introduced to prefetch instructions
in order to overlap HE (instnlction fctchidecode and execution] operations and to enable functional
parallelism. Ftmctional parallelism was supported by two approaches: Cine is to use multiple ftmctional units
simultaneously, and the other is to practice pipelining at va.rious processing levels.
The latter includespipelined instruction execution, pipelinod arithmetic computation s, and memory-access
operations. Pipelining has proven especially attractive in performing identical operations repeatedly over
vector data strings. Vector operations were originally carried out implicitly by sofiwarc-controlled looping
using scalar pipeline processors.

Flynn's Clnuiflcntlon Michael Flynn ('1 9‘i2] introduced a classification of various computer architectures
based on notions of instruction and data stream s. As illustrated in Fig. 1.3a, conventional sequ-cntial machines
are called SISD -['.s'ingie inslrnerion stream over rt singie rinrn stream] computers. Vector computers are
equipped with scalar and vector hardware or appear as SIMD (singie insnruerion stream over nrrlirr}-Jie ainm
snvsnnrsl machines {_Fig. l.3b]. Parallel computers are reserved for MIMD -[hnuiripie insrrnerion .-rrmnms over
naiiripie dam snennts) machines.
An MISD {rnuiripie instruction sl‘renrn.s' nnrfn single dnrn .s'rrenrn) machine is modeled in Fig. l.3d. The
same data stream flows through a linear array of processors executing different instruction streams. This
l'h1'Ml.'I;Ifl\lI' HI" l'n¢r.q|r_.u||r\

IO i Admirrced Cempmerfirehitecurre

architecture is also ltnown as systolic arr.n_]-‘s (Kttng and Leiserson, I978] For pipelinod execution ofspeeifie
algorithms.

LM
uses DP-13
ls
IS |g I
I -I 5°“
loaded
1 Program loaded I I from

H0 |5 133 from host D5 D5 host

[an 5'50 unlpmcefim archnecmm [hj SIMD architecture [with dstributed mernory)

CU = Control Unit |
= |rQ CU PU
PU Processing Unit is D5
MU = Memory Unit I I Sh;-Gd I
IS = Instruction Stream : : Memory :
D3 = Data Stream Ir‘-0 IS D5
C PU
PE = Processing Element
LM = Locai lI.|lfl'i10i']|'
[c] MIMD architecture [with shared memory]

IS ul IS
cu, cu: ''-
Memory ls IS '5

matn%aa~ H1
[program

US H
U0

[ti] MISD architecture [the systolic array)

Fig. 1.3 Flynn‘; dassiilcadon of computer ardtinecmres [Derived from Hicha-ei Flynn, 19??!)

Ofthe fourmachine models, most parallel computers built in thepast assumed the MIMD model forgene1al-
purposecomputations. The SIMD and MISD models are more suitable for special-purpose computations. For
this reason, MIMD is the most popular model, SIMD next, and MISD the least popular model being applied
in commercial machines.
Parallel! Wet-or Computer: Intrinsic parallel computers are those that execute programs in MIMD mode.
There ate two major classes of parallel computers, namely, s.horeo‘~memorj-* m1rIIipr0ees.s0r.t and message-
prrssing nwlfieomprrrers. The major distinction between multiproccssors and multicomputers lies in memory
sharing and the mechanisms used for interprocessor communication.
War If J11!!!‘ r'mx-;|umn
Rrroitlel Cornpu'tae|'Med-els i i | |

The processors in a multiprocessor system communicate with each other through shared \-'.nri.trbl'es in a
common memory. Each computer node in a multicomputer system has a local memory, unshared with other
nodes. lnterprocessor communication is done through rrressngc prrssirrg among the nodes.
Explicit vector instructiorts were introduced with the appearance of veemr processors. A voctorprocessor
is equipped with multiple vector pipelines that can be concurrently used under hardware or firmware control.
There are two iamilies ofpipelincd vector processors:
Merrrorjt=-to-rrrerrrrrrjt-' architecture supports the pipelinod flow of vector operands directly from the memory
to pipelines and then back to the memory. Register-m-rtrgisrer architecture uses vector registers to interface
between the memory and functional pipelines. ‘tfcctor processor architectures will be studied in Chapter B.
Another important branch of the architecture tree consists of the STMTJ oomputers for symchronized
vector processing. An SIMD computer exploits spatial parallelism rather than rcrry'Jortr1;JarttlieIism as in a
pipelinod computer SIMD computing is achievod through the use of an array oi'proeessirrg eIerrrcm.~'r {PEs]
synchronired by the same controller. Associative memory can be used to build Silt-'l'D associative processors.
SIMD machines will be treated in Chapter 8 along with pipelined vector computers.

Development Layer: A layered development of parallel computers is illustrated in Fig. 1.4, based on a
classification by Lionel Ni [I990]. Hardware configurations differ from machine to machine, even those oi"
the same model. The address space ofa processor in a computer system varies among difiierent architectures.
it depends on the tnentory organization, which is mar:hine—dependent. These features are up to the designer-
and should match the target application domains.

Ap-plicatiorts T
7 Programming Eiwironment Machhe
T Languages Supported lfldflfififltlflfll
Machine _ Commm|eatio_n__hiode| l
Dfiflefldfll“ Adcireesin t-I sP309
l Hardware Architecture

Fig. 1.4 Six layers for computer system development [Courtesy of Lionel Ni. 1990}

On the other hand, we want to develop application programs and programming environments which
are machine-independent. lndcpendcrrt of machine architecture, the user programs can be ported to many
computers with min imtrm conversion costs. High- level languages and communication models depend on the
architectural choices made in a computer system. From a programmer's viewpoint, these two layers should
be architecture-transparent.
Programming languages such as Fortran, C, C++, Pascal, Ada, Lisp and others can be supported by most
computers. However, the communication models, shared variables versus message passing, are mostly
mac hinc-dependent. The Linda approach using triple srxrcesoffers an arch itecture-tran sparcnt oommunication
model for parallel computers. These language features will be studied in Chapter 10.
Application programmers prefer more architecttrral transparency. However, kernel programmers have to
explore the opportunities supported by hardware. As a good computer architect, one has to approach the
problem from both ends. The compilers and CIS support should be designed to remove as many arehitecttrral
constrairlts as possible from the programmer.
F?» Mtfiruw Hfllrltmjtwrnw

II i Advanced Cnvnpunerfirchiteetu-.re

New Challenge: The technology of parallel processing is the outgrowth of several decades of research
and industrial advances in microelectronics, printed circuits, high density packaging, advanced processors,
memory systems, peripheral devices, communication channels, language evolution, compiler sophistication,
operating systems, programming environmcncs, and application challenges.
The rapid progress made in hardware technology has significantly increased the economical feasibility of
building a new generation ofoomputcrs adopting parallel processing. However, the major barrier preventing
parallel processing from entering the production mainstream is on the software and application side.
To dare. it is still fairly difficult to program parallel and vector computers. We need to strive for major
progress in the sofiware area in order to create a user-friendly environment for high-power computers.
A whole new generation of programmers need to be trained to program parallelism effectively. High-
performance computers provide fast and accurate solutions to scientific. engineering, business, social, and
defense problems.
Representative real-liii: problems include weather forecast modeling, modeling of physical, chemical and
biological processes, computer aided design, large-scale database management, artificial intelligence, crime
controL and strategic defense initiatives, just to name a few. The application domains ofparallcl processing
computers are expanding steadily. With a good understanding ofscalablc computer architectures and mastery
of parallel programming techniques, the reader will be better prepared to face future computing challenges.

1.1.4 Syscem Attributes to Performance


The ideal performance of a computer system demands a perfect match between machine capability and
program behavior. Machine capability can bc enhanced with better hardware tcctnrology, innovative
architectural features, and efficient resources management. However, program behavior is difficult to predict
due to its heavy dependence on application and run-time conditions.
There are also many other factors affecting program behavior, including algorithm design, data structures,
language eficiency, programmer skill, and compiler technology. It is impossible to achieve a perfect match
between hardware and sofiware by merely improving only a few factors without touching other factors.
Besides, machine performance may vary from program to program. This makes peril: ;Jerjorm¢im-¢= an
impossible target to achieve in real-lite applications. On the other hand, a machine cannot be said to have an
average performance either. All pcrtb rrnancc indices or benchmark ing results mu st be tied to a program mix.
For this reason, the performance should be described as a range or a distribution.
We introduce below lirndamental factors forprojecting the performance ofa computer. These performance
ind icators are by no means conclusive in all applications. However, they can be used to guide system architects
in designing better machines or to educate programmers or compiler writers in optimizing the codes for more
efficient execution by the hardware.
Consider the execution of a given program on a given computer. The simplest measure of program
performance is the rrrrrmrourm‘ time, which includes disk and memory accesses, input and output activities,
compilation time, OS overhead, and CPU time. ln order to shorten the tumaround time, one must reduce all
these time factors.
ln a multiprogrammcd computer, the l.-‘O and system ovcrhead.s ofa given program may overlap with the
CPU times required in other programs. Therefore, it is fair to compare just the total CPU time needed for
program execution. The CPU is used to execute both system programs and user programs, although oiten it
is the user CPU time that oonccms the user most.
War If J11!!!‘ r'mr:-;|(min
Rrueilel Cnmptmer Models i i |3

Clock Rate and CPI‘ The CPU {or simply theproeesserj oftoday‘s digital computer is driven by a clock
with a constant cycle time t‘. The inverse ofthe cycle time is the eloeir rare -[f= Ht‘). The size ofa pregrarn is
determined by its irrsrruerion mum‘ {L}, in terms of the number of machine instructions to be executed in the
program. Different machine instructions may require dificrent numbers ofcloclt cycles to execute. Therefore,
the eydes per instruction { C PI '1 becomes an important parameter for measuring the time needed to execute
each instrucfjon.
Fora given instnretion set, we can calculate an merrrge CPI over all instruction types, provided we know
their frequencies of appearance in the program. An accurate estimate of the average CPI requires a large
amount of program eode to be traced over s long period of time. Unless specifically focusing on a single
instruction type, we simply use the term CPI to mean the average value with respect to a given instruction set
and a given program mix.

Perfbrmance Factors Let I, be the number of instructions in a given program, or the instruction count.
The CPU time ( Tin secondsrprogram} needed to execute the program is estimated by finding the product of
three eontributing factors:

T=1,.><C'PI><t' {1.1}

The execution ofan instruction requires going through a cycle ofevents involving the instruction fetch,
decode, operand( s] fetch, execution, and store results. In this cycle, only the instruction decode and execution
phases are carried out in the CPU. The remaining three operations may require access to the memory. We
define a memorjt-‘ e_1-‘dc as the time needed to complete one memory reference. Usually, a memory cycle is it
times the processor cycle t. The value ofir depends on the speed of the cache and memory technology and
processor-memory interconnection scheme used.
The CPI of an instruction type can be divided into two component terms corresponding to the total
processor cycles and memory cycles needed to complete the execution of the instruction. Depending on the
instnrct ion type, the complete instruction cycle may involve one to as many as four memory references (one
for instruction fetch, two for operand fetch, and one for store results]. Therefore we can rewrite Eq. 1.1 as
follows:

T=I,.><{p+m><k)><r {L21
where p is the number ofprocessor cycles needed for the instruction decode and execution, m is the number
of memory references needed, It is the ratio between memory cycle and processor cycle, 1,. is the instnrction
eount, and r is the processor cycle time. Equation 1.2 can be further refined once the CPI components (p, m,
Ir) are weighted over the entire instnrction set.
System Attribute: The above five pczrformancc factors (1,, p, m, Ir, r) arc influenced by four system
attributes: instnretion-set architecture, compiler technology, CPU implementation and control, and cache and
memory hierarchy, as specified in Table 1.2.
The instn.|ction-set architecture afiects the program length (1,) and processor cycles needed (p). The
eompiler technology aifects the values ofr',.,p, and the memory reference count (m). The CPU implementation
and control determine the total processor time (p - I‘) needed. Finally, the memory technology and hierarchy
design affect the memory access latency (Ir - t‘). The above CPU time can be used as a basis in estimating the
execution rate ofa processor.
re» Mum-w not I'm'l!Il|(1rlM'\ '
I4 i _ Advorrced Compunerfirchitecto-.re

Table 1.2 Perfbnnenee Fuerors tersus System Attributes

Per_;l"urmnr|r.-.9 Frrr.'mrs-
Y )'n'.'rt'c rlvemge f.:vr.'r‘e.s per hr.s|'r'r.re!i0:r_ CPI PIr0c'e'.s.wr'
. ysrem - i - --
' _ C01!!!‘-I‘. P.roees."ror .Irf£'mr).|jt' .'l'Ili£'J'7‘l£.l'J’}'- Crete
.-"1.r.rrrbr.r.re.s _
IL. yerll.-'.sper Rre_',le.re'm.'e.s per .-"Ice-cs.s Time,
hr.'r.rruc'! iorr. p .i'n'.'t|‘nrc't‘lwr. m .[a!e'J'rqs'. It T
Instruction-set
. J’ if
Archrrectu re
Compiler
1" \/ \/
Technology
Processor
Implementation vi’ \/
and Control
Cache and
Memory 1/ st’
Hierarchy

MIPS Rate Let C be the total number of clock cycles needed to execute a given program. Then the CPU
time in Eq. 1.2 can be estimated as T= C>< t= Cf)’. Ftn-thermore, CPI = C.-‘L. and T= 1,. ><CPI >< t= L. >< CPlrjf.
The processor speed is often measured in terms of miflion instructions per secornrf (MIPS). We simply call
it the MIPS rate of a given processor. It should be emphasized that the MIPS rate varies with respect to a
number of factors, including the clock rate (fl, the instruction count {L}, and the CPI ofa given machine, as
defined be-low:
MIPS ratc= I" 6 = "f 6 = ‘ii X ['6 (1.31
T><1{1 CPI><l0 C><1t"1

Based on Eq. 1.3, the CPU time in Eq. 1.2 can also be written as T=1,.>< 1O'6r'MIPS. Based on the system
attributes identified in Table 1.2 and the above derived expressions, we conclude by indicating the fact that
the MIPS rate ofa given computer is directly proportional to the clock rate and inversely proportional to the
CPI. All fo|.|r system attributes, instmction set, compiler, processor, and memory technologies, affect the
MIPS rate, which varies also from program to program because ofvariations in the instruction mix.
Floating Point Operations per Second Most compute-intensive applications in science and engineering
make heavy use of floating point operations. Compared to inshuctiorts per second, for such applications a
more relevant measure ofperforrrrsnce is floating point operations per second, which is abbreviated as flops.
With prefix mega. (1051, gigfl tlflglt him (113121 of new 1105'}, this is written as rnegaflops {mt1ops), gigaflops
(gflops), terafleps or petafleps.

‘Throughput Rots: Another important concept is related to how many programs a system can execute per
unit time, called the .s_'t-sterrr throughput l -"___ {in programs?!->econdj. In a multiprogrammcd systcm,thc system
throughput is often lower than the C PU rhmugfipur ll"), defined by:

. f _
“P: % ‘ti-‘“
War If J11!!!‘ r'mr:-;|umn
Rrrallel Cumputaer Models i P |5

Note that Ht}, = {MIPS} >< 106.-1', irom Eq. 1.3. The unit for H1, is also programsfsecond. The CPU
throughput is a measure of how many programs can be executed per second, based only on the MIPS rate
and average program length (_I,.)_ Usually Hf, -=: Hr}, due to the additional system overheads caused by the l.-‘D,
oompilcr, and CIS when multiple programs are interleaved for CPU execution by multiprogramming ortime-
sharing operations. [ftheCPU is kept busy in a perfect prograrn-interleaving fashion, then Hf, = Hr}, This will
probably neverhappen, since the system overhead often causes an extra delay and the CPU may be left idle
ibr some cycles.

33"? Example 1.1 MIPS ratings and performance measurement


Consider the use of two systems SI and S; to csocute a hypothetical benchmark program. Machine
characteristics and claimed performance are given below:

.41-frrehirre Cioelt Pe'r_fi:rn':r.rnee' CPU Time

S1 SOD MHZ IUD MIPS 12.: seeottds


53 2.5 GI-Iz IEJDU MIPS .1: seconds

These data indicate that the measured CPU time on S, is 12 times longer than that measured on The
object codes n.|nning on the two machines have dii-Terent lengths due to the differences in the machines and
oompilers used. All other overhead times are ignored.
Based on Eq. 1.3, we can soc that the instruction count ofthe object code nrnning on S; must be 1.5 times
longer than that ofthe code running on SI. Furthermore, the average CPI on S, is seen to be 5, while that on
S3 is 1.39 executing the same benchmark program.
S, has a typical CISC (enrrrpfex in.srrr.1erirJn set (."||'JI.|'!‘i_||'J.IU'i-fig] architecture, while S; has a typical RISC
(reduced r'nsIrucrr'on set computing} architecture to be characterized in Chapter 4. This example offers a
simple comparison between the two types ofoomputcrs based on a single program run. When a different
program is run, the conclusion may not be the same.
We cannot calculate the CPU throughput H1, unless we know the program length and the average CPI of
each code. The system throughput Ii-'f._. should be measured across a large number of programs over a long
observation period. The message being conveyed is that one should not draw a sweeping conclusion about
the performance ofa machine based on one ora few program runs.

Ftrogmmmilrg Environments The programmability of a computer depends on the programming


environment provided to the users. [n fact, the marketability ofany new computer system depends on the
creation ofa user-friendly environment in which programming becomes a productive undertaking ratherthan
a challenge. We briefly introduce below the environmental featmcs desired in modern computers.
Conventional uniproecs sor computers are programmed in a .s'equenrr'o! environment in which in stnrctions
are executed one after another in a sequential manner. In fact, the original UNIX-"OS kernel was designed to
respond to one system call from the user process at a tirnc. Successive system calls must be serialized through
the kemel.
rm‘ MIGIELH Hill l!'m'rIq|r_.\.I|n*\ ‘I _

I6 i Advanced Celnpmerfirehlteetu-.re

Wlren using a parallel computer, one desires a prrrrrllei ant-ircnmenr where parallelism is automatically
exploited. Language extensions or new constructs must be developed to specifi; parallelism or to facilitate
easy detection of parallelism at various granularity levels by more intelligent compilers.
Besides parallel languages and compilers, the operating systems must be also extended to support parallel
processing. The OS must be able to manage the resources behind parallelism. Important issues include
parallel scheduling of concurrent processes, inter-process communication and sync-lironizatin-n, shared
memory allocation, and shared peripheral and communication links.
Implicit Parallelism An implicit approach uses aconventional language, such asC, C-H-, Fortran, or Pascal,
to write the source program. The sequentially coded source program is translated into parallel object code
by a parallciizing compiler. As illustrated in Fig. 1.5a, this compiler must be able to detect parallelism and
assign target machine resources. This compiler approach has been applied in programming shared-memory
multiprocessors.
With parallelism being implicit, success relics heavily on the “intelligence” of a parallclizing compiler.
This approach requires less eifort on the part of the programmer.

Explicit Parallelism The second approach (Fig. l.5b) requires more efibrt by the programmer to develop
a source program using parallel dialects of C, C++, Fortran, or Pascal. Parallelism is explicitly specified in
tl1e user programs. This reduces the burden on the compiler to detect parallelism. In stead, the compiler needs
to preserve parallelism and, where possible, assigns target machine resources. New programming language
Chapel (see Chapter 13) is in this category.

Source co-clowritton Soiree code written


in scepcntlal languages in concurrent dialects
C, C++, Fortran, or of C, -C++, Fortran,
Pascal or Pascal

Para llsl lzing Concurrency


comp-ilor preserving cornp-tier

Parallel Concurrent
object cod lb object code

Execution by Execution try


rmtirna system mntime system
{at lrrpllcit parallelism {bi Explicit parallelism

Fig. 1.5 Two approaches to paraiicl prcgramrning {Courtesy cf Gtaries Seirz: adapted with pcrrnisslon from
“CcincurrenrArd1ireerl.rres". pt 51 an-dp.53. VLSI met‘ Parole! Cornpl.ltatlol1.edi1:cd by Strays and Blrrwiscic.
M-organ Kauflrarn Pubiifliers. 1990}
HM‘ If J11!!!‘ I'mi!I;|r1rHt\
Rrrellel Cunputaw Models _ T | -y

Special software tools are noeded to make an environment more iriendly to user groups. Some of the
tools are parallel extensions ofconventional high -level languages. Others are integrated environments which
include tools providing difierent levels ofpmgram abstraction, validation, testing, debugging, and tuning;
performance prediction and monitoring; and visttalization support to aid program development, perfomiance
measurement, and graphics display and animation ofcomputational results.

MULTIPRDCESSDRS AND MULTICDM PUTER5


1 Two categories of parallel computers are architecturally modeled below. These physical
models are distinguished by having a shared common memory or unshared distributed
memories. Only architectural organization models are described in Sections 1.2 and 1.3. Theoretical and
complexity models for parallel computers are presented in Section 1.4.

1.2.1 Shared-Memory Multipmcessnrs


We describe below three shared- memory multiprocessor models: the rrrrrjbrm rm'rrmr'__i-'-access (_U M.-‘ti model,
the nomm{finrrrr-rrierrrrirfir-net-e.es {NLTMAI model, and the eriehe-orify memorii-' nrehirtrrum {COMAI model.
These models differ in how the memory and peripheral resources are shared ordistributed.

The UMA Modal ln a LFMA multiprocessor model (Fig. 1.6), the physical memory is uniformly shared
by all the processors. All processors have equal access time to all memory words, which is why it is called
uniform memory access. Each processor may usea private cache. Peripherals are also shared in some fashion.
Procemors

System Interconnect
[Bus Crossbar, Multistage netwofit]

SM II I I

Shared Merrnry

F-lg. 1.6 The UMP. muiriprocessor model

Multiprocessors are called rightly £‘OIJ]J.f£'£'|l.S1FS!£'J'fl.§‘ due to the high degree ofrcsource sharing. The system
interconnect takes the form of a common bus, a crossbar switch, or a multistage network to be studied in
Chapter 7.
Some computer manufacturers have mrr}tipr0r*es.sor {MP} cstcnsions of their unipmcassor {UP) product
line. The UM.-"L model is suitable for general-purpose and times haring applicationsby multiple users. lt can be
used to speed up the execution ofa single large program in time-critical applications. To coordinate parallel
events, synchronization and communication among processors arc done through using shared variables in
the common memory.
Par MIGIITLH Hf" l'mrJI||r_.u|r¢\ :

ll] i Advanced Celnprioerfirehiteem-.re

Wlren all processors have equal access to all peripheral devices, the system is called a synsrnerrfe
multiprocessor. In this case, all the processors are equally capable ofrunning the executive programs, such as
the CIS kernel and l.-‘Cl service routines.
ln an mryrrrnrerrie multiprocessor, only one or a sub set o fprocessors are executive-capable. An executive
or a master processor can execute the operating system and handle IICI. The remaining processors have no
HO capability and thus are called mrnehedpmeessors {A Ps]. Attached processors execute user codes under
the supervision of the master processor. ln both MP and AP configurations, memory sharing among master
and attached processors is still in place.

I»)
lg Example 1.2 Approximated performance of
a multiprocessor
This example exposes the r-rsider to parallel program execution on a shared memory multiprocessor system.
Consider the following Fortran program written for sequential execution on a uniproccssor system. All the
arrays, Ail ], Bfl), and Cfl ], are assumed to have N elements.

L1: [In 10 [= 1,N


L2: .-'\{[)= B(l'j + C(l]
L3: IO Continue
L4: SUM = U
L5: [I-n I'D J = 1, bl
L6: SUM = SUM + A(_J_)
LT: Zfl Continue

Suppose each line ofcode L2, L4, and L6 takes 1 mach inc cycle to execute. Tilt? time required to execute the
program control statements; Ll , L3, L5, and L? is ignored to simplify the analysis. .-°..ssumethat It cycles are needed
for each intcrproocssor commun icat ion operation via the shared memory.
Initially, all arrays are assumed already loaded in the main memory and the short program fragment
already loaded in the instruction cache. In other words, instruction fetch and data loading overhead is ignored.
Also. we ignore bus oontention or memory access conflicts problems. in this way. we can concentrate on the
analysis ofCPU demand.
The above program can be executed on a sequential machine in 23"." cycles under the above assumptions.
Ncycles are needed to execute the N independent iterations in the 1 loop. Similarly, N cycles are needod for
theJ loop, which contains N recursive iterations.
To execute tl1e program on an M-processor system, we partition the looping operations into M sections
with L =Nr‘M elements per section. ln the following parallel code, Dnall declares that all M sections be
executed by M processors in parallel.
For M-way parallel execution, the sectioned I loop can be done in L cycles.
The sectioned J loop produces Mpartial sums in L cycles. Thus EL cycles are consumed to produce all M
partial sums. StilL we need to merge these M partial sums to produce the final sum of N elements.
rm‘ MIGIELH H“ l'm'rIq|r_.r.I|n*\ _
Rrrullel Ccrnpuur Models 1 |9

Doall K = l, M
Do 1t]l=(K—1]*L+1,l(*L
A{_l)= B(l'j +C=['l)
10 Continue
SUM(K) =0
Du 20 .l = l, L
SUM(K) = sumoq + AIIK — 11* L +11
20 Continue
Endall
The addition of each pair of partial sums requires It cycles through the shared memory. An I-level
binary adder tree can be constructed to merge all the partial sums, where I = log; M. The adder tree takes
Mr + 1] cycles to merge the M partial sums sequentially from the leaves to the root ofthe tree. Therefore, the
mu ltiprccessor rcqu ires EL + Kit + 1]= 2Ni‘.M + (Ir + 1)log2 M cycles to produce the final sum.
Suppose N = 23} elements in the array. Sequential execution of the original program takes EN = 23'
machine cycles. Assume that each IPC synchronization overhead has an average value of k = 204} cycles.
Parallel execution on M = 256 processors requires E '3 + 1608 = 9809 machine cycles.
Comparing the abovetiming results, the multiproccs sorshowsa speedup iactorofl 14 out ofthe maximum
value of 256. Therefore, an effi ciency ot'214f256 = 83.6% has been achieved. We will study the speedup and
cflicieney issues in Chapter 3.
The above result was obtained under favorable assumptions about overhead. [n reality, the resulting
speedup might be lower after considering all software overhead and potential resource conflicts. Nevertheless,
the example shows the promising side of parallel processing ifthe intcrprocessor communication overhead
can be maintairieid to a suflieiently low level, represented here in the value of r.

The NUMJN. Model A NUNIA multiprocessor is a shared-mernory system in which the access time varies
with the location of the memory word. Two N'Lll‘vI.-4|. machine models are depicted in Fig. 1.7. The shared
memory is physically distributed to all processors, called local memories. The collection ofall fora! nrenrorics
forms a global address space accessible by all processors.
[t is faster to access a local memory with a local processor. The access ofremote memory attached to other
processors taltes longer due to the added delay through the interconnection network. The BBN TC-2000
Butterfly multiprocessor had the configuration shown in Fig. Lia.
Besides distributed memories, globally shared memory can be added to a multiprocessor system. In this
case, there are three memory-access patterns: The fastest is local memory access. The next is global memory
access. The slowest is access of remote memory as illustrated in Fig. l.'l'b. As a matter offact, the models
shown in Figs. 1.6 and 1.? can he easily modified to allow a mixture of shared memory and private memory
with prespecified access rights.
Ahierarehically structured multiprocessor is modeled in Fig. l.Tb. The processors are divided into several
cl'us'rers'*. Each cluster is itself an UMA or a NUMA multiprocessor. The clusters are connected to global
s'hnrcra'-rrrtrmory modules. The entire system is considered a NUMA multiproccs sor. All processors belonging
to the same cluster are allowed to uniformly access the efrrsrer shared-rncmori-' modules.

‘The word ‘cluster’ is used in a ditlierent sense in cluster computing, as we shall see later.
rhr MIEIHW HI-l'l_lNf.l]l(1|llf\

Zfl i Admlrrced Cempiimerfirelritecm-re

All clusters have equal access to the global memory. However, the access time to the cluster memory is
shorter than that to the global memory. One can specify the access rights among intereluster memories in
va.rious ways. The Cedar multiprocessor, built at the University of Illinois, had such a structure in which each
clusterwas an Alliant FX.-‘SD multiprocessor.

I Global Irteroorinect Network i l


Legends:
I IlIl IlI
P: Processor
CSM:Cli.ster
E' C5 Shared Memory
G5M:G|oba
Fl-Iii
EH! III Iii $ C3 III ? 5II
C5 Shamdiimw
rnN:emse»
' N iii! - N - lflllflffllflflflfltfllll
||-19;- ' l
'- I-' Network
E cenneelion E
Netwoiii CS I-S-l'u'l

B =:|-|rCli.is1et1
|=—15I' QQl§=
:
i“"_'_“_'_“"_“"_“] '|Il:
I Il:
I l_ _ _ _ _ _ -_. -_. -_. -_. -_.|
Cii.ls'tetN _ _ _ _ _ ?:_ _§ _ i
{aj Sharedlo-cal memories [e.g. the [b] Ahierarehieal cluster model [e.g. the Cedar system at the Uni-
H Butterfly] versity of Illinois}

Fig. 1.7 Two NLIMA models for rmiidproossor sy-s1:erns

The CONIA Model A multiprocessor using cache-only memory assumes the COl\-‘IA model. Early
examples ofCOMA machines include the Swedish Institute of Computer Science's Data Diffusion Machine
(DDM, Hagersten ct aI., 1990) and Kendall Square Research's KSR-1 rnaehine{Burli:l1ardt et al., I992]. The
COMA model is depicted in Fig. 1.8. Details ofKSR-1 are given in Chapter 9.

l lnteremnectbn Network |

III

C I in I C
II II II

Fig. 1.! The COMA model of a rriuiltlproeessor (P: Prc.\c-ess-rir.C: Cache. D: Directory; -2.3. the KER-1)

The COM.-"L model is a special ease ofa NLll\-‘IA machine, in which the distributed main memories are
oonverted to cliches. There is no memory hierarchy at each processor node. All the caches form a global
Rriellel Cunputer Models
.
1 2|

address space. Remote cache access is assisted by the distributed cache directories [D in Fig, 1.8). Depending
on the interconnection network used. sometimes hierarchical directories may be used to help locate copies oi
eachcblocks. Initial dataplacemcnt is not critical because data will eventually migrate to where it will be usod.
Besides the UMA, NUMA, and COMA models specified above, other variations exist for multiprocessors.
For example, a einehe-eoherzrrir rion-unifiirni l'fl€’l';lI-tJl':'l-' flC'£"£’S.5' {CC-NUl'l-'lA]I model can be specified with
distributed shared memory and cache directories. Early examples of the CC-NUMA model include the
Stanford Dash (Lenoslci ct al., 1990) and the MIT Alewiie (Agarwal et al., 1990] to be studied in Chapter 9.
A cache-ooherent COM.-it machine is one in which all cache copies must be kept consistent.

Representative Multipmcessors Several early commercially available multiproccssors arc summarized


in Table 1.3. They represent four classes of multiprocessors. The Soquent Symmetry Sill belonged to a
class called minisupereomputers. The IBM S]‘BlCH'L"39[l models were high-end mainframes, sometimescalled
ncar- superoo mputcrs. The BEN TC -2000 represented the MPP class.

Table 1.3 Sol-no Early Commercial Multiprocessor System:

['0-rriperrr_i' Hunlwure and .5'q-ff it‘are and Remarltfr


and Model A rclilieciure App] ii;"a1 ions

Soqueut Bus-connected with D‘li'N|X.i‘()E-3, Latter rno-tick designed


Syininetry 30 i356 processors, l<.APi'Soquei1t with taster processors of
S-Bl [PC via SLIC bus; preproceasor, the family.
Weitck floating-point transaction
aoce lerator. irmltiprocessiiig.
IB M ES.i‘9ClIDD 6 ES-'9t]IllID processors os support: rvrvs, Fiber optic chiimieks,
Model with vector faciiit iiai, V M K M S, i'IilX.-'3 TD, integrated
9D[l.""'\-" F crossbar conriected parallel Fort ran. cryptographic
to U0 chanitelis arid VSF V2.5 compiler. architecture.
shared memory.
BEN TC-EDDD SIZ MSSICHJ Ported l'v'Each"OS Latter rrii:i-dais designed
prmessors with local with multiclusteriiig, with taster processors of
memory contracted parallel Fortran, the rzimiry.
by a Butterfly time-critical
switch, a NLTNIZA applications.
machirie.

The S-S1 was a transaction processing multiprocessor consisting of 3-U BB6-"i486 microprocessors tied
to a common badizplane bus. The IBM ESIQODO models were the latest IBM mainframes having up to 6
processors with attached vector facilities. The TC-2000 could be configured to have 512 M83100 processors
interconnected by a multistage Butterfly network, This was designed as a NUMA machine for real-time or
time-c ritical applications.
Par MIGIITLH HI" l'mrJI||r_.u|n¢s :

Z1 i Advanced Cmnprroerfirchriteeturc

Multiprocessor systems are suitable for general-purpose mu ltiuser applications where programmabilit'y is
the major concern. A major shortcoming ofrnultiproeessors is the lack of scalability. It is rather diflicult to
build MPP machines using centralized shared memory model. Latency tolerance for remote memory access
is also a major limitation.
Packaging and cooling impose additional constraints on scalability. We will study scalability and
programmability in subsequent chapters.

1.1.2 Distributed-Memory Multicomputers


A distributed-memory multicomputer system is modeled in Fig. 1.9. The system consists of multiple
computers, often called nodes, intercccnnected by a message-passing network. Each node is an autonomous
computer consisting of a processor, local memory, and sometimes attached disks or l.-‘O peripherals.

M
P II H
> Message-passing P M
l['|lIBl'(X)l'tl1-B\C.1Ibfl network
: [Me-sh, ring, torus, :
hypercube, cube-
oc-nnected cycle, etc.) p- M

P
M H Ill
Fig. 1.9 Generic model ofa message-pasrlng multloorrrp-Lmer

The message-passing network provides point-to-point static connections among the nodes. All local
memories are private and are accessible only by local processors. Forth is reason, traditional multicomputers
have also been called no-nrnmrwmenmr_r=nec=.'ss (NORMA) machines. lnternode communication is carried
out by passing messages through the static connection network. With advances in interconnection and
network technologies, this model of computing has gained importance, because ofits suitability for certain
applications, scalability, and fault-tolerance.
Nlulticqrrlput-er Gal-emtion: Modern multicomputersuse hardware routers to pass messages. A computer
node is attached to each router. The botmdary router may be connected to l.-‘O and peripheral devices. Mes sage
pa_s.sing between any two nodes involves a sequence of routers and channels. Mixed types of nodes are
allowed in a heterogeneous multico mputer. The imemode conununication in a heterogeneous multicomputer
is achieved through compatible data representations and message-passing protocols.
Early message-passing multicomputers were based on processor board technology using hypercube
architecture and software-controlled message switching. The Caltech Cosmic and lntel iPSCr"l represented
this early development.
The second generation was implemented with mesh-connected architecture, hardware message routing,
and a software environment for medium-grain distributed computing, as represented by the lntel Paragon and
the Parsys SuperNode 1CI'l]'U.
F?» if run! r'nm;|wm1
RJrcMlelCunpote|'Mod-sis M 23

Subsequent systems of this lytte are fine-grain multicomputers, early examples being the MIT I-Machine
and Caltech Mosaic, implemented with both processor and communication gears on the same ‘s-‘LS-l chip. For
further discussion; sec Chapter l3.
In Section 2.4, we will study various static network topologies used to construct multicomputers.
Commonly used topologies include the rirrg, tree. rrrcsh. torus. in-‘perenbe. enbe-mrrrrer:rede\'eie, etc. Various
communication patterns are demanded among the nodes, such as one-to-one, broadcasting, permutations, and
multicast pattern s.
Important issues for multicomputers include message-routing schemes, network flow control strategies,
deadlock avoidance, virtual channels, message-pas.sing primitives, and program decomposition techniques.

Representative Mulricomput-er: Tbrcr: early message-passing multicomputers are surnnrarizod in


Table 1.4. With distributed processorfmemory nodes, such machines arc better in achieving a scalable
periormance. However, message passing imposes a requirement on programmers to distribute the
computations and data sets over the nodes or to establish cffieicnt communication among nodes.

Table 1.4 Some Early Comrnerclul Mulrlcomputcr Systems


.'i:5'.'t rem Int’er‘ n(.'L-‘BE.-“.2 Pars)-'.t Ltd.
Fe¢r.tr.rre.s Paragon XP-"S sass Srrpa'rr'v'urJl2J'HUG

Node Types 5U Ivfl-I: i860 XP Each node contains a EC-fancied Esprit


a.nd Memory computing nodes with CISC 54-bit CPU, sup-emodc built with
16-128 Mbytes per with FPU, 14 Dl'vl.-it multiple T-800
node, special U0 ports, with 1-6-4 Transputers per node.
service nodes. Mbytes .-"node.
Network and 2-D mesh with SCSI, 13-dimensional Reconfigurable
H0 HIPPI, VME, hypercube of B193 interconnect,
Ethernet, and custom nodes, S12-Gbyte expandable to have
IIO. memory, 64 l.-‘CI i024 processors.
boards.
OS and DSF conformance VertexfOS or UNIX IDRISIUS
Software Task with 4.3 BSD, supporting message UNIX-compatib le.
Parallelism visualization and passing using
Support programming worrnhole routing.
support.
Application General sparse rnatrix Scientific number Scientific and
Drivers methods, parallel crunching with scalar academic
data manipulation, nodes, database applications.
strategic computing. processing.
Performance 5--300 Gflnpet peak 2'-I‘ Gfiopa Peak. 36 200 NITPS to I3 GTPS
Remarks 64-hit results. 2.3»-' int) Gbytesfs U0 peak.
GIPS peak integer
pcrforrnance.
l'r\1'Ml.'I;Ifl\lI' HI" l' wrqt .u| rs T

Z4 ii Adrorrcad Compmerfirehltacm-re

The Paragon system had a mesh architecture, and the nCUBE.-'2 had a hypercube architecture. The lntel
iilfifls and some custom-designed VLSI processors were used as building blocks in these machines. All three
O5s were UNLX-compatible with extended lbnctions to support message passing.
Most multicomputers can be upgraded to yield a higher degree ofparallclism with enhanced processors.
We will study various massively parallel systems in Part lll where the tradooffs between scalability and
progranmtability are arraiymod.

1.1.3 ATaxonorny of MIMD Computers


Parallel computers appear as either SIMD or MIMD configurations. The SLMD-s appeal more to special-
purp-ose applications. it is clear that STMTJs are not sirescalahle. but unclear whether large Slhflls are
generation-scalable. The fact that CM-5 had an MIMD architecture, away from the SIMD architecture in
CM -Z, represents the architectural trend {see Chapter B]. Furthermore, the boundary between multiproccssors
and multicomputers has become blurred in recent years.
The architectural trend for general-purpose parallel computers is in favor of MTMD configurations with
various memory configurations (sec Chapter I3]. Gordon Bell (1992) has provided a taxonomy of MIMD
machines, reprinted in Fig. l.lCI. He considers shared-memory multiproccssors as having a single address
space. Scalable multiproccssors or multicomputers must use distributed memory. Multiprocessors using
ecrntrally shared memory have limited scalability.
Dynanic bhoingot
attorewae to trooeesore
KER
Dtstntmect rnemory flak: lmcltng,ri'ng rmlti
rmltprooeaors IEEE SCI atendaci trepeaer
{amiable} Hale trifling, eacheing
Alma-art DASH
State program btndng
BEN. Cedar, GM‘

ll|ltfl:|'n-colon Cross-print ormult-stage


Sirql&Acl'.trem Spare Cray, Fu,U.tau. HI.t.ar:.lrJ. IBM.
Shared lderrroql NEE, Tera-
Cortrpuflllon Sirrple. mg m|.t1|,u.ta
Denial memory mutt rephemtertt
rruttprocewors Bus rmtie
{I'D-1 scalable) DB3. Encore. N'C.F\'_.._
Sequatu. SE1 Eu.»

MIMI] Me-sh oonne-cud


Iltliisl
B.rtter'ly.' Fat Tree
ores-in-me CH5
mI.l1ioon'pu\e¢B
tsoatmte) Fqpermhes
NCUBE
Fast LAMB ‘It! high
Ilultienmputaen avatanility and tigh
la'l.|l-lpte Atflrem Space camely duster:
tllessag-F'a§ir1g DEC. Ta-nderrr
Oorrouhtlar
Lima sorotes-tmao
processing
waketa-.|'.leus , PCs
Central rmtioomyruhrs

Fig. 1.10 Bells tavronorrly efM|MD cornp-uters (Courtesy -ofGorelon Bel: rqarlnnecl with perrnlsdon from the
Commtmlcntlons ofAC.l'rl, August 1991}
PM‘ I Ifllli l'I>I'rIqIr_.I.III¢I _
Rrrollal CorrIpI.rI5er Models 1 25

Multicomputers use distributed memories with multiple address spaces. They are scalable with distributed
memory. The evolution of fast LAN (Inert! trretr nenvork]-cormeeted workstations has er-eated “commodity
supcreomputing”. Bell was the first to advocate high-speed workstation clusters interconnected by high-
speed switches in lieu of special-purpose multicomputers. The C M-5 development was an early move in that
direction.
The scalability of MIMD computers will be further studied in Section 3.4 and Chapter 9. In Part lll, we
will study distributedonemory multiproccssors (KER-1, SCI, etc.}; central-memory multiproccssors (Cray,
IBM, DEC, Fujitsu, Encore, ete.); multicomputers by lntel, TMC, and nCUBE; fast LAN-ba-red wor'kstation
clusters, and other exploratory research systems.

MULTIVECTOR AND SIMD COMPUTERS


1 ln this section, we introduce supercomputers and parallel proeossors for vector processing
and data parallelism. We classify supercomputers either as pipelirred voetor machines using a
few powerful processors equipped with vector hardware, or as SIMD computers emphasizing massive data
parallelism.

1.3.1 Vector Supercomputers


A vector computer is often built on top of a scalar processor. As shown in Fig. 1.11, the vector processor
is attached to the scalar processor as an optional feature. Program and data are first loaded into the main
memory through a host computer. All instructions are first decoded by the scalar control unit. If the decoded
i1'Istr1.IetiorI is a scalar operation or a program control operation, it will be directly executed by the scalar
processor using the scalar functional pipelines.

r
I I I I I I I I I I I II I I
“'1
Scalar Pro-oeosor -J
Seals
FI.nc1I_on:I
Pp-olnea
"""""" 'TuTI5.5I'I5rIi>E55&""““'”'
Seals |nai'Ix:'non

M M»Mm Z
We
i

i
‘labour

Control

hag-“mam - "|\I=I~c:I.or Ftnc. .


l

l'l9"l"|9'"°'1'|' I-botar
3'53“ {PI'og'ar1HId 'l|l'gI(;i'3|'
“la Dara; Rggqrm

—|
Hoot
Mme “"93
Shaw F__ _ -_ I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I

IIO {Llaor)

Fig. 1.11 The ardultoertrro of a vector sup-ereornputor


Par MIGIITLH HI" l'I»rI'JI|Ir_.uII¢\ :

Ii i Advanced Cmnpireerfirchitocturc

[fthe instruction is decoded as a vector operation, it will be sent to the vector control unit. This control
unjt will supervise the flow of vector data between the main memory and vector functional pipelines. The
vector data flow is coordinated by the control unit. A number ofvector functional pipelines may be built into
a vectorprocessor. Two pipeline vector supercomputer models are described below.

Hector Processor Model: Figure l .l l shows a register-to-register architecture. Vector registers are used
to hold the vector operands, intermediate and final vector results. The vector functional pipelines retrieve
operands from and put results imo the vector registers. All vector registers are programmable irI user
instructions. Each vector register is equipped with a component countcrwhich keeps track ofthe component
registers used in successive pipeline cycles.
The length of each vector register is usually fixed, say, sixty-four 64-bit component registers in a vector
register in a Cray Series supercomputer. Other machines, like the Fujitsu VPZUDG Series, rise reconfigurable
vector registers to dynamically match the register length with that of the vector operands.
In general, there are fixed numbers of vector registers and functional pipelines in a vector processor.
Therefore. both resources must he reserved in advance to avoid resource conflicts between different vector
operations. Sorne early vector-reg;is,ter based supercomputers are summarized in Table 1.5.

Table 1.5 Some Eorly Commercial Nutter‘ Supercornpirtcrs

Sys.rem Iri-cror Harzfwara‘ A mhirec-mm (.TunIpHer and


.1-f0d'-er‘ and (_'r.rpafiI'ii.Iie.r Sof.'rv.'rr.rc Support
Coirvex Goats-loosed mutt iprooessor ftdvmtoed (I, Fortran,
C3300 with B processors and and Ada vecisoriiting
family 500-lvihytefs access port. and parallclizing compilers.
4 Gt:-ytes rnain rnernory. Also supported interpro-cedurul
1 Ciilvrt Pr-=Il= oprtimiration.
perforrrranee with PUSIK IIIII]3.I.-‘()5
ooirer.|.rrent scalariveetor plus I.-‘(J iirterI'aees
operations. and visualization system
Digital Integrated vector processing MS or ULTRJJGCIS,
WKX QUDIIJ in the VAX environrriertt, \-‘AK Fortran and
System I25—5lJ0 lvtflops VAX Vector Instruction
pink perfi:-rrnmree. Emulator {VVIEFI
63 vector for vcctorized
I6 x 64 x 64 vector registers. program debugging.
Pipeline cl1.ai1Ii1Ig possible.
Cray Research Y~MP ran with 2, 4, or CF77 compiler for
Y-MP and 8 processors, 2.67 Gflop automatic vcctorizarion,
C-90 peak willr Y-MPS256. C-90 scalar optimirafion.
l1.IICl 2 vector pipes.-‘CPU and parallel processing.
buiit w itlr IOK gate ECL UNIt'_‘OS im proved
with I6 Gflops ptfilt performance. from UNIXN and
Berkeley BSDMS.
Par I J11!!!‘ l'mrJI||r_.u|n¢\
Rrrollel Cunpuow Models 1 21

A menmrj-'-to-naenmrjr-' architecture difiers from a register-to-register architecture in the use of a vector


stream unit to replace the vector registers. Vector operands and results are directly retrieved from and stored
into the main memory in superwords, say, S12 bits as i11thc Cyber 205.
Pipelined vector supercomputers started with uniproccssor models such as the Cray 1 in 1‘§'?6. Subsequent
supercomputer systems ofi'ered both uniprocessorand multiprocessor models such as the Cray Y-MP Series.

Representative Supercomputer: Over a dozen pipelinod vector computers have been manufactured,
ranging irom workstations to mini- and supercomputers. Notable early examples include the Stardmt 3000
multiprocessor equipped with vector pipelines, the Convex C3 Series, the DEC 'v'A}i 9000, the [BM 390:"
VF, the Cray Research Y-MP family, the NEC SX Series, the Fujitsu VP2000, and the Hitachi S-810120. For
fi.|rther discussion, soe Chapters 8 and 13.
The Convex C 1 and C2 Series were made with ECL-“CMOS technologies. The latter C3 Series was based
on Ga.-is technology.
The DEC VAX 9000 was Digital's largest mainframe system providing concurrent sealarfvector
and multiprocessing capabilities. The ‘s-‘AX 9000 processors used a hybrid architecture. The vector unit
was an optional feature attached to the K-‘AX 9000 CPU. The Cray Y-MP family ofi'ered both vector and
multiprocessing capabilities.

1.3.2 SIMD Supercomputers


In Fig. 1.3b, we have shown an abstract model ofSlMD oomputers having a single instruction stream over
multiple data streams. An operational model of SIMD computers is presented below (Fig. 1. 13) based on the
work ofH. J. Siegel (19791. Implementation models and ease studies of SIMD machines are given in Chapter E.

Cmtrol Unit

PEO PE1] PE2 PE N-1


]Pno]e.fl-I |Proe.1| |Pr<T.2| -H
|Mem.Dl |Mor'n.1| |Mom.2|

I lotoroonnoetion Netwoflc ‘

Fig. 1.12 C\|:|-creel-onal model of SIMD oompurers

SIMD Machine Model An operational model of an SIMD computer is specified by a 5-tuple:


M=(N, CJ, M, R) -[1.5_|
rs.- Mam-w rrrm-...¢-,......¢. '
Ill i Advanced Compmaerfiuehiteeture

where

-['1] N is the number ofpmrussing elements (PEs) in the machine. For example, the llliac [V had 64 PEs
and the Connection Machine CM-2 had 65,536 PEs.
{2} C is the set of instructions directly executed by the r-onrmf unit (CU), including scalar and program
flow control instructions.
{'3} I is the set of instructions broadcast by the CU to all PEs for parallel execution. These include
arithmetic, logic, data routing, masking, and other local operations esocuted by each active PE over
data within that PE.
{4} M is the set ofmasking schemes, where each mask partitions the set of PEs into enabled and disabled
subsets.
('5) R is the set ofdata-routing functions, specifying various patterns to be set up in the interconnection
network ibr inter-PE communications.
One can describe a particular SIMD machine architecture by spociiying the S-tuple. An example SIMD
machine is partially specified below.

I»)
lg Example 1.3 Operational specification of the MasPar
MP-1 computer
We will study the detailed arcbitecttlre of the MasPar MP-l in Chapter T. Listed below is a partial specification
ofthe 5-tuple for this machine:

{lj The MP-1 was an SIMD m.achine with N = 1024 to 16,38-'-l PEs, depending on which configuration is
oonsidered.
{2_j The CU executed scalar instructions, broadcast decoded vector instructions to the PE array, and
oontrolled intcr- PE communications.
{'3} Each PE was a register-based load.-"store RISC processor capable of executing integer operations over
various data sizes and standard floating-point operations. The PEs rcoeived instructions from the CU.
{lll The masking scheme was built within each PE and continuously monitored by the CU which could set
and reset the status ofeach PE dynamically at run time.
{Sj The MP-i had an X-Net mesh network plus a global multistage crossbar router for inter-CU-PE,
X-Net nearest B-neighbor, and global router communications.

Repmsontotive SIMD Computer: 'I'hree early commercial STMT) supercomputers are summarized in
Table 1.6. The number of PEs in these systems ranges from 4096 in the DAP6l0 to 16,354 irl the Ma.sPar
MP—1 and 65,536 in the CM—2. Both the CM—2 and DAP610 were fine-grain, bit-slice SLMD eornputers with
attached floating-point accelerators for blocks of PEs’.
Each PE of the MP-i was equipped with a 1-bit logic unit, 4-bit integer.-KLU, 64-bit mantissa unit, and
16-bit exponent unit. Multiple PEs could be built on a single chip due to the simplicity ofeach PE. The MP-l

" Witlt rapid advarroes in ‘t-“L5! teelutology, use ofbit-slice processors in systems has dtsapp-eured.
HM‘ If J11!!!‘ I'ml!I;|r1rHt\
Rrrellel Cunputaw Models _ T 29

irnplementod 32 PEs perchip with forty 32-bit registers per PE. The 32 PEs were interconnected by an X-Ne!
mesh, which was a 4-neighbor mesh augmented with diagonal dual-stage links.
The CM-2 implemented 16 PEs as a mesh on a single chip. Each 16-PE mesh chip was placed at one
vertex ofa 12-dimensional hypercube. Thus 16 >< 2 '3 = 2 '6 = 65,536 PEs fomted the entire SIMD array.
The D.-\PfilO implement-ed 64 PEs as a mesh on a chip. Globally, a large mesh (64 X 6-'-ll was lbrmed
by interconnecting these small meshes on chips. Fortran 90 and modified versions of C, Lisp, and other
sequential programming languages have been developed to program SIMD machines.

Table 1.6 Some Early Contmerdd SIMD 5Lll.'.NH'COl11]l>lJ‘ll!J'S

SW1‘-em SIMD )l1fuv_'§rl.Ir€ A rehlleelure Lerngr.r¢r‘|.;e.'r, C-:1-rr:pHcr.t


Moe|"e1 and C'ap¢:.r£I'Hl.rle.t and .':7q-,|l.i'H.'ure Suppon

Mae Par Dcaig;ned for configurations from Fortran T7. MasP'ar Fortran
Computer I024 to l6,3 E4 pro-ee-ssors with {MIPF), and M:e;Par Parallel
Corporation 26.000 MIPS or 1.3 Gflops. Each Application Language;
MP-l Family PE was a RISC pruc1e5aor.Witl1 16 UNIX.-'03 with X-windurv.
Kbytes local memory. An X-Net symbolic debugger, visualizers
mesh plus a mult ietage crossbar and animators.
interconnect.
Thinking A bit~sliee array of up to 65,536 Driven by a host of VAX,
Machines PEs arranged rat a ll]-dimers ional Sun, or Symbolies 36410, Lisp
('Iorp-oration, hy]:|-ercube with 4 :-< 4 mesh on each emnpiler, Fo1tran9'D, C‘, and
CM-2 vertex, up to 1M bits of memory “Lisp supported by PARIS
per PE, with optional FPU shared
betvneen blocks ol'32 PEs. 23
Gflops peak and 5.6 Gfiops
sustained.
Active A fine-grain, bit-slice SIM!) array Provided by host VAXfVl‘vl5
Memory of up to 40945 PEs interconnected or UNIX Fortran-plus or
Technology by a square mesh with I K bitsper .-'iPi'tL on D.-'iP, Fortran T? or
D.¢'iP6i]fl PE, orthogonal and 4-rie ighb-or C on host.
Family links, 20 GLPS and 560 Mflops
peak pertbcrrnauoe.

\ PRAM AND vLs| MODELS


Theoretrcal models of parallel computers are abstracted from the physical morlels studied in
previous sections. These models are often used by algorithm designers and ‘v'LSl device.-‘chip
developers. The ideal models provide a convenient framework for developing parallel algorithms without
worry about the implementation details or physical constraints.
The models can be applied to obtain theoretical performance bounds on parallel computers or to estimate
‘v'LSl complexity on chip area and execution time before the chip is fabricated. The abstract models are
F?» Mtfiruw Hlllrlimpwtnw

Ill i Advotriced Cmnpiunerfiuehtitectere

also useful in scalability and programmability analysis. when real machines are compared with an idealized
parallel machine without worrying about communication overhead among processing nodes.

1.4.1 Parallel Random-Access Machines


Theoretical models ofparallel computers are presented below. We define first the time and space complexities.
Computational tractabilny is reviewed for solving difiicult Problems on computers. Then we introduce the
rnrirfom-net-ess ntnchine {RAM}, ,rJar.tiHcf rttrirforri-access rratehim’ {_PRAlt-ll]-, and variants of PRAl'VIs. These
complexity models facilitate the study of asymptotic behavior of algorithms implementable on parallel
computers.
Tilrre and Space Cornplexities The complexity of an algorithm for solving a problem of size s on a
computer is determined by the execution time and the storage space required. The time complexity is a
function of the problem sire. The time eorrr,rJiexi1['jt-' function in order notation is the ns_t-vrrprorie time eonzp!e.\'ity
of the algorithm. Usually, the worst -ease time comp lexity is considered. For example, a time complex ity gs]
is said to be U‘ [Haj], read ‘“orderf{s)“, if there exist positive constants cl, cg and st, such that 1-, f{sj 5 ,g1'_s']
5 c;_,t'(sj__ for all nonnegativc values of s > so.
The space com,rJierir_v can be similarly defined as a function of the problem size s. The tts_}vqrJ!orit- spite-e
conyrltrxity refers to the data storage oflarge problems. Notcthat the program {_ code] storage requirement and
the storage fbr input data are not considered in this.
The time complexity ofa serial algorithm is simply called s'eriti1co:rr,r)i'e.rir_\-: The time complexity of a
parallel algorithm is called ;mr.nl'fcr' eom;JIe.\ir_1-'. lmuitively, the parallel complexity should be lower than
the serial complexity, at least asymptotically. We consider only alercrminisrie nl'g0ri.thm.s, in which every
operational step is uniquely defined in agreement with the way programs are executed on real computers.
Anonderermin isrie algorithm: contains operations resulting in one outcome from a set ofpossible outcomes.
There exist no real computers that can execute nondcterministic algorithms. Thereibre, all algorithms (or
machines] considered in this book are deterministic, tmless otherwise noted.

NP‘-Completeness An algorithm has a pol\-'nomit1l eomrrferin-' if there exists apolynomial p{s'] such that the
time complexity is O(p {sl} for problem size s. The setofprob lems having polyno mial-complex ity algorithms
is called P-elms (for polynomial class]. The set of problems solvable by nondcterministic algorithms in
polynomial time is called NP-class‘ { for nondcterministic polynomial class).
Since deterministic algorithms are special cases ofthe nondcterministic ones, we know that P -1: NP. The
P -class problems are computationally n'ttc-ttrble, while the NP — P-class problems are inrrnetttbie. But we do
not know whether P = NP or P as NP. This is still an open problem in oomputcr science.
To simulate a nondcterministic algorithm with a deterministic algorithm may require exponential time.
Therefore, intractable NP-class problems are also said to have exponential-time complexity.

L»-l
égl Example 1.4 Polynomial- and exponential-complexity
algorithms
Polynomial-complexity algorithms are known for sorting n numbers in Ofn log rt] time and for multiplication
of two rt X rt matrices in O'{n3j time. Therefore, both problems belong to the P-class.
PM‘ I Ifllli l'm'rIq|r_.\.I|n*\ _
Rrrdlel Cunpuoer Models 1 3|

Nonpolynomial algorithms have been developed for tl1e traveling salesperson problem with complexity
Qfrlzl") and for the knapsack problem with complexity O\[E"Pj. These complexities are eJq'Jom:nIr'oi, greater
than the polynomial complexities. So far, deterministic polynomial algorithms have not been found for these
problems. Therefore, these exponential-complexity problems belong to the NP-class.

Most oomputcr scientists believe that P sfi NP. This leads to the conjecture that there exists a subclass,
called NP-c-0nipLcIe(:l\lPC] problems, such that NPC C NPbut NPCFN P = l,'.'1{'Fig. 1.13]. ln fact, it has been
proved that ifany NP-complete problem is polynomial-time solvable, then one can conclude P = NP. Thus
NP-complete problems are oonsidered the hardest ones to solve. Only approximation algorithms can be
derived for solving the NP-complete problems in polynomial time.

NP NP: Nmdotormlnistlc poly nonial


time class
® P: Polynomial-time class.
MPG: NP-oompiete elass

Hg. 1.1! The relationships conjectured among the NP. E and NFC eimses of eompumrlonal pmhaloms

FRAM Models Conventional uniproccssor computers have been modeled as random |:Ic‘£'e’.i‘.S‘ nmehines
(RAM) by Shepcrdson and Sturgis (1963). A ,rJ.omHe1 r.ondom-access nmchirie [PRAMIII model has been
developed by Fortune and Wyllie ([973] for modeling idealized parallel computers with zcno synchronization
or memory aeeess overhead. This PRAM model will be used for parallel algorithm development and for
scalability and complexity analysis.
An n-processor PRAM (Fig. 1.14} has a globally addressable memory. The shared memory can be
distributed among the prroeessors or centralized in one plaoe. The n process-ors—also called proe:.'s.-ring
demenr.-r {PEs}—opcrate on a synchronized read-memory, compute, and write-memoty cycle. With shared
memory, the model must speciiy how concurrent read and concurrent write of memory are handled. Four
memory-update options are po ssiblc:

Tlghtty 9
synchronized Shaved
lvlemory

Fig. 1.14 PRAM model of a m:.|-ltlproceseor system width shamed mernc.r_y. on which all n processors operate
in loekstep in morn-ory access and pmgrarn ea-teoutl-on operations. Each processor can access any
memory location in unit time
F?» Mtfiruw Hlllr'».-rqtwrnw

31 i Advanced Covnpultertllrcirttteettrre

t Erc1'tt.sr'te read (ER}—This allows at most one processor to read fi'om any memory location in each
cycle, a rather restrictive policy.
~ Erulttsrte it-'rr're [EWl—This allows at most one processor to write into a memory location at a time.
' Concurrent read‘ (CR)—This allows multiple processors to read the same information from the same
memory cell in the same cycle.
' C'orrt'rtr'rt'rtr ‘l1-Tile’ {'CW_l—This allows simultaneous writes to the same memory location. In order to
avoid confusion_ some policy must be set up to resolve the write conflicts.

Various combinations of the above options lead to several variants of the PRAM model as specified below.
Since CR does not create a conflict problem, variants differ mainly in how they handle the CW conflicts.

FRAM ibriant: Described below arc four variants of the PRAtvI model, depending on how the memory
reads and writes are handled.

{'1} EREW~PR.*fM rrmdel—This m-odcl forbids more than one processor from reading or writing the same
memory cell simultaneously (Snir, 1982; Karp and Ramachandran, 1935]. This is the most restrictive
PRAM model proposed.
{2} CIli'Ellr'-PR.'l.1-f rrr-rtnlci'—Thc write conflicts are avoided by mutttal exclusion. Concurrent reads to thc
same memory location are allowed.
{'3} ERCll~"-PR.-IM rrrm;l'cl—This allows exclusive read or concurrent writes to the same memory location.
{'-fl] CREW-PR.-1M rrtod'e!—This model allows either concurrent reads or concurrent writes to the same
memory location.

3?) Example 1.5 Multiplication of two n >< n matrices in O(log


n) time on a PRAM with nil log n processors
(Viktor Prasanna,1992)
Let A and B be the input matrices. Assume rt3 PEs are available initially. We later rcduce the number of PEs
to rtj.-‘log rt. To visualize thc algorithm, assume the memory is organized as a three-dimensional array with
inputs A and B stored in two planes. Also, for the sake ofexplanation, assume a three-dimensional indexing
ofthe PEs. PE{t',j, Ir], D S It E rt — l are used for computing the[i,_,r‘]th entry ofthe output matrix, Cl E r',_,t' S
rt — l, and rt is apowerofl
ln step l, rt product terms corresponding to each output are computed using rt PEs in Cl-['1 j time. In step E,
these are added to produce an output in Oflog rt) time.
Thc total number of PEs used is rti. The result is available in C-'(i,j, fl), ll E i,_,r'£ rt -1. Listed below are
programs for each PE{i,j, Ir] to execute. All rtl PEs operate in parallel for rtj multiplications. But at most rt‘l.-"2
Pet are busy for or‘ - rt’) additions. Also, the PRAM is assumed to be crtsw.
Step 1
l. Rea-tl.-l{r',kj
.2. Read B{k,jj
War If J11!!!‘ I'mt!I;|(1rtnr\
Rrrdlel Cornptrtaer Models i i 33

3. Compute .*t{'t', kl X B{.i',_jj


4. Store in C{i,j, Ir)

Step I
l. E‘ <— tr
Z. Repeat
E<— H2
if-[Ir s: E] then
begin
Read C{'i,j, Ir]
Read C'{1',j, i'r+ E)
Compute C(r',j, Ir) + C.'{r',_,r', k+ E‘)
Store in C(i,j, Ir]
end
until (E = 1)
To reduce the numbcrof PEs to n3.~"log rt, use a PE array of size n >< n >< n.-‘log rt. Each PE is responsible tor
computing log n product terms and su.t|:u‘ning them up. Step l can be easily modified to produce n.-‘log rt partial
sums, each consisting of log n multiplications and {logn — 1] additions. Now we have art array C-[:r',_,r',kj,'U S
i,_,r' Sn - l, G S it S n.-‘log n — l, which can be summed up in log(n.-‘log rt] time. Combining the time spent in
step l and step 2, we haw: a total execution time 2 log n — l + log{n.-‘log nj = Oflog rt] for large n.

Discrepancy with Physical Model: PRAM models idealize parallel computers, in which all memory
references and pmgram executions by multiple processors are synchronized without extra cost. In reality,
such parallel machines do not exist. An SIMD machine with shared memory is the closest architecture
modeled by PRAM. However, PRAL-I allows different instructions to be executed on different processors
simultaneously. Therefore, PRAM really operates in synchronized MIMD mode with a shared memory.
Among the four PRFM variants, the BREW and CREW are the most popular models used. In fact,
every CREW algorithm can be simulated by an EREW algorithm. The CREW algorithm runs faster than an
equivalent EREW algorithm. It has been proved that the best n-processor EREW algorithm can be no more
than Oflog rt] times slower than any n-processor CRCW algorithm.
The CREW model has received more attention in the literature than the ERCW model. For our purposes,
we will use the CRCW-PRAM model unless otherwise stated. This particular model will be used in defining
scalability in Chapter 3.
For oomplexity analysis orperfoflnflltce comparison, various PRAM variants offeran ideal model ofparallel
computers. Therefore, computer scientists use the PRAM model more often than computer engineers. ln this
book, we design parallel-“vector computers using physical architectural models rather than PRAM models.
The PRAIM model will be used for scalability and performance studies in Chapter 3 as a theoretical reference
machine. PRAM models can indicate upper and lower bounds on the performance of real parallel comp-uters.

1.4.2 VLSI Complexity Model


Parallel computers rely on the use of VLSI chips to iahricate the major components such as processor
arrays, memory arrays, and large-scale switching rietworks. An A T‘ model ior two-dimensional VLSI chips
Par MIGIITLH HI" l'mrJI||r_.u|i¢\ :

34 i Advanced Colnpiieerfiiclritaeru-.ra

is presented below, based on the work of Clark Thompson (1950). Three lower bounds on \-‘LS-l circuits
are imerpreted by Jeffrey Ullman {I984}. The bounds are obtained by setting limits on memory, l.-‘CI, and
communication for implementing parallel algorithms with VLF‘.-l chips.
The AT‘, Nlodel Let A be the chip area and The the latency for completing a given computation using a
VLSI circuit chip. Let s by the problem size involved in the computation. Thornpson stated in his doctoral
thesis that for certain computations, there exists a lower boundfls] such that

If >< Tie oifou (1.6)


The chip area A is a meas ure of the chip's complexity. The latency Tis the time required irom when inputs
are applied until all outputs are produced for a single problem instance. Figure l.l 5 shows how to interpret
the AT: complexity results in VLSI chip development. The chip is represented by the base area in the two
horizontal dimensions. The vertical dimension corresponds to time. Therefore, the three-dimensional solid
represents the history ofthe computation performed by the chip.

Memory Bound on Chip Arno There are many computations which are memory-bound, due to the need
to process large data sets. To implement this type of computation in silicon, one is limited by how densely
information {bit cells) can be placed on the chip. As depicted in Fig. l.l5a, the memory requirement oi" a
computation scts a lower bound on the chip area .»l.
The amount of information processed by the chip can be visualized as information flow upward across the
chip area. Each bit can flow through a unit area of the horizontal chip slice. Thus, the chip area bounds the
amount of memory bits stored on the chip.

HO Bound on Volume AT The volume of the rectangular cube is represented by the product AT. As
information flows through the chip for a period oftime T, thc number of irlput bits cannot etcccd the volume
AT, as demonstrated in Fig. l. 15a.

Time Time

. .9'.L 1
, Gmparoa

{at Memory-iim ited no und on chip area


A and ID-limited hound on chip hiatoiy
Kin 5-§
{ht Communicntim-limited bound on the
bl$BCllOfl HT
repress riled by the volume AT

Fig. 1.15 The At’ oornpletdry model of ruvo-dimensional v|_5| chips


The If J11!!!‘ I'ml!I;|(1rHt\
Riteflel Cunputaw Med-ets _ i 35

The area A corresponds to data imo and out ofthe entire surface ofthe silicon chip. This areal measure sets
the maximum l.-‘O limit rather than u.sing the peripheral l.~"O pads as seen in conventional chips. The height T
ofthe volume can be visualized as a number of snapshots on the chip, as computing time elapses. The volume
represents the amount of infomtation flowing through the chip during the entire course of the computation.

Bixaction Communication Bound, <-JET Figure l.l5b depicts a communication limitod lower bound
on the bisection area The bisection is represented by the vertical slice cutting across the shorter
dimension ofthe chip area. The distance ofthis dimension is for a square chip. The height ofthe cross
section is T.
The bisection area represents the maximum amount of intbrmalion exchange between the two halves of
the chip cireuit during thc time period T. The cross-section area JET limits tl1c communication bandwidth
of a computation. VLSI complexity theoreticians have used the square ofthis measure, 1-{T1, to which the
lower bound applies, as seen in Eq. 1.6.

I»)
lg Example 1.6 VLSI chip implementation of a matrix multi-
plication algorithm (Viktor Prasanna,1992)
This example shows how to estimate the chip area A a.nd compute time T for rt >< n matrix multiplication
C = A >< B on a mesh ofproeessing elements -[PEs] with a broadcast bus on each row and each oolumn. The
2-D mesh architecture is shown in Fig. 1.16. Inter-PE commmtication is done through the broadcast buses.
We want to prove the bound AT! = Ofiflj by developing a parallel matrix multiplication algorithm with time
T= O[_n_] in using the mesh with broadcast buses. Therefore, we need to prove that the chip area is bounded
by .-1 = O{_n2].
Each PE occupies a unit area, and the broadcast buses require Ofnzj wire area. Thus the total chip area
needed is O{n3] for an n >< n mesh with broadcast buses. We show next that then >< n matrix multiplication can
be performed on this mesh chip in T= Olin) time. Denote the PEs as PE(i',j],'[l E r',_,r' E rt — 1.
[nitially the input matrix elements .»t(i,j} and B-[:r', j] are stored in PE{i,j_) with no duplicated data. The
memory is distributed among all the PEs. Each PE can access only its own local memory. The tbllowing
parallel algorithm shows how to perform the dot-product operations in generating all the output elements
C|['r',j)= 12;}, .»t(i, k]>< B{_k,j] tbrtl S r',j£n -1.
36 i Admnced Ciimpimerfiiehitectu-.re

I I I

O0 01 O2 _ 03

If I I ‘l

10 11 12 13.

in I 1 I

20 21 22 23

# ‘l 1‘ ‘l

3-Ct 31 32 33

Fig. 1.16 A 4 x 4 mesh of processing -elements {PB} wlth broa-clcast buses -on each row and on each ootumn
{Courtesy of Prasarma Ken-iar and Raghav4sndi~a;reprtn1:ed from journd offiirdbt and Dtsrrlbuted
Computing, A.pr~ll 193?}
Dnall1'[]for0Ei,jEn—1
10 PE(i,j] sets C{t,j}to{}fIni1ializaiionf
Do 5'11 fort'l£kSn—l
Dual] 20 for'[l E r'£n—l
20 PE{i, Ir] broadcasts zt{_r', Ir] along its row bus
Doall3(l for'[l £jEn—l
30 PE{k,j] broadcasts B-[ir,f] along its column bus
v‘PE(i',_,i'] now has AU‘, kjand B-[:fr,j'j, (1 S r',j 5 rt — 11"
Dllflll-I‘-10 fortls a',j5n—1
40 PE{r',j) oomputes C{i',j] <— C{i,_,r'j + .4(i, Ir) >< B{.1r,jj
50 Continue

The above algorithm has a sequential loop along the dimension indexed by Ir. lt takes H time units
.. . . . . 7 ‘.\ . . 1 .
[iterations] in this It-loop. Thus, we have T= t']'{_n]. Therefore, AT‘ = Om‘). {_O'{n)]' = O{nil].

ARCHITECTURAL DEVELOFMENTTRACK5
1 The architectures ofmost existing computers tbllow certain development tracks. Understanding
features of various tracks provides insights tor new architectural development. We look imo
six tracks to be studied in later chapters. These tracks are distinguished by similarity in computational models
and technological bases. We also review a few early representative systems in each track.

1.5.1 Multiple-Fr~ot:essorTraeks
Generally speaking, a multiple-prooessor system can be either a shared-memory multiprocessor or a
distributed-memory multicomputer as modeled in Section 1.2. Bell listed these machines at the leafnodes of
Par I J11!!!‘ l'mrJI|ir_.uii¢\
Rrieflel Cunpuoer Models 1 31

rhe taxonomy tree (Fig. 1.11)). Instead of a horizontal listing, we show a historical development along each
important track of the taxonomy.

Shared-Jlflemory Truck Figure l.1'?a shows a track of multiprocessor development employing a single
address space in the entire system. The track started with the C.mrnp system developed at Carnegie-Mellon
University (Wulf and Bell, 1972]. The C.mmp was an LIMA multiprocessor Sixteen PDP ll.-"1-l'lIl processors
were interconnected to lo shared-memory modules via acrossbar switch. A special iriterprocessor interrupt
bus was provided for first interproccss communication, besides the shared memory. The C .mmp project
pioneered shared-memory multiprocessor development, not only in the crossbar architecture but also in the
multiprocessor operating system {_l-lydra] development.

Stan1iorr:h'Dash
(Lenoski. Henneesy et al, 1992}

Fujitsu VPP5-D0
llllnols Cedar (Fujmul |m_ 1992)
Kurzltetal
1sar}
KSR1
CMwc_mmp [Kendall Square Raseareh,1Q9D]
lfwulf and Bell, 19??)
IBM RP3
“Yul, [Pfister at al. 1935]
Ullraeornputer <
tGottIieb et ol. 1982!}
BEN Butterfly
{BBN.19<flQ)

ta} Sharedmemory track


nGUBE-H6400
nCUB/E Corp. 1990)

. lntel 'tPSC’s lntel Paragon


Gee C be . .
tsdglgeelh {lntel Sctentifir: -ii» (lntel Supercomputer
Computers. 1983) Systarna. 1992}

Mosaic MITIJ Machine


t-‘Bait: 1992} {Dally at at. 1992)

(ti) Message-pas-sing track

Fig. 1.1 7 Two mutrtple-processor cracks with and without st1.a.md mo-nory

Both the NYU Ultracomputer project (Gottlieb ct al., 1983] and the Illinois Cedar project {Kuck ct al.,
198'?) were developed with a single address space. Both systems used multistage networks as a system
interconnect. The major achievements in the Cedar project were in parallel compilers arid performance
bcnehrnarleing experiments. The Ultracomputer developed the combining network for fast syneliroirization
among multiple processors, to be studied in Chapter 7.
PM‘ MIGIELH HI" r'mr:qn_.r-um ‘I _

3|] i Advanced Celnpmerfirclritectu-.rc

The Stanford Dash lbcnoslci, Hennessy ct al., 1992) was a HIJMA multiprocessor with distributed
memories formi1'|g a global address space. Cache coherence was enforced with distributed directories. The
KSR-I was a typical CDMA model. The Fujitsu WP 501] was a 222-processor system with a crossbar
interconnect. The shared memories were distributed to all processor nodes. We will study the Dash and the
KSR-] in Chapter 9 and the VPPSGU in Chapter 8.
Following the Ultracomputer are two large-scale multiproccssors, both using multistage networks but
with diiferent interstage connections to he studied in Chapters 2 and 7. Among the systems listed in Fig.
1.17s, only flie KSR-I. VPPSDO, and BBN Butterfly (BBN ftdvaricerl Computers, 1989} were commercial
products. The rest were research systems; only prototypes were built in laboratories, with a view to validate
specific architectural concepts.
Menage-Fbning Track The Cosrnic Cube {Seitz et a1., I981} pioneered the development of message-
passing multicomputers (Fig. 1.17b]. Since then, lntel produced a series of medium-grain hypercube
computers (the iPSC-s}. The nCUBE 2 also assumed a hypercube configuration. A subsequent lntel system
was the Paragon {I 9'92] to be studied in Chapter T. On the research track, the Mosaic C [Seiu-., 1992] and the
MIT J-Machine (Dally at a1., 1992} were two fine-grain multicomputers, to be studied in Chapter 9.

1.5.1 Multivector and 5lMDTracks


The multivector track is shown in Fig. l. 18a, and the SIMD track in Fig. 1.1811, with corresponding early
representative systems of each type.

coc: Cybaf2Ct5 ETA 10


(Lavina, rsszp " rem. Inc. rsasp
CDC TGIDCI
{CDC-197°? Cray Y-MP CrayIMPF'
C 1 {Cray Research, 1989) {Cray Research. 1993)
our
{Russel rsrsi
Fri llflrl, NEG. Hitachi Models

r-.1) Mulliu'er1orir'aclt
one are
{AMI inc. rssrr
Goodyear MPF
{Eiatehen 195]:
CH2 CM5
Iliacw |[TMC, 193)) ('l'MC,1Q91:|
{Barnes alaL 1968)
l'rIasF'ar MP1
{Nicltds, 1990)
BSP
{Kuek and Shires. 1982)
IBMGFI11
{Baelem at al, 1'BB5)
(tr) SIMD track

Fig.1.1ll Mtfldveetor and SIMD tracks


Par MI J11!!!‘ l'mrJI||r_.u|r¢\
Model Cornprmzr Models 1 39

Both tracks arc useli.|l for conctrrrcnt scalar-"vectorprocessing. Detailed studies can be found in Chapter 8,
with further discussion in Chapter 13.

Multivector Track These are traditional vector supercomputers. The CDC T600 was the first vector dual-
proccssor system. Two subtraclcs were derived from the CDC 7600. The Cray and Japanese supercomputers
all followed the register-to-register architecture. Cray 1 pioneered the multivoctor development in 19']"8.
The C rayr"MPP was a massively parallel system with distributed shared memory, to work as a back-end
accelerator engine compatible with the C ray ‘fr’-MP Series.
The other subtrack used memory-to-memory architecture in building vector supercomputers. We have
identified only the CDC Cyber 205 and its successor the ETAICI here, for completeness in tracking different
supercomputer architectures.

The SIMD Truck The llliac [V pionecrcd thc construction of SIMD computers, although thc array
processorconccpt can be traced back iar earl icr to the 19605. The subtrack, consisting of the Goodyear MPP,
the AMT.-‘DAP6 ll], and the TMC.-‘CM-2, were all SIMD machines built with bit-slice PEs. The CM-5 was a
synchronized l'v'[1'MD machine executing in a multiple-SIMD mode.
The other subtrack corresponds to medium-grain S-[MD computers using word-wide PEs. The BSP
(Kuck and Stokes, 1982] was a shared-memory SIMD machine built with 16 processors updating a group
of 17 memory modules synchronously. The GFII (El-octcm ct al., 1985) was developed at the [BM Watson
Laboratory for scientific simulation research use. The MasPar MP1 was the only medium-grain SIMD
computer to achieve production use in that time period. We will describe the CM-E, MasPar MP1, and CM-5
in Chapter B.

1.5.3 Multlthreaded and DataflowTracks


These two architectural tracks (Fig. 1.19) will be studied in Chapter 9. The following introduction covers
only basic definitions and milestone systems built in the early years.
Tera
[Alverson Sm lth, at al, 1990]
one as-no HEP
[c cc, ta64Ji"{sm1rh tars]
MITIAIQM.-lfo
[Agarwal at at, 1939]

[a] Muttlthroacled track

utrr Tagged Token “°"*°°" +1


tA~ln<1 at HI. 19841) " gflfidfgfiim 3‘ " [Nlkhll oral, 1991 J

Static Dataflow
[Dennis 1914;
Manchester S 1 EM5
lJ$“t':D3‘
a ll,
1932 'i"[SigHniiifadaeta| I tssrt l53l“““*“- 19593
[bl Dataflowtrack

Fig. 1.1! Hrlthhnsadad and dataflow tracks


Ff» Mtfirnii H'["I'nrl!q|glrlII'\' _

Ill i Advanced Cuvnptuneriluchtitaeture

The cont-entional von Neumann machines are built with processors that execute a single context by each
processor at a time. ln other words, each processor maintains a single thread ofeontrnl with its hardware
resources. ln a multithrcaded architecture, each processor can execute multiple contexts at the same time.
The term mufrirlireotfing implies that there are multiple threads ofcontrol in each processor. Multithreading
oifers an efiicctive mechanism ibr hiding long latency in building large-scale multiproccssors and is today a
mature technology.
As shown in Fig. l.19a, the multithrcading idea was pioneered by Burton Smith [ 1978] in the HEP system
which extended the concept of scoreboarding of multiple functional units in the CDC 6460. Subsequent
multithreadod multiprocessor projects were the Tera computer {Alverson, Smith ct al., 199(1) and the MIT
.-hlewife {Agarwal ct al., 1989) to be studied in Section 9.4. ln Chapters 12 and I3, we shall discuss the
present technological factors which have led to the design of multi-threaded processors.
The Demrfllmrr Track We will introduce the basic concepts of dataflow computers in Section 2.3. Some
experimental datatlow systems are described in Section 9.5. The key idea is to use a datallow mechanism,
instead of a control-flow mechanism as in von Neumann machines, to direct the program flow. Fine~g1'ain,
instruction-level parallelism is exploited in dataflow computers.
As listed in Fig. 1.l9b, the dataflow concept was pioneered by Jae-it Dennis (1974) with a “static”
architecture. 'I'he concept later inspired the development of “dynamic” dataflow computers. A series of
tagged-token architectures was developed at MIT by Arvind and coworkers. We will describe the tagged-
token architecture in Section 2.3.] and then the *T prototype (Nikhil ct al., 199] ) in Section 9.5.3.
Another suhtrack of dynamic datafiow computer was represented by the Manchester machine [Gurd and
Watson, 1982). The ETL Sigma 1 {Shimada et al., 198?] and EMS evolved from the MIT and Manchester
machines. We will study the EMS (Sakai ct al., I989] in Section 9.5.2. These datafiow machines represent
research concepts which have not had a major impact in terms ofwidcsprcad use.

1.
~1$\\ Summary
In science and in engineering, theory and practice go hand-in-hand, and any significant achievement
irrvariahly relies on a judicious blend of the two. In this chapter, as the first step towards a conceptual
understanding ofparallelism in computer arcltritecturc, we have looked at thc models ofparallcl computer
systems which have emerged over the years. We started our study with a brief look at the development
of modern computers and computer architecture, including the means of classification of computer
architecture, and in particular Flynn’s scheme of classification.
'I'he performance of any engineering system must he quantifiable. In the case of computer systems.
we have performance measures such as processor clock rate, cycles per instruction (CPI), word size, and
throughput in !t»'l£Ps andfor MFLtJPs, These measures have been defined, and basic relationships between
them have been examined. Thus the ground has been prepared for our study in subsequent chapters ofhow
processor architecture, system architecture, and software determine performance.
Next we looked at the architecture of shared memory multiproccssors and distributed memory
multicomputers, laying the fotmdation for a taxonomy of MIMD computers. A key system characteristic
is whethcrdifierent processors in the system have accesn; to a common shared memory and—ifthey do—
FM MiG-l'i1I-i-' Hfiiformlunm _:
Fbrolkrl Ccmputer Models 4|

whether the access is uniform or non-uniform. Vector computers and SIMD computers were examined,
which address the needs of highly compute-intensive scientific and engineering applications.
Over the last two or three decades, advances in K-"LSl technology have resulted in huge advances in
oomputer system performarrcc; however, the basic architectural concepts which were developed prior to
thc ‘VLSI revolution‘ corrtinue to remain valid.
Parallel random access machine {PRAM] is e theoretical model ofa parallel computer. No real computer
system can behave exactly like the PRAM. but at the same time the PRAM model provides us with a
basis for the study ofparallcl algorithms and their performance in terms of time andfor space complexity.
Difierent sub-types of the PRAM model emerge-, depending on whether or not multiple processors can
perform concurrent read or write operations to the shared memory.
Towartls the end ofthe chapter, we could discern the separate architectural development tracks which
leave emerged over thc years in computer systems. We looked at multiple-processor systems, vector
processing, SIMD systems, and multi—thrcaded and dataflow systems. We shall see in Chapters 12 and 1'3
that, due to various technological factors, multi-threaded processors have gained in importance over the
last decade or so.

&............
Problem 1.1 A 400-l"'lHz processor was used to each memory access.
execute a benchmark program with the following {a} What is the effectixe CPI of this computer?
instruction mix and dock cycle counts: {b} Suppose the processor is being upgraded
with a 3.0 Gl-lz clod-t. However. even with
Instruction type Instruction count Clo-ck cycle count
faster cache, two clock cycles are needed per
Integer aritl'n1e1:ic 45»OiXl0 memory access. lf 30% of the instructions
D-ate transfer IIZOIXIO require one memory access and another 5%
Floating point 15C[lJ0
require two memory accesses per instruction.
Control transfer SIIOD |'-I KJIM-l I
what is the performance of the upgraded
processor with a compatible instruction set
Determine the eifcetlvc CPI. MIPS rate. and execution
time for this program. and equal instruction counts in the given
program mix?
Problem 1.2 Explain how instruction set, compil-
Problem 1.4 Consider the execution of an
er technology CPU implementation and control. and
object code with 2 '>< 106 instructions on a 4(1)-l"'lHz
cache and memory hierarchy affect the CPU per-
processor.The program consists of four major types
formance and justify the effects in terms of program
of instructions.The instruction mix and the number
length. clock rate. and effective CPI.
of qrclfi [CPI] needed for each instruction type are
Problem 1.3 Aworlrstation usesa 1.5 GHZ pro- given below based on the result of a program trace
cessor with a claimed 1030-HIPS rating to execute experiment:
a given program mi:n:.Assurne a one-cycle delay for
TM Hnffirnil-' Hllllfmminnm
42 — Adswrced Cuvnpiuteriliiclnitecture

lnstr|.|ccion type CPI Instruction mix three computers! Give reasons If you find a way to
Arithmetic and lcgic 1 60% rank them statistically.
Loadistore with I 13%
cache hit Execu1:ionT'me [in seconds)
Branch 4 11% Progran Con1puterA Computer B Computer C
Memory reference B 10% Progran 1 ‘I ‘IO 10
with cache miss
Progrln I ‘IUJCI 100 ICI
Progrin 3 5C0 ‘I500 50
{a} Calculate the average CPl when the program
Progran 4 ‘ICU EBB ‘HID
is executed on a uniprocessor with the above
trace results.
Problem 1.7 Characterize the ardwitectural op-
(b) Calculate the corresponding MIPS rate based
erations of SIMD and MIMD computers. Distinguish
on the CPI obtained in part (a).
between multiprocessors and multicomputers based
Problem 1.5 Indicate whether each of the fol-
on their structures. resource sharing. and interpro-
lowing statements is true or fialse and justify your
cessor communications. Also. esqalain the differenc-
answer with rsoning and supportive or counter
es among UMA. NUMA. and COMA. and NORMA
examples:
computers.
(a} The CPU computations and IIO operations
cannot be overlapped in a multiprogrammed Problem 1.8 The following code segment.
computer. consisting of six instructions. needs to be executed
(b) Synchronintion of all PEs in an SIMD 64 times for the evaluation of vector arithmetic
computer is clone by l"ardware rather than expression: D[l]- =A(l) + B(l]- X C{l) for U E l S 63.
by software m is often done in most MIMD Load Ri . B(l) IR] 1- Memory (rr + I)!
computers.
Load R2. C(|} 1R2 <— Memory QB + I)!
(c} As far as programmrability is concerned.
Multiply R]. R2 JR] <— (R1) '>< {R2}!
shared-memory multiproccssors offer
Load R3. A{|} 1R3 1- Memory ( y+ I]-1'
simpler interprocessor communication
Add R3. Ri JR3 1- (R3) + {R1}!
support than that offered by a message-
Store D(l). R3 {Memory {B + I} <— (R3)!
passing multicom pute r.
{d} in an MIMD computer". all processors must where R1. R2. and R3 are CPU registers. {R1} is
execute the same instruction at the same the content of R1. 11'. [3, y. and B‘ are the starting
time synchronously. memory addressa of arrays B(l). C(l}. A(l}. and D|[l}.
(e} As far as scalability is concerned. multicom- respectively. Assume four clock cycles for each Load
puters with distributed memory are more or Store. two cycles for the Add. and eight cycles for
scalable than shared-memory multiproces- dwe Multiply on either a uniprocessor or a single PE
sors. in an SIMD machine.
(a) Calculate the total number of CPU cycles
Problem 1.6 The execution times [in seconds} of
needed to execute the above code segment
four programs on three computers are giyen below:
repeatedly 64 times on an SlSD uniprocessor
Assume that 109 instructions were executed in
computer sequentially. ignoring all od'1er time
d'1of the four programs. Calculate the MIPS rating
delays.
of each program on each of die dwree machines.
Based on thse ratings. can you draw a clr (b) Consider the use of an SIMD computer with
conclusion regarding die relative performance of the 64 PEs to execute the above vector operations
TM Illnffirlhir Hfllfiurnpennri .
Fbmlkl Canputer Models

in six synchronized vector instructions over Problem 1.13 Design an algorithm to find the
54-component sector data and both driven maximum of n numbers in O[log n) time on an
by the same-speed clock. Calculate the total EREW-PRAM model.Assume that initially ch
loca-
execution time on the SIMD machine. ignoring tion holds one input value. Explain how you would
instruction broadcast and other delays. make the algorithm processor time optimal.
(c) W'hat is the speedup gain of the SIMD
Problem 1.14 Develop two algorithms for fast
computer over the SISD computer?
multiplication of two n >< n matrices with a system
Problem 1.9 Prove that the best parallel algo- of p processors. where 1 £ p E naflog n. Choose
rithm written for an n-processor ERE\N PRAM an appropriate PRAM machine model to prove that
model can be no more than O[log n} times slower dwe matrix multiplication can be done in T = Clfnafp)
than any algorithm for a CRCW model of PRAM time.
having the same number of processors. (a) Prove d"|atT = Ofni} ifp = n.The corresponding
algorithm must be shown. similar to that in
Problem 1 .10 Consider the multiplication oftwo Example 1.5.
n-bit binary integers using a 1.2-pm CMOS multi- (b) Sholw the parallel algorithm with T = O(n]- if p
plier chip. Prove the lower bound AT: > k.n1.where A =n .
is the chip area. T is dae execution time. n is the word Problem 1.15 Match each of the following eight
length. and k is a technology-dependent constant computer systems: KSR-1 . RP3. Paragon. Dash. C l"’l-2.
Problem 1.11 Compare the PRAM models with VPP500. EM-5. and Tera. with one of the best de-
physical models of real parallel computers in inch of scriptions listed below.The mapping is a one-to-one
the following categories: correspondence.
(a) W'hich PRAM variant can best model SIMD (a) A massively parallel system built with
madaines and how? multiple-context processors and a 3-D torus
(b} Repeat the question in part {a} for shared- architecture.
memory Ml MD machines. (b) A data-parallel computer built with bit-slice
PEs interconnected by a hypercubelmesh
Problem 1.11 Answer the following questions network
related to the architectural development tracks pre-
(c) A ring-connected multiprocessor using a
sented in Section 1.5:
cache-only memory architecture.
(a} For the shared-memory track (Fig. 1.17‘). ex- (d) An experimental multiprocessor built with a
plain the trend in physical memory organi- dynamic clataflow architecture.
zations from the earlier system (Cmmp) to (e) A crossbar-connected multiprocessor built
more recent systems [such as Dash.etc.). with distributed processorlmemory nodes
(b) Distinguish between medium-grain and fine- forming a single address space.
grain multicomputers in dweir architectures ff) A multicomputer built with commercial
and programming requirements. microprocessors widw multiple address
(c} Distinguish between register-to-register spaces.
and memory-to-memory architectures (g) A scalable multiprocessor built with
for building conventional multivector distributed shared memory and coherent
supercomputers. caches.
(d} Distinguish between single-thrded and (h) An MIMD computer built with a large
multithrded processor architecturfi. multistage switdaing network.
FM Mnffirirli-' Hflllmminnm

Program and Network


Properties
This daapter covers fundamental properties of program behavior and introduces major classes of
interconnection networlts.V5e begin with a study of computational granularity. conditions for program
partitioning. matching software with hardware. program flow mechanisms. and compilation support for
parallelism. interconnection architectures introduced include static and dynamic networks. Network
complexity communication bandwidth. and data-routing capabilities are discussed.

CONDITIONS OF FARALLELISM
i
Thc cxploitation ofparrllclism has crcatcd a ncw dimension in computcr scicncc. ln ortlcrto
move parallel processing into the mainstream of computing, ll.'I'. Kung H991) has identified
the need to make significant progress in three lrey areas: c-mnpumrrmr l'i'NJtl'c’ll1‘ ibr parallcl computing,
inrcrprrx-::s.s'0r c-ommurrinnrion in parallcl architoclurcs, and sjrsrorrr irrregrariori lbr incorporating parallel
systems into general computing cnvirorunents.
A thcorctical trcatmcnt ofparallclism is thus ncc-dc-tl to build a basis for thc above challcngcs. ln practice,
parallelism appears in various ibrms in a computing cnvironmcnt. All forms can bc attributcd to lcvcls oi
parallelism, computational granularity. time and space complexities, communication latencies, scheduling
policies. and load balancing. Very often. tradcoffs exist among time, space. performance. and cost factors.

2.1.1 Data and Resource Dependence:


The ability to execute several program segments in parallel requires each segrnent to be independent of the
other segments. Thc independence comes in various forms as defined below separately. For simplicity. to
illustratc thc idca, we corttidcr thc clcpcndcncc relations among instructions in a program. In gcncral, each
code segment may contain one or more statements.
We use a deyieneferrc-r.' graph to dcscribc thc relation s. Thc nodcs ofa clcpcndcncc graph correspond to thc
program statcmcnts [instructions], and thc dircctcd ctlgcs with dificrcnt labels show tl1c ordcrcd rclations
among the statements. The analysis of dependence graphs shows where opportunity exists for parallellzation
and vectorizalion.

Data Dependence Thc ordering rclalionship bctwecn stalcrncnls is indicated by thc data dependence.
Five types of data dependence are defined below:
,.,W,mm,,,,k,.,,,,,,,. 45
['1] Ffon-' dependence: A statement S2 is_fiot1-'-d't*pendt*nr on statement S1 if an execution path exists from
S1 to S2 and if at least one output {variables as-signedj of S1 feeds in as input {operands to be used] to
S2. Flow dependence is denoted as S1 -3» S2.
(21 .-intidepertdertce: Statement S2 is anridependenr on statement S1 if S2 follows S1 in program order and
ifthc output ofS2 overlaps the input to S]. A direct arrow crossed with a bar as in S1 -l—ZI> S2 dicates
antidependence from 5-1 to S2.
(31 Output dependertce: Two statements are oropur-depenrienr if they produce [write] the same output
variable. S1 0-Eb S2 indicates output dependence from Si to S2.
(41 HO d-epertdeneer. Read and write are [IO statements. l.~"O dependence occurs not because the same
variable is involved but because the same file is referenced by both l.-‘O statements.
(5) Unknon-"n dependent-¢'. The dependence relation between two statements cannot be determined in the
following situations:

' The subscript of a variable is itself subscribed.


' The subscript does not contain the loop index variable.
v A variable appears more than once with subscripts having difierent coeflicients of the loop
variable.
' The subscript is nonlinear in the loop index variable.
When one or more ofthesc condition s ex ist, a conservative assumption is to c laim tmknown dependence
among the statements involved.

I»)
g Example 2.1 Data dependence in programs
Consider the following code fiagment of four instructions:
SI: Load R1, A IRI 1- Men1ory(A).-'
S2: Add R2. RI IRS! <— (RI) + (R2)!
S3: Move Rl,R3 IR] <—[R3]-I
S4: Store B. R1 fMemorv(B) 1- (R1)!
As illustrated in Fig. 2,la, S2 is flow—dep-endent on SI because the variable A is passed via the register
R1. S3 is antidcpcndent on S1 because of potential conflicts in register content in R1. S3 is output-dependent
on S1 because they both modify the same register R1. Other data dependence relationships can be similarly
revealed on a pairwise basis. Note that dependence is a partial ordering relation; that is, the members ofnot
every pair of statements are related. For example, the statements S2 and S4 in the above program are totally
t'rte2'e']Je’rte2'e'Hf.
Nest, we consider a code fiagrnent involving HO operations:
SI: Read (4), A(T) fllead array A from file 4?
S2: Process fProcess data.’
S3: Write [4]. BU) fwrite array B into file 41‘
S4: Close (4) K.‘-lose file 4.-'
: _ PM‘ I Ifllli t'm'rIq|r_.\.I|n*\ _
4% i Adwmced Corr|pu'tae|'Architecbtiv12

As shown in Fig. 2.11:-, the rcadiwrite statements, S1 and S3, are UCI-dependent on each other because
they both access the same file. The above data dependence relations should not be arbitrarily violated during
program execution. Otherwise, erroneous results may be produced with changed program order. The order in
which staternents are executed in a sequential program is well defined and repetitive runs produce identical
results. On a multiprocessor system, the program order may or may not be preserved, depending on the
memory model used. Dctcrrninisrn yielding predictable results can be controlled by a programmer as well as
by constrained modification of writable data in a shared memory.

(3) Dgpgndanog graph


@—~@
my |!Ct dependence caused try
accessing tho same filo by
the road and wrtto state-
merits

Flg.2.1 Dara and IID dependence: in the ping:-an ofE>-temple 11

Control Dependence This refers to the situation where the order of execution of statements cannot be
determined before run time. For example, conditional statements will not be resolved until run time. Different
paths taken afier a conditional branch may introduce or eliminate data dependence among instructions.
Dependence may also exist between operations performed in successive iterations of a looping procedure.
[n the following, we show one loop example with and another without control-dependent iterations. The
successive iterations ofthe following loop are control-ino'ependt'nr:

[lo 201 = 1, N
Ail) = Ctll
IF tau) .LT. oi Ail) = 1
2'0 Continue

The following loop has eonnnf-dqrendenr iterations:

[I-It llll = l, N
IF [Ail — 1) 1'11. D) Ail) = 0
I0 Continue

Control dependence often prohibits parallelism from being exploited. Compiler techniques or hardware
branch prediction techniques are needed to get around the control dependence in order to exploit more
parallelism.

R-an urce Dependence This isdiifetent from data orcontrol dependence, which demands the independence
of the work to be done. Resource dependence is concerned with the conflicts in using shared resources,
,.,,,,,,,,,,,,,,,,,,,,,,,,,,, 4,
such as integer units, l]oating—poinl units, registers, and memory areas, among parallel events. When the
conflicting resource is anALU. We call it ALL’ a’epenrir-nee.
Ifthe conflicts involve workplace storage, we call it srorngerfepertderiee. [n the case of storage dependence,
each task must work on independem storage locations or use protected access (such as locks or monitors to
be described in Chapter ll] to shared writable data.
The detection of parallelism in programs requires a check of the various dependence relations.
Bernstein’: Condition: In 1966, Bernstein revealed a set of conditions based on which two processes can
execute in parallel. A pr0ce.ss' is a software entity corresponding to the abstraction of a program fragment
defined at various processing levels. We define the input set I, of a process P, as the set ofall input variables
needed to execute the process.
Similarly, the onninr sor O‘; consists of all output variables generated ailer execution oi" the process P,-.
[nput variables are essentially operands which can be fetched from memory or registers, and output variables
are the results to be stored in working registers or memory locations.
Now, consider two processes P, and P; with theirinput sets 1| and I; and output sets O I and O3, respectively.
These two processes can execute in parallel and are denoted P| P; if they are independent and therefore
create deterministic results.
Formally, these conditions are stated as follows:
1, rs 0: = pl
12rwO,=::i} (2.1)
O, rw 03 = p t
These three conditions are known as Bernstein is eonriirrimis. The input set If is also call-od the mod set‘ or
the .n’om.nin of P; by other authors. Similarly, the output set O; has been called the \\1--l"t'l!'r.’.i‘-r.’I or the range of a
process P,-. in terms of data dependences, Bernstein's conditions simply imply that two processes can execute
in parallel if they are flow-independent, antiindcpendent, and output-independent.
The parallel execution of sueh two processes produces the same results regardless of whether they are
executed sequentially in any order or in parallel. This is because the output of one process will not be used
as input to the other process. Furthermore, die two processes do not modify (write) the same set of variables,
either in memory or in the registers.
In general, a set ofprocesscs, P I, P3, , Pg can execute in parallel if Bomstein's conditions are satisfied
on apairwise basis; that is,P| | Pg |P_q | |P;,_ ifand only il"P, |P_,- for all iafij. This is citeirlpliliotl by the
following program illustrated in Fig. 2.2.

P)
g Example 2.2 Detection of parallelism in a program
using Bernstein's conditions
Consider the simple casein which each process is a single HLL statement. We want to detect the parallelism
embedded in the following five statements labeled P I, Pg, P3, P4, and P, in program order
W _ H‘-r Mclinrw HJ'lI|:'im-.-;n_.-.-|-rs _
43 Ii mama Cnu'npu1JerArchitecuue

.F‘|:C=D><E]
l=5;.w=c;+c
@;,1=s+c} (2.2)
1=:,;c=r.+.w-
.F_',:F=G'+E j
Assume that each statement requires one step to execute. No pipelining is considered here. The depenclenee
graph shown in Fig. 2.2a dernenstrates flow dependence as well as resouree dependence. In sequential
execution. five steps are needed (Flg. 2.2b).

0P~
0 Pi'*®
P3

{aj A dependenee graph showing both data dependence [solid arrows)


and resource deperdence[dasl1ed snows)

D
E Time

P1

BG C
+

Pg as
1'. A
G Pdlfll
s P15 as
I

E 3 LP2 Paps
P5 M
P4
F 1 C A F

[bi Sequential execution in five sens, [c] Parallel execution in three steps,
assuming one step perstatement assuming two adders are available
{no pipelining) per step

Fig.1.1 Denecnbn of parallelism in the prugrarn of Example 11

If two adders are available simultaneously, the parallel execution requires only three steps as shmvn in
Fig. 22¢. Pairwise, there are ll] pairs of statements to check against Bernstein's eoitditicms. Only 5 pairs, P,
| P5, P; | P3, P; | P5, P5 | P3, and P, '_| P5, can execute in parallel as revealed in Fig. 2.2a if there are no
,.,,,,,,,,,,,,_,,,,,.,,,,,,,, 4,
resource conflicts. Collectively, only P; I P3 I P, is possible (Fig. 2.20] because P; I P3, P3 I P5, and P, I
P; arc all possible.

In gcncral, thc parallelism relation I is commutative; i.c., P, I implies Pf; I P,-. Bill the relation is not
transitive; i.c., P, I 111,- and 11,- I Pk do not necessarily guarantee P, I Pg For example, we have P, I P, and F,
I P3, but P, II P3, whcrc II mcans P, and P3 cannot cxccutc in parallcl. ln other words, thc ordcr in which P,
and P; arc crtccutcd will make a diifcrcncc in thc computational results.
Thcrciorc, I is not an equivalence relation. l-Iowcver,.P,- I Ff, I Pk implies assoclativity; i.c. (P, I Pi] I P, =
P, I t P,- I P1], since the order in which the parallel executable processes are executed should not make any
diffcrcncc in thc output scts. lt should bc notcd that thc condition 1, {"1 I,-as cl docs not prcvcnt parallelism
bctwccn P, and i
Violations ofany one or more ofthe three conditions in 2.1 prohibits parallelism between two processes.
In general. violation of any one or more ofthe 3n('n W2 Bernstein's conditions among n processes prohibits
parallelism collectively or partially.
ln gcncral, data dcpcndcncc, control dcpcndcncc, and rcsourcc depctrdcn-cc all prcvcnt parallelism from
bcirrg cxplo itab lc.
The statement-level dependence can be generalized to higher levels, such as code segment, subroutine,
process, task, and program lcvcls. Thc dcpcndcncc of two highcr lcvcl objects can bc inicrrcd from thc
dcpcnidcnoc of statements in the corresponding objects. Thc goals of analyzing thc data dcp-cndcncc,
control dependence, and resource dependence in a code an": to identify opportunities for parallelization or
vectorization. Hardware techniques for detecting instruction-level parallelism in a running program are
dcscribod in Chapter 12.
Very often program restructuring or code transformations need to be performed before such opportunities
can bc rcvcalcd. Thc dcpcndcncc tclations arc also used in instruction issuc and pipclinc scheduling operations
dcscribod in Chapter 6 and 12.

2.1.2 Hardware and Sofinuare Parallelism


For implementation ofparallclism, we need spccia] hardware and software support. In this section, we address
these support issues. We first distinguish between hardware and soflware parallelism. The mismatch problem
bctwccn hardware and sofiwarc is discusscd. Thcn wc dcscribc thc fundamental conocpt of compilation
support nccdod to closc thc gap bctwccn hardware and software.
Details of special hardware functions and software support for parallelism will be treated in the remaining
chapters. The key idea being conveyed is that parallelism cannot be achieved flee. Besides theoretical
conditioning, joint cfibrts bctwccn hardware dcsigncrs and swoihvarc programmers arc nccdcd to cxploit
parallelism in upgrading computer pcrformancc.

Hardware Parallelism This refers to the type of parallelism defined by the machine architecture and
hardware multiplicity. Hardware parallelism is ottcn a firnction of cost and performance tradeoffs. It displays
thc resource utilization patterns of simultaneously cxccutablc operations. It can also indicate the pcalr
performance of thc proocs sor rcsourccs.
One way to characterize the parallelism in a processor is by the number of instruction issues per machine
cycle. If a processor issues It instructions pcr machine cycle, then it is called a It-i.ssrrc process-o r.
_ F|'>r'MfGJ'|Ili' H“ I'mt!I;|(1rHt\ _
50 i Aohioinccd Computer architecture

A conventional pipelinod processor takes one machine cycle to issue a single instruction. These types of
processors an: called one-issue machines, with a single instruction pipeline in the processor. ln a modem
processor, two or more instructions can be issued per machine cycle.
For example, the lntel i96OCA was a three-issue processor with one arithmetic, one memory access, and
one branch instruction issued per cycle. The LBM FJSC!System 6000 is a four-issue processor capable of
issuing one arithmetic, one memory access, one floating-point. and one branch operation per cycle.
Sofhtnnre Hnra.llcl.i:m This type of parallelism is revealed in the program profile or in the program flow
graph. Software parallelism is it function of slgoritlun, programming style, and program design. The program
flow graph displays the partems of simultaneously executable operations.

.39 Example 2.3 Mismatch between software parallelism and


hardware parallelism (Wen-Mel Hwu,1991)
Consider the example program graph in Fig. 2.3a. There are eight instructions {four loans and fourririrhrrierie
operations) to be executed in three consecutive machine cycles. Four load operations are peribrrned in the
first cycle, followed by two nmlri;J{i- operations in the second cycle and two addrlsubrrric-I operations in the
third cycle. Therefore. the parallelism varies fi'om 4 to 2 in three cycles. The average software parallelism is
equal to SE3 = 2.6? instructions per cycle in this example prograrn.
New consider execution of the same program by a two-issue processor which can execute one memory
access (load or write) and one arithmetic {aid subrraer. mrrlriflrilt-1 etc.) operation simultaneously. Willi this
hardware restriction, the program must execute in seven machine cycles as shown in Fig. 2.3b. Therefore.
the hernia-‘are par'oHel'r'.sm displays an average value of Sf? = 1.14 instmctions executed per cycle. This
demonstrates a mismatch between the software parallelism and the hardware parallelism.

Q Cyda1
{Q Cycla2

--"G ‘1Y='='3
es-1 e e e “ Q, Cydo4
Cyda 2 Q Q G Cyde 5
Cydo 2. Q. ‘Q 9 Cycle s

A El
Q Cycle 7
Ly: Load opsation
Xy: Multiply operation 5
{ai snfiwaa pad“-Em {bl Hadwae paalslism

Fig. 2.3 E:-reetrclng an in-rarrtp-lc program by a 1:we~lssue swperscalar processor


yaw, mMm,yyy _ 5|
Let us try to match the software parallelism shown i11 Fig. 2.3a in a hardware platform oi‘ a dual-processor
wstem, where single-issue processors are used. The achievable hardware parallelism is shown in Fig. 2.4.
where US stands for lonrfflsrore operations. Note that six processor cycles are needed to execute the I2
instructions by two processors. .S'| and .5‘; are two inserted stone operations, and I5 and lg, are two inserted
form‘ operations. These added instructions are needed for interproccssor communication through the shared
memory.

Q Cydoi

99 Q Clyde-2
US: Loaoilfi-‘ore op-orainn
JIC Mtlqaly operation
Gilda 3 +.l—: Aodfiubioet operation

so Cyde 4}
Added
I‘tEtlI't.|Glol'E
or iPe
I’ Cycle s

1.-fies 0 Cycle 6

A s
Fig.1.4 Dual-processor omcutloo niche program in Fig. 13:.

Of the many types of sottware parallelism, two are most frequently cited as important to parallel
programming: The first is mnrrol parol'!ch's.-rt, which allows two or more operations to be performed
simultaneously. The second type has been called darn porrillclisrri, in which almost the same operation is
perfomied over many data elements by many processors simultaneously.
Control parallelism, appearing in the form of pipelining or multiple functional units, is limited by the
pipeline length and by the lmlltiplicity of functional units. Both pipelining and functional parallelism are
handled by thc hardware; programmers need take no special actions to invoke thorn.
Data parallelism offers thc highest potential for concurrency. It is practiced inboth SIMD and MIM.D modes
on MPP systems. Data parallel code is easier to write and to debug than control parallel code. Synchronization
in SIMD data parallelism is handled by the hardware. Data parallelism exploits parallelism in proportion to the
quantity ofdata involved. Thus data parallel computations appeal to scaled problems, in which the pcrforniancc
of an MPP system does not drop sharply with the possibly small sequential fraction in the program.
To solve the mismatch problem bctwccn software parallelism and hardware parallelism, one approach is
to develop compilation support, and the other is through hardware redesign for more efficient exploitation of
parallelism. These two approaches must cooperate with each othcrto produce the best result.
FM Mtfiruw titmpwtns _
51 i Aahtcvtced Con'tpu'terAt'chtitectut'e

Hardware processors can be better designed to exploit parallelism by an optimizing compiler. Pioneer work
in processor technology with this objective was seen in the IBM 801, Stanford MIPS, and Berkeley IUSC.
Such processors use a large register filc and sustained instruction pipelining to cxceutc nearly one instruction
per cycle. The large register file supports fast access to temporary values generated by an optimizing compiler.
The registers are exploited by the code optimizer and global register allo-cater in such a compiler.
The instruction scheduler exploits the pipeline hardware by filling brunch and Juan’ delay slots. [rt
superscalar processors, hardware and sofiware branch prediction, multiple instruction issue, speculative
execution, high bandwidth instruction cache, and support for dynamic scheduling are needed to facilitate the
detection of parallelism opportunities. Further discussion on these topics can be found in Chapters 6 and 12.

1.1.3 The Role of Compilers


Compiler techniques are used to exploit hardware featttres to improve pcrfortnartee. The pioneer work on the
IBM PL.8 and Stanford sttss compilers aimed for this goal. Other early optimizing compilers for exploiting
parallelism included the CDC STACK.LlB, Cray CFT, Illinois Parafrase, Rice PFC. Yale Bulldog, and Illinois
IMPACT.
In Chapter ll], we will study loop transformation, software pipelining, and features developed in existing
optimizing compilers for supporting parallelism. Interaction between compiler and arcllitecttue design is a
necessity in modern computer development. Conventional scalar processors issue at most one instruction
per cycle and provide a few registers. This may causc excessive spilling of temporary results from the
available registers. Therefore, more software parallelism may not improve performance in conventional
scalarprocessors.
There exists a vicious cycle of limited hardware support and the use of a naive compiler. To break the
cycle, ideally one must design the compiler and the hardware jointly at the same time. Interaction between
the two can lcad to a better solution to the mismatch problem between software and hardware parallelism.
The general guideline is to increase the flexibility in hardware parallelism and to exploit software
parallelism in control-intensive programs. Hardware and softtvaie design tradcoffs also exist in terms of cost.
complexity, expandability, compatibility, and perfort-nancc_ Compiling for multiproccssors is much more
challenging than for uniprocessors. Both granularity and communication latency play important roles in the
code optimization and scheduling process.

PROGRAM PART ITIOHI NG AND SCHEDULING


_ This section introduces the basic definitions of computational granularity or level of
parallelism in programs. Communication latency and scheduling issues are illustrated
with programming examples.

2.1.1 Grain Sizes and Latency


Grrtirt size or grrmtdririrtr is a measure of the amount ofcomputation involved in a software process. The
simplest measure is to count the number of instructions in a grain [program segment). Grain size determines
thc basic program segment chosen for parallel processing. Grain sizes are commonly dcscribcd as fine.
me:-fiuni or course, depending on thc processing levels involvod.
,,,,W, ,,,,_,,_,,,,,.,,,,,,, 5,
Latency is a time measure of the communication overhead incurred between machine subsystems. For
example, the mcrrrcry infers-_r-' is the time required by a processor to access the memory. The time required for
two processes to synehroniioe with each other is called l.l'lB s'r=nc.irronr':orri'Jn l'are'nc_t-'. Computational granularity
and communication latency are closely related, as we shall see below.
Parallelism has been exploited at various processing levels. As illustrated in Fig. 2.5, five levels of
program execution represent different computational grain sizes and changing communication and control
requirements. The lower thc level, the finer the granularity of the software processes.
[n general, thc execution of a program may involve a combination ofthesc levels. The actual combination
depends on the application, formulation, algorithm, language, program, compilation support, and hardware
characteristics. We characterize below the parallelism levels and review their implementation issues from the
viewpoints ofa programmer and of a compiler writcr.

Instruction Level At the lowest level, a typical grain contains less than 20 instructions, calledfine grain
in Fig. 2.5. Depending on individual programs, fine-grain parallelism at this level may range from two to
thousands. Butler et al. (1991) has shown that single-inst1'uetion-stream parallelism is greater than two. Wall
{I991} finds that the average parallelism at instruction level is around five, rarely exceeding seven. in an
ordinary program. For scientific applications. Kumar (1988) has measured the average parallelism in the
range of 500 to 3000 Forl:ra.n statements executing concurrently in an idealized environment.
,

|_m,-9| 5 Jobs or programs

-Coarse grain
Suop-rograms, job T
Level 4 stops or rotated
parts of a progam
>~ lvlacllum grain
Increasing
conrnunleatlon Procodires,
Laval 3 suh-routlrrn-s, taslas, Higher dagoa
demand and of parallelism
schaclullng L) or no-routines
overhead
il
Nonrneurslvo to-ops or
Lovol 2
unfolded ltaratlons

t» Flna gra In

Instructions or
Level 1 statements
Y
i}
Fig. 2.5 Levels of parallelism in program execution on modem cornprueers {Reprinted from l-twang. Prue. JEEE,
October 1937}

The exploitation of fine-grain parallelism can he assisted by an optimizing compiler which should be able
to automatically detect parallelism and translate the source code to a parallel form which can be recognized
by the run-time system. Instruction-level parallelism can be detected and exploited within the processors, as
we shall sec in Chapter 12.
ma if run! |:'nm;u;um1
54 i AoManccdCon'|po'tcr.luchritectusc

Loop Level This corresponds to the iterative loop operations. A typical loop contains less than SUD
instructions. Some loop operations, if independent in successive iterations, can be vectorlzed for pipelined
execution or for lock-step ertocution on SIMD machines. Some loop operations can be self-scheduled for
parallel execution on MIMD machines.
Loop-level parallelism is often the most optimized program construct to execute on a parallel or vector
-computer. However, recursive loops are rather diflieult to parallclizc. Vector processing is mostly exploited
at the loop level {level 2 in Fig. 2.5) by a vectorizing compiler. The loop level may also he considered a line
grain of computation.

Procedure Level This level corresponds to medium-grain parallelism at the task, procedural, subroutine,
and coroutine levels. A typical grain at this level contains less than 2000 instructions. Detection ofparallelism
at this level is much more diflicult than at the finer-grain levels. lnterprocedural dependence analysis is much
more involved and history-sensitive.
{Iommunication requirement is often less compared with that required in MIMD execution mode. SPlv'I.D
execution mode is a special case at this level. Multitasking also belongs in this category. Significant efforts
by programmers may be nccdod to restructure a program at this level, and some compiler assistance is also
needed.
Subprogram Level This corresponds to the level ofjob steps and related subprograms. The grain size may
typically contain tens or hundreds of thousands of instructions. Job steps can overlap across different jobs.
Subprograms can be scheduled for dii1"erent processors in SPMD or MPMD mode, often on message-passing
multico mp uters.
Multiprogramrning on a uniprocessor or on a multiprocessor is conducted at this level. Traditionally,
parallelism at this level has been exploited by algorithm designers or progranuners, rather than by compilers.
Good compilers for exploiting medium- or coarse-grain parallelism require suitably designed parallel
programming languages.

job |‘Pnogrnmj Level This corresponds to the parallel execution oi‘ essentially independent jobs (programs)
on a parallel computer. The grain size can be as bigh as millions of instructions in a single program. For
supercomputers with a small number of very powerful processors, sueh coarse-grain parallelism is practical.
Job—level parallelism is handled by the program loader and by the operating system in general. Time-sharing
or space-sharing multiproccssors explore this level of parallelism. In fact, both time and space sharing are
extensions of multipmgramm ing.
To surnmarizc, line-grain parallelism is often exploited at instruction or loop levels, preferably assisted by
a parallelizing or vectorizing compiler. Medium-grain parallelism at the task orjob step demands significant
roles for the programrner as well as compilers. Coarse-grain parallelism at thc program level relics heavily
on an eilective OS and on the efficiency ofthe algorithm used. Shared-variable communication is often used
to support fine-grain and medium-grain computations.
Message-passing multicomputers have been used for medium- and coarse—grain computations. Massive
parallelism is ofien explored at the fine-grain level, such as data parallelism on SIMD or Milt-ID computers.
Communiontion Lntency By balancing granularity and latency, one can achieve better performance of
a computer system. Various latencies are attributed to machine architecture, implementing technology, and
communication pottcms involved. The architecture and tcclmology afiiect the design choices for latency
tolerance between subsystems. 1n fact, latency imposes a limiting factor on the scalability of the machine
Pro-gum and Network Props ' i 55

sine. For example, over the years memory latency has increased with respect to processor cycle time. ‘Various
latency hiding or tolerating techniques will be studied in Chapters 9' and I2.
The latency incurred with interproeessor communication is another important parameter for a system
designer to minimize. Besides signal delays in the data path, LPC latency is also affected by the communication
pattems involved. In gcnctal, n tasks communicating with each other may require ntn - l)f2 communication
links among them. Thus the complexity grows quadratically. This leads to a communication bound which
limits the number of processors allowed in a large computer system.
Communication patterns are determined by the algorithms used as well as by the architectural support
provided. Frequently encountered patterns include permurruions and br0on‘c¢r.sr. rrtrrfricttsr, and c-oryierenc-e
{many-to-many) eommunications. The communication demand may limit the granularity or parallelism. Very
oilen tradoo fi's do exist between the two.
The communication issue thus involves the reduction of latency or complexity, the prevention ofdeadlock,
tninintizing blocking in communication patterns, and the tradeoff between parallelism and communication
overhead. We will study techniques that minimize communication latency, prevent deadlock, and optimize
grain size in latter chapters of the book.

2.1.2 Grain Packing and Scheduling


Two fitndamental questions to ask in parallel programming are: (i) How can we partition a program into
parallel branehes, program modules, mierotasks, or grains to yield the shortest possible execution time? and
(ii) What is the optimal size ot" concurrent grains in a computation‘?
This grain-size problem demands determination of both the number and the size of grains {or microtasks)
in a parallel program. Of course, the solution is both problem-dependent and machinedependent. The goal is
to produce a short schedule for fast execution of subdivided program modules.
There exists a trradeoffbetwcen parallelism and scheduling." synchronization overhead. The time oomplexity
involves both computation and eommtmication overheads. The program partitioning invoh-es the algorithm
designer, programmer, compiler, operating system support, etc. We describe below a grain ;mt-It-ing approach
introduced by Kruatraehue and Lewis (I938) for parallel programming applications.

I»)
8] Example 2.4 Program graph before and after grain packing
(Kruatrachue and Lewis, 1988)
The basic concept of program partitioning is introduced below. In Fig. 2.6, we show an example program
graph in two different grain sizes. A program graph shows the strucnirc of a program. It is very similar
to the dependence graph introduced in Section 2.1.1. Each node in the program graph corresponds to a
computational unit in the program. The grain .st':e is measured by the number of basic machine cycle-5
-[including both pro-eessor and memory cycles) needed to execute all the operations within the node.
We denote each node in Fig. 2.6 by a pair (rt. st, where rt is the node mime {id} and s is the grain size of the
node. Thus grain size reflects the number of computations involved in a program segment. Fine-grain nodes
have a smaller grain size. and coarse-grain nodes have a larger grain size.
1 _ . 1 I.‘ IBM‘ ln¢r.q|r_.u|»rs

Sfi i Aahvmced Compuoernrchitecuue

The edge label (rt, cl) between two end nodes specifies the output variable rt from the souroe node or the
input variable to the destination node. and the communication delay d between them. This delay includes all
the path delays and memory latency involved.
There are 1? nodes in the fine-grain program graph (Fig. 2.6a) and 5 in the coarse-grain program graph
(Fig. lob]. The coarse-grain node is obtained by combining {grouping} multiple line-grain nodes. The line
grain corresponds to the following program:

Q 6""
tsflsd
. , was
C‘! 11"_o.-t _o.'5orF.‘"
"en
o"*
"'4 at Yd

[x,t'_| {lnp-it, delay]


mg. [u.k) - (output, delay) ‘M W,‘ mg

[bl Coarse-grain program graph


{at Fine-grain program graph before packing afigf pagkjng,

Fig. 2.! A program graph before and alter grain paddrt; in Example 1.4 {Modified from Kmatradwe and Lewis.
lEE.E Seflvm.ns,]an. 1933]

Var rt, b,c-, d,e,_,i", g, Fr, i,_,r', k, I, m, n,o,p, q


Begin
. rr:= :=e><f
_. {J := :=d><_f
:= =j><k
= :=4>< l
_ = :=3>-Cm
= :=n><r'
g:=.n><h :=o><ft
'= P:ovu-.pt-.:~.1-— >4 E. i li in-mi :~1:'="t*:'=-E"F‘~":-P ~e"t:o:5- >v-" -:=p><q
Home
-:.~.—
.‘°P .“'~?"*. '~‘=*!-" “ r:=n‘><e
End
,,_,W,, mMN,k,¢ 5,
Nodes l, 2, 3, 4, 5, and 6 are memory reference (data fetch} operations. Each takes one cycle to address
and six cycles to fetch from memory. All remaining nodes (7 to 1?) are CPU operations. each requiring two
cycles to complete. After packing, thc coarse-grain nodes have larger grain sizes ranging from 4 to R as
shown.
The node (A, 8) in Fig. 2.61) is obtained by combining the nodes (1, 1), (2, 1), (3, 1}, [4, I), (5, 1], (6, 1),
and (ll. 2) in Fig. 2.6a. The grain size, 8. of nodeAis the summation of all grain sizes (1 + l + l + 1 * 1 +
1 + 2 = 8) being combined.

The idea of grain packing is to apply fine grain 1"n'st in order to achieve a higher degree of parallelism.
Then one combines (packs) multiple fine~g1'ain nodes into a coarsegrain node if it can eliminate unnecessary
communications delays or reduce thc overall scheduling overhead.
Usually, all fine-grain operations within a single cc-arse-grain node arc assigned to the same processor for
execution. Fine-grain partition of a program often demands more interprocessor communication than that
required in a coarse-grain partition. Thus grain packing offers a tradeofi‘between parallelism and scheduling!
communication overhead.
Internal delays among fine-grain operations within thc same coarse-gtain node are negligible because thc
communication delay is contributed mainly by interprocessor delays rather than by delays within the same
processor. The choice of the optimal grain size is meant to achieve the shortest schedule for the nodes on a
parallel computer system.

9.

11}
11
12
7 14 t 14*
T|m a 13: _. 15

2°'— 21
22-— F3
I _..._-3
_=-=.
24 24
2a 2!'f__§ I
:.=.2=

‘I
"' _-mi
E

_.
es-4
#33 Iii
[3] Flno grain [Fig. 2.63] [h-) Coarse grain [Flg. 2.61:]

Fig. 2.‘! Scheduling of the fins-grain and coarse-grain programs (arrows: idle dme; shaded area: communication
fiche!
_ War MIGIIILH H“ I'mr!I;|(1rtnr\ _
5-B i Aoktolnced Colrnputer Architecture

With respect to the line-grain versus coarse-grain program graphs in Fig. 2.6, two multiprocessor schedules
are shown in Fig. 2.7. The fine—grain schedule is longer (42 time units) because more communication delays
were included as shown by the shaded area. The coarse-grain schedule is shorter (38 time units) because
communication delays among nodes 12, 13, and 14 within the same node D (and also the delays among 15,
115, and 17 within the node E) are eliminated after grain packing.

2.1.3 Static Multiprocessor Scheduling


Grain packing may not always produce a shorter schedule. In general, dynarnic multiprocessor scheduling
is an NP-hard problem. Very olien heuristics are used to yield suboptimal solutions. We introduce below the
basic concepts behind multiprocessor scheduling using static schemes.

Node Duplication In order to eliminate the idle time and to further reduce the communication delays
among processors, one can duplicate some ofthe nodes in more than one processor.
Figure 2.3a shows a schedule without duplicating any of the five nodes. This schedule contains idle time
as well as long interproccssor delays {S units) between Pl and P2. In Fig. 2.3b, node A is duplicated into A’
and assigned to P2 besides retaining the original copy A in P1. Similarly, a duplicated node C’ is copied into
Pl besides the original node C in P2. The new schedule shown in Fig. 2.Sb is almost 50% shorter than that
in Fig. 2.3a. Thc reduction in schedule time is causcd by elimination ofthe (a, 8] and (c, B] delays between
the two processors.

P, P2 P, P2 P2 P1 P2

4
5 a,1 a.1
e 4
5 4

o o re H
3,1 III B

l D 9

‘U
P’
Q u,1 e,1 P ah.-5
'i‘ 13
.4?"B
‘o
o1* ~5- _@4»_n-
2“ ca:- if
5
II

86
-"= a
l\Jto
-n InIt-I
'-I

[a] Schedule without node dupd lcaion [bl Schooua with node clu plleatlon {A-1 A and A’; C -1» C and -C’)

Fig. 2.8 Node-duplication scheduling to eliminate cmtamutlcarlon delays between processors (I: idle rims;
shaded areas: communication delays}

Grain packing and node duplication are often usedjointly to determine the best grain size and corresponding
schedule. Four mqor steps are involved in the grain determination and the process ofscheduling optimization:
Step l . Construct a fine-grain program graph.
Step 2. Schedule the fine-grain computation.
Step 3. Perform grain packing to produce the coarse grains.
Step 4. Generate a parallel schedule based on the packed graph.
Pm, mMm,k,¢ 5,
The purpose ofmultiproeessor scheduling i5 to obtain a minimal time schedule for the eompulations
involved. The following example clarifies this concept.

5*) Example 2.5 Program decomposition for static multipro-


cessor scheduling (Kruatrachue and Lewis, 1 988)
Figure 2.9 shows an example ofhow to calculate the grain size and communication latency. ln this example,
two 2 >< 2 manic-es A and B are multiplied to compute the sum ofthe four elements in the n:.a"|.|lting product
matrbt C = .-I >< B. There are eight multiplications and sea-'-en additions to be perfonned in this pmgrzmi, as
written below:
[A11/412] X[51131:] =[c11C12]
1411142 321322 C2152
5“ -“+1 *1 B11 A12 *1 33

lc=1er2 G-rmiazeifli Gyg-15;-93

‘t1t><5'-g G1g‘*11><511"-‘\:2><32;
CPU CYCLE CPU CYCLE
Mme w An. D1 15 Move L PARLD1 213
Move W B1101, D2 15 Move L PARZU2 20
MPTYD1, D2 T1 ADDL DI, D2 B
MOVE L D2. PAR 219 MOVE L D2. P5-UM 20

{a)GII"1 alze cdulaim I‘| M6-B000 aasemlily code e120-MHz eyele-

E {S I-in E d=T'l+T2+T3l'T4+T5+TG
=ZO+2D+32+2‘D+2CI+10fl
T1 T2T3_m T5 =212cydea
T3 = 32-on ianarllssen tine at 20 Mops
nom1aizedtoM68fiDC|cydea120 MI-lz.
T6 = oehy one ta eoilware proboeh {aaeune 5
Mow Iemiotlene, 1130;
{h}1Calc1ulauen oi oemmuueaton delay d

A 4:» C D E F -1-1,1
d d *

N G 5 u '5'
P
Sun
(<>1FIw~;:I-I1 WWW each
Fig. 2.! Caicu-laeion of graln size and comrmmdcadon delay for the pang:-nan graph in Example 15 {Cotnecsy of
Kruatnehue and Lewis: reprinted witli permission irmn IEEE Sofhmie. 19$}
H _ l'h1'Ml.'I;IflH1I' HI" l'nrr.q|r_.u||rs 5
HI i Aohlolnced Corrtpu'ae|'At'chitecbt11re

C11:-’l|1><3|1'l'/l|2><32|
C12 =fl|1><3|2‘l'fl|;'><B22
C21:-"i21><311+-"i2:><B:1
C22 = 11:1 >< 5'12 + -"122 >< 322
5Hm =C11+C1:+C:1+C-'2:
As shown in Fig. 2.9*a., the eight multiplications are performed in eight 431 nodes, each ofwh ich has a grain
size of 101 CPU cycles. The remaining seven additions are performed in a 3-level binary tree consisting of
seven $ nodes. Each additional node requires S CPU cycles.
The interprocessor cornrnunication latency along all edges in the program graph is eliminated as rt = 212
cycles by adding all path delays between two communicating processors (Fig. 2.9b).
A fine-grain program graph is thus obtained in Fig. 2.9c. Note that the grain size and communication delay
may vary with the difierent processors and communication links used in the system.
Figure 2.10 shows scheduling of the fine—grain program first on a sequential uniprocessor (Pl) and then
on an eigltt-processor [Pl to P8) system (Step 2}. Based on the fine-grain graph (Fig. 2.9c), the sequential
execution requires 364 cycles to complete without incurring any communication delay.
Figure 2.lDb shows the reduced schedule of T41 cycles needed to execute the I5 nodes on E processors
with incurred communication delays (shaded areas}. Note that the communication delays have slowed clown
the parallel execution significantly, resulting in many processors idling {indicated by 1), except for Pl which
produces the final sum. A speedup factor ol‘li64l"1‘4l = 1.16 is observed.
P
__ Q il _ P1 P2 P3 P4 P5 Pa P? Pa
A Cl
o E F o 1-1 C
"11 1
B 101 . . 1 . '
2o2—
303-i‘
D 313
45¢ i 321 me
" E
“M sos T 522.. 1
G

filfi
T41 &\\W%/
B24
%/
117,:
F*—"-{ft
"2 Ill]
M
8-48- Ir]
M Isl
r 554 Iil
[a] Asoquontlal schedule [b] A parallel schedule

Fig. 2.10 Seqoertrlal versus parallel sehed1.tI1-tgin Exarnple 15

Next we show how to use grain packing (Step 3] to reduce the communication overhead. As shown in
Fig. 2.11, we group the nodes in the top two levels into four coarse-grain nodes labeled V, W, X, and Y. The
,.,,E,,,,,,,,,._,,,,,,,,,,,,,, ‘I
remaining three nodes (N, O, P) then form the fifth node Z. Note that there is only one level of interprocessor
communication required as marked by rt in Fig. 2.lla.

O P1 P2 P‘3 F’4

co <-:.
°
ca
E
E ><
G -<

o e_ o _
'l'l I
202 J
210 -
,1 . ' <1 Time If

, Z

@210 ' v=w=x=~r=1o1+1o1+s=21o {M5


z=a+a+a=24 '
[oi Grain packing of 15 small nodes |nto5 bigger no-dos [bi Parallel s1::i'1eciutoforti'1o packed program

Hg. 2.11 Parallel seheduiing for Exarrnpie ‘1.5 atrer grain pedcirtg to reduce eornnnmieariort delays

Since the maximum degree of parallelism is now reduced to 4 in the program graph, we use only four
processors to cxccutc this ooarsc-grain program. Aparallcl schedule is worked out (Fig. 2. 1 1] for this program
in 4415 cycles, resulting in an improved speedup of8t'1-H446 — 1.94.

PROGRAM FLOW MECHANISMS


2
Conventional computers arc based on il control flow mechanism by which thc order ofprogram
execution is explicitly stated in the user programs. Datailow computers are based on a data-
driven mechanism which allows the execution of any instruction to be driven by data (operand) availability.
[Jatatiow computers emphasize a high dcgrcc of parallelism at the fine-grain instructional lcvcl. Reduction
computers are based on a demand-driven mechanism which initiates an operation based on the demand for
its results by other computations.

2.3.1 C-orrtml Flow\i"ersus Data Flow


Conventional yon Ncu mann computers use a progmrrr -:-onmer (PC) to sequence the execution of insir|.tctio1:1s
in a program. The PC is sequenced by instruction flow in a program. This sequential execution style has been
called cm1rroi-dr1'w_v1, as program flow is explicitly controlled by programmers.
A uniprocessor computer is inherently sequential, due to use of the cone-oi driven mechanism. However,
control flow can he made parallel by using parallel language constructs or parallel compilers. In this book,
we study primarily parallel control—ilow computers and their programming techniques. Until the data-driven
or demand-driven mechanism is proven to be cost-effective, the control-flow approach will continue to
dominate thc computer industry.
I _ Par I J11!!!‘ l'mrJI||r_.u|n¢\
BI i Aahmtced Corr|pn'te|'Ar'chritecbtrn2

ln a daraflow computer, the execution of an instruction is driven by data availability instead of being
guided by a program counter, ln theory, any instruction should be ready for execution whenever operands
become available. The instructions in a data-driven program are not ordered in any way. instead of being
stored separately in a main memory, data are directly held inside instructions.
Computational results. {rrnra mirens} are passed directly between instructions. The data generated by an
instruction will be duplicated into many copies and forwarded directly to all needy instructions. Data tokens,
once consumed by an instruction, will no longer he available for reuse by other instructions.
This data-driven scheme requires no program counter, and no eonlrol sequencer. However, it requires
special mechanisms to detect data availability, to match data tokens with needy instructions, and to enable
the chain reaction of asynchronous instruction executions. No memory sharing between instructions results
in no side effects.
Asynchrony implies the need for handshaking or token-matching operations. A pure dataflow computer
exploits fine—grain parallelism at the instruction level. Massive parallelism would be possible if the data-
driven mechanism could be cost-effectively implemented with low instruction execution overhead.
A Dataflow Architecture There have been quite a few experimental datafiow computer projects. Anrind
and his associates at MIT developed a tagged-token architeetme for building dataflow computers. As shown
in Fig. 2.12, the global architecture consists ofn processing elements {PEs} interconnected by an n >< n routing
network. The entire system supports pipclined dataflow operations in all n PEs. Inter-PE eontnntnieations are
done through the pipelinod muting network.

From routing network

O’
Local path "5
I P
D §
Global pan

I n i< n Rotting Hatwortr 1


s-
gal Compute
Tao

-
H 81¢
Processing Etamo 3

To Routing Network
[3] Tho global architecture [bi lntorla design of a pro-coaxing olomont

Flg.I.1‘Z The l*'lIT uggnd-'|tolten clamflow oornputier {aehptted from Arvind and lannuoel. 1936 with permission)

Within each PE, the machine provides a low-level Iokm-nmfr-hing mechanism which dispatches only
those instructions whose input data [tokens] are already available. Each datum is tagged with the address of
.,W, mmN,kkg _._ H
thc instruction to which it belongs and the context in which thc instntction is being executed. Instructions
are stored in the program memory. Tagged tokens enter the PE through a local path. The tokens can also
be passed to other PEs through the routing network. All internal token circulation operations are pipclincd
without blocking.
{Jae can think oi‘ the instruction address in a datafiow computer as replacing the program counter, and
the context identifier replacing the frame base register in a control flow computer. It is the machine’s job to
match up data with the same tag to needy instructions. In so doing, new data will be produced with a new tag
indicating the successor instructiontsl. Thus, each instruction represents a synchronization operation. New
tokens are formed and circulated along the PE pipeline for reuse or to other PEs through the global path,
which is also pip-clined.
Another synchronization mechanism, called the 1-so-in-runs, is provided within each PE. Tl1e 1'-structure is
a tagged memory urtit for overlapped usage ofa data structure by both the producer and consumer processes.
Each word Of I-stt't.tCtt.tDE uses a I-bit tag indicating whether the word is £'.|'flpI_'1-‘, isjirli, or has pending react‘
requests. The use ofI-structure is a retreat fi'om the pure dataflow approach. The purpose is to reduce excessive
copying of large data structures in datallow operations.

Ir)
g Example 2.6 Comparison of dataflow and control-flow
computers (Gajski,Padua, Kuel-r,and K|.|hn,1982)

The dataflow graph in Fig. 2.l3a shows that 24 instructions are to be executed (8 niw'des, 8 mulriplres, and
8 adds). A dataflow graph is similar to a dependence graph or program graph. The only difference is that data
tokens are passed around the edges in a dataflow graph. Assume that each rrriri rmri'rr}Jl_t-', and dit-'r'rr’c requires
1, 2, and 3 cycles to complete, respectively. Sequential execution oi‘ the 24 instructions on a control [low
uniprocessor takes 48 cycles to complete, as shown in Fig. 2.l3b.
On the other hand, a dataflow multiprocessor completes the execution in 14 cycles in Fig. 2.13c. Assume
that all the cxtcmal inputs (d,-, cl-,_f; ihri = l, 2, . ..,8 and q-,'j are available bcibrc entering the loop. With i'ot.n'
processors, instructions. 0|, Hg, 113, and or are all ready for execution in the l‘n'st three cycles. The results
prod uccd then triggcr the -execution of115, in | , rig, and .n-_|- starting finm cycle 4, The data-driven chain reactions
arc shown in Fig. 2. l 3e. The output ca is the last one to produce, due to itsdcpcndcncc on all thcprcvious cl-‘s.
Figure 2. l 3d shows the execution of the same sct of computations on a conventional multiprocessor using
shared memory to hold the intermediate results [s,- and r,- for i = 1, 2, 3, 4). Note that no shared memory is
used in the dataflow implementation. The example does not show any time advantage of datafiow execution
over control flow execution.
The theoretical minimum time is I3 cycles along the critical path rt | h |c|r-3 . “C8. The chain reaction control
in datallow is more ditlicult to implement and may result in longer overhead, as compared with the uniform
operations performed by all the processors in Fig. llid.
H _ rhr I.‘ IBM!‘ I l'nrr.q|r_.u||r\ 5
'54 i Aohtotnced Cornptmer Architecture

1"P"1d~ 9-f ‘*1 "152 9243 93d-ti ‘*4 do "5 dc “ed? “Wis °s
cD=0
iorlfromltofldo
h I
232.; 312 33 34 35 36 3? 36
bl; .-. f1 f f f4 f fa f? f

e
"cg: .l?_~_p- f-
_¢-
+ —-In
'1-|'*2"-Baht: be babtbe
Otllfll-.l'l3, U, C ' f1 1:3 Cd 13-5 gs‘ QB’

[ajfitsampieptogramandttsdataflowgraph

1 4 6 7? 1|} 12 #3 46 4-B
I *1 I b1l*=1l as I be Isl
[bi Sequential execution on a unlpro-oeesor In #8 cycles

4 (7 1491011121314
*1 I *5 l°1|°2|°al°#l°5l°sl°t|e°B|

£1ti’ f.
K J
.%.
E‘
Zr .%.
gr_
ha‘ mg‘E
at . _.

[c] Data-driven exotatflon on a 4-1:-1'0-oemor dataflow computer In 14 cyctes

T 9 11 121314
31 as '11 I *5 |$1|l1|'=1|°5| $1=t’2*t*1-*1='*3"$1-‘=1=b1*“o-°5=*’s*°4
no ‘*6 l [_$2[l2|_°;|°t-§~| @'2=b4*b3-*2==-1"52-¢2=51*'=o-¢c=$s*°=1
‘*3;.N 3? D3] ll? l"_3[_‘a|°3]°?| 53:56‘be-la=t'?"5a~°3=‘1*°o-°?=‘a*°4
-‘viviji1
‘*4 ‘iilk- “B ?if mu’ |s4ll4|°4|°B| *4=be*t'?-*4=s=1*$3'°=1=l2"°o-‘=e=l4"°4
tjd] Parallel execution on a shared-memory 4-pro-oossor system in H cycles

Fig. 2.13 Comparison between datiflerw and control-flow computers [adapted from Grajsltl. Pacl1.n,K.u|:lt. and
Kuhn, ‘E952; reprinted with permission from IEEE Computer. Feb. 1931}

Cine advantage of tagging each datum is that data from different contexts can be mixed freely in the
instruction execution pipeline. Thus, instruction-level parallelism of datafiow graphs can absorb the
communication latency and minimize the losses due to synchronization waits. Besides token matching
and l-strllctttre, compiler technology is also needed to generate datafiow graphs for tagged-token dataflow
computers. The dataflow architecture ol‘l'e1s in theory a promising model For massively parallel computations
because all for-reaching side effects are removed. However. implementation ofthese concepts onacommercial
scale has proved to be very diflicult.
Pm mmN,kkg _._ is
2.3.2 Demand-Driven Mechanisms
In a reduction rnnt-hinc, the computation is triggered by the demand for an operation's result. Consider the
evaluation ofa nested arithmetic expression tr = {U1 + 1} '>< c — {rt + cl]. The data-driven computation Seen
above chooses a bottom-up approach, starting from the innermost operations b + I and tr’ + c, then proceeding
to the >< operation, and finally to the outermost operation —_ Such a computation has been called eager
ct-mtinrion because operations are carried out immediately after all their operands become available.
A dcmmiri’o‘rit-on computation chooses a top-down approach by first demanding the value ofn, which
triggers the demand for evaluating the next-level expressions [in + l}>< c and n‘ + c, which in tum triggers tl1c
demand tor evaluating b + 1 at the innermost level. The resttlts are then returned to the nested dectnandcr in
the reverse order before tr is evaluated.
Adetnand-driven computation corresponds to hrzy e-t-nlmrinn. because operations are executed only when
their results are required by another instruction. "The demand driven approach matches naturally with the
functional programming concept. The removal of side efiircts in functional programming makes programs
easier to parallelize. There are two types of reduction machine models, both having a recursive control
mechanism as characterized below.
Reduction Machine Model: In a string reduction model, each demander gets a separate copy of the
expression for its o\vt| evaluation. A long string expression is reduced to a single value in a recursive fashion.
Each reduction step has an operator followed by an embedded reference to demand the corresponding input
operands. The operator is suspended while its input arguments are being evaluated. An expression is said to
be fully reduced when all the arguments have been replaced by literal values.
In a graph rcducrimi model, the expression is represented as a directed graph. The graph is reduced by
evaluation of branches or subgraphs. Diiilicreot parts of a graph or subgraphs can be reduced or evaluated
in parallel upon demand. Each demander is given a pointer to the result of the reduction. The demander
manipulates all rciercnccs to that graph.
Graph manipulation is based on sharing the atgtrtncnts using pointers. This traversal of the graph and
reversal ofthe rcilercnccs are continued tmtil constant arguments are encountered. This proceeds until the
value of ti is determined and a copy is rettuned to the original demanding insuuction.

2.3.3 Comparison of Flow Mechanisms


Control-flow. dataflow, and reduction computer architectures are compared in Table 2.1. The degree of
explicit control decreases fi'om control-driven to demand-driven to data-driven. Highlighted in the table are
the dillcrences between eager ct-'.trlu.trIi7r;tn and ltrzy ct-‘nitration in data-driven and demand-driven computers.
respectively.
Furthermore. control tokens are used in control-flow computers and reduction machines, respectively. The
listed advantages and disadvantages of the dataflow and reduction machine models are based on research
findings rather than on extensive operational experience.
liven though conventional von Neutnann model has many disadvantages, the industry is still building
computers following the control-flow model. The choice was based on cost—efl'ectiveness, marketability, and
the narrow windows of competition used by the industry. Program flow mechanisms dictate architectural
choices. Both dataflow and reduction models, despite a higher potential for parallelism, are still concepts in
the research stage. Control-flow machines still dominate the market.
'55 N
.
Aohioinced Computer Ai'chitecture

Table 2.1 Corrtrol-Flow, Datoflow, and Reduction Computers

."t-fuclrin-t.' .'H0dt'f I.’_'o.rr.|'m-1' Firm‘ fconrmf-dri ten) Dalaflan' |"da.t.r.t-drr'ven) Re'dr.rc.rion {J£'rru.rnuLd'ri\=e'n}

Coiiveiitioual computation; token Eager evaluation; statements Leary evaltiatio-J1; statements


Basic of control indicate: when a are esecuted when all ot'tl1ei|' are executed only when
Definition statement should he executed operands are available their result is required for
another computation
Full control
Very high potential for Only required instructions
The most sueoensliil model parallelism are executed
for commercial products
fidvaiitage-i —
l [iph throughput lligli degree of para llelisrn
Complex data and control
structures are easily implemented Free from side effects Easy manipulation of data
structures
lri theory, less cflicieirit than the Time lost waiting for Docs not support shining ct‘
other two unneeded arguments objects with cluutgiug local
state
Disadvamagcs Difficult in preventing run-time lligh control overhead
t".‘.'l'l'O‘l’S
Time needed to propagate
Difiicult in manipulating demand tokens.
data structures

(Courtesy olwali, Lowrie, and Li; reprinI.e=d with p-eimission from Computersfor A‘rnfficiol intelligence Pmcersing edited
by Wah and Ra.maJ:noorthy, Wiley and Sons. l.uc., 1990]

In this book, we study mostly control-flow parallel computers. But dataflow and rnultithreadcd architectures
will be further studied in Chapter 9. Dataflow or hybrid von Neumann and dataflow machines otter design
alternatives; .i‘fl"-EH1’?! proccs.-rr'ng { see C hapter 13) can be considered an example.
As far as innovative computer architecture is conccmed thc dataflow or hybrid models cannot he ignored.
Both thc Electroteclmical Laboratory (ETL) in Japan and the Massachusetts institute of Technology have
paid attention to these approaches. The book edited by Gaudiot and Bic (1991) provides details of some
development on dataflow computers in that period.

SYSTEM INTERCONNECT ARCHITECTURES


_ Static and dynamic networks for interconnecting computer subsystems or for constructing
multiproccssors or multicomputers are introduced below. We study first the distinctions
between direct networks for static connections and indirect networks for dynamic connections. These
networks can be used for internal connections among processors, memory modules, and HO adaptors in a
centralized system, or for distributed networking of rnulticoniputcr nodes.
Various topologies for building networks are specified below. Theo we focus on thc communication
properties ofinterconnection networks. These include latency analysis, bisection bandwidth, and data-routing
functions. Finally, we analyze the scalability ofparallcl architecture in solving scaled problems.
,,,,,,, ,,,,,,,,,,,,,,,,,,,, _,_ ,,
The communication efliciency of the underlying network is critical to the performance of a parallel
computer. What we hope to achieve is a low-latency network with a high data transfer rate and thus a wide
communication bandwidth. These network properties hclp make design choices for machine architecture.

2.4.1 Network Properties and Routing


The topology of an interconnection network can be either static or dynamic. Static nem'orks are formed
of point-to-point direct connections which will not change during program execution. D_.\-Wflmfl‘ m'fWm'ir-s
are implemented with switched channels, which are dynamically configured to match the communication
demand in user programs. Packet switching and routing is playing an important role in modern multi-
processor architecture, which is discussed in Chapter 13; the basic concepts are discussed in Chapter T.
Static networks are used for fined connections among subsystems of e centralized system or multiple
computing nodes of a distributed system. Dynamic networks include buses, crossbar switches, multistage
networks, and routers which are often used in shared-memory multiprocessors. Both types of networks have
also been implemented for inter-PE data routing in SIMD computers.
Before we discuss various network topologies, let us define several parameters often used to estimate the
complexity, communication efficiency, and cost of a network. In general, a network is represented by thc
graph of a finite number of nodes linked by directed or undirected edges. The number of nodes in the graph
is called the nervvorir sire.

Node Degree and Network Diameter The number of edges {links or channels) incident on a node is
called the node degree d. In the case oftmidircetional channels, thc number ofchanncls into a node is the in
degree, and that out ofa node is thc our degree. Then thc node degree is thc sum ofthe two. Thc node degree
reflects the number of IIO ports required per node, and thus the cost of a node. Therefore, the node degree
should be kept a (small) constant, in order to reduce oost. Aeonslant node degree helps to achieve modularity
in building blocks for scalable systems.
The rfirrmerer D of a network is the maximum shortest path between any two nodes. The path length is
measured by the number of links traversed. The network diameter indicates the maximum number of distinct
hops between any two nodes, thus providing a figure of communication merit for the network. Therefore, t:he
network diameter should be as small as possible from a communication point ofvicw.

Bfseetlon Width When a given network is cut into two equal halves, the minimum number of edges
{channels} along thc cut is called thc ehrmmsl bisection it-'idrh b. In the cascofa communication network, each
edge may correspond to a channel‘ with w bit wires. Then thc nits’ hr'.serrion u-'r'nl‘h is B = bu‘. This parameter
B reflects the wiring density of a network. When B is fixed, the channel‘ it-'in'rh (in bits} w = Bib. Thus the
bisection width provides a good indicator ofthe maximum commtmication bandwidth along thc bisection of
a network.
Another quantitative parameter is the wire length (or channel length] between nodes. This may affect
the signal latency, clock skewing, or power requirements. We label a network s__tmmen-r‘-:- ifthe topology is
the same looking from any node. Symmetric networks are easier to implement or to program. Whetlter the
nodes arc homogeneous, the channels are buffered, or some ofthe nodes arc switches, arc some oth-cr useftil
properties for characterizing the structure of a network.
Data-Routing Fmention: A data-routing network is used for inter-PE data exchange. This routing network
can be static, such as the hypercube routing network used in the TMCICM-2, or dynarnic such as the multistage
_ Ft‘:-r Mtfiruw Hliirml!I;|(1rtnr\ '
BB i i Aahtotnced Computer Architecture

network used in the IBM GFI 1. ln the case of a multicomputer network, the data routing is achieved through
message passing. Hardware routers are used to route messages among multiple computer nodes.
We specify below some primitive data-routing functions implementable on an inter-PE routing network.
The versatility of a routing network will reduce the time needed for data exchange and thus can significantly
improve the system performance.
Commonly seen data-routing fimctions among the PEs include Sflffiing. mmri.-m. perm1.1mIr'mI {one»to-
one), hronrierrsr (one-to-all], mnfrieosr [one-to-many}, shuflie, arehrrnge, etc. These routing ftmctions can be
implemented on ring, mesh, hypercube, or multistage networks.
Permutation: For n objects, then: are n! permutations by which the n objects can he reordered. The set of
all perm utations form a permutation group with respect to composition operation. One can use cycle notation
to specify a permutation function.
For example, the permutation rr = fa, b. e] (rt. 2] stands for the bijection mapping; £1 —a b, in —> r-, e -> rt,
rt’ -3- e, and e —> din a eireiilar fashion. The cycle (rt. F). c) has a period of3, and the cycle (d, e) a period ofl
Combining the two cycles, the permutation rr has a period of2 >< 3 = 6. lfone applies thc permutation rrsix
times, the identity mapping I = ii-rt), U1), {ej, frfj, {cl is obtained.
One can use a crossbar switch to implement the permutation in connecting rt PEs among thernselves.
Multistage networks can implement some ofthe permutations in one or multiple passes through the network.
Permutations can also be implemented with shifting or broadcast operations. The permutation capability of
a network is often used to indicate the data routing capability. When n is large, tl1e permutation speed often
dorm inates the perfo rmance ofa data routing network.

Perfect Sltuflle and Exchange Perfect shuffle is a special permutation function suggested by Harold
Stone (l9Tl) for parallel processing applications. The mapping corresponding to a perfect shuflie is shown
in Fig. 2.1-Ila. Its inverse is shown on the right-hand side (Fig. 2.14b).

000?;-000 000?.-000 =0

001 001 001 001 =1

010 010 010 010 =2

011 011 011 011 =3

100 100 100 100 =4


101 101 101 101 =5
110 110 110 110 =s
1 1 1 —-111 111—- 111 =1
[a] Perfoctshuffto [bl Inverse pert-act shuffle

Fig. 2.14 Perfect siufifie and its inverse mapping over eight. oblects [Courtesy of H. Stone; reprlrmed whit
permission from JEEE Tn:|ns.Cornpi.rters, 19?1)

In general, to shuffle n = 2 .i objects evenly, one can express each object in the domain by a k- bit binary
numbcrx = {xx |,. . . ,1‘, , .1",-_,), The perfect shuffle maps x to __\-', wl1ere__t-' = {.11 1,. . . ,x| , .1r{,, xk 1] is obtained from
.1‘ by shifting 1 bit to the lefi and wrapping around the most significant to the least significant position.
,,_,g,,,,,,,,,,,,,,,,,,,,,,,., . H
Hypercube Routing Function: A three-dimensional binary cube network is shown in Fig. 2.15. Three
routing functions are defined by th.ree bits in the node address. For example, one can exchange the data
betwoeri adjacent nodes which differ in the least signifieartt bit C9,, as shown in Fig. 2.] Sb.
Similarly, two other routing patterns can be obtained by checking the middle hit C, (Fig. 2.l5c} and
the most significant bit C; (Fig. 2.l5d), respectively. In general, an n-dimensional hypercube has n routing
fimctions, defined by each bit of thc rt-bit address. These data exchange functions can be used in routing
messages in a hypercube multicomputer.
110 1'11

I
ooo
F
001
[3] A 3-cube with nod-as donated as C2016‘; in binary

|r:oo]-+1001] ]o1o|-1-I-[o11| ]1oo|-1-a-[1o1|<-vi 11o|<+[111|


[b] Routing by toast significant bit, CO

|no-0| ]oo1| ]o1o| |o11| ]1oo] |1o1| ]11o| |111|


iii lit
[c] Routing by middle bit, C1

|noo|
l
|oo1[ |o1o[
i
[011] |1oo[
i
[1o1| |11o[
+
[111]
T l l l
[dji Rotting by most signiflcmt bit, C2

Fig. 1.1 5 Three routing functions defined by a binary 3-cube

Broadcast and Multiooat Bmaol:'os! is a one-to-all mapping. This can be easily achieved in an SIMD
computer using a broadcast bus extending from the array controller to all PEs. A. message-passing
multicomputer also has mechanisms to broadcast messages. ilr'f1ifi"iCflSf corresponds to a mapping from one
PE to other PEs [one to many].
Broadcast is often treated as a global operation in a multicomputer. Multioast has to he implemented with
matching ofdestination codes in tl'|e network.

Network Performance To summarize the above discussions, the performance of an interconnection


network is affected by the following factors:
If IIIIH tncl'q||;1r|I¢-\
TU i Aohiaviced Con'|po'terArchtitectose

(lj Frrneri0nnIir_1-'—Tl1is reiers to how the network supports data routing, interrupt handling,
synchronization, request-"message combining, and coherence.
[2] ."'t"ernnrk Iorem:'_1-'-—This refers to the worst-ease time delay ibra unit message to betran-sferretl through
the network.
(31 Bflltflltt-'i£|l!it—Thi5 refers to the maximum data transfer rate, in terms of Mliytesfs or Gliytesfs,
transmitted through the network.
(41 Hurdnure compfexirt-'—This refers to implementation costs such as those for wires, switches,
connectors, arbitration, and interface logic.
['5] .S'coIcr!Jf£ir_1-'—This refers to the ability ofa network to be modularly expandable with a scalable
performance with increasing machine resources.

2.4.2 Static Connection Networks


Static networks use direct links which are fixed once built. This type ofuetwork is more suitable for building
computers where tl'|e communication patterns are predictable or implementable with static connections. We
describe their topologies below in terms of network parameters and comment on their relative merits in
relation to comrnunication and scalability.

Linear Array This is a one-dimensional network in which N no-des are connected by N~ 1 links in a line
{Fig. 2.16:1). Internal nodes have degree 2, and the terminal nodes have degree 1. The diameter is N I , which
is rather long for large N. The bisection width in = l . Linear a;rra_ts are the simplest connection topology. The
structure is not symmetric and poses a communication inefiiciency when N becomes very large.
For N= 2, it is clearly simple and economic to implement a linear array. As the diameter increases linearly
with respect to N, it should not be used for large N. It should be noted that a linear array is very difibrent from
a bus which is time—shared through switching among the many nodes attached to it. A linear array allows
concurrent use of different sections (channels) of the structure by different source and destination pairs.
Ring and Chordol Ring A ring is obtained by connecting the two terminal nodes of a linear array with
one extra link (Fig. 2.1 6b]. A ring can be unidirectional or bidirectional. It is symmetric with a constant node
degree ofl The diameter is for a bidirectional ring, and N for tmidireetional ring.
The IBM token ring had this topology, in which messages circulate along the ring until they reach
the destination with a matching ID. Pip-elined or packet-switched rings have been implemented in the
CDC Cyberplus multiprocessor (I935) and in the KSR-1 computer system (1992) for interprocessor
communications.
By increasing thc node degree from 2 to 3 or 4, we obtain two ehoniai rings as shown in Figs. 2. ltie and
2.16:1, respectively. One and two extra links are added to produce the two chordal rings, respectively. In
general, the more links added, the higher the node degree and the shorterthe network diameter.
Comparing the 16-node ring (Fig. 2.16b} with the two chordal rings (Figs. 2.16:: and 2. I 5d}. the network
diameter drops from B to 5 and to 3., respectively. In the extreme, the eonrp!emi_1-' r-onner-rerf network in
Fig. 2.] sr has a node degree of I 5 with the shortest possible diameter of 1.
Barrel Shifter As shown in Fig. 2.l6e for a network ofN = 16 nodes, the bar'n:I shirt-r is obtained from
the ring by adding cittra links from each node to those nodes having a distance equal to an integer power of
2. This implies that node i is connected to nodej if|_,r'— fl = 2" for some r ='[1,l,Z,..., n —l and the network
size is N = 2". Such a barrel shifter has a node degree of r!= In I and a diameter D = n.-‘Z.
PWmmN,kR¢ _
c 1
1s 2
0 1 2 3
14 3
il‘ 6 5 4 13 4
12
B 9 10 11
11 s
15 14 13 12 10 9 8 1
[a] Linear away [bi Ring

0 1
15 2 -:1 1
,.ir1'£\|-.
1t 3 _,|"fn="‘3~Ti
I"
13 1 -$1 4*-

12 5 12 Q‘
H
11 6 ‘Q2-,,__- 1-1'‘II
11]
9 s
1 109 1'4 '°":li-.1’"'i"".;-!tag.-; :4’-'"’
I\,._
I_

{e]Cho-reel ring of elegraeii {di Chorcial ring of degree 4


[same as llliac mash]
D 1 D 1
3 _,__-.'_-.-'..
"'.-.s.._
14 .-_:;..- 7.1’--' "-"l- 4;. ,.-/52-,."-13:-=;=' ,=-:-,'-:.~T"-.
,1, 3
.2: \l/ ,,~ : 1'-';:f-"--‘EH1: T:
-.1 .- .*"=- .-'. .* .-3.!-,= .7;--'-'.-. r.\.
13 '1 “fl; 2-'.l:Lq"f‘.-':l'.'l-'-!| 13;‘ ‘ 4'
-1‘, .; -|-' -$1-'1-.¢:4.'=.'
.- -----__.:
12 v: -1 *»~;*aI-.‘.*.‘3'.=i‘iT.='-‘.- 5
11 \r ~41!
1; 1-:
11 \;,-',=_,-7.1-:.§;:gtr';:-. ",v a
Q _';"i"_-_.- '-325;‘
"1?!’-1!1'ii 1|} 0 —- I T
B 9 B
[oi Barrelshifter [fl Completely oonnectad

Fig. 2.16 Linear array: ring.el'iorchi rings of degrees 3 and 4. barrel shifrenand completely connected networit

Obviously, the oonnectivity in the barrel shifiecr is increased over that of any ehordal ring of lower node
degree. For N= I6, the barrel shifter has a node degree oi‘ 7 with a diameter of 2. But. the barrel shifter
complexity is still much lower than that of the completely connected network {Fig 2. 16f).
Tree and Star A biririrj: rme of 31 nodes in five levels is shown in Fig. 2.l7a. In general, a Ir-level,
completely balanced binary tree should have N= 2* - 1 nodes. The maximum node degree is 3 and the
diameter is EU: - 1). With a constant node degree, the binary tree is a scalable architecture. However, the
diameter is rather long.
Thc srnr is 11 two-level tree with a high node degree at the central node ol‘.r.1'= ."i~'— I (Fig. 2. i‘?h] and a small
constant diameter of 2. A DADO multiprocessor was built at Columbia University (1987) with a 10-level
binary tree of I023 nodes. The star architecture has been used in systems with a centralized supenrisor node.
I _ Par I J11!!!‘ l'mrJI||r_.u|r¢\
TI i -5riNtJl'iCrl!£l'l:iJrITlPlr'r'.B'.5rI'Crl1>rlt£r'Jllttl'B

For Tree The corrventional tree structure used in computer science can be modified to become thefar tree,
as introduced by Lciscrson in 1935. A binary fat tree is shown in Fig. 2.17c. The channel width of a fat tree
increases as we ascend from leaves to thc mot. Thc tat trcc is more like a teal trcc in that branches get thicker
toward the mot.

I" " '

it; Ti
r

.-*-‘_‘\. _/-2
"-- r_.-’ {
/,,.
_'_'-tlr.__ '_‘;u" r.
.;-.___ O‘-._t4-.'\ )"1
c.~—"'“- . .,_ :j::cr’_ -.r ._,_-_':1L_
-__~ '____=tr ._'I-___\_._ 1-,,
"-1‘-"I -3- I1‘!
. -I1
_,._,_ ‘____.-T1r -“-_ ‘___-.Ji'
._1 _. ._.I:t____ CI Gib
-_r ~_ ii;
.-1' U’ -— Q},-3-
-'7.
Q‘-‘I . D--.
Q
' fr-".-"l'_‘ ‘i1’.1.-_ '' :1-‘ 1!'+___ 3}:
- .>___j_ _r_0-—-..
R3-__. .')-"fitt- ._. C1
(alfiirurytree fbifitar {ciflluryfattroo

Flg.1.1‘l' Tree. star.arrd far tree

One ofthe major problems in using the conventional binary tree is the bottleneck problem toward the root,
since the traffic toward die root becomes heavier. The fat tree has been proposed to alleviate the problem. The
idea ofa fat tree was applied in the Connection Machine CM-5, to be studied in Chapter E. The idea ofbinary
fat trees can also be extended to mtrltiway fat trees.
Mull and Torus A 3 >< 3 example mesh network is shown in Fig. 2.18s. The mesh is a frequently used
architecture which has been implemented in the Illiac IV, MPP, DAP, and Intel Paragon with variations.
lo general, a Jr-dimensional mesh with N = nk nodes has an interior node dcgtcc of Eli‘ and thc network
diameter is .ir'(n I]. Note that the pure mesh as shown in Fig. 2. lSa is not symmetric. The node degrees at
the boundary and comer nodes are 3 or 2.
Figure 2.181: shows a variation of the mesh by allowing wraparormd connections. The llliac IV assumed an
8 >< 3 mesh with a constant node degree of 4 and a diameter of 7. The llliac mesh is topologically equivalent
to a chordai ring ofdegree 4 as shown in Fig. 2. 16d for an N = 9 = 3 >< 3 configuration.
in general, an :1 >< n llliac mesh should have a diameter ofd= n l, which is only haifofthe diameter for
a pure mesh. Thc rorrrs shown in Fig. 2.180 can be viewed as another variant ofthe mesh with an even shorter
diameter. This topology combines the ring and mesh and extends to higher dimensions.
The torus has ring connections along each row and along each column of the array. In general, an n >< n
binary torus has a no-tic degree of 4 and a diameter of The torus is a symmetric topology. All added
wraparoimd connections help reduce the diameter by one-hall‘ from that of the mesh.
Systolic Array: This is a class ofmultidimensional pipelinod anay architectures designed for implementing
fixed algorithms. What is shown in Fig. 2.lBd is a systolic array specially designed for perforrning matrix
multiplication. The interior node degree is ti in this example.
in general, static systolic arrays are pipelinod with multidirectional flow of data streams. The comnzrcrcial
machine lntel iwarp system lfiuiatalone ct al., 1936) was designed with a systolic architecture. The systolic
array has become a popular research area ever since its introduction by Kong and Leiserson in l9'?8.
,,,,,,, ,,,,,,_,,,,,,,,,,, ,_,_

C'=I':I'=l
.t="::‘:.
"JlQ|§'.l
I'.iI.iI.:l
{a) Mosh {bi liac mesh {c{|Tcma {di Systcie aray

Fig. 2.18 Mesh. llliac mesh, torus, and systolic array

With fixed interconnection and synchronous operation, a systolic array matches the communication
structure of the algorithm. For special applications like signalliniagc processing, systolic arrays may offer
a better perfo-mtanceilcost ratio. However, the structure has limited applicability and can be very difiieult to
program. Since this book emphasizes general-purpose computing, we will not study systolic arrays further.
interested readers may refer to the hook by S.Y. Kung (1938) for using systolic and wavefront architectures
in building VLSI array processors.
Hypercube: This is a binary n-cube architecture which has been implemented in the iPSC, nCUBE, and
-CM-2 systems, In general, an n-cube consists ofN= 2" nodes spanning along n dimensions, with two nodes
per dimension. A 3-cube with 8 nodes is shown in Fig. 2. I 9a.
A 4-cube can be formed by interconnecting the corresponding nodes of two 3 cubes, as illustrated in
Fig, 2,19b_ The node degree ofan n-cube equals n and so does the network diameter. In fact, the node degree
increases linearly with respect to the dimension, making it difficult to consider the hypercube a scalable
architecture.
Binary hypercube has been a very popular architecture for research and development in the 19805. Both
Intel iPSC_.~' 1. iPSCl‘2, and nCUBE machines were built with the hypercube architecture. The architecture
has dense connections. Many other architectures. such as binary trees, meshes, etc., can be embedded in the
hypercube.
With pour scalability and ditliculty in packaging higher-dimensional hypcrcubes, the hypercube
architecture was gradually being replaced by other architectures. For example, the CM-5 employed the fat
tree over the hypercube implemented in the CM-2. The lntel Paragon employed a t"wo-dimensional mesh
over its hypercube predecessors. Topological equivalence has been established among a munber of network
architectures. The bottom line for an architecture to survive in future systems is packaging efficiency and
scalability to allow modular grrrwth.

Cube-Connected Cycle: This architecture is modified from the hypercube. As illustrated in Fig. 2.19c, a
3-cube is modified to form 3-eztbeeonntrerirrf tjt-‘dos {CCC]. The idea is to cut offthe corner nodes (vert ieesj
ofthe 3-cube a.nd replace each by a ring {cycle} of'3 nodes.
ln general, one can eon sttuet .lr-e1tbc~conneetco' ct-‘cits from a k-eubewith n = 2?‘ cycles nodes as illustrated
in Fig. 2. 19d. The idea is to replace each vertex of the it dimensional hypercube by a ring ofIt nodes. A It-cube
ean be thus transformed to a It-CCC with Ir >< 2" nodes.
The 3-CCC shown in Fig. 2.l9b has a diameter of 6, twice that of the original 3-cube. ln general, the
nettvorlt diameter ofa Jr-CCC equals 21:. The major improvement ofa CCC lies in its constant node degree of
3, which is independent of the dimension of the underlying hypercube.
T4 i Adumtcedfornpucerkdniteeuim

[a] 3-cube
lll!ta!
{b~j|A 4-cube formed by lntoroon nocting two 3-ctbee

%
1 it
ii lit (B
1 1
e _
(ii
{-::] 3-cttto-connected cycles [d] Replacing each node of a teeuho by a ring [cycle]
of it no-do-s to form the k-cube-connected cycles

Fig. 2.19 Hytpereitbes and ctibe-ccrmeccecl cycles

Consider a hypercube with N = 2" nodes. A {ICC with an equal number of N nodes must be built from a
lower-dimension Ir-cube such that 2" = Jr - Z‘ for some It <1 rt.
For example, a 64-node CCC can be forrned by replacing the corner nodes of a 4-cube with cycles of four
nodes, corresponding to the case n = 6 and It = -4. The CCC has a diameter ofllr = S, longer than 6 in a 6-cube.
But the CCC has a node degree of 3, smaller than the node degree of 6 in a 6-cube. In this sense, the OCC is
a better architecture for building scalable systems if latency can be tolerated in some way.
k-ary n-Cube Network: Rings, meshes, tori, binary n-cubes (hypercubcsj, and Omega networks are
topologically isomorphic to a family of k-my n-cube networks. Figure 2.20 shows a 4-ary 3—cube network.

--'-s.-1-ea-t-4
viiij
/I/I/
I‘. '\
I./JV17
:9-my-lg

ufjfi
'l ‘\-

\\“:\‘?.\;:

oi ‘s.\~t&‘!1\
1n.".1|.0
I I
lII'I'i'i 'i i X\_‘_":\",",:\§li|.‘a

Fig. 2.20 Theta-ary n-ctrb-enerwrrttsh-ouvnvrtrhtc =-tan-cln H 3;htdden nudes orccmnectiuusarenccshown


,.,,,,,,,,, ,,,,_,,_,,,,,,,,,,,. ,5
The parameter n it the dimension of the cube and It is the radix, or the number of nodes (multiplicity)
along each dimension. These two numbers are related to the number ofnodes, N, in the network by:
.y=t-",(t-= ‘§;W,n= |ng,ni| (2.3)
A node in the Ir-ary n-cubc can bc identified by an n-digit radix-Jr address A = n, a; ...n,,, where rt,-
represents the node’s position in the nh dimension. For simplicity, all links are assumed bidirectional. Each
line in the network represents two communication channels, one in each direction. In Fig. 2.20. the lines
between nodes are bidireet ional links.
Traditionally, low-dimensional ll‘-ary ri-cubes arc called tori, and high-dimensional bina.ry n-cubes arc
called }:__vl-Jemzlbm. The long end-around connections in a torus can be avoided by folding the network as
shown in Fig. 2.21. 1n this case, all links along the ring in each dimension have equal wire length when the
multidimensional network is embedded in a plane.

I FI P. I F. I
II II II
iii '_I'-‘—
¢|.l.1|.l.1|.l.1|.-.-+
"ii"-"_ 'iii'iiii'
c.l.|1'.'|1...|1-|_

Ii.-.1|.-.1|.-.1|.-J» n.-.|1-'11.-.|1'-|—-
I I I I I J I I
-[al Tratitiond torus {a -it-ary 2-oubal {blflttorus wilt folded eornectiorn

Fig. 111 Folded connections tn equalize the wine Iengdt in a tnorus network (Courtesy o‘l'W1 Dally; reprinted
with permission from l'.EEE Tmna Computers, june W90}

William Dally (I990) has revealed a number of interesting properties of k-ary n cube networks. The oost
of such a network is dominated by the amount of wire, rather by the numb-ezr oi‘ switches required. Under the
assumption ofconstant wire bisection, low-dimensional networks with wide channels provide lower latency,
less contention, and higher hot-spot throughput than higher-dimensional networks with narrow channels.
Network Throughput The network rhmrigfniur is defined as the total number of messages the network can
handle per unit time. One method of estimating throughput is to calculate the capacity of a network, the total
number of messages that can be in the network at once. Typically, the maximum throughput of a network is
some fraction of its capacity.
A hm spur is a pair of nodes that accounts for a disproportionately large portion of the total network
traffic. Hot-spot traflic can degrade performance of the entire network by caltsing congestion. The ho!-spot
throughput oi‘ a network is the maximum rate at which messages can be sent from one specific node P, to
another specific node P)-.
Low-dimensional networks operate better under nonuniform loads because they allow better resource
sharing. in a high-dimensional network, wires are assigned to particular dimensions and cannot be shared
between dimensions. For example, in a binary n-cube, it is possible for a wire to be saturated while a
physically adjacent wire assigned to a different dimension remains idle. In a torus, all physically adjacent
wires are combined into a single channel which is shared by all messages.
‘ _ Par I J11!!!‘ l'mrJI||r_.u|i¢\
‘ta i Ad~iov1cedCorr|pute|'.lu'ch¢itecbtrn2

As a rule oi‘ thumb, minimum network latency is achieved when the network radix Ii: and dimension
n are chosen to make the components of communication latency due to distance D {the number of hops
between nodes] and the message aspect ratio Li‘ H-" (message length L normalized to tllc channel width Hr)
approximately equal.
Low-dimensional networks reduce contention because haying a few high—bandwidtl'1 channels results in
more resource sharing and thus a bctter queueing performance than having many low-bandwidth channels.
While network capacity and worst-case blocking latency are independent of dimension, low-dimensional
networks have a higher maximum throughput and lower average block latency than do high-dimensional
networks.
Both fat tree networks and Jr-ary n-cube networks are considered universal in the sense that they can
efliciently simulate any other network ofthe same volume. Dally claimed that any point-to-point network can
be embedded in a 3-D mesh with no more than a constant increase in wiring length.
Summary ofstotic Network In Table 2.2, we summarize thc important characteristics of static connection
networks. The node degrees of most networks are less than 4, which is rather desirable. For example, the
[NMCIS Transputer chip was a compute communication microprocessor with four ports for communication.
Sec also the T[LE6-4 systern-on-a-chip described in Chapter 13.

Table 2.2 Summary of Stnrlc Network Gtnmaerlsrlcs

l"l'em'or.k !_t]x' l't|'ucle Netw-or.i'r Mo. of' Ei.r-er.'l'io.r1 Jstsmm-2.|'ry Renxrrkr on


degree, J diamerer: links. 1 n'io".rh, B net t-writ" size

Linear Array 2 N- l N- I 1 No N llflilfii

Fdns 2 Lao- N 2 Yes N nodes


Completely N 1'\'l1"»'— W3 tact’ Yes N no-ties
Connected
Binary 3 so - o N- I 1 No Trcc height
Tree s |'tag3.t"
Star N 2 N—l LM'2_- No N 11.0-tl.¢s
ID-Mesh 4 l(r—1) 2N— Ir I‘ No r X r mesh
where r o‘ N
llliac 4 r—l 2N Zr No Equivalent to
Mesh a chordal ring
ofr -" 5
2D-Torus 4 2|_r."2_ ZN Zr Yes rxrtotus
wlterer \|'l'u'
I Iypercube fl‘ fl‘ nN.|'Z It-l"1 Yes N nodes,
rr log; N
{d'trne|tsiot1]-
CCC 3 2k- 1 1- lw2_- 3N|’2 N."(2.k l Yes N X 2*-

notles with a cycle


length it E 3
A--ary n-cube Zn !Il_.ltI.'r2_' nN 21¢“ Yes N Ir” nodes
,,,,,,, ,,,,,,,,,,,,,,,,,,,, _,_ ,,
With a constant node degree of 4, a 'I1'ansputer {such as the T800] becomes applicable as a building block.
The node degrees for the completely connected and star networks are both had. The hypercube node degree
increases with log; N and is also bad when the value oi'N becomes large.
Network diameters vary over a wide range. with the invention ofhardwarc routing (worrnhole routing),
the diameter has become less critical an issue because the communication delay between any two nodes
becomes almost a constant with a high degree ofpipclining. Thc number oflinks aficcts the network cost.
The bisection width ailects the network bandwidth.
The property of symmetry affects scalability and routing efficiency. it is fair to say that the total network
cost increases with n‘ and I. A smaller diameter is still a virtue. But the average distance between nodes may
be a better measure. The bisection width can he enhanced by a wider channel width. Based on the above
analysis, ring, mesh, torus, it-ary n-cube, and CCC all have some desirable features for building MPP systems.

2.4.3 Dynamic Connection Networks


For multipurpose or general—purpose applications. we may need to use dynamic connections which can
implement all commtniication patterns based on program demands. Instead of using fixed connections,
switches or arbiters must be used along the connecting paths to provide the dynamic connectivity. In
increasing order of cost and perfonrtance, dynamic connection networks include bus s_t-'s'rcms, nm!Iis'Iogc
inrcrconnccrion ncru-'ork.r {'MlNj, and crossizrrr svt-‘itch ncm-'orks.
The price tags ofthesc networks are attributed to the cost ofthe wires, switches, arbiters, and connectors
required. The perfonnance is indicated by the network bandwidth, data transfer rate, network latency, and
communication patterns supported. A brief introduction to dynamic connection networks is given below.
Details can be found in subsequent chapters.
Digital Bum A bus .~i_-.-‘stem is essentially a collection oi" wires and connectors for data transactions among
processors, memory modules, and peripheral devices attached to the bus. The bus is used for only one
transaction at a time between source and destination. In case of multiple requests, the bus arbitration logic
must be able to allocate or dcallocale the bus, servicing the requests one at a time.
For this reason, the digital bus has been called conrcnrion has or a Iin.=c-shoring bus among multiple
functional modules, A bus system has a lower cost and provides a limited bandwidth compared to the other
two dynamic connection networks. Many industrial and IEEE bus standards are available.
Figure 2.22 shows a bus-connected multiprocessor system. The system bus provides a common
communication path between the processors, U0 subsystem, and the memory modules, secondary storage
devices, network adaptors, cue. The system bus is often implemented on a backplane of a printed circuit
board. Other boards for processon-;, memories, or device interfaces are plugged into the backplane board via
connectors or cables.
The active or master devices (processors or HO subsystem] generate requests to address the memory. The
passive or slave devices (memories or peripherals) respond to the requests. The common bus is used on a
time-sharing basis, and important busing issues include thc bus arbitration, interrupts handling, coherence
protocols. and transaction processing. We will study typical bus systems, such as the VME bus and others.
in Chapter 5. Hierarchical bus structltres for building larger mulfiprooessor systems are studied in Chapter T,
TB i Adumicedfontpuwwkdriemim

‘ ' - I

Su baystom
. ' . I

Bus \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\3

Main M .... Secondary


Memory 2 Storage

Flg.2.22 A bus-connected mulnlproeoss-or syscom. such as the Sequont Symrnetry 51

Switch Moduli: An cl >< b sn-'i1't-it module has o inputs and b outputs. A br'n.or__\-' sn-‘itch corresponds to a 2 ><
2 switch mod|.tlc in which n = in = 2. in theory, n and in do not have to be equal. However, in practice, n and I:
arc often chosen as integer powers of2; that is, a = {J = 2* forsomc it 2 1.
Table 2.3 lists several commonly used switch module sizes: 2 >4 2, 4 >< 4, and 8 >< 8. Each input can be
connected to one or more oi‘ the outputs. However, conflicts must be avoided at the output terminals. ln other
words, one-to-one and one-to-many mappings are allowed; but many-to—one mappings are not allowed due
to conflicts at the output terminal.

Table 2.3 Switch Modules nod‘ Legitimate Stats

.llr:fr:d'ui'e' Size Lcgirinure .'_i'.l'a.|'e.s Permutation C.'onne\ci‘ions


21-<2 4 2
4X4 256 24

EX8 l'figT:i,2li5 40.3-ZQ


nXn :1” :1!

When only one-to-one mappings {permutations} are allowed, we call tl1e module an n >< n crossbar switch.
For example, a 2 >< 2 crossbar switch can connect two possible patterns: straight or crnssm-'cr. In general, an
n >< n cros shat can achieve rt! pcrmutat ion s. The numbers oflcgitimatc connection patterns for switch modules
oi‘ various sizes are listed in Table 2.3.
Multistage Interconnection Network: 3»-lINs have been used in both MIND) and SIMD computers. A
generalized multistage network is illustrated in Fig. 2.23. A number of o >< h switches arc used in each
stage. Fixed interstage connections are used between the switches in adjacent stages. The switches can be
dynamically set to establish the desired connections between the inputs and outputs.
Different classes of MINs differ in the switch modules used and in the kind of inrcrsmgc cormecfion -[i [SC]
patterns used. The simplest switch module would be the 2 >< E switches -[is = b = 2 in Fig. 2.23}. The ISC
panems ofien used include perfect skufiie, butterfly. mnltiway shagfiie, crossbar, cube connection, etc. Some
ofthesc [SC patterns are shown below with examples.
,.,,,,,,,,, ,,,,_,,,,,,,,,.,,,,,,., ,9
Ci
1 ass -1
—" so I
CF
I i."
IF‘
I I
a.=<.i:-
. 1 I
D

&_1 switch switcit . 4 __.> . switch . b_1

l}
+03

R?» _
ass
switch ,
isc,
,
I
assCF
switch .
IS-C2
J.
3
__$_
iscn
.
ass
. n+1
switch , 2b_1

Z I I I
I I I I

"- b“-h
“and“ BK!)
switch : :
31-=.iJIF
switch :
21‘-'
_L"'_L :
so .
switch : b,,_1

Stago1 Stago2 Q Q Q Stagen

Fig. 2.23 A generalized structure of a multistage interconnection networlt {MIN} bolt with a x b switch
filDdi.Ht‘.!S and intcrstagc ccrmcetien patterns ISC1, |SC;, i5C,.,

Omega Network Figures 2.24a to 2.2441 show four possible connections of 2 >< 2 switches used in
constructing the Omega network. A I6 >< 16 Omega network is shown in Fig. 2.2-4e. Four stages of 2 X 2
switches are needed. There an: lo inputs on the left and 16 outputs on the right. The ISC pattcm is the perfect
sltuflle over 16 objects.
lo general, an rt-input Omega network requires log; rt stages ofl >< 2 switches. Each stage requires m'2
switch modules. In total, thc network uses n log; n.I’2 switches. Each switch module is individually controlled.
Various combinations of the switch states implement dil'l'ercnt permutations, broadcast, or other
connections front the inputs to the outputs. The interconnection capabilities ofthe Omega and other networks
will be fi.n"t|1eI studied in Chapter T.

Baseline Network Wu and Feng {I98-G) have studied the relationship among a class of multistage
interconnection networks. A Bast-fine netit-‘ark can be generated recursively as shown in Fig, 2.25:1,
The first stage contains one NX Nbkick,andtl1c second stage contains t\vo(A"/2]>< {Nt'2)subb1o-t:ks,1abelcd
Cr, and C |. The construction process can be recursively applied to the subl:-locks until the M2 subblocks cl‘
size 2 X 2 are reached.
The small boxes and the ultimate building blocks ofthe subblocks are the 2 >< 2 switches, each with two
legitimate connection states: straight and £‘.l'O.'i‘.S‘O'|-'£'!‘ between thc two inputsand two outputs. A 16 >< 16 Baseline
network is shown in Fig. 2.25b. In Problem 2.15, readers are asked to prove the topological equivalence
between the Baseline and other networks.
l _ _ rr<- Mclinrw HJ'lI:'|-rr.-W.-.-|-r~
BU ii Adwmced Cm1pu1JerArchitecunre

O 0 G O 0 0 w 0

1 1 1 T 1 1 1 t
{a}51rarg'11 {up Doeacwer 10: Upper lroaclcaai ¢d:| Lower brcxacl»::aa1

2 I I I 3’

émmmw
M=i=.%*»'i=i\M=i*»%*»"'
g ~w=w*~*»~a:¢-*~*»w-~-
1;
‘a=4%=$%=$*'
I I
4 G } T6\<.15
I
. Dme-g flflfl-1 mm
I 12
Fig. 2.14 The use ofl x 1 swiuchus and perfisct shuffle as an inuricagu cmnectinn pmern um construct a
16 x 1-5 Omega nitwork {Camrbesy of Dunan Lawrbl; reprimed with purrriisslon from IEEE Tmns.
C-ompumrs.Dec.19T5)

2( I I ( ?
2,‘ I-Q-I
I-Q::;{/:m‘$‘-I:
2:? I-W‘-I-y-I-7-I2

N ><- N
%fl
[b;§~§;:;é1
{aj Re-aurawwwwIuqion
1.:-:’0f’&'i'i;%Ehn : =
:3 ~|ii'4|i-V: :3
15
(b}|.A16 >=.16 Baaeme IH!'1'lJOl‘ll
15

Fig. 125 Rewrshre curmruction of I Baseline network {Caurnsy of'Wu and Fang; reprirmed with perminicm
frnrn [EEE Tmns. C0fl14;vlI!eI's,August 1930}
,.,W,M,,_,,,,,,,k,.,q,,fl H
Crossbar Network The highest bandwidth and interconnection capability are provided by crossbar
networks. A crossbar network can be visualized as a single-stage switch network. Lilce a telephone
switchboard, thc crosspoint switches dynamic conncctions hctwccn source, destination pairs. Each
cmsspoint switch can provide a dedicated connection path hcttvccn at pair. The switch can be s-ct on or ofi
dynamically upon program demand. Two types of crossbar networks are illustrated in Fig. 2.26.
To build :1 sharod-memory multiprocessor, one can use a crossbar rtclworlc bctwccn thc processors
and memory modules (Fig. 2.262.]. This is essentially a memory-access network. The pioneering C.1:m:up
multiprocessor (Wolf and Bell, 1972} implemented a [6 >< I6 crossbar network which connected 16 PDP
ll processors to I 6 memory modules, each ofwhich had s capability of 1 million words ofmemory cclls. The
115 memory modules could be accessed by thc processors in parallel.
Transmit

HEO-0-0-0 »~<>@»0@~0e~0 __;Q__O__ »—0@»0~0~0


0moO
O 000
0-0

............. . . -0

oostIII
w -
II
M2 ...
-
M15
0 on
E
0-0 Q
CF‘
o-O-5 P
E E03
Q}
:-_'_ I
I I IPE219 PEZZO F‘EZ21F'E222l
I I

O or Receive
[aj lnterprooessor-rnorrtory crossbar [bj The lnterpro-oooso-r cnoesbar network bultt in tho
network bu ltt In the C.n1rnp Fuj|tsu\.-'PP5-D0voctorpa|'a|lolprooo<ssor(19921
multiprocessor at Carnegie-
Mollon University [19-T2]

Fig. 2.26 Two crossbar swlcch non-voric configurations

Note that each memory module can satisfy only one processor request at a time. When multiple requests
arrive at the some memory module simultaneously, the crossbar must resolve the conflicts. The behavior
of each crossbar switch is very similar to that of a bus. However, each processor can generate a sequence
_ War If J11!!!‘ I'mi!I;|(1rinr\ _
B1 I Adnoinccd Corr|pn'tct'At'chitcctuvc

oi‘ addresses to access multiple memory modules simidtaneously. Thus, in Fig. 2.2641, only one crosspoint
switch can be set on in each column. However, several crosspoint switches can be set on simultaneously in
order to support parallel (or interleaved] memory accesses.
Another type ofctossbar network is for interproccssor communication and is depicted in Fig. 2.2611. This
large crossbar [224 >< 224) was actually built in a vector parallel processor [\"PP5tlD} by Fujitsu Inc. (1992).
The PEs are processors with attached memory. The CPs stand for control processors which are used to
supervise the entire system operation, including the crossbar networks. In this crossbar, at one time only one
crosspoint switch can be set on in each row and each column.
The interprocessor crossbar provides permutation connections among the processors. Only one-to-one
connections are provided. Thereibrc, thc n >< rt crossbar connects at most n source, destination pairs at :1 time.
We will further study crossbar networks in Chapters 7 and 8.

Summary In Table 1.4, we summarize the important features of buses, multistage networks, and crossbar
switches in building dynamic networks. Obviously, the bus is the cheapest to build, but its drawback lies in
the low bandwidth available to each processor.

Table 2.4 Summary o]"Dynnmic Network Chnrocrerttris

l't-Ternvrrir But Mr.rii‘i.st‘rrgc Cnrsrrbar


f.'irarrrc'!c'ri.tr ie.'r Eyre‘em l'i|'c't' tt'rrr.it Stvii'c.ir
A-'linimu.m latency for
unit data nvntsiier Constant Uilflttt I1} Constant
Bandwidth per (I tt'i"rr} to (X tr) U{ tr} to Ui[rru.'} Uiwl I11 5'i-W)
processo1-
Vfiting Complexity Eirti. U{rru.' lutlgkyrr}
Switchirtg Complexity Clfll t'){n log; rr} crab
Connectivity and Only onetc oneatatime. Sornc perrnutations All permutations.
routing capability and broadcast, if one at a time.
network unblocked
Early representative Symmetry S-1. BEN TC-2000. only Y-l»[Pitt16.
computers Encore Multinnnt IBM RP3 Fujilst1VPP5DtI
Remarks Assurue rr processors rr X rr MIN Arsunte n x n
on the bus; bus using it >< it crossbar with
width is w bits. switches with line line widtlt of
width of u- bits. u- bits.

Another problem with the bus is that it is prone to failure. Some fault-tolerant systems, like the Tandem
multiprocessor for transaction processing, used dual buses to protect the system from single failures.
The crossbar switch is the most ertportsive one to build, due to the fact that its hardware complexity
inc reascs as :12. However, the crossbar has the highest bandwidth and routing capability. For a small network
size, it is the desired choice.
,,,__,W M, ,,,Mk Hm
rr--alrfimw Hriicmpwv _ H
Multistage networks pmvidc a compromise between the two c:-ttrcmcs. Thc major advantage of MlNs
lies in their scalability with modular construction. However, the latency increases with log n, thc number of
stagcs in the network. Also, costs duo to increased wiring and switching complexity are another constraint.
For building MPP systems, some of thc static topologies arc more scalable in specific applications.
Advances in VLS1 and intcrcoimccl icclmologies have had a major impact on multiprocessor system
architecture. as \ve shall see in Chapter I3, and there has been a clear shift towards the use of packet-based
switched-media inte'rconnccts.

,__

$5 Summary
ln thischaptenwe have foc used on basic program properties which make parallelism possible and determine
the amount and type ofparallelism which can be exploited.'lNith incrsing degree of multiprocessing,the
rate at whida data must be communicated between subsystems also increases, and therefore the system
interconnect architecture becoma important in determining system performance.
We started this chapter with a study of the basic conditions which must be satisfied for parallel
computations to be possible. in essence. it is dependences between operations whidw limit the amount
of parallelism which can be e)cploited.After all. any set of N fully independent operations can always be
performed in parallel.
The three basic data dependences between operations are flow dependence, o.rrt.i-dependence and output
dependence. Resource dependence refers to a limitation in available hardware andfor software raources
which limits the achievable degree of parallelism. Bernstein‘s conditio ns—which apply to input and output
sets of processes—must be satisfied for parallel execution of processes to be possible.
Parallelism may be exploited at the level of software or hardware. For software parallelism. program
design, and the program development and runtime environments play the key role. For hardware
parallelism. availability of the right mix of hardware resources plays the key role. Program partitioning,
grain size, communication latency and scheduling are important concepts; scheduling may be static or
dynamic.
Program flow may be control-driven, data-driven or demand-driven. Of these, comzrol-driven program
flow. as exemplified in the von Neumann model, is the only one that has proved commercially successful
over the last six decadesfither program flow models have been tried out on research-oriented syscems.
but in general these models have not found acceptance on a broader basis.
When computer systems consist of multiple processors—and several other sub-systems such as
memory modules and network adapters—the system interconnect architecture plays a very important
role in determining final system performance.We studied basic network properties. including topology
and routing functionality Network performance can be characterized in terms of bandwidth. latency.
functionality and scalability.
We studied static network topologies such as the linear array. ring, tree. fat tree. toms and hypercube;
we also looked at dynamic network topologies which involve switching andlor routing of data.'iNith higher
degree of multiprocessing.bus-based systems are unable to meet aggregate bandwidth requirements of
the system; multistage inter-connection networks and crossbar switches can provide better alternatives.
FM M¢Gl'i1I-H Hillfmmlunm :
B4 D Advanced Compu'terA|'chiitecture

53 Exercises
Problem 1.1 Define the following terms related (c) Wlrat are the differences between string
to parallelism and dependence relations: reduction and graph reduction machines?
lll Computational granularity.
Problem 2.4 Perform a data dependence analysis
(bl Communication latency. on each of the following Fortran program fragments.
{Cl Flow depe ndence. Show the dependence graphs among the statements
ldl Antidependence. with justification.
[El Output dependence. {a} S1 = B+ D
lll HO dependence. $2 =A X 3
{El Control dependence. 53 =A + C
{hi Resource dependence. 54 =AI I
(ll Bernstein conditions.
(bi) S1 = $lN(Y)
lll Degree of parallelism.
52 = X +W
Problem 2.2 Define the following terms for S3 -<H)<"'|Il-i"‘Zb-= -2.5 XW
various system interconnect architectures: 54: X = CO5(Z}
{El} NC-Clfl dEgI‘EE- {c} Determine the data dependences in
(b) Network diameter. the same and adjacent iterations of the
(c) Bisection bandwidth. following Do-loop.
[d} Static connection networks. DD 10 | = LN
{e} Dynamic connection networks. 5]; A“ + 1) = B{| _ 1) + cm
(f) Nonbloclting networks. 51; gm = Aw X |(
(g) l"1ulticast and broadcast 53; cm = 5“) _ 1
(h) Mesh versus torus. 1Q continue
(i) Symmetry in neovorlts.
Problem 2.5 Analyze the data dependencfl
[ii Multistage networks. am the fa" .n mt ts . .
ong oun g s emen |n a gnren program:
(it) Crossbar networks.
(I) Digital buses. Si? Lflfiil R]. lUZ4 IR] (-1024!
51: Load R1. l"l{lO) IR! <— l"1emory(lCl).l
Problem 1.3 Answer the following questions on
5]: Add R]. R2 IR] t—(R1]+(R1)l
program flow mechanisms and computer models:
54: Store M(1n24}. R1 i'l"'1emory[ 1024} <- (R1)!
la‘! Compare oontrol-flow. dataflow. and 55: Store M([R2)). 1024 ll‘*‘Iemory{64] <- 1024:
reduction computers in terms ofthe program
fl°\'\" mE¢l13|'ll5lT'l |-|5Ed- where (Ri) means the content of register Hi and
(bi Comment on the advantages and l“‘|emc|-ry|[l D) contains 64 initially.
disadvantages in control complexity. potential {a} Draw a dependence graph to show all the
for parallelism. and cost-effectiveness of the dependences.
above computer models. (la) Are there any resource dependences if only
Program and Networlt Prape i 35

one copy of ch
functional unit is available in S5: l"l=G+C
the CPU? S6: A=L+C
{c} Repeat the above for the following program ST: A=E+A
statements:
Problem 1.B According to program order. dve
S1: Load R1. l"l{100} rm <- l"lemory[1UD]-I following six arithmetic expressions need to be
S2: Hove R1. R1 nu <- {R1 )1 executed in minimum time. Assume dvat all are
inc R1 rm <- (R1) + 1: integer operands already loaded into working
Add R1. R1 nu <- (R2) + (R1)! registers. No memory reference is needed for the
Store M(100). R1 .fl""lemory(1DO) <- {R1}! operand fetch.Also. all intermediate or final results
Problem 1.6 A sequential program consists are written back to working registers without
of the following five statements. S1 through S5. conflicts.
Considering each statement as a separate process. P1: X-t—[A+B)><(A-B]
clearly identify input set ly and output set O; of each P1: ‘i'<—(C+D]f(C-D)
process. Restructure the program using Bernstein's P3: Z t— X +‘l’
conditions in order to achieve maximum parallelism P4: A -t— E >< F
between processes. If any pair of processes cannot P5: Y t— E — Z
be executed concurrently. specify which ofthe three P6: B <— (X — F) '>< A
conditions is not satisfied.
(a) Use the minimum number of working
= B+C registers to rewrite the above HLL program
= B :>< D into a minimum-length assembly language
sew.-_~
'U‘lU‘lU'l '-fin)» = Cl code using arithmetic opcodes odd. subtract.
S4: Do I = A. 100 multiply. and divide exclusively. Assume a fixed
5 = S + X{l) instruction format with three register fields:
End Do two for sources and one for destinations.
S5: IF (S .G'l'.1000} C = C :>< 2 {la} Perform a flow analysis of the assembly
code obtained in part {a} to reveal all data
Problem 1.7 Consider the execution of dve
dependences with a dependence graph.
following code segment consisting of seven
{c} The CPU is assumed to have two odd units.
statements. Use Bernstein's conditions to detect
one multiply unit. and one divide un.|'t.Work out
the maximum parallelism embedded in this code.
an optimal schedule to execute the assembly
justify the portions that can be executed in parallel
code in minimum time. assuming 1 cycle for
and the remaining portions that must be executed
the add unit. 3 cydes for the multiply unit.
sequentially Rewrite the code using parallel
and 18 cycles for the divide unit to complete
constructs such as Cobegin and Coend. No variable
the execution of one instruction. Ignore all
substitution is allowed. All statements can be
overhead caused by instruction fetch. decode.
executed in parallel if they are declared within the
and writeback. No pipelining is assumed here.
same block of a |[Cc|-begin. Co-end) pair.
M: A=B+C Problem 1.9 Consider the following assembly
S1: C = D + E language code. Exploit the maximum degree of
: G + E parallelism among the 16 instructions. assuming no
resource conflicts and multiple functional units are
n'H =A + F
available simultaneously. For simplicity. no pipelining
TM Illnffirihi-* Hfllfiuroponnri .
B5 i Advanced Compu'terArchi'tectui'e

is assumed.All instructions take one machine cycle (b) Work out an optimal schedule for parallel
to execute. Ignore all other overhd. execution of the above divided program by
1: Load R1.A .lR1t— l"'lem{A)J' the two processors in minimum time.
1: Load R1, s no <- l"'1em(B}! Problem 1.11 You are asked to design a direct
Z Hui R3. R1. R2 .fR3 (— {R1} X [R2}f network for a multicomputer with 64 nodes using
2 Load R4. D .fR4 <— l"'lem{D)f a three-dimensional torus. a six-dimensional binary
: Hui R5, R1. R4 IRS t— {R1} I>< {R4}! hypercube. and cube-connected-cycles (CCC]- widv
: Add R6. R3. R5 .fFl.6 t— {R3} 1' (R5]f a minimum diameter. The following questions are
: Store X, R6 fMem[X] <— (R6)! related to the relative merits of thae network
N“-l G\U"l-h f-n l 1 Load R1. C .fR7 <— l"1em(C).i' topologies:
9; Mul R8. R7. R4 ins ¢- in?) >< (R4)! {a} Let d be the node degree. D the network
10: Load R9. E IR‘? <— l"'lem(E}i' diameter. and I the total number of links in
11: Add R10. R8. R9 {R10 t— (RB) + (R9)! a network Suppose the quality of a network
1;‘; StoreY.P.10 l‘l'lem(Y} <- {R10}! is measured by (.1 >< p .>< ljr‘. Rank the three
architectures according to this quality
131 Add R11. R6. R10 .lR11 t— (R6) + {R10}!
measure.
14: Store u,n.11 fl'“1em(U} <- (R11)!
(b) A mean irrtemode distance is defined as the
15: Sub R12. R6. R10 IR12 t— {R6} — (R10)!
average number of hops {links} along the
1s; Storelf. R12 l'Her1'|(V) <- {R12}! shortest path for a message to travel from one
{a} Draw a program graph with 16 nodes to node to another".The average is calculated for
show the flow relationships among the 16 all (source.destination} pairs.Order the three
instructions. architectures based on their mean internode
(b) Consider die use of a three-issue superscalar distances. assuming that the probability that a
processor to execute this program fragment node wfll send a message to all other nodes
in minimum time. The processor can issue with distance i is (0 -r+1}1zf=, k.where 0
one memory-access instruction (Load or is the network diameter.
Store but not both). oneAddl'Sub instruction.
Problem 1.11 Consider an llliac mesh (B >< 8).
and one l"'lul (multiply) instruction per cycle.
a binary hypercube. and a barrel shifter. all with 64
The Add unit. Load.lStore unit. and l"‘lultiply
nodes labeled N0. N1. Ng3.All network links are
unit can be used simultaneously if there is no
bidirectional.
data dependence.
{a} List all the nodes reachable from node N9
Problem 1.10 Repeat part (b]- of Problem 2.9 in exactly three steps for each of the three
on a dual-processor system with shared memory. networks.
Assume that the same superscalar processors are (b) Indicate in each case the tightest upper bound
used and that all instructions take one cycle to on the minimum number of routing steps
execute. needed to send data from any node N, to
(a) Partition dve given program into two balanced another node Ni.
halves. You may want to insert some load {c} Repeat part (b) for a larger network with
or store instructions to pass intermediate 1024 nodes.
results generated by the two processors to
each other. Show the divided program flow Problem 1.13 Compare buses. crossbar switches.
graph with the final output U and‘! generated and multistage networks for building a multiprocessor
by the two processors separately. system with n processors and m shared-memory
Program and Network Pmpa W 31

modules.Assume a word length of w bits and that (c) Based on the results obtained in (a} and [b].
2 '>< 2 switches are used in building the multistage prone the topological equivalence between
networks. The comparison study is carried out the Flip network and the Omega network.
separately in each of the following four categories:
(a) Hardware complexities such as switdwing. ‘i I I I S
arbitmtion. wires. connector. or cable
requirements.
(b) Minimum latency in unit data transfer between
the processor and memory module.
[c] Bandwidflw range available to each processor.
(d) Communication capabilities such as
permutations. data broadcast. blocking
handling. etc.
2 I-ll"-I-lll‘-I-ll.“-I-Ni A
Problem 2.14 Answer t.he following questions
“.1 I4 l 2*.
1: I-I I-I I-I I-l 2.
. \:'\:'\='
related to multistage networks:
{a} How many legitimate states are mere in a
4 >< 4 switch module. induding both broadcast
and permutations? justify your answer with
reasoning. Fig. 2.27 A 16 >< 16 Flip network {Courtesy of
(b} Construct a 64-input Omega network using
Keri Baecher: reprinted from Proc. int.
cm. Pan:|lieiPmc-esslng. 1915;
4 >< 4 switch modules in multiple stages. How
many permutations can be implemented
Problem 2.16 Answer the following questions
direcdy in a single pass through the network
for the k-ary n-cube network:
without blocking?
{a} How many nodes does the network contain?
{c} What is the pencentage of one-pass
(b) What is the network diameter?
permutations compared with the total
{c} What is the bisection bandwidth!‘
number of permutations achievable in one or
more passes through the network!‘ (d) What is the node degree?
(e) Eaqalain the graph-theoretic relationship
Problem 1.15 Topologically equivalent networks among k-ary n-cube networks and rings.
are those whose graph representations are iso- meshes. tori. binary n-cubes. and Omega
morphic widw the same interconnection capabili- networltrs.
ties. Prove the topological equivalence among the {f} Explain the difference between a conventional
Omeg. Flip.and Baseline networks.
torus and a folded torus.
[a] Prove that dwe Omega network (Fig. 2.24} {g} Under the assumption of constant wire
is topologically equivalent to the Baseline bisection. why do low-dimensional networks
network [Fig. 2.25b}. (tori) have lower latency and higher hot-spot
(b) The Flip network (Fig. 2.2?) is constructed throughput dwan high-dimensional networks
using inverse perfect shuffle (Fig. 2.14b} for (hypercubes)?
interstage connections. Prove that the Flip
network is topologically equivalent to the Problem 1.17 Read the paper on fat tnees by
Baseline network. Leiserson.which appeared in IEE Trans. Computers.
Fhr MIG-l‘l7l|H Hflllluqieuira-s
BB W fiialwaviced Computer Architecture

pp. 391-9111. Oct 1985. Answer the following is equal to the bandwidth of a single channel
questions related to the organization and application vv = k— 1. under the assumption ofa constant
of fat trees: wire cost.
{a} Explain the advantages of using binary fat
Problem 1.19 Network embedding is a technique
trees over conventional binary trees as a
to implement a network A on a network B. Explain
multiprocessor interconnection network.
how to perform the following network embeddings:
(ti) A uriiversai fat tree is defined as a fat tree of
{a} Embed a two-dimensional torus r >< r on an
n nodes with root capacity w. where nm 5 W
n-dimensional hypercube with N = 1" nods
E n . and for each channel qt at level it of die
where r1 = 1".
tree. the capacity is
(b) Embed the largat ring on a CCC with
C. = min ll"-Tl]. i“»'2’*"‘*1i N=k>< 2"nQdi==§=n-=ii<z3_
Prove that the capacities of a universal fat tree [c] Embed a complete balanced binary tree with
grow exponentially as we go up the tree from maximum height on a mesh of r>< r nodes.
the leaves. The channel capacity is defined Problem 1.20 Read the paper on hypernets
here as the number of wires in a channel.
by Hwang and Ghosh, which appeared in lEE.E
Problem 1.18 Read the paper on k-ary n-cube Tmns. Computers. Dec. 1989.Answer the following
networks by Dally. which appeared in IEEE Trans. questions related to the network properties and
Computers. june 1990. pp. 775-785. Answer applications of hypernets:
the following questions related to the network (a) Erqalain how hypernets integrate positive
properties and applications as aVL5l communication features of hypercube and tree-based
network: topologies into one combined architecture.
{a} Prove that the bisection width B of a it-ary
(b) Prove that the average node degree of a
n-cube with iv-bit wide communication
hypernet can be maintained essentially
channels is
constant when the network size is increased.
B(k. J1) = 1w-J'fik“'1"' = 2wNIk (c) Discuss the application potentials of hypernets
where N = it" is the network size. in terms of message routing complexity. cost-
effective support for global as well as localized
(b) Prove that the hot-spot throughput ofa it-ary
communication. HO capabilities. and fault
n-cube neowork with deterministic routing
tole ran ce.
TM Htfiflhlt Hfllio-nponnti

— —

Principles of Scalable
Performance
We study performance mmsures. speedup laws. and scalability prindpies in this chapter.Three speedup
models are presented under different computing objectises and rmource constraints. These include
Amdahfs law (1967). Gustafson‘s scaled speedup {19E-B). and the mernory-bounded speedup by Sun and
Ni (1993).
The efficiency. redundancy. utilization. and quality of a parallel computation are defined. involving the
interplay between architectures and algorithms. Standard performance memures and several benchmark
kernels are introduced with relevant performance data.
The performance of parallel computers relies on a design that balances lordware and software. The
system architects and programmers must exploit parallelism. pipelining. and networking in a balanced
approach. Toward building massively parallel systems. the scalability issues must be resolved first.
Fundamental concepts of scalable systems are introduced in this chapter. Case studies can be found in
subsequent chapters, especially in Chapters 9 and 13.

PERFORMANCE METRICSAND MEASURES


— In this section, we first study parallelism profiles and define the asymptotic speedup factor,
ignoring communication latency and rcsourcc limitations. Tbcn wc introduce thc concepts of
system clficiency. utilization, redundancy. and quality of parallel computations. Possible tradcoffs among
these performance metrics are examined in tl'te context of cost-el’fectivenc$$. Several commonly used
performance mcasurcs, MIPS, Mflops, and TP5, arc forrruslly dcfincd.

3.1.1 Parallelism Profile in Programs


The degree of parallelism reflects the extent to which software parallelism matches hardware parallelism.
We characterize below parallelism profiles, introduce the concept of average parallelism, and define an
idcal spccdup with infinitc machine rcsourccs. Variations on thc id-cal spccdup factor will bc prcscntcd in
subsequent sections from various application viewpoints and under different system limitations.
Degree of Parallelism The execution of‘ a program on a parallel computer may use different numbers
of processors at diffcrcnt time pcriocls during thc cxocution cycle. For each tirnc period, thc number of
processors used to execute a program is defined as the degrw tJf]JtlI‘rIHt’ff$t?1 (BOP). This is a discrete time
function, assuming only nonnegarive integer values.
F?» Mtfiruw ,‘["I_nlfJ|ll;l|'lII'\

9B i Advanced Comp-uter Architecture

The plot of the D-OP as a function ol" time is called the parallelism pm-fitle of a given program. For
simplicity, we concentrate on the analysis of single-program profiles. Some software tools are available to
Lrace the parallelism profile. The profiling of multiple programs in an interleaved fashion can in theory be
extended from this study.
Fluctuation of the profile during an observation period depends on the algoritlunic structure, program
optimization, resource utilization, and run-time conditions of a computer system. The DOP was defined
under the assumption ofhat.-‘ing an unbounded number of available processors and other neoessary resources.
The DOP may not always be achievable on a real computer with limited resources.
Wlien the DOP exceeds the maximtun number of available processors in a system, some parallel branches
must be executed in chunks sequentially. However, parallelism still exists within each chunk, limited by the
machine sizte. The DDP may be also limited by memory arid by other nonprocessor resources. We consider
only the limit imposed by processors in our discussions on speedup models.
Average Fturallelism In what follows, we consider a parallel computer consisting of n homogeneous
processors. The maximum parallelism in a profile is m. In the ideal case, rt .1>;> m. The contpriririg cqvm:'i!_y
A of a single processor is approximated by the execution rate, sueh as MIPS or Mflops, without considering
the penalties from memory access, communication latency, or system overhead. When i processors are busy
during an observation period, we have DDP = i.
The total amount of work ll-"' (instructions or computations} performed is proportional to the area under
the profile curve:
is
w=.¢-._[[ none) dr (3.1)
I

This integral is often computed with the following discrete summation:


J71’

it-'=a Zr‘-I, (3.2)


I I

where r,- is the total amount of time that DOP = i and If’ , r, = r; — r; is the total elapsed time.
The at-wage parallelism A is computed by
1 I
A — ij ’ DOP{_r) df (3.3)
I2 — I] 1|

ln discrete form, we have

HT HI

A= (3.4)

29 Example 3.1 Parallelism profile and average parallelism


of a divide-and-conquer algorithm (Sun and
Hi,1993}

As illustrated in Fig. 3.1, the parallelism profile of a divide-and-conquer algorithm increases from l to its
peak value m = 8 and then decreases to 0 during the observation period (rt, I2).
Fr‘:-r Mtfiruur Hlllrrqn
rd’ imo \ '
Principle: ofS<:orlo.bile Perfrrmonce i. - q|

It Degree of Parallelism
a - roots
1'-
6._

5.-

4._

3_
"""" ".liv'€r5r_iei E>.§r'itE|'t§ri-Fri "““""
2_
_ _ _ _u

1- t r
1 I r 4 t I t 1 r 2
2 4 T ‘lfl 13 15 1? 2!] 24 27
Time?!»

Fig. 3.1 Parallelism pntrfile of I divide-and-conquer algoritltrn

lnFig. 3.l,theaverageparallelismA —(l ><5 +2><3+3><4+4><6+ 5><2+6><2+8><3)l['5+3-4


+ 6 + 2 + 2 + 3) = 93:25 = 3.72. In fact, the total workload H-"'= A A (r; — til. and A is an upper bound of the
asymptotic speedup to be defined below.

Available Parallelism There is a wide range of potential parallelism in application programs. Engineering
and scientific codes exhibit a high DOP due to data parallelism. Manoj Kumar (I983) has reported that
computation-intensive codes may execute 500 to 3500 arithmetic operations concurrently in each clock cycle
in an idealized environment. Nicolau and Fisher (I984) reported that standard Fortran programs averaged
about a factor of 90 parallelism available for very-long-ins‘o'uction word architectures. These numbers show
thc optimistic side of available parallelism.
However, David Wall (I991) indicated that limits of instruction-level parallelism is around 5, rarely
exceeding 7. Bulter et al. (I991) reported that when all constraints are removed, the DOP in programs
may exceed 17 instructions per .cyclc. If the hardware is perfectly balanced, one can sustain fi-om 2.0 to
5.8 instructions per cycle on a superscalar processor that is reasonably designed. These numbers show the
pessimistic side of available parallelism.
The above measures of available parallelism show that computation that is less numeric than that in
scientific codes has rclativcly little parallelism cvcn when basic block boundaries art: igt1r.n'cd.A basic block
is a sequence or block of instructions in a program that has a single entry and a single exit points. While
compiler Dptimirflfion and algorithm redesign may increase the available parallelism in an application,
limiting parallelism extraction to a basic block limits thc potential instruction-lcvcl parallelism to a factor of
about 2 to 5 in ordinary programs. However, the DO? may be pushed to thousands in sortie scientific codes
when multiple processors are used to exploit parallelism beyond the boundary of basic blocks.
ilsympnatic Speedup Denote the amount of work executed with DOP = i as ll’, = r‘.'1tr,- or we can write
W = E,’-°"| ll-'}. Thc execution time of Hr} on a single processor [sequentially] is r,{l) = rrys. The execution
time of ll’, on it processors is r,{_lr) - ltr',J.t-A. With an infinite number of available processors, r,{=>~=] — llilio for
l E i E rrr, Thus we can W‘l'ite the response time as
a
nn- Zr. m- —'l-1-._¥ out
ll l[*'l§
War Mcliruur irrttr-...s-,.n,t.¢. '
92 i Advanced Cornp-uter Architecture

Tie") — fir; (ml — (315)


The .n.syntptoric sireeriitp .S',, is defined as the ratio of TU} to ii»):

zit-1.
= rm = .- . _
s._. -—-nun) -in, (3.7)
Zn-‘F
1' I

Comparing Eqs. 3.4 and 3.1’, we realize that S.__, = A in the ideal ease. In general, S,_ £ A if communication
latency and other system overhead are considered. Note that both S... and .-t are defined under the assumption
n=wornl>;"-in.

3.1 .2 Mean Performance


Comider a parallel computer with n processors executing m programs in various modes with tlifferent
perfomtance levels. We want to define the mean performance of such multimode computers. With a weight
distribution we can define a meaningful performance expression.
Different execution modes may correspond to scalar, vector, sequential, or parallel processing with
tlifiierent program parts. Each program may be executed with a combination of these modes. Hrtrnionic- mean
perfomiance provides an average perfonnance across a large number of programs running in various modes.
Before we derive the harmonic mean performance expression, let us study the arirhnreric nterm performance
expression first derived by James Smith (1938). The execution rate R, for program i is measured in MIPS rate
or Mflops rate, and so are the various performance expressions to be derived below.
Arithmetic Mean Performance Let {R,-} be the execution rates ofprograms i— l, 2,. . ., m. The rtri£hmc'Iic'
.-storm execution rare is defined B5

Rtr = fr"
r -1
The expression R“ assumes equal weighting (lim) on all m programs. If the programs are weighted with a
distribution rr + {;§| i — l, 2, .. ., m}, we define a 'tt‘c’igJi?Ic’(|i in-irhmerir mean ct-ccurimi mic as follows:

R:= Eu.-a-.1
J" I
(1.91
Arithmetic mean execution rate is proportional to the stun of the inverses of execution times; it is not
inversely proportional to the sum of execution times. Consequently, the arithmetic mean execution rate fails
to represent the real times oonstuned by the benchmarks when they are actually executed.
Harmonic Mean Performance With the weakness of arithmetic mean performance measure, we need
to develop a mean performance expression based on aritlunetie mean execution time. 1n fact, 7} = UR, is thc
mean execution time per instruction for program i. The arirhnicric mean t'.ror*urii'1n time per instruction is
defined by
Principle: of Scalable Perftrmonce i Q3

III III

T..= = ow)
The hrrrrrronie moan oxorririorr rote across m bcnclirmark programs is thus dcfinod by the fact RI, = li"?':,:
m
R ,= i
I l{-1|:-El] {111} .

Therefore, the harmonic mean performance is indeed related to the average execution time. With a weight
distribution It = {_f,3|r' = 1, 2, . .., rrr}, we can define the it-wightcr! harmonic rrrcrm exrmrrtion rote as:

s-=;,= A (1:2)
E" 1 U"-'rRr_l
The above harmonic mean performance expressions correspond to the total number of operations divided
by the total time. Compared to arithmetic mean, the harmonic mean execution rate is closer to the real
performance.
Harmonic Mean Speedup Another way to apply the harmonic mean concept is to tie the various modes
of a program to the number of processors used. Suppose a program {or a workload of multiple programs
combined) is to be executed on an n-processor system. During thc executing period, the program may use
i - 1, 2, .. ., n processors in different time periods.
We say the program is executed in rrror.i'o i, if r‘ processors are used. The corresponding execution rate R, is
used to reflect the collective spend of r‘ processors. Assume that T1 = HR] = l is the sequential execution time
on a uniprocessor with an execution rate R, — l.Then T,- - 1tR,- - lit is the execution time of using i‘ processors
with a combined execution rate of R, = i in the ideal case.
Suppose the given program is executed in n execution modes with a weight distribution w = {j,?|i = 1, 2,
. . ., rt}. A it-'or'g}rtod harmonic moon .spocdup is defined as follows:

s=r,n"*= A (3.13)
{E-i"i | J'i"Rrl
where T"‘ — ltR1t, is the it-riglrrr-rt nrir!rnu=n'r- nrerm o.x'or:'rrIion tr'nrr_' across the n execution modes, similar to
that derived in Eq. 312.

Ir]
g Example 3.2 Harmonic mean speedup for a multipro-
cessor operating in n execution modes
(Hwarlg and Briggs, 19:14)
In Fig. 3.2, we plot Eq. 3.13 based on the assumption that 7] = lti for all i = 1, 2, ..., n. This corresponds to
the ideal case in which a unit-time job is done by i processors in minimum time. The assumption can also
be interpreted as R, = i because the execution rate increases i times fi"om R1 = I when iprocessors are fully
utilized without waste.
The three probability distributions rt], I2, and E3 correspond to thrcc processor utilization patterns. Lot
s = E:-' | F. rt] = (1/n, lfn, _. ., Lin) corresponds to a uniform distribution over the rt cxocution modes, rt; =(1t's,
2."s, ..., n.I'.s') favors rising more processors, and 21'; = (m‘s,{n — ljfs, .. .,2."s,l."s_) favors using fcwcr processors.
rm‘ MIGIELH HI" t'm'rIq|r_.\.I|n*\ ‘I _

94 Z5 Advanced Conrp-uter Architecture

The ideal ease corresponds to the 45“ dashed line. Obviously, 11'; produces a highccr speedup than tr] does.
The distribution 11-1 is superior to the distribution :13 in Fig. 3.2.
‘ s=%i

r
1024- 51,», P I = ___,
/ 2 5

or / as
\
ii
N .-—.-—. ‘J-
he-h Et-IMM
\.
Spggdup 64- cf _ n 11-1 1
,1’ I3-'l§' _:s_'
‘IE — ,5 H
,1 whemas= El‘
d = ._¢
4_1
I

4 '15 B4 255 ‘H124 Tl

Fig. 3.2 Harrnordc rn-eon speedup performance with respect no 1:hree probability dls1:rtn.r|:lonsr :r1 for tmtforrn
-cHsr.riburlori. ;rr1 in faror ofuslng more prooessors. and rt; to favor ofuslog fewer processors

Amdnhli Law Using Eq. 3.13, one can derive Amda1'|l's law as follows: First, asstlme R,-= i, w={|'1‘, GI, D, .. .,
0, 1 — tr); i.e., wl = or, ti-',, = I — tr, and +1; = O for r‘ ¢ 1 and r'a= n. This implies that the systern is used either in a
pure sequential mode on one processor with a probability tr, or in a fiilly parallel mode usingn processors with a
probability 1 — 0:. Substituting R; = 1 andR,,=n and w into Eq. 3.13, we obtain the following speedup expression:
H
$.= W (3-14)
This is known as Amdahl’s law. The implication is that S —> lfrras H —> W. In other words. under the above
probability assumption, the best speedup one can expect is upper—b-ounded by llrr, regardless of how many
processors are employed.
In Fig. 3.3, we plot Eq. 3.14 as a function of n for four values of tr. When tr = Cl, the ideal speedup is
achieved. As the value of rr increases from 0.0] to 0.1 to 0.9, the speedup performance drops sharply.
; _ n
S_'l+{rr—1[|tr
1024- ,°o=lJ
I
2
25s — ,<f
/1’ tr = C|.C|1
Speedup 5; _ Id
I
dd

15” I
1 u:=U.'|
I
4_

11 = [LB
1 n
4 16 64 256 11124

Fig.1.} Speedup perforrnanee wl1:h respect no the probability distribution rr = {opt}, CL1- cc) where II ls the
fralzrion of sequential boctlenedc
rs» MIG:-|:|'u|' Hillr1 tr.I;|(rm-r It '
Principle: ofS-soluble Perflrmonce i. - Q5

For many years, Amdahl's law has painted a pessimistic picture For parallel processing. That is, the system
performance cannot be high as long as the serial fiaction rr exists. We will furtlier examine Amdabl’s law in
Section 3.3.1 from the perspective of workload growth.

3.1.3 Efficiency, Utilization, and Quality


Ruby Lee (1930) has defined several parameters for evaluating parallel computations. These are fundamental
concepts in parallel processing. Tradeoffs among these performance factors are often encountered in real-life
applications.
System Efficiency Let (In) be the total number of unit operations performed by an n-processor system and
Tin) be the execution time in unit time steps. In general, Tin} *1 Uta} ifmore than one operation is performed
by n processors per unit time, where n 2 2. Assume Til ) = 0(1) in a uniprocessor system. The speedup_firr-rnr
is defined as
Sin) = Til]./Tin] (3.15)
The system c;fi‘icz‘ene_v for an n-processor system is defined by
5" T 1'
rt nT{nl
Efiiciency is an indication of the actual degree of speedup performance achieved as compared with the
maximum value. Since 1 5 5l[_n) 5 n, we have lln 5 Erin) 5 I.
The lowest efficiency conesponds to the case of the entire program code being executed sequentially
on a single processor, the other processors remaining idle. The maximum efficiency is achieved when all n
processors are fully utilized throughout the execution period.

Redundancy and Lftilizotion The l"t.’|tIfN.F?|til|iI;l'I-C‘_}-‘ in a parallel computation is defined as the ratio of Ofn) to
17(1):
R(n] = O(ri)fO(l] (3.17)
This ratio signifies the extent of matching between software parallelism and hardware parallelism.
Obviously 15 Rio) £ n. The s_t-stem un'l'i:a!r'on in a parallel computation is defined as
O
Lilfri) = R{n_}E(n_) = [_3.1ll_)
nT{_n)
The system utilization indicates the percentage of resources (processors, memories, etc.) that was
kept busy tll.lI‘i1‘|g the execution of a parallel program. lt is interesting to note the following relationships:
1:515 E(n]| S L"(n) S 1 and 1 SR(n] 5 HE-['n] 5 n,

Quality of Parallelism The qznrlirv of a parallel computation is directly proportional to the speedup and
efficieney and inversely related to the redundancy. Thus, we have
3 .

Qt») = {Ml
= HT -[n)O{nj (3.19)
Since Elin) is always a 'Frat:t‘ion and Rlfn) is a number between l and n, the quality Q‘{_n) is always upper-
bounded by the speedup .5l[n').
96 ii Advanced Carnp-uter Arclw'Iectm'e

I/I
lg Example 3.3 A hypothetical workload and performance
plots
In Fig. 3.4, we compare the relative magnitudes of5(_n), E{n), R{_n), U{n), and Q{n'] as a fimetion of machine
sire n. with respect to a hypothencal workload characterized by O(l } = T111) = n3, O{n) = H1 + nlloggn, and
1m= 4,r‘r(n + 3).
Substituting these measures into Eqs. 3.15 to 3.19, we obtain the following performance expressions:

S|['n_) = (n + 3_)f4
E('n] = {n + 3]/(4:1)
R-['n} — {n + log; n]I'n
Lin] = (ri + 3_](n + log; n).-‘(4n2)
fin] = 1;» + 3]Fr(1s(n + log; ND
The relationships lfn S E(_n) S Ufn) S I and I] S Q(_n) S BT11} S n are observed where the linear speedup
eerresponds to the ideal ease of 1DD% effieieney.

Speedup S[n]
Efiidencv Elfll Redundancy Rm]
Utilization U[n] Quality Q[n]
It I
10- --------------------------- -- +32
\
Linear /\
E E.
os— &“**% -16
E E'- \
“W.
o.e- \ 2 =-
\ -is
\
_-_ __--_‘:

0.4- \ -'4
N I
\ l

\
02- X.
\
-2
\
\
\
00- 4q[|‘n‘IEMIlI.'* -1
i u I I I
1 2 4 a 1s 32
Numberof processors {n]

Fig. 3.4 Perfcrrnanea measures for Example 3.3 on a parallel computer with up to 32 processors

To summarise the above discussion on perforrnanee indices, we use the speedup Sin] to indicate the
degree of speed gain in a parallel computation. The efieieney E{'n) measures the useful portion ofthe total
work perfonned by n processors. The redundancy Rm] measures the extent of workload increase.
The utilization Ufn) indicates the extent to which resources are utilized during a parallel computation.
ram Mrlirului Hfllf 1 nr".I||r_.u| w u :
Principles ofScololslePetftI'ma.nce 0 - Q1

Finally, the quality Qfnj combines the effects of speedup, efficiency, and redundancy into a single expression
to assess the relative merit of a parallel computation on a computer system.
The speedup and efiiciency of 10 parallel computers are reported in Table 3.1 for solving a linear system
of I000 equations. The table entries are excerpts from Table I in I1-‘ongan"a's report (1092) on LINPACK
benchmark performance over a large ntunber of computers.
Either the standard LINPAL‘-K algorithm or an algorithm based on matrix-matrix multiplication was used
in these experiments. A high degree of parallelism is embedded in these experiments. Thus high efficiency
{such as 0.94 for the IBM 30‘90."fi-008 VF and 0.95 for the Convex C3240] was achieved. The low efficiency
reported on the lntel Delta was based on some initial data.

Table 3-1 Speedtrp and Efliciency offlnrallel Computers fiar Solving 0 Linear System with ‘F000 Unknowns

Computer N0. of UIII'- Mudfl- Speedup Efiicieney


Madel Processors processor processor
firming Timing
n T; fir) T, fr) 5 r,.=r,, E‘=S-":1

Cray Y-MP can is 0.1": 0.069 11.12 00.80


NEE SX-3 2 0.15 0.052 I .82 0.9]
Cm]! v-stars 5 2.11 c-:01: 6.96 0.01
Fujitsu AP I000 512 160.0 l.l0 I410 0.29
IBM 309010005 VF 0 7.27 L19 5.64 0.9-4
lutc1Delta iii 22.0 1.50 g 14.1 0.03
Alliant FX/2800-200 14 22.9 2.06 1].] 0.79
nCU'HE-‘2 i024 331.0 2.59 I23 .0 0.12
Convex C3240 _ _ l-4.9 _ 3.0]
Parsytec FT-400 400 l0'II‘5.0 4.90 219.0 0.55

Source: Jack Dongarra. “Performance of Various Computers Using Standard Linear Equations Soflvraref‘ Computer
Science Dept, Univ. of'Tetmessec, Knoxville. TH N996-1301. March ll. 1992.

3.1.-I Benchmarks and Performance Measures


We have used MIPS and Mfiops to describe the r'n.srrm:-rion erccurion rare andfloating-point‘ capability of a
parallel computer. The MIPS rate defined in Bq. l.3 is calculated from clock frequency and average L‘-Pl. In
practice, the 1'vl1PS and Mfiops ratings and other performance indicators to be introduced below should be
measured from nlnning benchmarks or real programs on real machines.
ln this section, we introduce standard measures adopted by the industry to compare various computer
performance, including Mflopr, MIPS, KLIPS, Din-ysrane, and Pl-’h¢-stone, as often encountered in reported
computer ratings.
Most computer manufacturers state peak or sustained performance in terms of MIPS or Mfiops. These
ratings are by no means conclusive. The real performance is always prograrlt-dependent or application-
driven. In general, the MIPS rating depends on the instruction set, varies between programs, and even varies
inversely with respect to performance, as observed by Hennessy and Patterson (I990).
F?» Mtfirpw Hfllt'n.-rqiwtnw

93 i Advanced Comp-uter Architecture

To compare processors with different clock cycles and different instruction sets is not totally fair. Besides
the native MIPS, one can define a reittttve MIPS with respect to a reference machine. We will discuss
relative MIPS rating against the VAKFTRU when Dhrystone perforrnanee is introduced below. For numerical
computing, the LINPACK results on a large number of computers are reported in Chapter B.
Similarly, the Mflops rating depends on the machine hardware design and on t11e program behavior. MIPS
and Mflops ratings are not convertible because they measure difierent ranges ofoperations. The conventional
rating is called the native Mflops, which doe-s not distinguish unnormalized from normalized floating-point
operations.
For example, a r-ea! fioanng-pom: divide operation may correspond to four normoltzedfioating-point
rift-'ic1'c operations. One needs to use a conversion table between real and normalized floating-point operations
to convert a native Mflops rating to a normalized Milo-ps rating.

The Dhrynone Result: This is a CPU-intensive benchmark consisting of a mix of about IUD high-
level language instructions and data types found in system programming applications where floating-point
operations are not used fweicker, 1984). The Dbrystone statements are balanced with respect to statement
type, data type, and locality of reference, with no operating system calls and making no use of library
functions or subroutines. Thus the Dhrystone rating should be a measure of the integer perfomiance of
modem processors. The unit K.Dhtj-vst‘or:esi'.s is often used in reporting Dhrystone results.
The Dhrystone benchmark version l.l was applied to a number of processors. DEC VAX ll."'i'8U scored
1.7 KDltrystonesfs performance. This machine has been used as a reference computer with a IMIPS
pcrfomtance. The relative VAX.-"MlPS rating is commonly accepted by the computer industry.

The Whetstone Result: This is a Fortran-based synthetic benchmark assessing the floating-point
performance, measured in the number ofKWhetstortesfs that a system can perform. The benchmark includes
both integer and floating-point operations involving array indexing, subroutine calls, parameter passing,
conditional branching, and trigonometricitrariscertdental functions.
The Whetstone benchmark does not contain any vectorizable code and shows dependence on the system’s
trtathelrtatics library and efiiciency ofthe code generated by a compiler.
The Whetstone performance is not equivalent to t.he Mflops performance, although the Whetstone contains
a large number ofscalar floating-point operations.
Both the Dhrystone and Whetstone are synthetic benchmarks whose perfomiance results depend heavily
on the compilers used. As a matter of fact, the Dhrystone benchmark program was originally written to test
the CPU and compiler performance for a typical program. Compiler techniques, especially procedure in-
lining, can significantly affect the Dhrystone performance.
Both benchmarks were criticized for being unable to predict the performance of user programs. The
sensitivity to compilers is a major drawback of these benchmarlts. In real-life problems, only application-
oriented benchmarks will do the trick. We will examine the SPEC and other benchmark suites in Chapter 9.

The TPS and KUPS Rating: On-line transaction processing applications demand rapid, interactive
processing for a large number of relatively sirnple transactions. They are typically supported by very large
databases. Automated teller machines and airline reservation systems are familiar examples. Today many
such applications one web-based.
FM Mtfirpw Hlllr1 wt’qt wins
Principle: of Scalable Peiflrmonce i pg

The throughput o fcomputers for on-line transaction processing is often measured in !r.on.s'm:'Iion.spc"r second
(TF5). Each transaction may involve a database search, query answering, and database update operations.
Business computers and servers should be designed to deliver a high TPS rate. The TP1 benchmark was
originally proposed in I985 for measuring the transaction processing of business application computers. This
benclnnark also became a standard for gauging relational database perfonnances.
Over the last couple of decades, there has been an enormous increase both in the diversity and the scale
of computer applications deployed around the world. The world-wide web, web-hosed applications, multi-
media applications and search engines did not exist in the early l99'9Ds. Such scale and diversity have been
made possible by huge advances in processing, storage, graphics display and networking capabilities over
this period, which have been reviewed in Chapter I3.
For such applications, application-specific henchmarlts have become more important than general purpose
benchmarks such as Whetstone. For web servers providing 24 >< 7 service for example, we may wish to
bencl1mark—under simulated but realistic load conditions—performance parameters such as: throughput {in
number of requests served and.-‘or amount of data delivered) and or-vzrage rsrsyionse time.
ln artificial intelligence applications, the measure KLIPS (kilo logic iri_,r‘i'r+.'ric-¢'s per soc-mint‘) was used at
one time to indicate the reasoning power of an A] machine. For example, the high-speed inference machine
developed under Japan's Fii’th—Generation Computer System Project claimed a performance of 400 KLIPS.
Assuming that each logic inference operation involves about 100 assembly instructions, 400 KLIPS
implies approximately 40 MIPS in this sense. The conversion ratio is by no means fixed. Logic inference
demands symbolic manipulations rather than numeric computations. Interested readers are referred to the
book edited by Walt and Ramamoorthy -[I990].

PARALLEL PROCESSING APPLICATIONS


— Massively parallel processing has become one of the Frontier challenges in supercom-
puter applications. We introduce grand challenges in high-performance computing and
conlrmlnications and then assess the speed, memory, and I."O requirements to meet these challenges.
Characteristics of parallel algorithms are also discussed in this context.

3.1.1 Massive Parallelism for Grand Challenges


The l1Bflll.lT.lOtl of massive parallelism varies with respect to time. Based on today’s standards, any machine
having hundreds or thousands of processors is a rr1r:r.ssr't-'ei'_1-‘ ;JrIrol'h'1 proc'es.s ing {MPP} system. As computer
technology advances rapidly, the demand for a higher degree ofparallelism becomes more obvious.
The performance of most commercial computers is marked by their peak MIPS rate or peak Mllops
rate. In reality, only a fraction of the peak performance is achievable in real benchmark or evaluation runs.
Observing flte sustained performance makes more sense in evaluating computer perforrnance.
Gr-and Challenge: We review below same of the grand challenges identified in the U.S. High-Performance
Computing and Communication fl-IPCC} program, reveal opportunities for massive parallelism, assess past
developments, and comment on firture trends in MPP.

{'1 j Thc magnetic rc-cording industry rclics on thc usc ofoomputcrs to srttdy mcgmctostatic and ctcharigc
F?» Mtfirpw Hlllrlimpwinw

ton 1- Advanced Cornpoterhrdritectwe

interactions in ordcr to reduce noisc in mctallic thin films used to coat high-density disk-;. ln gcncraL
all rcscarch in science and cnginccring makes hcavy demands on computing powcr.
(El Rational drug design is bcing aided by computers in thc search for a curc for cancer, acquircd
immunodeficiency syndrome and other diseases. Using a high-pcribrmanoc computer, new potential
agents have bccn idcntificd that block thc action ofhuman immunodeficiency virus protcasc.
{'3} Dcsign of high-spccd transport aircraft is bcing aidcd by computational fluid dynamics running on
supcrcomputcts. Fucl combustion can bc madc mon: cfficicnt by designing bcttcr cnginc models
through chemical kinetics calculations.
(J-lj Catalysts for chemical reactions arc being dcsigncd with computers for many biological proccsscs
which arc catalytically controlled by cnzymcs. Massively parallel quantum models demand largc
simulations to rcducc thc time rcquircd to design catalysts and to optimia: thcirpropcrtics.
{'5} Ocean modeling cannot bc accurate without supcreomputing MPP systcms. Ozonc dcplction and
climatc rcscarch demands thc usc of computers in analyzing thc complex thcrmal, chcmical and fluid-
dynamic mcchanisnu; involved.
{'6} Othcr important arcas demanding computational support includc digital anatomy in rcal-time mcdical
diagnosis, air pollution reduction through computational modeling, thc dcsign ofprotcin structures
by computational biologists, image processing and understanding, and technology linking rcscatch to
cducation.

Bcsidcs computer scicncc and computer cnginccring, thc abovc challcngcs also cncouragc thc cmcrging
discipline of computational science and engineering. This demands systematic application of computer
systems and computational solution techniques to mathematical models formulated to describe and to
simulate phenomena ofscicntific and cnginccring intcrcst.
Thc I-[FCC Program also identified some grand challenge computing rcquircrncnts ofthe time, as shown
in Fig. 3.5. This diagram shows the levels ofprocessing speed and memory size required to support scientific
simulation modeling, advanced computer-aided design (CAD), and real-time processing of largevscale
database and information rctricval operations. In the sincc thc carly 19905, thcrc have bccn hugc
advances in the processing, storage and networking capabilities of computer systems. Some MFP systems
have reached petaflop perforrnance, while even PCs have gigabytes of memory. At the same time, com.puting
rcquircmcnts in scicnoc and cnginccring have also grown cnormously.

Exploiting Massive Parallelism The parallelism embedded in the instruction level or procedural level is
rather limited. Very few parallel computers can successfirlly execute more than two instructions per machine
cyclc from thc samc program. Instruction parallelism is oficn constrained by program behavior, con1pilcrr"US
incapabilities, and program flow and execution mechanisms built into modern computers.
On the other hand, alum paralleimm is much higher than instruction parallelism. Data parallelism refers to
thc situation whcrc thc samc operation (instruction or program} eitccutcs ovcr a large array ofdata lopcrands}.
Data parallelism has been implemented on pipelinod vector processors, SIMD array processors, and SPMD
or MPMD multicomputer systems.
In Table 1.6, we find SIMD data parallelism over 65,536 PEs in the CM-2. One may argue that the CM-2
was a bit-slicc machine. Evcn if vvc divide thc numl:rc:r by 64 (thc word lcng1.h ofa typical supercomputer},
we still end up with a DDP on the order of thousands in CM-2.
The vector length can be used to determine the parallelism implementable on a vector supercomputer.
I'M.- Mrfiruw Hill!‘ I NT .l]lI_J.|lM"| -
Principls ofS<:ol-ubile Pelflrmonce -- H"

In the ease of the Cray Y.-‘MP C-9|), 32 pipelines in lo processors eoulrl potentially achieve a DUP of
32 >< 5 = 160 ifthe average pipeline has five stages. Thus a pipelined processor can support a lower degree of
data parallelism than an STMD computer.
M°'"'°"lf c"=\P3°“l" Global Change
I. Human Geno-mo
Flilo Turbulence
Vehicle Dynam lea
1000 GB i _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Q4533“ clfgliaflon
Viscous Fluid Dy-nalnles
Suporoonducbr Modeling
Semiconductor Modeling
Quantum £';l1romod3|' namles
100 GB __ Vlfilflfl I

1o GB — ~————————————— 5"“°‘““*‘
Mohlelo Bbbar
Sig rnturo

72-Hour
1 '35 — Weather Pmrggcgxfical

SD Plasma
10435551 Modeling

Chemical Dynamics
118-How
Weather
10 MB —
Alrfoll OH Reservoir
Modeling
_ _ _ _ _ _ _J

ioso 1968 1991 1993 19Q5ancl boyonld


I
1lJ'DMflorpe
I
1G-flops
I
10Gflorpe
I
1fl0Gflope
I
1Tflops
'-Steam Seed
Fig. L5 Grand challenge requirements in curnputiltg and OOlI‘ll1'I.ll'lll;l‘ClU|'IS{cOl.I'liE¥f of U.S. H@l
Gonrpudng and Con'ln1|.lnlcatl-on Progrant. 1992}

Dn a message-passing multiwmputer, the parallelism is more scalable than a shared-meme-ry


multiprocessor. As revealerl in Table 1.4, the nCUBE.|'2 could achieve a maximum parallelism on the order of
thousands if all the node processors were kept busy simultaneously.
The Past and the Future MPP systems started in 1968 with the introduction of the llliac IV computer
with 64 PEs under one controller. Subsequently, a massively parallel processor, called MPP, was built by
Goodyear with 16,3 84 PEs. IBM built a GF1 I machine with 5T6 PEs. The MasPa.r MP-I , AMT IJAPGIO, and
CM-2 were all early examples of STMD computers.
FM Mtfiruw Hllltbmpwins

I BI i Advanced Camp-uter Architecture

Early MPP systems operating in MIMD mode included the BEN TC-2000 with a maximum configuration
of 512 processors. The IBM RP-3 was designed to have 512 processors (only a 64—pro-cessor version was
built). The Intel Touchstone Delta was a system with 5'l'[l processors.
Several subsequent MPP projects included the Paragon by lntel Supercomputer Systems, the CM-5 by
Thinking Machine Corporation, the I-{SR-I by Kendall Square Research, the Fujitsu VPP500 System, the
Tera computer, and the MIT *T system.
[BM announced MPP projects using thousands of IBM RS.!6tl0O and later Power processors, while C-ray
developed MPP systems using Digital‘s Alpha processors and later AMD Opterort processors as building
blocks. Some early MPP projects are summarized in Table 3.2. We will study some of these systems in later
chapters, and more recent advances in Chapter 13.

Table 3.2 Early Representative Massively Parallel‘ Processing Systems

MPPSystem Archiaec-nrre, Teeilnologa, and Operational Fenfllres

lntel Paragon A 2-D mesh-cormlected muiticomputler, built with i860 XP procsors and
wonnhole routers, targeted for 300 Ciflops in peak performance.
IBM MPI‘ Model Use IBM RlSC.I‘t‘§IlIlIl procsors as building blocks, 5-ll Gflops
peak expected for a I024-processor configuration.
TMC CM-S A universal architecnrre for SIMIJ-'MTlvID computing using SPARC PEs and
custom-dmigned FPUs, control and data networks, 2 Tflopa peak for 15K nodes.
Cray Research A 313 torus heterogeneous architecture using DEC Alpha ehips with special
It-IPP Model communication support, global addrem space over physically distributed memory;
first system offered 150 Gllops in a I024-process-or configuration in 1993; capable
of growing to Tllops with larger configurations.
Kendall Square An ALLCACHE ring-connected multiprocsorr with custom-designed processors,
Research KSR-I -t3 Gflops peak performance for u ltlllli-processor configuration.
Fujitsu VPPSOU A crossbar-connected 222-PE MIME! vector system, with shared distributed
memories using VP2tlt1tl as a host; peak performance = 355 Cifleps.

3.1.1 Application Models of Parallel Computers


ln general, if the workload is kept unchanged as shown by curve rr in Fig. 3.6a, then the efficioncy E decreases
rapidly as the machine size n increases. The reason is that the overhead h increases faster than the machine
size. To maintain the efliciency at a desired level, one has to increase the machine size and problem size
proportionally. Such a system is lmown as a scalable conipurer for solving .rcrilezf;JrobIerrts.
in the ideal case, we like to see a workload curve which is a linear fimction ofn [curve 1' in Fig. 3.6a). This
implies linear scalability in problem size. Ifthe linear workload curve is not achievable, the second choice is
to achieve a subliuear scalability as close to linearity as possible, as illustrated by curve ,3 in Fig. 3.6a, which
has a smaller constant of proportionality than the curve 1'.
Euppose that the workload follows an exponential growth pattern and becomes enormously large, as
shown by curve B in Fig. 3.6a. The system is considered poorly scalable in this case. The reason is that to
keep a constant efficiency or a good speedup, the increase in workload with problem size becomes explosive
and exceeds the memory or HO limits.
Principles oficololale Peiftrmortce
.
h -- "B

Workload Efflclgriql
1 mums 1 __ ___ B
H {Exponential} ‘I’
ll [Suhllnoarj

0.5 B

o. [Constant]
I1

I ,I | ,I ,. 0 I I |,,.,
1 10 100 1000 _.n—-\
10 11343 111043
Machine size, n Ma-chin-eslzie, n
la] Four workload growth patterns [hi Conespondlng efileleney curves

Work load
F |neel- memory
Me-"Mow
Bound
model

Flsed-time model

Fixed-load mode-I g°m°"“d"“ “mm

F
Machine size [n]
(e}Appllc.atlon models for parallel oornp uters

Fig. 3.5 Wlorldood growcli. lfliellnqr curves. and qiplicatlon models of parallel eomputanr undw resources
conso-am:

The Efilcioncy Curve: Corresponding to the four workload patterns specified in Fig, 3.6a, four efiicieney
curves an: shown in Fig. 3.6h, respectively. With a constant workload, the efficiency curve {tr} drops rapidly.
In fact, curve rr corresponds to the famous Amdahl’s law. For a linear workload, the efficiency curve (11) is
almost fiat, as observed by Gustafson in 1983.
The exponential workload (8) may not he implementable due to memory shortage or U0 bounds [if real-
time application is considered). Thus the Gefliciency (dashed lines) is achievable only with exponentially
increased memory [or U0) capacity. The suhlinear efficiency curve (B) lies somewhere between curves r:
and y.
Scalability analysis determines whether parallel processing of a given problem can offer the desired
improvement in perforrnance. The analysis should help guide the design ofa massively parallel processor. lt
is clear that no single scalability metric suffices to cover all possible eases. Different measures will be usefiul
in different contexts, and further analysis is needed along nitdtiple dimensions for any specific application.
Aparallol system can he used tn solve arbitrarily large problems in a fixed time if and only if its workload
FM Mtfiruw Hlllrilmylorrns

I B4 i Advanced Comp-uter Architecture

pattern is allowed to grow linearly. Sometimes, even if minimum time is achieved with more processors, the
system utilization {or efliciency) may be very poor.
Application Model: The workload patterns shown in Fig. 3.6a are not allowed to grow unbounded. In
Fig. 3.oc, we show three models for the application of parallel computers. These models are bounded
by limited memory, limited tolerance of IPC latency, or limited U0 bandwidth. These models are briefly
introduced below. They lead to three speedup performance models to be formulated in Section 3.3.
Thcfixed-load model corresponds to a constant workload (curve rr in Fig. 3.6a}. The use of this model is
eventually limited by the communication bound shown by the shaded area in Fig. 3.6c.
The_fi.:eci-time model demands a constant program execution time, regardless of bow the workload scales
up with machine size. The linear workload growth [curve y in Fig. 3.6a) corresponds to this model. The
fixed-memorjv model is limited by the memory bound, corresponding to a workload curve between yand H‘ in
Fig. 3.6a.
From the application point of view, the shaded areas are forbidden. The communication bound includes
not only the increasing IPC overhead but also the increasing U0 demands. The memory bound is determined
by main memory and disk capacities.
in practice, an algorithm designer or a parallel computer programmer may choose an application model
within the above resource constraints, as shown in the unshaded application region in Fig. 3.-tic.
Trad-cuff: in Scalability Analysis Computer cost c and programming overhead p (in addition to speedup
and efficiency) are equally important in scalability analysis. After all, cost-effectiveness may impose the
ultimate constraint on computing with a limited budget. What we have studied above was concentrated on
system efficicncy and fast execution of a single algorithtrtfprogram on a given parallel computer.
ll would be interesting to extend the scalability analysis to multiuser environments in which multiple
programs are executed concurrently by sharing the available resources. Sometimes one problem is poorly
scalable, while another has good scalability characteristics. Tmdeoffs exist in increasing resource utilization
but not necessarily to minimize the overall execution time in an optimization process.
Exploiting parallelism for higher performance demands both scalable architectures and scalable
algorithms. The architectural scalability can be limited by long communication latency, bounded memory
capacity, bounded [ID bandwidth, and limited processing speed. I-low to achieve a balanced design among
these practical constraints is the major challenge ol'today’s MPP system designers. On the other band, parallel
algorithms and efficient data structures also need to be scalable.

3.1.3 Scalability of Parallel Algorithms


in this subsection, we analyze the scalability of parallel algorithms with respect to key machine classes.
An isoelliciency ooncept is introduced for scalability analysis of parallel algorithms. Two examples are
used to illustrate the idea. Further studies of scalability are given in Section 3.4 after we study the speedup
perforrna.nce laws in Section 3.3.
Jllgoritlrmic Characteristic: Computational algorithms are traditionally executed sequentially on
uniprocessors. Parallel algorirhrns are those specially devised for parallel computers. The idealized parallel
algorithms are those written for the PRAM models if no physical constraints or communication overheads are
imposed. ln the real world, an algorithm is considered efficient only if it can be cost effectively implemented
on physical macltines. In this sense, all machine-implementable algorithms must be architecture-dependent.
This means the effects of oommunicaiton overhead and architectural constraints cannot be ignored.
rt» MIG:-|:|'u|' Htltr1 ..r.I;|(sm-r \ '
Principle: ofS-coloblc Perfrrmonce i. - [Q5

We siunmarize below important characteristics of parallel algorithms which are machine implementable:
( 1] Deterttritrr'str'r- versrrs nondcterministic: As defined in Section 1.4. 1, only deterministic algorithms are
implementable on real machines. Our study is confined to deterministic algorithms with polynomial
time complexity
(Zj Computcrtinnrti grrurrriarity: As introduced in Section 2.2.1, granularity decides the size of data items
and program modules used in computation. ln this sense, we also classily algorithms as fine-grain,
medium -g rain, or coarse-grain.
-['3] Patr:rHcIt.smptr:1tife:The distribution offne degree ofparallelism in an algorithm reveals the opportunity
for parallel processing. This ofien afiects the effectiveness of the parallel algorithms.
{J-ll Cottrtnrttrtrrrttnn prrrttertrs and s_'mr'hmnt:r;rtt'ntr reqrrtnzments: Communication patterns address both
memory access and interprocessor communications. The partems can he .strrtr'c or dit-nrrttrtc, depending
on the algorithms. Static algorithmsare more suitable tbr SIMD orpipelined machines, while dynamic
algorithms are for MIMD machines. The st-'ne!rmtrfiatr'otr _,ti'eqrrerrc_v often affects the efficiency of an
algorithm.
{'5} L-"trr:,t'hrmr't_v of the operations: This refers to the types of fundamental operations to be pcrlbrmed.
Obviously, ifthe operations are uniform across the data set, the SIMD processing or pipelining may
be more desirable. [n other words, randomly structured algorithms are more suitable for MIMD
processing. Other related issues include data types and precision desired.
{oj .'l»l'ettrorfr-' reqrrtrcttrent and data structures: In solving large-scale problems, the data sets may require
huge memory space. Mettrnrfr-' tjfhctenrji-' is atfected by data structures chosen and data movement
patterns in the algorithms. Both time and space complexities are key measures ofthe granularity ofa
parallel algorithm.

The lloafilcilnqr Concept The workload ll‘ of an algorithm grows with s, the problem size. Thus, we
denote the workload w = no"; as a function of s. Kumar and Ran {I987} have introduced an t's0e;fl‘ict'enr.?y
concept relating workload to machine size n needed to maintain a fixed efficiency E when implementing
a parallel algorithm on a parallel computer. Let tr be the total communication overhead involved in the
algorithrn implementation. This overhead is usually a lhnction ofboth machine size and problem size, thus
denoted tr = Ms, n].
The cfficieney of a parallel algorithm implemented on a given parallel computer is thus defined as

E= t1\[sj+h|[.s,n]
rm)
The workload wis) corresponds to useful computations while the overhead Fits, rt} are computations
attributed to synchronization and data communication delays. In general, the overhead increases with respect
to both increasing values ofs and n. Thus, the efficiency is always less than I. The question is hinged on
relative growth rates between u'{-Y} and lll-Y, H).
With a fixed problem size {or fixed workload), the efficiency decreases as n increase. The reason is that the
overhead it-ls, rt) increases with n. With a fitted machine size, the overhead It grows slower than the workload
l1-‘. Thus the effic-iency increases with increasing problem size for a fixed-size machine. Therefore, one can
expect to maintain a constant efficieney if the workload ‘H-' is allowed to grow properly with increasing
machine size.
F?» Mtfiruw Hlllrlimpwrnw

loll 1- Advanced Cornputerhrchitectme

For a given algorithm, the workload w might need to grow polynomially or exponentially with respect to
n in order to maintain a fixed efiiciency. l3ifl‘erent algorithms may require different workload growth rates
to keep the efficiency from dropping, as n is increased. The isoefiiciency functions of common parallel
algorithms are polynomial fimctions ofn; i.e., they are rJ(_n“) for some It 2 I. The smaller the power ofn in
the isoelficiency function, the more scalable the parallel system. Here, the system includes the algorithm and
architecture combination.
lsoefficieney Function We can rewrite Eq. 3.20 as E = lt(l + fr(.s__ n]v‘u{s}). In order to maintain a constant
E, the workload wls) should grow in proportion to the overhead his, tr). This leads to the following condition:

n-{.s') = g X h(s,n_] (3.21)

The Factor C = Et{l — E] is a constant for a fixed elliciency E. Thus we can define the r's'oe_fil‘icr1cncy_fimr'ti0n
as follows:
jg-{'rr) = C >< !r{_s, tr] (3.22)
If the workload v.-ts) grows as fast a.s_,fl._-(in) in Eq. 3.2 l, then a constant efliciecncy can be maintained for a
given algorithm-architecture combination. Two examples are given below to illustrate the use ofisoefficiency
functions for scalability analysis.

I/)
£5 Example 3.4 Scalability of matrix multiplication algorithms
(Gupta and l(uma.r,1992)
Four algorithms for matrix. multiplication are compared below. The problem size s is represented by the
matrix order. In other words, we consider the multiplication of two s >< s matrices A and B to produce an
output matrix C = AX B. The total workload involved is iv = C-‘(s3). The number of processors used is confined
within l 5 n £ s3. Some of the algorithms may use less than 53 processors.
The isoefliciency functions of the lbur algorithms are derived below based on equating the workload with
the communication overhead [Eq. 3.21) in each algorithm. Details of these algorithms and corresponding
architectures can be found in the original papers identified in Table 3.3 as well as in the paper by Gupta and
Kumar [I 9'92}. The derivation of the communication overheads is left as an exercise in Problem 3.l4.
The Fox-{ltto~Hey algorithm has a total overhead tits, n} = rJ(n log rt + $2 [ii }. The workload n'= 0t¢3)=
Gin logn + s2 Ni; J. Thus we must have Dlltjl = Ulrt logfll and 5T-'~‘l = Oi ~51. Combining the two, we obtain
the isoefficiency fi.1ncl:ion tlsj) = (Iain), where l S tr S as shown in the first row of Table 3.3.
Although this algorithm is written for the torus architecture, the toms can be easily embedded in a
hypercube architecture. Thus we can conduct a fair comparison of the four algorithms against the hypercube
architecture.
Berntsen’s algorithm restricts the use of n 5 ad"; processors. The total overhead is Ola“ + n logn + .s2nm’]|.
To match this with (Isa), we must have O{'s3) = O(tr'm) and O{.s3) = Oln). Thus, Olsj) must be chosen to yield
the isoelliciency function tiling).
The Gupta—l(.umar algorithm has an overhead Gin logo + .s'2n “log ti). Thus vvemust have Otsi) = Ola log n)
and Utfijl = 5'l'$2H“3 log tr]. This leads to the isoefficicncy fimction Ulntlog old) in the third row of Table 3.3.
Fr‘:-r Meomw nrmr rd’ am-r '
Prim-We at Sortable Petftrmonce I W i ' i. - to‘!

The Dekel-Nassimi-Sahni algorithm has a total overhead O{n logn + .-13) besides a useful computation time
of tlrlfril for sz 5 n 5 s3. Thus the workload growth 0(.-vi) = O[_n log n) will yield the isoefliciency listed in
the last row of Table 3.3.

Table 3.3 llsymptotic lsoefliciency Functions of Four Matrix Mulnipiiootion Algorithms (Gupta and Ktmior. 1992}

Morris Mukipiicorion Lrogfiieiemy Range of iilsrger Mehilre


.-'l lg-:1-ri.rhm Fr.rm;'.t‘ flu} .-"lpp]i|;'rrt'1iIi1'_ 5' A .r't:'.ir1".t‘é'r:'.r1:rrr=*

Fox, Otto, and Hey ooér’) 1 sn5 E A ,4‘; >< J;torus


(1931)
Bemtsen Ding) 1 5 rr 5 .9“ A hypercube with
(19891 it if _f{=I“7m'5¢[
Gupta and Kumar U{n(iogn}'l] 1 5 ,1 5 .6 A hypercube mm
(19921 n = 2“ no-dos
and it <% log s
Dekel. Nassimi, and O{r1-logn} .35 nil .3 hhypermshewlfll
Sahni(l9Bl] n=.#=2’*me=s
Note: Two s‘ >< s matrices are multiplied.

The above isoefliciency fimctions indicate the asymptotic scalahilities of the four algorithms. In practice,
none of the algorithms is strictly hettecr ’cl'ta.n the others for all possible problem sizes and machine sizes. For
example, when these algorithms are implemented on a multicomputer with a long communication latency [as
in Intel iPSC 1), Berntsen's algorithm is superior to the others.
To map the algorithms on an SIMD computer with an extremely low synchronization overhead, the
algorithm by Gupta and Kumar is inferior to the others. Hence, it is h-est to use the Dekel-Nassirni-Sahni
algorithm for s2 5 n 5 s3, the Fox-Otto-Hey algorithm for .sm £ n 5 sz, and Berntseifs algorithm for n £ sm
for SIMD hypercube machines.

I»)
gl Example 3.5 Fast Fourier transform on mesh and hypercube
computers (Gupta and l(umar,1993)
This example demonstrates the sensitivity of machine architecture on the scalability of the FFT on two
different parallel computers: mesh rind h__vpercnbe. We consider the Cooley-Tukey algorithm for one-
dimensional s-point fast Fourier nensfonn.
Gupta and Kiunar have established the overheads: h5{s, n] = Din logn + .s log n] for FFT on a hypercube
machine with n processors, and I12 {'s, n} = C-‘(n log n + s J; ) on a viii >< vrri mesh with n processors.
Foran s»point FFT, the total workload involved is it-{_s_] = O{.s log s}. Bquating the workload with overheads,
we must satisfy Cllis logs) = Oin logn} and Dis logs} = O{sIogn], leading to the isoefficicncy fimction
_,t'] = Ofn logn] for the hypercube machine.
re» Mrliruuv H'["I'm'l!I||r1rlM'\ '
Ins 1- _ .lrdv\onosdCnrnptrter'.ltrdrJIectru'e

Similarly, we must satisfy Ofs logs) — Urn log n) and O'~[_s'lOgs] —- Ots viii) by equating ntj s) - kgs, n). This
leads to the isoefficiency functionf; — Oi M5 ) for some constant Ir '5 2.
The above analysis leads to the conclusion that FFT is indeed very scalable on a hypercube computer. The
result is plotted in Fig. 3.?a for three effieiency values.
Problem size [s] Problem also [si
1o.=<1c5 1o_><1o5
E oar Mesh
B;-t‘lD5 s>=.1c5

21-<1o5 E MB 2111115
= ' Hypereube

O 500 1000 1500 2000 0 5-DD 1llOG 15-CH} 2000


Machlnoslaa [n) Machine also [nj
[ai Hyp-srcub-a under three operating elflclenclo-s [bi Comparison between mosh and hyporctbo

Fig.3.? lsoeffleiency curves for FFT on rvvo parallel computers {Courtesy of Gupta and l<r.rnur: 1W3]

To maintain the same efiiciency, the mesh is rather poorly scalable as demonstrated in Fig. 3.Tb.
This is predictable by thc fact that the workload must grow exponentially in {Iii nk "iii ) for the mosh
architecture, while the hypercube demands only D(n log n} workload increase as the machine size increases.
Thus, we conclude that the FFT is scalable on a hypercube but not so on a mesh architecture.

If the bandwidth of the comrnunication channels in a mesh architecture increases proportional to thc
increase of machine size, the above conclusion will change. 1-'or the design and analysis of Fl-'1‘ on parallel
machines, readers are referred to the books by Abo, Hopcroft and Ullman (1974) and by Quinn (I987). We
will ftnthcr address scalability issucsfrom thc architecture standpoint in S-oction 3.4.

SPEEDUP PERFORMANCE LAWS


1 Thrcc speedup pcrforrnancc models are defined below. Amrlahl's law {I967} is based on a
fixed workload or a fixed problem size. Gustafson’s law ( I98?) is applied to scaled problems,
where the problem size increases with the increase in machine sine. The weedup model by Sun and Ni (1993)
is for scaled problcrns bounded by memory capacity.

3.3.1 Amdahl’s Law for a Fixed Workload


in many practical applications that demand a real-time response, the computational workload is often fixed
with a fixed problem size. As the number of processors increases in a parallel computer, the fixed load is
distributed to more processors for parallel cxccution. Thtntforc, thc main objcc-tivc is to produce thc results
ram Mtfiruw Hflif 1 N1".I||r_.u| w u :
Principles efS-cctlahtle P£lftI'fl"l0Jl€E ? - "lg

as soon as possible. In other words, rninimal turnaround time is the primary goal. Speedup obtained for time-
eritical applications is called fixed—load speedup.
Fixed-Load Speedup The ideal speedup fonrnttla given in Eq. 3.7 is based on a fixed workload. regardless
of the machine size. Traditional formulations for speedup, including Amdahl's law, are all based on a fixed
problem size and thus on a fixed load. The speedup factor is upper-bounded by a sequential bottleneck in this
C338.

We consider below both the cases of DDP < n and cl‘DOP 2 n. We use the ceiling Function |i.t-_| to represent
the smallest integer that is greater than or equal to the positive real number x. Whenx is a fraction, lxi equals
l. Consider the ease where DOP = F 1* n. Asstune all rt processors are used to execute ll] exclusively, The
execution time of ll-} is
ll r
rod ifiini
, = ——’ - — ( 3.23 l
Thus the response time is
1'11 .- .

T(N)= Z (3.24)
1 I
Note that ii‘? < n, then r,{n_] = r,{=-=) = W,~*'id. Now, we define tbejired-load speedtipfiactor as the ratio of
T[1) to Tin}:

S":
I
a =

ZMH. E._[‘, 4=‘J";i,13--.

Note that S-',, S S“ 5 .-l, by comparing Eqs. 3.4, 3.7, and 3.15.
K number of factors we have ignored may lower 1:he speedup perforrnance. These include communication
latencies caused by delayed memory access, interprocessor communication over a bus or a network,
or operating system overhead and delay caused by interrupts. Let Qin) be the lumped sum of all system
overheads on an n-processor system. We can rewrite Eq, 3.25 as follows:
II‘!

S = =
Ec-
I l

" T(n]+Q(n] _
(3.26)
ZMH ~E
' in-i + QM)

The overhead delay Q(n] is certainly application-dependent as well as machine-dependent. It is very


difficult to obtain a closed form for Qirr). Unless otherwise specified, we assume QM] = I] to simplify the
discussion.

Jl.mdahJ": Law Revisited In I967, Gene Amdahl derived a fixed-load speedup For the special case
where the computer operates either in sequential mode {with DOP = 1) or in perfectly parallel mode (with
DOP = n]. That is, ll’, = ID if i#= lor F asn in the profile. Equation 3.25 is then simplified to
if m+m
(3.2?)
m+mm
That Ml.'I;Ifllb' HI" l'n¢r.q|r_.tI|»r\ -

lltfii Advanced Cmtpttterfltrdaiiedtue

Amdahl’s law implies that the sequential portion oi‘ the program ii} does not change with respect to the
machine size H. However, the parallel portion is evenly executed by n processors, resulting in a reduced time.
Consider a normalized situation in which ii] + Hi, = rr i (1 - tr] = 1, with rr = IF, and rr = iij,
Equation 3.27 is reduced to Eq. 3.14, where cr represents the percentage of a program that must be executed
sequentially and 1 — rr corresponds to the portion of the code that can be executed in parallel.
Amdahl’s law is illustrated in Fig. 3.8. when the number of processors increases, the load on each
processor decreases. However. the total amount of work (workload) ii’, + W” is kept constant as shown in
Fig. 3.8a. In Fig. 3.Bb, the total execution time decreases because 7], = H-",,r'n. Eventually, the sequential part
will dominate the perfonnance because 1",, —-> U as n becomes very large and T1 is kept unchanged.
Workload Eioacution Tlmo

IE 1

W" W“ W“ W" W“ W" *~ T ~ 1


T.
1 2 3 4 5 6
n
1 2 3 4 5 B
T" J7
n
No of processors Ho. of processors
[a] Fixod workload [bl Decroasi ng execution time

-‘=‘-tmdtm
s
i "3 102491!

1024
91* Sass‘ m.
4-Bx
31” -——- 24x 1x

0% 1% 2% 3% 4% 100%
Sequential fraction of prcgram
[e] Speedup with afisod load

Fig. 3.! Fixed-ioa-d speedup model arrdhmdahfs law

Sequential Bottleneck Figure 3.8c plots Amdahl’s law using Eq. 3.14 over the range ii 5 rr 5 1. The
maximum speedup S" = rt if rr = D. The minimum speedup 8,, = 1 if rr = l. As rt —> M, the limiting value of
5',,—> 11"rr. This implies that the speedup is upper-bounded by lfrr, as the machine size becomes very large.
FM Mcfiruw Htllr1 orqr wins
Pimple Dfs-C|t?rlGbtl|E Perfwmorrce i |||

The speedup curve in Fig. 3.8:: drops very rapidly as rr increases. This means that with a small percentage
ofthe sequential code, the entire perfonnance cannot go higher than llrr. This rrhas been called the sequential
horrlene-:-Ir in a program.
The problem of a sequential bottleneck cannot be solved just by increasing the number of processors in a
system. The real problem lies in the existence ofa sequential fraction of the co-tlc. This property has imposed
a pessimistic view on parallel processing over the past two decades.
In fact, two major impacts on the parallel computer industry were observed. First, manufacturers were
discouraged from making large-scale parallel computers. Second, more research attention was shifted toward
tlevclopitrg parallelizing compilers which would reduce the value of rr and in turn boost the perfomianee.

3.3.2 Gus'l:afson’s Law for Scaled Problems


One of the major shortcomings in applying Amdahl’s law is that the problem (workload) cannot scale to
match the available computing power as the machine size increases. ln other words. the fixed load prevents
sealahility in performance. Although the sequential bottloneclt is a serious problem, the problem can be
greatly alleviated by removing the fixed-load (or fixed-problem-size) restriction. John Gustafson (I 988] has
proposed a fixed-time concept which leads to a scaled speedup model.
Scaling for Higher Accumcy Time-critical applications provided the major motivation leading to the
development of the fitted-load speedup model and Amdahl’s law. There are many other applications that
emphasize accuracy more than minimum turnaround time. As the machine size is upgraded to obtain more
computing power, we may want to increase the problem size in order to create a greater workload, producing
more accurate solution and yet keeping the execution time unchanged.
Many scientific modeling and engineering simulation applications demand the solution of very large-
soale matrix problems based on some partial differential equation (PDE} formulations diseretized with a
huge number of grid points. Representative examples include the use of finite-element method to perform
structural analysis or t.he use of finite-difference method to solve computational fluid dynamics problems in
weather forecasting.
Coarse grids require less computation, but finer grids require many more computations, yielding greater
accuracy. The weather forecasting simulation often demands the solution of four-dimensional PDEs. if one
reduces the grid spacing in each physical dimension (X, l", and Z] by a factor of lll and increases the time
steps by the same magnitude, then we are talking about an increase of 10* times more grid points. The
workload thus increases to at least 10,000 times greater.
With such a problem scaling, of course, we demand more computing power to yield the same execution
time. The main advantage is not in saving time but in producing much more accurate weather forecasting
This problem sealing for aecrlracy has motivated Gnstafson to develop a fixed-time speedup model. The
scaled problem keeps all the increased resources busy, resulting in a better system utilization ratio.
Fixed-Time Speedup In accuracy-critical applications. we wish to solve the largest problem size possible
on a larger machine with about the same execution time as for solving a smaller problem on a smaller machine.
As the machine size increases, we have to deal with an increased workload and thus a new parallelism profile.
Lct m’ be the maximum DCIP with respect to the scaled problem and Hr]-' be the scaled workload with D01‘ = i.
Note that in general ll";-’ re Hr’, for 2 E i E of and W1 = l-l"|. Thc fiitcd-time speedup is defined under the
assumption that TU) — T’{n], where T’('n') is the execution time of the scaled problem and Tl 1] corresponds
to the original problem without scaling. We thus obtain
I I115 Advanced CurtputerArdw'Iect||u'e

m ml _r; .

E"? = + Qt») (3.2s)


J | 1 l

A general formula for fitted-time speedup is defined by Sf, = Til )fT'(n), modified from Eq. 3.26:

M“ -s -s
s',,= = :tmi
~4= (3.29)
LL Qt») zlu-;
IP45
Gu.rtofson": Law Fixed-time speedup was originally developed by Gustafson for a special parallelism
profile with W, = 0 if i sé l and i as n. Similar to ."tmdahl‘s law, we can rewrite Eq. 3.29 as follows, asswtiing
Qfnl = D,
.
m

ii,_-1
Sir _ t_t'{+ it-',; _ ti-1 Hm-',, (3.30)
‘" HQ + PF" ll-"1 + W"
Zn}.
1" I

where H’; — n H-"',, and H-'1 + H-"',, — H-"1 + H-"'j,="n, corresponding to the fixed-time condition. From Eq. 3.30, the
parallel workload H-‘f, has been scaled to n times H--1, in a linear fashion.
The relationship of a scaled workload to Gustafson's scaled speedup is depicted in Fig. 3.9. In fact,
Guatafsonb law can be restated as follows in terms of rr = HF] and 1 — rr = HF" under the same assuntption
W, + H-',, — 1 that we have made for Arndahl‘s law:
s:.= l"‘i’
rr + {1— rt) =~--so' 1) cw)
In Fig. 3.911, we demonstrate the workload scaling situation. Figure 3.91:: shows the fixed-time execution
style. Figure 3.9c plots Ji-‘I, as a fiirlction of the sequential portion 1'1‘ ofa program ru.t|J1i.t1g on a system with
n — 1024 processors.
Note that the slope of the .5, curve in Fig. 3.90 is much flatter than that in Fig. 3.EIc. This implies that
Gustafson‘s law does support mlable perfomiance as the machine size increases. The idea is to keep all
processors busy by increasing the problem size. When the problem can scale to match available computing
power, the sequential fraction is no longer a bottleneck.

3.3.3 Memory-Bounded Speedup Model


Xian-I-le Sun and Lionel Ni (1993) have developed a memory-bounded speedup model which generalizes
A_mdahl‘s law and Gustafsons law to maximize the use of both CPU and memory capacities. The idea
is to solve the largest possible problem, limited by memory space. This also demands a scaled workload,
providing higher speedup, higher accuracy. and better resource utilization.
Memory-Bound Problem: Large-scale scientific or engineering computations often require larger
memory space. In fact, many applications of parallel computers are memory—bound rather than CPU—bound
H‘-1. - Mrfiruw HIM!‘ I NT .q|r_.u||rs ;5

Principim ofS<:oI-olale Peiftrmonce 1- | [3

Worlooad Eitacutlon Time

W“
W W»
n Til TH TH TH TH TH

1
‘"“
2 3 4 5 6
No. of processors
n
1 2 3 4 5 6
No. of processors
n

la] Scaled workload [bl Fined eioacutlon time

Speedup).
[SA] 1024a
'i F-1 F -t t
1014:: 1954;; 993,; *3,‘

aim =1o2t -102341

- -I-11
0% 1% 2% 3% 4%
S-aouortlal fraction of pro-gram
[cl Speedup with flioad oxia-cutlon time

Fig. 3.9 Fit-md~tlrne speedup model and Gttsrafsonb law

or U0-bound. This is especially true in a multicomputer system using distributed memory. The local memory
attached to each node may be relatively small. Therefore, each node can handle only a small subproblem.
when a large number of nodes are used collectively to solve a single large problem, the total memory
capacity increases proportionally. This enables the system to solve a scaled problem through program
partitioning or replication and domain decomposition of the data set.
instead of keeping the execution time fixed, one may want to use up all the increased memory by scaling
the problem size further. In other words, if you have adequate memory space and the scaled problem meets
the time limit imposed by Gustafsorfs law, you can further increase the problem size, yielding an even better
or more accurate solution.
A memory-bounded model was developed under this philosophy. The idea is to solve the largest possible
problem, limited only by the available memory capacity. This model may result in an increase in execution
time to achieve scalable pcrformanoe.
Fr‘:-r Mtfirow uritt-...¢-,.,at.¢. '
I I4 "XII Advanced Cunp-uterArdu';tectm'c

Fixed-Nlemory Speedup Lct M be the memory requirement ofa given problem and H-‘be the computational
workload. These two factors are related to each other in various ways, depending on the address space and
architectural constraints. Let us write It’ = g{'il-1") or M = g l{_ H"), where g l is the inverse ofg.
ln a multicomputer, the total memory capacity increases linearly with the number of nodes available.
We write W = E:-"1 Ii} as the workload for sequential execution of the prograln on a single node, and
H»-'* = if-7', li-'1 as the scaled workload for execution on n nodes, where m* is the maximum DOP of the scaled
problem. The memory requirement for an active node is thus bounded by g 1 (_E_j." I Iii.) .
A_,|"i.xed-memory speedup is defined below similarly to that in Eq. 3.29.
of

s:=.=
2»?
1 I
o-so
1 i'l+oc~i
2%. I N

The workload for sequential execution on a single processor is independent of the problem size or system
size. Therefore, we can write W1 = lt"1 = lt'1‘- in all three speedup models. Let us consider the special case of
two operational modes: .sequcnrin! versus pwrfirctlit parallel execution. The enhanced memory is related to the
sealed workload by W1‘, = g*(nM_], where mid is the increased memory capacity for an n-node multicomputer.
Furthermore, we assume g*{_nM‘j = G[n)g(_M) = G{n]li’,,, where ll"',, = gtit-f,t and g"‘ is a homogeneous
function. The factor Gin) reflects the increase in workload as memory increases n times. Now we are ready
110 reunite Eq. 3.32 tutder the Hssumpfion that Ht] = Cl if i ac lor n and Q11) = ll:
St: tr; + W; = ti; + o{_a)tt',, (3.33)
up + it-;r*;,, H-I + G-[n]li'".~’n
Rigorously speaking, the above speedup model is valid under two assumptions: {ll The collection of
all memory forms a global address space (in other words, we assume a shared distributed memory space];
and (2) All available memory areas are used up for the sealed problem. There are three special cases where
Eq. 3.33 canapply:
Cam I .' Gfnl = l. This corresponds to thc case whcrcthcpmblcm size is fixed. Thus, thc fixcd-memory
speedup bccomcs cquivalcnt to Arndahlls law; i.c. Eqs. 3.2? and 3.33 arc cquivalcnt whcn a fixcd
workload is given.
Case 2: G('n] = n. This applics to thc case whcrc thc workload increases rt times when thc mcmory is
increased rt timcs. Thus, Eq. 3.33 is identical to Gu.stafson‘s law (Eq. 3.30j with afixed csccution time.
Case 3: 6'1’rt] 2* n. This corresponds to thc situation where thc computational workload incrcascs faster
than thc memory tcquircmcnt. Thus, thc fixed-mcmoty modcl (Eq. 3.33] will likcly give a higbcr
speedup than thc fixed-time spccdup |[:Eq. 3.30).

The above analysis leads to the following conclusions: Amdahl’s law and Gustafson‘s law are special
cases of the fixed-memory model. When computation grows faster than the memory requirement, as is often
true in some scientific simulation and engineering applications, the fixed-memory model (Fig. 3.10} may
yield an even higher speedup (i.e., 5'1‘, 2 51, 2 3,) and better resource utilization.
The fixed-memory model also assumes a scaled workload and allows an increase in execution time. The
increase in workload (problem size) is memory-bottnd. The growth in machine sine is limited by increasing
rt».-Mrliruw rrmr1 ..|-qt;mm \ '
Principle: ofS<:olo.ble P£lf£I'fl"l0JlCE _ i - | [5

eonununication demands as the number ofproeessors becomes large. The fixed-time model can be moved
very close to the fi:r.ed—memory model if available memory is fully utilized.
Workload Ettoeutlon Time

T 1
W“ wn ‘ll
W T T

i n Tn TI] Tn Tn n n
II II
A5to -ll en co
toU5 _.. to or -ll on er:
No. of processors No. of pro-oossors
[a) Scaled workload {la} Increased 9JtQfl.lllO1'lllfl\B

Fig. 3.10 Scaled speedup model using fixed memory {Cournesy of Sun and Niwepfineed with purrnission from
MM Supercmnpud|1g,'l‘?9D}

L»)
£3 Example 3.6 Scaled matrix multiplication using global
versus local computation models (Sun and
Hi,1993)
In scientific computations, a matrix ofien represents some diseretized data continuum. Enlarging the matrix
size generally leads to a more aoeutate solution for the continutun. For matrices with dimension n, the munber
of computations involved in matrix multiplication is 2:13 and thc memory requirement ia roughly M = 3n2.
As the memory increases n times in an n-processor multicomputer system, nil! = n >< 3-n2 = 3n3. If the
enlarged matrix has a dimension of N, then 3:23 - 3N2. Therefore, N - 111-5. Thus Gui] - ni "5, and the sealed
workload lV‘f, = G'[n]W,, = nu Hr’. Using Eq. 3,33, we have

ll" '-‘W u-' ‘-5 tr


5" = WI + it H3, ii, + n - H5, ~23-34>
H

under the global computation model’ illustrated in Fig. 3.113,. where all the distributed memories are used as.
a common memory shared by all processor nodes.
As illustrated in Fig. 3.] lb, the node memories are used locally without sharing. In such a lam! t-om,nurnn'rn
nirztdef, Gin} - n, and we obtain the following speedup:

sf; _ (3.35)
ii, + ll"
I I6 1- Advanced Cornputerhrchitecture
- -1

liil
AB1 A52 AB3
~l ii ti
AB“
[a] Global computation with eistrihuted shared memories

AB1 AB: A53 AB“


[b] Local oomputatlon with distributed private memories

Fig. 3.11 ‘Fm: models for die distributed rnatrlx multiplication

The above example illustrates Gustafson’s scaled speedup for local computation. Comparing the above
two speedup expressions, we realize that the fixed-memory speedup (Eq. 3.34) may be higher than the fixed-
time speedup [Eq. 3.35}. In general, many applications demand the use of a combination of local and global
addressing spaces. Data may be distributed in some nodes and duplicated in other nodes. Data duplication is
added deliberately to reduce communication demand. Speedup factors for these applications depend on the
ratio between the global and local computations.

SCALABILITY ANALYSIS AND APPROACHES


_ The perfonnance of a computer system depends on a large number of factors, all affecting the
scalability of the computer architecture and the application program involved. The simplest
definition of .sc'ofrIlJr'fr'I_y is that the performance of a computer system increases linearly with respect to the
number of processors used for a given application,
Scalability analysis of a given computer system must be conducted for a given application program’
algorithm. The analysis can be pcnforrocd under different constraints on the growth of the problem size
(workload) and on the machine size (number of processors}. A good understanding of scalability will help
evaluate the performance of parallel computer architectures for large-scale applications.

3.4.1 Scalability Metrics and Goals


Scalability studies determine the degree of matching between a computer architecture and an application
algorithm. For different (architecture, algorithm) pairs, the analysis may end up with different conclusions. A
machine can be very efficient for one algorithm but bad for another, and vice versa.
Thus, a good computer architecture should be efficient in implementing a large class of application
algorithms. ln the ideal case, the computer performance should be linearly scalable w'ltl1 an increasing number
of processors employed in implementing the algorithms.
Scalability metric: identified below are the basic metrics (Fig. 3.12) affecting the scalability ofa computer
system for a given application:
r Mocfrinesize 1' n-j—tl'|e number ofproeessors employed in a parallel computer system. A large machine
size implies more resources and more computing power.
rs» Mefirulw Hlllf 1 nr".I||r_.u| w u :
Pnncrpla ofS<:olablePetfu'ma.nce ? - H1

CPU chine Computer


S an E
Scalability of
[architecture agorlthm] fit?
mancl
Demand Comb-I nation De

Eli
is Hui Problem
s
Fig. 3.12 Scalability metric
Communication
Ov ad

Clock rnre (fl—the clock rate dctennines the basic machine cycle. We hope to build a machine with
components (processors, memory, bus or network, etc.) driven by a clock which cart scale up with
better technology.
Problem size {s')—the amount ofcomputational workload or the ntlmber ofdata points used to solve a
given problem. The problem size is directly proportional to the seqrienrriei exeeririon rime T(.s, 1] for a
uniprocessor system because each data point may demand one or more operation s.
CPL’ time (T_t—the act1.|al CPU time {in seconds] elapsed in eseeuting a given program on a parallel
machine with rt processors oollectively. This is the parable! exeerrrion tirrte, denoted as ifs, rt] and is a
firnction ofbofn .1." and rr.
HO deemed (d')—the inputioutput demand in moving the program, data, and results associated with a
given application run. The [IO operations may overlap with the CPU operations in a mu ltiprog rammed
environment.
Merrtorft-' eupneiF_1-' {'rrrj—the amotmt of main memory [in bytes or words] used in a program execution.
Note that the memory demand is affected by the problem size, the program size, the algorithms, and
the data structures used.
The memory demand varies dynamically during program -tntecution. Here, we refer to the maximum
number of memory words demanded. Virtual memory is almost unlimited with a 64-bit address space.
lt is the physical memory which may be limited in capacity.
Cnrnmrmienrinn at-'erfi-end {.lr'j—the amount of time spent for interprocessor communication,
synchronization, remote memory access, etc. This overhead also includes all noneompute operations
which do not involve the C'PUs or [IO devices. This overhead !r{.s__ rt] is a function ofs and rr and is not
part of Tfs, rrj. Fora uniproeessor system, the overhead h(s, 1) = U.
Cornpurer ens‘! ('e]—the total cost of hartlware and software resources required to carry out the
execution ofa program.
Progmnrrrnng overlread {pj—the development overhead associated with an application program.
Programming overhead may slow down software productivity and thus implies a high cost. Unless
otherwise stated, both computer cost and programming cost are ignored in our scalability analysis.
Depending on the computational objectives and resource constraints imposed, one can fix some of the
above pra.ramEl¢1'$ and optimize the remaining ones to achieve the highest performance with the lowest eost.
The notion of scalability is tied to the notions of speedup and efiieieney. A sound definition of scalability
must be able to express the effects ofthe art:l:|iteeture‘s interconnection network, ofthe communication patterns
re» Altliruw um r-...=-mm. '
I in 1- _ .ltduenoedCornptrte|'.l|rclti1ectm'e

inltercnt to algorithms, of the physical constraints imposed by technology, and of the cost effectiveness or
system cfliciency. We introduce first the notion of speedup and efficiency. Then we define scalability based
on the relative performance of a real machine compared with that of an idealized theoretical machine.
Speedup and Efflcleney Revisited For a given architecture, algorithm, and problem size s, the est-‘ntpro.'ic
.5‘peerfi.ip 5'(s, n) is the best speedup that is attainable, varying only the number tn} of processors. Let Tis, 1)
be the sequential execution time on a uniprocessor, T(s, n) be the minimum parallel execution time on an
n-processor system, and Ms, n) be the lump sum of all comrnunioation and UCI overheads. The asymptotic
speedup is formally defined as follows:
sor, H] = i (3.315)
T(_s,rI)+ h(.~t, rt]
The problem size is the independent parameter, upon which all other metrics are based. A meaningful
measurement of asymptotic speedup mandates the use of a good sequential algorithm, even it is different from
the structure of the corresponding parallel algorithm. The Ti-t, rt) is minimal in the sense that the problem is
solved using as many processors as necessary to achieve the minimum runtime for the given problem size.
In scalability analysis, we are mainly interested in results obtained from solving large problems. Therefore,
the run times ‘Its, n) and Its, I} should be expressed using order-of-magnitude notations, reflecting the
asymptotic behavior.
The system efliciency of using the machine to solve a given problem is defined by the following ratio:
E(s, H) = ~‘i*i-1-”l om
ln general, the best possible cfliciency is one, implying that the best speedup is linear, or Slfs, n} = n.
Therefore, an intuitive definition of scalability is: A system it scalable ifrhe svsrem efliciertcy El.-r, rt} = lfor
nil nigtnrirhrns it-‘lift any ntvmirer ofn pmeessors turd any problerrr size .s_
Mark Hill (I990) has indicated that this definition is too restrictive to be useful because it precludes any
system from being called scalable. For this reason, a more practical elliciency or scalability definition is
needed, comparing the performance of the real machine with respect to the theoretical PRAM model.
Scalability Definition Nussbaum and Agarwa] ([991) have given the following scalability definition
based on a PRAM model. The scalability <D(_s, n) ofa machine for a given algorithm is defined as the ratio of
the asymptotic speedup Sts, rt) on the real machine to the asymptotic speedup S,{_s, rt] on the ideal realization
of an EREW PRAM.
T(.s,l)
s_,{s_. rt) = i
in [Sm]

where 1']-{'s, rt) is the parallel execution time on the PRAM, ignoring all communication overhead. The
scalability is defined as follows:
S(s, rt] T} (s, rt]
., = i = —? 3.38
(Du. H) .S',.{_.s,n] T(.s,n) ( )
intuitively, the larger the scalability, the better the performance that the given architecture can yield
running the given algorithm. in the ideal case, S;{_s, rt} = rt, the scalability definition in Bq. 3.38 becomes
identical to the efliciency definition given in Eq. 3.31".
Principle ofS-soluble Perftrmonce i | |g

5*? Example 3.? Scalability of various machine architectures


for parity calculation (Nussbaum and Agar-wal,
1991)
Table 3.4 shows the execution times, asymptotic speedups, and sealabilities (with respect to the ER.EW-PRAM
model) offive representative interconnection architectures: linear array, 2-D and 3-D meshes, hypercube, and
Omega network, for running a parallel parity calculation.

Table 3.4 Scoialility of ‘vlarious Hentrorlt-Based iirchinecnires fir the Parity Calculation

."l1lcrc'hine' ."l.F£'.i1'l:.l'é'£'l‘I.|'.l"¥.'
."l-{erricat
Linear army I 2-D ntrsir 3-D mesh fi'_iyx-rcube O-negu Nenwrir
Tl}, n) rm rm 3"‘ logs lflgzs

.'§[.'t', Ir] rm ity} .9“ .t.I'lDg .'l' .s/logzx

(Dis, rr) lDg.t/sm log .tf.sl'l3 log .\‘I.tm l lfiog .r

This calculation examines s bits, determining whether the number of bits set is evtm or odd using a
balanced binary tree. For this algorithm, T-[s, I}- s, i"]{s, n] — logs, and 5‘,{'s, n) - silogs for the ideal PRAIH
machine.
On real architectures, the parity algorithm"‘s performance is limited by network diameter. For example,
the linear array has a network diameter equal to n — l, yielding a total parallel running time of sin + n. The
optimal partition of the problem is to use n — J; processors so that each processor performs the parity check
on J; bits locally. This partition gives the best match between computation costs and communication costs
w ith ills, rt) = am, S[_s, n} = .3": and thus scalability <l>{s, n) = logsism.
The 2D and 3D mesh architectures use a similar partition to match their own cornniunieation structure with
the computational loads, yielding even better scalability results. It is interesting to note that the scalability
increases as the communication latency decreases in a network with a smaller diameter.
The hypercube and the Omega network provide richer communication structures [and lower diameters)
than meshes of lower dirnensionality. The hypercube tlocs as well as a PRAM for this algorithm, yielding
tlilts, n} — l.
The Omega network {I-‘lg. 2.24) does not exploit locality: communication with all processors takes the
same amount of time. This loss of locality hurts its performance when compared to the hypercube, but its
lower diameter gives it better scalability than any of the meshes.

Although performance is limited by network diameter for the above parity algorithm, for many other
algorithms the network bandwidth is the performance-limiting factor. The above analysis assumed unit
eornmunieafion time between directly connected cornniunication no:1es.An architecture may be scalable for
one algorithm but unsealable for another. One must examine a large class of useful algorithms before drawing
a scalability conclusion on a given architecture.
I 20 "Z5 Advanced Carnpinzerhrclnitectm-e

3.4.2 Evolution of Scalable Computers


The idea of massive parallelism is rather old, the technology is advancing steadily, and the software is
relatively unexplored, as was observed by C-ybenko and Kuck (1992). Cine cvolutional trend is to build
scalable supercomputers with distributed shared memory and standardized LJNIXILINLTX for parallel
processing. In this section, we present the cvolutional path and some scalable computer design concepts;
recent advances in this direction are discussed in Chapter I3.
The Evolutionol Path Figure 3.13 shows the early evolution of supercomputers with four-to-five-
year gestation and of micro-based scalable computers with three-year gestation. This plot shows the peak
performance ofCray and NBC supercomputers and ofCray, lntel, and Thinking Machines scalable computers
versus the introduction year. The marked nodes correspond to machine models with increasing size and cost.

10.000

The lntel
Teiaflop saaoii
..-. t _/'- C-ray
cos s2»:or:,/ /‘ Massively _
§ "“*\
I /’ .Ei‘sti,
ems S120-M I-‘T
_PG |ma|s55|v| a . /
CM5 SEJDM
S
l.

{Ganeeigs-flo NEC
" Sip-are _

10-
PeakPer
m fo
ray
Sipersr
1 .
19-B3 1990 19% 1994 1996 1998- 2000
Year
Fl‘. 3.13 The perfornwi-i:e {lo Gftops] of various comp-uters manufactured during 19% by Cray itesmrch. |ne..
NEG lra:el.antl'l'l'|lnltlng Machines Corporation Slloursesy oi Gordon Bell: reprinted with permlulon
from the Communications 0-MGM. August ‘i99'2][1

In 198 8, the Cray ‘r’-MP S delivered a peak of2.8 Gflops. By 1991, the Intel Touchstone Delta, a 672—node
multicomputer, and the Thinking Machines CM-2, a 2K PF. SIMD machine, both began to supply an order-
of-rnagnitude greater peak power (20 Gfiops} than conventional supercomputers. By mid-1992, a completely
new generation of computers were introduced. including the CM-5 and Paragon.
in Thinking l\"l.|'.‘|Cl1lt‘l£'S Corporation has since gone out of business.
F?» Mtfirpw Hlllf1 wt’qt mm
Principle: ofS-colohle Perftrmonce i |1|

tn the past, the IBM Systenfloll provided a ltltlzl range of growth for its various models. DEC VAX
machines spanned a range of lllllllcl over their lifetime. Based on past experiences, Gordon Bell has identified
three objectives for designing scalable computers. Implications and case studies of these challenges will be
further discussed in subsequent chapters.
Size Scalability The study of system scalability started with the desire to increase the machine size. A size-
scalable computer is designed to have a scaling range fi'om a small to a large number of resource components.
The expectation is to achieve linearly increased performance with incremental expansion for a well-defined
set of applications. The components include computers, processors or processing elements, memories,
interconnects, switches, cabinets, etc.
Size scalability depends on spatial and temporal locality as well as component bottleneck. Since very large
systems have inherently longer latencies than small and centralized systems, the locality behavior ofprogram
execution will help tolerate the increased latency. Locality will be characterized in Chapter 4. The bottleneck-
free condition demands a balanced design among pming, storage, and H0 bandwidth.
For example, since MP'Ps are mostly interconnected by large networks or switches, the bandwidth of the
switch should increase linearly with processor power. The L"O demand may exceed the processing bandwidth
in some real-time and large-scale applications.
The Cray Y-MP series scaled over a range of I6 processors (the C-90 model) and the current range of
Cray supercomputers offer a much larger range of scalability (see Chapter I3). The CM-2 was designed to
scale between SK and 6411 processing elements. The CM-5 scaling range was 1024 to 16K computers. The
KER-l had a range of 3 to I038 processor-memory pairs. Size-scalability cannot be achieved alone without
considering cost, efficiency, and prograrnmability on reasonable time scale.
Generation [Time] Scalability Since the basic processor nodes become obsolete every three years, the
time scalability is equally important as the size scalability. Not only should the hardware technology be
scalable, such as the CMOS circuits and packaging technologies in building processors and memory chips,
but also the soflwarelalgoritlun which demands software compatibility and portability with new hardware
systems.
DEC claimed that the Alpha microprocessor was generation-scalable for 25 years. In general, all computer
characteristics must scale proportionally: processing speed, memory speed and size. interconnect bandwidth
and latency, U0, and soflware overhead, in order to be useful for a given application.

Pmblelrl Scalability The problem size corresponds to the data set size. This is the key to achieving scalable
performance as the program granularity changes. A problem scalable computer should be able to perfomi
well as the problem size increases. The problem size can be scaled to he sufficiently large in order to operate
cfliciently on a computer with a given granularity.
Problems such as Monte Carlo simulation and ray tracing are “perfectly parallel”, since their threads of
computation do not come together over long spells of computation. Such an independence among threads is
very much desired in using a scalable MPP system. In general, the pro elem gnmu!rtrr't‘_t-‘ (operations on a grid
point./data required from adjacent grid points] must be greater than a nuichirw Is grannlariri-' (node operation
mte.|'no-dc-to-node communication data rate) in order for a multicomputer to be effective.
I 22' 1- Advanced Comp-uter Architecture

5*? Example 3.8 Problem scaling for solving Laplace equation


on a distributed memory multicomputer
(Gordon Bell, 1992)
Laplace equations are often used to model physical structures. A 3-D Laplace equation is specified by
‘Ila H21: H21:
V2 = '_+_+_=o 3.39)
“ of of H22 ( "
We want to determine the problem scalability of the Laplace equation solver on a distributed-memory
multicomputer with a sufficiently large number of processing nodes. Based on finite-diflierenee method,
solving Eq. 3.39 requires performing the following averaging operation iteratively across a very large grid,
as shown in Fig. 3. I4:
tot 1 z u t -Ii < -In -1: c —|i —u
“i'._."J~' * E |:i"';':iT_.I'.-l" +“i':r|._,|'.Jt +ui.n.Ii—l..l +Hi.3f+l_k +“:':[k—| +ui.:r.i+l

where I 5 i,j, It 5 N and Nis the number of grid points along each dimension. In total, there are hi] grid points
in the problem domain to be evaluated during eaeh iteration m for 1 S m S M.
The three-di.|:nensional domain can be partitioned into p sub-domains, each having n3 grid points sueh that
)'JH3 - Ev“. where p is the machine size. The computations involved in each subclomain are assigned to one
node of a multicomputer. Therefore, in each iteration, each node is required to perfomi Trri computations as
specified in Eq. 3.40.
Z

[0, r rm, r n]
[0, m, r n) [0, rn, r n]
AP
g [0, r n, r n-n]

[r n, ID, r rt)

,-L In
‘ ‘ L’ (0, r n, my

isQillqan A,
X
[rn, Ct, ID] [r n, rn, D]

la] 5|! GU59 5IJ|i'5<>"‘B|"$ [b] An Ha" N 1-: Hgrld partitioned Into p subdomelns,
aqaeent to a cube so belomaln 3 3 3
Mme came, each being an n3 etbe, where p = r = N In

Fig‘. 3.14 Partitioning of a ED domain for solving the Laplace equation


FM-Altfirulw Hiiir1 ..|-qt;our It '
Principle: ofS-colobile Petftrmonce i - [13

Each subdomain is adjacent to sis other subdomains (Fig. 3.14s]. Therefore, in each iteration, each node
needs to exchange (send or receive] a total of6:12 words offloating-point numbers with its neighbors. Assume
each floating-point number is double-precision (ti-4 bits, or 8 bytes}. Each processing node has the capability
ofperforming lll-tl Mtlops (or lJ.l.il he per floating-point operation}. The internode communication latency is
assumed to be 1,us [or l megaword/s} for transferring a floating-point number.
For a balanced multicomputer, the computation time within each node and inter-node communication
latency should be equal. Thus l].D?n3,us equals (Sn: its communication latency, implying that n has to be at
least as large as 86. A node rnernory of capacity 863 >< S = 640K >< 8 = 5120 Kwords = 5 megabytes is needed
to hold each subdomain of data.
On the other hand, suppose each message exchange takes 2 ,us [one receive and one send) per word. The
eonummicafion latency is doubled. We desire to scale up the problem size with an enlarged local memory
of 32 megabytes. The subdomain dimension sine n can be extended to at most loti, because I503 >< ti = 32
megabytes. This size problem requires 0.3 s of computation time and 2 >< 0.15 s of send and receive time.
Thus each iteration takes [L6 (0.3 + 0.3) s, resulting in a computation rate of fill Mflops, which is only 5i'l%
of the peak speed of each node.
ll" the problem size n is further increased, the elTective Milops rate and efficiency will be improved. But
this cannot be achieved unless the memory capacity is further enlarged. For a fixed memory capacity, the
situation corresponds to die memorybound region shown in Fig. 3.6c. Another risk of problem sealing is to
exacerbate the limited U0 capability which is not demonstrated in this example.

To summarize the above studies on scalability. we realize that the machine size, problem size, and
technology scalabilities are not necessarily orthogonal to each other. They must be considered jointly. In the
next section, we will identify additional issues relating scalability studies to software compatibility, latency
tolerance. machine prograrnmability. and cost-effectiveness.

3.4.3 Research Issues and Solutions


Toward the development of truly scalable computers, much research is being done. In this section, we briefly
identify several frontier research problems. Partial solutions to these problems will be studied in subsequent
chapters.
The Problem: When a computer is sealed up to become an MPP system, the following difficulties can
arise:
1 Memory-access latency becomes too long and too nonuniformly distributed to be considered tolerable.
I The [PC complexity or synchronization overhead becomes too high to be useful.
I The multicache inconsistency problem becomes out of control.
I The processor utilization rate deteriorates as the system size becomes large.
I Message passing [or page migration) becomes too time-consuming to benefit resource sharing in a
large distributed system.
' Overall system perforlnanoe becomes saturated with diminishing retum as system size increases
further.

Some Approaches In order to overcome the above difficulties, listed below are some approaches being
pursued by researchers:
re» Mtfiruw um =-...=-mam. '
I 24 1- _ flidmnced Clrnpti-tarhrchitecuue

' Searching for latency reducing and fast synchronization techniques.


- Using weaker memory consistency models.
- Developing scalable caehc coherence protocols.
I Realizing shared virtual memory system.
' [ntegrating multithreaded architectures for improved processor utilization and system throughput.
' Expanding software portability and standardizing parallel and distributed Ul\lIX|"I..lNU'X systems.

Scalability analysis can be carried out either by analytical methods or through trace-driven simulation
experiments. In Chapter '9, we will study both approaches toward the development of scalable computer
architectures that match program.-’ algorithmic behaviors. Analytical tools include the use of Markov chains,
Pelri nets, or queueing models. A number of simulation packages have already been develop-ed at Stanford
University and at MIT.
Supporting Issue: Besides the emphases of scalability on machine size. Problem sire and technology, we
identify below several extended areas For continued research and development:

-[1] .'i'q,iFinure seoloiaiiiri-': As problem sin: scales in proportion to the increase in machine size, the
algorithms can be optimized to match the architectural constraints. Software tools are being developed
to help programmers in mapping algorithms onto a target architecture.
A pericct match between architecture and algorithm requires matching both computational and
communication patterns through performance-tuning experiments in addition to simple numerical
analysis. Optimizing compilers and visualization tools should be designed to reveal opportunities for
algorithm.-‘program restructuring to match with the architectural growth.
(Zj Reducing eonrnnrniearion or-'crherm".' Scalability analysis should concern both uselitl computations
and available parallelism in programs. The most difficult part of the analysis is to estimate the
communication overhead accurately. Excessive communication overhead, such as the time required to
synchronin: a large number of processors, wastes system resources. This overhead grows rapidly as
machine size and problem size increase.
Furthermore, the run time conditions are often diflicult to capture. l-low to reduce the growth of
communication overhead and how to tolerate the growth of memory-access latency in very large
systems are still wide-open research problems.
(3) Enimnerhgrimgrornrnnhiiirt-': Thccomputing community generally agrees that multicomputers are more
scalable; multiproccssors may be more easily programmed but are less scalable thart multicomputers.
It is the centralized-memory versus distributed private-memory organization that makes the difihrcnoe.
In the ideal case, we want to build machines which will retain the advantages of both architectures.
This implies a system with shared distributed memory and simplified message communication among
process-ornodes. Hsterogencous programming paradigms are needed for fi.|t1.|re systems.
{'4} Providing longer-'ir_i-' and gent-miiry: Other scalability issues include longer-ir_t-, which requires an
architecture with sufliciently large address space, and g,v.=nemiiI_1-', which supports a wide variety oi
languages and binary migration of software.
Performance, scalability, progxarnmability, and generality will be studied throughout the book for general-
purpose parallel processing applications, unless otherwise noted.
n-rrrcrrmrv Hffliormortnrr ‘
Princripl-in of Scalable Performance Z I25

||-- -
I

I. $>~‘* --_»' Summary


'lNith rapid advances in technology, scalability becomes an important criterion for any modern computer
system—and especially so for a parallel processing system. However: system scalability can only be
defined in terms of system performance, and therefore issues of scalability and system performance are
very closely interrelated. in this chapter. we have studied some brnic issues related to the perfornnnce
and scalability of parallel processing systems.
The main performance metric considered is the execution tirne of a parallel program which has a
specific parallelism profilc.As a program executes, the degree of parallelism in it vanes with time, and
therefore we can calculate the average degree of parallelism in the program.The parallelism profile also
allows us to csrirnarc the speedup achievable on the system as the number of processors is increased.
Apart from speedup. we also defined system efiiciency and system utilization as asymptotic functions
of the number of processors. On an n processor system, efficiency is defined as the speedup achieved
divided by n [which is the ideal case speedup). System utilization, on the other hand. indi-canes tl'1e fraction
of processor cycles which was actually utilized during program execution on the n processor system.
Benchrmrk programs are very useful tools in measuring the performance of computer systerns.‘ir'\€a
looked at certain well-known benchmark programs. although it is also true that no two applications are
identical and that therefore, in the final analysis, application specific benchmark programs are more useful.
We took a brief look at so-called ‘grand challenge’ applications of high performance computer systems;
those are applications which are likely to have major impact in science and technology Massively parallel
processing {MPP} systems are increasingly being applied to such problems; clearly performance and
scalability are important criteria for all such applications.
We then looked at some speedup performance laws governing parallel applications. Ant-dahl's law
states in essence that. for a problem of a given fixed size,as the number of processors is increased. the
speedup achievable is limited by the program fraction which must necessarily run as a sequential program,
i.e. on one processor. Gustafson's lawton the other hand, studies also the effect of increasing the problem
size as the system size is increased, resulting in the so—called fixed time speedup model. The third model
studied was the memory-bounded speedup model proposed by Sun and Ni.
The specific metrics which affect the scalability of a computer system for a given application aro—
machine size in number of processors, processor clock rate, problem size. processor tlrnre consumed, IID
requirement. memory requirement communication requirement. system cost. and programming cost of
ld'lE application.Open research issues related to scalability in massively parallel systems were reviewed.

g Exercises
Problem 3.1 Consider the parallel execution for synchronization among the four program parts
of the some program in Problem 1.4 on tr four- 50000 extra instructions are added to each divided
p rocessor systc m with shared me mory. The p rogram p rogram part.
Cflfl bfl pQ|'l.llIlOl‘lIEd lI‘lI.O fUl.lf 'BqL|3.l PHFI3 fO|' b-Blflflflfld _A5g_,|r|1,E thg garfye in-5[f'|_,|fljQ|'1 35 in Ppgblem

execution by the four processors. Due to the need for each divided program part;
TM Illnffifihir Hillfiurnpennri .
|z‘—
Advanced Compuiter Architecture

The CPI for the memory reference (with cache Problem 3.3 Let rr be the percentage of a
miss) instructions has been increased from B to 12 program code which can be executed simultaneously
cycles due to contentions.The CPlsfor the remaining by n processors in a computer system.A.ssume that
instruction types do not change. the remaining code must be executed sequentially by
{a} Repeat part (a) in Problem 1.4 when die a single processor. Each processor has an execution
program is executed on the four-processor rate of it HIPS. and all the processors are assumed
system. equally capable.
(b) Repeat part [b] in Problem 1.4 when the {a} Derive an expression for the effective HIPS
program is executed on the four-processor rate when using the system for exclusive
system. execution of this program. in terms of five
(c) Calculate the speedup factor of the four- parameters n. rz, and x.
processor system over the uniprocessor (b) lfn= 16 and x=40O l"‘llPS.determine the value
system in Problem 1.4 under the respective of {I which will yield a system performance of
trace statistics. 4000 HIPS.
{cl} Calculate the efficiency of the four-processor
Problem 3.4 Consider a computer which can
system by comparing the speedup factor in
execute a program in two operational modes: regular
part (c) widw dwe ideal case.
mode versus enhanced mode. with a probability
Problem 3.1 A uniprocessor computer can distribution of irr. i rr}, respectively.
operate in either scalar or vector mode. In vector {a)lfr1'varies between oand band O E o -=‘-b E 1.
mode. computations can be performed nine tima derive an -expression for fine average speedup
faster than in scalar mode. A ceriain benchmark factor using the har'monic mean concept
program tool-c time T to run on this computer. (b) Calculate the speedup factor when 0 -3 O and
Further. it was found that 25% of T was attributed to b -1 1.
the vector mode. In the remaining time.the machine
Problem 3.5 Considertheuseofafour-proces.s»or.
operated in the scalar mode.
shared-memory computer for the execution of a
[a) Calculate the effective speedup under the
program mix.The multiprocessor can be used in
above condition as compared with die
four execution modes corresponding to die active
condition when dwe vector mode is not used
use of one.tvvo. three. and four processors. Assurn-e
at all. Also calculate :1‘, the percentage of
that each processor has a peak execution rate of
code that has been vectorized in the above
500 l‘-1| PS.
program.
Let f] be the percentage of time dwat I processors
(b) Suppose vve double the speed ratio between
will be used in the above program execution and
the vector mode and the scalar mode by
1’, + fl + 1‘:-1 + f4 = 1.You can assume the execution
hardware improvements. Calculate five
rates R1. R1. R3. and R4. corresponding to five
effective speedup that can be achieved.
distribution (fi. fi. 15. fl). respectively.
(c) Supposethesame speedup obtainedinpart [b]-
(a) Derive an expression to show the harmonic
must be obtained by compiler improvements
mn execution rate R of the multiprocessor
instead of hardware improvements. ‘What
in terms of ff and R. for i = 1. 2. 3. 4.Also
would be the new vectorization ratio rz
show an expression for the harmonic mean
that should be supported by the vectorizing execution time T in terms of R.
compiler for the same benchmark program?
FM liilcfimu-‘ Hilllimnponm
Piirrcipko of Scalable Peifwmance W r21
(b) What would be the value of dwe harmonic next 32 iterations (l' = 33 to 64). and so on.
mean execution time Tof die above program What are the execution time and speedup
mix given fi = 0.4.fi = 0.3.1‘; = 0.2.151 = 0.1 and factors compared with part (a)? (Note that
R1= 400 l"1lPS.R1 = B00 l"1|PS.R3 = 1100 MIPS. the computational worldoad. dictated by the
R4 = 1500 MIPS? Explain the possible causes j-loop. is unbalanced among the processors.)
of observed R. values in the above program (c]- Modify the given program to facilitate
execution. a balanced parallel execution of all the
{c} Suppose an intelligent compiler is used to computational workload over 32 processors.
enhance the degree of parallelization in the By a balanced load.we mun an equal number
above program mix with a new distribution of additions assigned to each processor with
f1 = 0.1. f1= 0.2. f3 =0.3. 1‘; =0.4.Whatwould respect to both loops.
be the harmonic mean execution time of the {d} What is the minimum execution time resulting
same program under the same assumption on from the balanced parallel execution on 32
{R} as in part (b)? processors? What is the new speedup over
the uniprocessor!‘
Problem 3.6 Explain the applicability and
die restrictions involved in using Amdahl's law. Problem 3.8 Consider the multiplications of
Gustafs/on‘s law. and Sun and Ni's law to estimate two n >< n matrices A = {oi} and B = {by} on a scalar
die speedup performance of an n-processor system uniprocessor and on a multiprocessor. respectively.
compared with that of a single-processor system. The matrix elements are floating-point numbers.
Ignore all communication overheads. initially stored in the main memory in row-major
order.The resulting product matrix C = {cg} where
Problem 3.? The following Fortran program is
C = A >< B. should be stored back to memory in
to be executed on a uniprocessor. and a parallel
contiguous locations.
version is to be executed on a shared-memory
multiprocessor. Assume a 2-address instruction format and
an instruction set of your choice. Each loadistore
Ll: no 10 I = 1,1024
instructiontakes. on the average.4 cycles to complete.
L2: SIJM-{Ii = If]
All ALU operations must be done sequentially on
L3: Do 2'3 J = 1, I the processor with 2 cycles if no memory reference
L4: 20 SLIM -:Ii = SUM -:Ii + I is required in the instruction. Dthenrvise. 4 cycles
La: 10 Continue are added for each memory reference to fetch an
Suppose statements 2 and 4 each take two operand. Branch-type instructions require. on the
machine cycle times, including all CPU and memory- average. 2 cycla.
access activities. Ignore the overhead caused by the {a} Write a minimal-length assembly-language
software loop control {statements L1. L3. and L5} program to perform the matrix multiplication
and all other system overhead and resource con- on a scalar processor with a load-store
flicts. architecture and floating-point hardware.
(a) What is the total execution time of the {b} Calculate the total instruction count. the
program on a uniprocessor? total number of cycles needed for the
[b] Divide the outer loop iterations among program execution. and the average cydes
32 processors with prescheduling as follows: per instruction [CPI]-.
Processor 1 executes the first 32 iterations {c} What is the MIPS rate of dwis scalar machine.
(i = 1 to 32). processor 2 executes the If the processor is driven by a 400-MH: clock?
FM liilcfimur Hilllimnponm
|zB—
Advanced Computer Architecture

(d) Sugest a partition of the above program to special case of the Si‘ expression.
execute dve divided program parts on an {d} Prove the relation 5.1.‘; S1. 2 Sn for solving the
N-processor shared-memory system with same problem on the same machine under
minimum time.Assume n = 1000N. Estimate different assumptions.
the potential speedup of the multiprocessor
Problem 3.11 Prove the following relations
over the uniprocasonassu ming dwe same type
among the speedup Sin}. efficiency E(n}, utilization
of processors are used in both systems. Ignore
U[n). redundancy Ri[n). and quality Q(n} of a parallel
the memory-access conflicts. synchronization
and other overheads.
computation. based on the definitions given by Lee
{W80}:
(e) Sketch a scheme to perform distributed
{a} Prove 1fn E E[n) E U[n) E 1, where n is the
matrix computations with distributed
number of processors used in the parallel
data sets on an N-node multicomputer
computation.
with distributed memory. Each node has a
{b} Prove 1 5 R(n} S 1i'E{n]- E n.
computer equivalent to the scalar processor
used in part [a]. {c} Prove the expression for Q(n} in 3.19.
|[f} Specify die message-passing operations (d} Verify the above relations using the
required in part [e). Suppose dvat. on the hypodwetical workload in Example 3.3.
average. each message passing requires 100 Problem 3.11 Rept Example 3.? for sorting s
processor cycles to complete. Estimate the numbers on five dilferentn-processor machines using
total execution time on the multicomputer dve linear array. 2D-mesh. 3D-mesh. hypercube. and
for the distributed matrix multiplication. Clnvep network as interprocessor communication
Hake appropriate assumptions if needed in ardwitectures. respectively.
your timing analysis. (aj Show the scalability of the five architectures
Problem 3.9 Consider the interleaved execution as compared with the EREVV-PRAM model.
of the four programs in Problem 1.6 on each of {b} Compare the results obtained in part (a) with
dve three machines. Each program is executed in a those in Example 3.7‘. Based on these two
particular mode with the msured HIPS rating. benchmark raults. rank the relative scalability
{a} Determine the arithmetic mean execution of the five architectures. Can the results be
time per instruction for each machine generalized to the performance of other
executing the combined workload. assuming algorithms?
equal weights for the four programs. Problem 3.13 Consider the execution of two
(b) Determine the harrnonic mean MIPS rate of benchmark programs. The performance of three
each machine. computers running these two benchmarks are given
{c} Rank the machines based on the harmonic below:
mean performance. Compare this ranking
Be.uit.l'z.|ritu'.I'r .lfii'.i'r'0.u.v' t'.'om.iJu.rc.i' ifonupurer t'.'o.m.imIe.i'
with that obtained in Problem 1.6.
of i 2 3
Problem 3.10 Answer or prove the following flmflhgv T. |".v'1‘r.: _i T;|’.\'ec) ]i'_.r.m-..1
statements related to speedup performance law: |l'J0fJI.i'
r.~p¢.=rr¢:|Ir'ni:'.v
(a] Derive the fixed-memory speedup expression
S3: in Eq. 3.33 under reasonable assumptions. Pmblcml | mo i I0
(b) Derive Amdahl‘s law {Sn in Eq. 3.14) as a Problem 2 | 100 rum S 1'-l l-l :I<:I

special case of the S: expression. Totaitime | 1001 110 40


{c} Derive Gus1afson's law [S1, in Eq. 3.31} as a
rr» Mcfi-rm-H um rmmm-I111
Principles of Scalable Peifwmance

{a} Calculate Rd and Fly, for each computer under {c} Prove that h(.s. n) = O{nlogn + s3} when
the equal-weight assumption fi = 15 = 0.5. mapping the Delcel-Nassimi-Sahni algorithm
{b} W"hen benchmark 1 has a constant R1 = on a hypercube with n = s3 = 23* nodes.
10 Mflops performance across the three Problem 3.15 Xian-He Sun [1992]] has
computers. plot Ra and Rh asa function of R1. introduced an iscrspeecl concept for scalability
which varies from 1 to 100 |*"'I'l"lops under the analysis.The concept is to maintain a fixed speed for
assumption fl = 0.8 and fl = 0.1 each processor while increasing d1e problem size.
{c} Repeat part (b} for the case f1 = O2 and Let W and W be two workloads corresponding to
f1 = 0.8. two problem sizes. Let N and N’ be two machine
{d} From the above performance results under sizes {in terms of the number of processors}. Let TH
diffe rentconditions,can you draw a conclusion and Ty be the parallel execution times using N and
regarding the relative performance of the N’ processors. respectively.
three machines! The isospeed is achieved when WKNTH} = VII’!
[N ’Thr). The isoeflfiolency concept defined by Kumar
Problem 3.14 In Example 3.5. four parallel
and Rao (1987) is achieved by maintaining a fixed
algorithms are mentioned for multiplioation of s >< s
elficiency through SN{W'_)fN = SH-(W)!N’. where
matrices. After read ing the original papers describing
S~(W} and are the corresponding speedup
these algorithms. prove the following com munication
factors.
overheads on the target machine architectures:
Prove that the two concepts are indeed equivalent
fa) Prove that h[s. n} = O(n log n +516) when if (i] the speedup factors are defined as the ratio of
mapping the Fox-Otto-Hey algorithm on a pom.ll'el speed Rh, to sequential speed R1 {rather than
J; X ‘ll; 1I'Dl"LlS. as the ratio of sequential execution time to parallel
{b} Prove that h(s. n} = Ofinm + nlogn + sin“)- execution time). and {ii} R1(W} = R1[W’}. ln other
words. isoefiiciency is identical to isospeed when
when mapping Berntsen‘s algorithm on
the sequential speed is fixed as the problem size is
a hypercube with n = 23* nodes, where
increased.
kS% logs.

You might also like