@vtucode - in 21CS643 Module 1 2021 Scheme
@vtucode - in 21CS643 Module 1 2021 Scheme
MODULE-1
— I _
4 i Advanced Colnpuoerfirehitecru-.re
Obviously, the fact that computing and communication were carried out with moving mechanical parts
greatly limited the computing speed and reliability' ofmechanical co mputers. Modern computers were marked
by the introduction of electronic components. The moving parts in mechanical computers were replaced by
high-mobility electrons in electronic computers. lnibrmation transmission by mechanical gears or levers was
replaced by electric signals traveling almost at the speed of light.
C/nmputer Generation: Over the past several doeades, electronic computers have gone through roughly
five generations of development. Table l.l provides s summary ofthe five generations ofelectronic computer
development. Each of the first three generations lasted about 10 years. The fourth generation covered a time
span of 15 years. The fifili generation today has processors and memory devices with more than l billion
transistors on a single silicon chip.
The division ofgcnerations is marked primarily by major changes in hardware and software technologies.
The entries in Table l.l indicate the new hardware and soflware features introduced with each generation.
Most i'-eatures introduced in earlier generations have been passed to later generations.
Progr-as in Hardwun: As far as hardware technology is concerned, the first generation (I945 I954} used
vacu|.n"n tubes and relay memories interconnected by insulated wires. The second generation (1955-1964)
HM‘ If TILE I'm-I!I;|r1rIit\
Rrreilel Cnrnprrter Models _ M 5
was marked by the use of discrete transistors, diodes, and magnetic ferrite cores, interconnected by printed
circuits.
The third generation -[i 1965-1974) began to use irrregrritcri circuits {IC s) for both logic and memory in
.srmH-scale or mcdirrrn-scale irrregruzirion {SSI or MSI ) and multilayered printed circuits. The fourth generation
( l 974-1991 ] usod l'rrr'ge-scale or 1-'cr'__\-‘large-sem"e irrrcgrariorr (LSI or ‘s-‘LS1 ). Semiconductor memory replaced
core memory as computers movcd from the third to the fourth generation.
The fifth generation (l99I-present) is highlighted by the use of high-density and high-speed processor
and memory chips based on advanced VLSI ncchoology. For example, 64-bit Crl-Iz range processors are now
available on a single chip with over one billion transistors.
The first Generation From the architectural and software points of view, first generation computers
were built with a single eerrrroi proeessirrg rm ir (CPU) which performed serial fixed-point arithmetic using a
program co1.mter,branch instructions, and an accumulator. The CPU must be involved in all memory access
and irrprrrr'r1rr.§rrrrr{:lr'Cl] operations. Machine or assembly languages were used.
Representative systems include the ENI.r*iC (Electronic Numerical Integrator and Calculator} built at the
Moore School of the University of Pennsylvania in 195'[l;the [AS [Institute ibrhdvaneed Studies) computer
ba_sed on a design proposed by John von Neumann, Arthur Burks, and Herman Goldstinc at Princeton in
1946; and the IBM Till, the first electronic stored-program commercial computer built by IBM in l953.
S-ubroutinc linkage was not implemented in early computers.
The Second Generation Index registers, floating-point arithmetic. multiplexed memory. and U0
processors were introduced with second-generation computers. High iei-‘oi lringrrriges {l~lLLs], such as Fortran,
Algol, and Cobol, were introduced along with compilers, subroutine libraries, and batch proccs sing monitors.
Register transfer language was developed by Irving Reed {I 957) for systematic design of digital computers.
Representative systems include the IBM 7030 (the Stretch computer) featining instruction lookahcad and
error-correcting memories built in 1962, the Univac LARC [Livennore Atomic Research Computer} built in
1959, and the CDC 1604 built in the 1960s.
The Third Genernti-on The third generation was represented by the lBMr‘3tii'i—3?O Series, the CDC
ofitllifitilltl Series, Texas lnstrurnents ASC {Advanced Scientific Computer), and Digital Equipmenfs PDP-8
Series from the mizl-I9-fifls to the mid l9'F'l]s.
Microprogrammed control became popular with this generation. Pipelining and cache memory were
introduced to close |.|p the specd gap between the CPU and main memory. The ideaofmultiprogramming was
implemented to interleave C PU and i-"O activities across multiple user programs. This lcd to the development
of time-sharing opcrrifing .s_y.n'cms {OS} using virtual memory with greater sharing or multiplexing of
resources.
The Fourth Generation Parallel computers in various architectures appeared in the fourth generation of
computers using shared or distributed memory or optional vector hardware. Multiprocessing OS, special
languages, and compilers were developed for parallelism. Sofhvare tools and environments were created for
parallel processing or distributed computing.
Representative systems include the VAX 9000, Cray X-MP, IBMr'3(l9D VF, BBN TC-2000, etc. During
these 15 years (l9'F5—l990), the technology ofparallel processing gradually became mature and entered the
production mainstream.
E i Advanced Cmnpimerfiichitecture
The Fifth Generation "These systems emphasise supersealar processors. cluster computers, and rnosrii-'e'I_1-'
por.oHer' processing ('M'PF). Scalable and latency tolerant architectures are being adopted in MPP systems
toting advanced VLSI tectmologics, high-density packaging, a11d optical technologies.
Fifih-generation computers achieved Terafiops (I0 II' floating-point operations per second) performance
by the mid-19905, and have now crossed the Petaflop (IOU floating point operations per sooond] range.
Ueremgemous processing is emerging to solve lalgoseale problems using a network of heterogeneous
computers. Early fifth-generation MPP systems were represented by several projects at Fujitsu (VTP500),
Cray Research [MPP), Thinking Machines Corporation [the CM-5}, and Intel (the Paragon]. For present-day
examples ofadvanced processors and systems; Chapter 13.
Computing Problem: It has been long recognized that the concept of computer architecture is no longer
rest rictod to the st ructure ofthe bare machine hardware. A modern computer is an integrated system consisting
of machine hardware, an instruction set, system software, application programs, and user interfaces. These
system elements are depicted in Fig. 1. 1. The use ofa computer isdriven by real-life problemsdemanding cost
effective solutions. Depending on the nature of the problems, the soh.|tions may require different computing
resources.
"W19
PS’ “ms Operatl ng
System
A‘and Data
"ms Manning Hardware
5
Strum
H%
Arelitoeture
Pro-gramrntng
Performance
Evsl uatlon
For numerical problems in science and technology, the solutions demand complex mathematical
formulations and intensive integer or floating-point computations. For alphanumerieal problems in business
F?» if run! I'nrl'J|||;1rlM'\
lh|'o0lelCornpu'te|'.fl-llodels i 1
and government. the solutions demand efficient transaction processing, large database management, and
information retrieval operation s.
For artificial intelligence [Al] problems, the solutions demand logic inferences and symbolic manipulations.
These computing problems have been labeled rrurrrericof eompnrirrg. rrtrnsoerion pro:-tosing, and logical
reasoning. S-omc complex problems may demand a combination ofthesc processing modes.
Algorithm: and Dam Strueturm Special algorithms and data structures are needed to specify the
computations and communications involved in computing problems. Most numerical algorithms are
deterministic, using regularly structured data. Symbolic processing may use heuristics or nondctcnninistic
searches over large knowledge bases.
Problem formulation and the development ofparallcl algorithmsoficn require interdisciplinary interactions
among thcoreticians, esperimcntalists, and computer programmers. There arc many books dealing with the
design and map ping ofalgorithms or heuristics onto parallel computers. ln this book, we are more conccmed
about the resources mapping problem than about the design and analysis ofparallcl algorithms.
Hardware Resource: The system architecture ofa computer is represented by three nested circles on the
right in Fig. 1 .l ..-it modem computer system demonstrates its power through coordinated cfiorts by hardware
resotnccs, an operating system, and application software. Processors, memory, and peripheral devices form
the hardware core of a computer system. We will study instruction-set processors, memory organization,
muhiproccssors, supercomputers, multicomputers, and massively parallel computers.
Special hardware inte rfams are oflen built into l-“O devices such as display terminals, workstations, optical
page scanners, magnetic ink character recognizers, modems. network adaptors, voice data entry, printers,
and plotters. These peripherals are connected to main frame computers directly or through local or wide-area
networks.
In addition, sofiware interface programs are needed. These software interfaces include file transfer
systems, editors, word processors, device drivers, interrupt handlers, network communication programs, etc.
These programs greatly facilitate the portability of user programs on difierent machine architectures.
Operating System An effective operating system manages the allocation and deallocation of resources
during the execution ofuser programs. Beyond the OS, application software must be developed to benefit the
users. Standard bench mark programs are needed for performance evaluation.
Mapping is a bidirectional process matching algorithmic structure with hardware architecture, and vice
versa. Efficient mapping will benefit the programmer and produce belier source codes. The mapping of
algorithmic and data structures onto the machine architecture includes processor scheduling, memory maps,
interproccssorcommunications, etc. These activities are usually architecture-dependent.
Optimal mappings are sought for various computer architectures. The implementation of these mappings
relies on efficient compiler and operating system support. Parallelism can be exploited at algorithm design
time, at program time, at compile time, and at run time. Techniques for exploiting parallelism at these levels
fortn the core of parallel processing technology.
System S-ofhvure Support Software support is needed for the development of etficient programs in high-
lcvcl languages. The source code written in a I-[LL must be first translated into object code by an optimizing
compiler. The cornpifer assigns variables to registers or to memory words, and generates machine operations
corresponding to HLL operators. to produce machine code which can he recognized by the machine hardware.
A loader is used to initiate the program execution through the CIS kcmcl.
re» Meemw um r-...=-mm. '
B i _ Admrrcad Compunerfirchitectn-.re
Resource binding demands the use of the compiler, assembler, loader. and OS kernel to commit physical
machine resources to program execution. The effectiveness of this process determines the eflieieney
of hardware utilization and the programmability of the computer. Today, programming parallelism is
still difiicult for most programmers due to the fact that existing languages were originally developed for
sequential computers. Programmers arc sometimes forced to program hardware-dependent f-eatures instead of
programming parallelism in a generic and portable w'ay. Ideally, we need to develop a parallel programming
environment with architecturc-independent languages, compilers, and software tools.
To develop a parallel language, we aim for etfieiency in its implementation, portability across different
machines, compatibility with existing sequential languages, expressiveness of parallelism, and ease of
programming. One can attempt a new language approach or try to extend existing sequential languages
gradually. A new language approach has the advantage of using explicit high-level constructs for specifying
parallelism. However, new languages are often incompatible with existing languages and require new
compilers or new passes to existing compilers. Most systems choose the language extension approach; one
way to achieve this is by providing appropriate function libraries.
Compiler Support There are three compiler upgrade approaches: preproecssor, preer:|mpii'er, and
pnrnifctfrbing compiler. A prcprocessor uses a sequential compiler and a low-level library of the target
computer to implement high-level parallel constructs. The precompiler approach requires some program flow
analysis, dependence checking, and liniitod optimizations towanl parallelism detection. The third approach
demands a fully developed parallelizing or vectorizing compiler which can automatically detect parallelism
in source code and transform sequential codes into parallel constnrcts. These approaches will be studied in
Chapter ltl.
The etficieucy of the binding process depends on the effectiveness of the preprocessor, the precompiler,
the parallelizing compiler, the loader, and the OS support. Due to unpredictable program behavior, none ofthe
existing compilers can be considered fully automatic or lirlly intelligent in detecting all types of parallelism.
Very often eonrjriifer dircerit-'e.s' are inserted into the source code to help the compiler do a better job. Users
may interact with the compiler to restr|.|cture the programs. This has been proven uselirl in enhancing the
performance of parallel computers.
Legends:
|l'E: Instruction Fetch and Execute.
SIMD: Sings Instruction stream and
Multlplo Data strmms
E “finial MIMD: Multiple Instruction straarls
snd Multiple Data streams
Functional
i-it
Q Parall rn
st
Mu ltlpla
Func Units Pipeline
tr
y-to oglstor-to
- es"
s rnory -Register
Associative
Processor QEE
5
'E
%
PTO-09$
Arte !l'
Muitleo
Massively parslte-l
processors {M PP)
Fig. 1.‘! Tree showing arehlneerural evolution from sequential scalar composers no veeror processors and
Mulll
parallel computers
Loulalhaeld, Paruilelism, and Pijllelining Lookahead techniques were introduced to prefetch instructions
in order to overlap HE (instnlction fctchidecode and execution] operations and to enable functional
parallelism. Ftmctional parallelism was supported by two approaches: Cine is to use multiple ftmctional units
simultaneously, and the other is to practice pipelining at va.rious processing levels.
The latter includespipelined instruction execution, pipelinod arithmetic computation s, and memory-access
operations. Pipelining has proven especially attractive in performing identical operations repeatedly over
vector data strings. Vector operations were originally carried out implicitly by sofiwarc-controlled looping
using scalar pipeline processors.
Flynn's Clnuiflcntlon Michael Flynn ('1 9‘i2] introduced a classification of various computer architectures
based on notions of instruction and data stream s. As illustrated in Fig. 1.3a, conventional sequ-cntial machines
are called SISD -['.s'ingie inslrnerion stream over rt singie rinrn stream] computers. Vector computers are
equipped with scalar and vector hardware or appear as SIMD (singie insnruerion stream over nrrlirr}-Jie ainm
snvsnnrsl machines {_Fig. l.3b]. Parallel computers are reserved for MIMD -[hnuiripie insrrnerion .-rrmnms over
naiiripie dam snennts) machines.
An MISD {rnuiripie instruction sl‘renrn.s' nnrfn single dnrn .s'rrenrn) machine is modeled in Fig. l.3d. The
same data stream flows through a linear array of processors executing different instruction streams. This
l'h1'Ml.'I;Ifl\lI' HI" l'n¢r.q|r_.u||r\
IO i Admirrced Cempmerfirehitecurre
architecture is also ltnown as systolic arr.n_]-‘s (Kttng and Leiserson, I978] For pipelinod execution ofspeeifie
algorithms.
LM
uses DP-13
ls
IS |g I
I -I 5°“
loaded
1 Program loaded I I from
[an 5'50 unlpmcefim archnecmm [hj SIMD architecture [with dstributed mernory)
CU = Control Unit |
= |rQ CU PU
PU Processing Unit is D5
MU = Memory Unit I I Sh;-Gd I
IS = Instruction Stream : : Memory :
D3 = Data Stream Ir‘-0 IS D5
C PU
PE = Processing Element
LM = Locai lI.|lfl'i10i']|'
[c] MIMD architecture [with shared memory]
IS ul IS
cu, cu: ''-
Memory ls IS '5
matn%aa~ H1
[program
US H
U0
Fig. 1.3 Flynn‘; dassiilcadon of computer ardtinecmres [Derived from Hicha-ei Flynn, 19??!)
Ofthe fourmachine models, most parallel computers built in thepast assumed the MIMD model forgene1al-
purposecomputations. The SIMD and MISD models are more suitable for special-purpose computations. For
this reason, MIMD is the most popular model, SIMD next, and MISD the least popular model being applied
in commercial machines.
Parallel! Wet-or Computer: Intrinsic parallel computers are those that execute programs in MIMD mode.
There ate two major classes of parallel computers, namely, s.horeo‘~memorj-* m1rIIipr0ees.s0r.t and message-
prrssing nwlfieomprrrers. The major distinction between multiproccssors and multicomputers lies in memory
sharing and the mechanisms used for interprocessor communication.
War If J11!!!‘ r'mx-;|umn
Rrroitlel Cornpu'tae|'Med-els i i | |
The processors in a multiprocessor system communicate with each other through shared \-'.nri.trbl'es in a
common memory. Each computer node in a multicomputer system has a local memory, unshared with other
nodes. lnterprocessor communication is done through rrressngc prrssirrg among the nodes.
Explicit vector instructiorts were introduced with the appearance of veemr processors. A voctorprocessor
is equipped with multiple vector pipelines that can be concurrently used under hardware or firmware control.
There are two iamilies ofpipelincd vector processors:
Merrrorjt=-to-rrrerrrrrrjt-' architecture supports the pipelinod flow of vector operands directly from the memory
to pipelines and then back to the memory. Register-m-rtrgisrer architecture uses vector registers to interface
between the memory and functional pipelines. ‘tfcctor processor architectures will be studied in Chapter B.
Another important branch of the architecture tree consists of the STMTJ oomputers for symchronized
vector processing. An SIMD computer exploits spatial parallelism rather than rcrry'Jortr1;JarttlieIism as in a
pipelinod computer SIMD computing is achievod through the use of an array oi'proeessirrg eIerrrcm.~'r {PEs]
synchronired by the same controller. Associative memory can be used to build Silt-'l'D associative processors.
SIMD machines will be treated in Chapter 8 along with pipelined vector computers.
Development Layer: A layered development of parallel computers is illustrated in Fig. 1.4, based on a
classification by Lionel Ni [I990]. Hardware configurations differ from machine to machine, even those oi"
the same model. The address space ofa processor in a computer system varies among difiierent architectures.
it depends on the tnentory organization, which is mar:hine—dependent. These features are up to the designer-
and should match the target application domains.
Ap-plicatiorts T
7 Programming Eiwironment Machhe
T Languages Supported lfldflfififltlflfll
Machine _ Commm|eatio_n__hiode| l
Dfiflefldfll“ Adcireesin t-I sP309
l Hardware Architecture
Fig. 1.4 Six layers for computer system development [Courtesy of Lionel Ni. 1990}
On the other hand, we want to develop application programs and programming environments which
are machine-independent. lndcpendcrrt of machine architecture, the user programs can be ported to many
computers with min imtrm conversion costs. High- level languages and communication models depend on the
architectural choices made in a computer system. From a programmer's viewpoint, these two layers should
be architecture-transparent.
Programming languages such as Fortran, C, C++, Pascal, Ada, Lisp and others can be supported by most
computers. However, the communication models, shared variables versus message passing, are mostly
mac hinc-dependent. The Linda approach using triple srxrcesoffers an arch itecture-tran sparcnt oommunication
model for parallel computers. These language features will be studied in Chapter 10.
Application programmers prefer more architecttrral transparency. However, kernel programmers have to
explore the opportunities supported by hardware. As a good computer architect, one has to approach the
problem from both ends. The compilers and CIS support should be designed to remove as many arehitecttrral
constrairlts as possible from the programmer.
F?» Mtfiruw Hfllrltmjtwrnw
II i Advanced Cnvnpunerfirchiteetu-.re
New Challenge: The technology of parallel processing is the outgrowth of several decades of research
and industrial advances in microelectronics, printed circuits, high density packaging, advanced processors,
memory systems, peripheral devices, communication channels, language evolution, compiler sophistication,
operating systems, programming environmcncs, and application challenges.
The rapid progress made in hardware technology has significantly increased the economical feasibility of
building a new generation ofoomputcrs adopting parallel processing. However, the major barrier preventing
parallel processing from entering the production mainstream is on the software and application side.
To dare. it is still fairly difficult to program parallel and vector computers. We need to strive for major
progress in the sofiware area in order to create a user-friendly environment for high-power computers.
A whole new generation of programmers need to be trained to program parallelism effectively. High-
performance computers provide fast and accurate solutions to scientific. engineering, business, social, and
defense problems.
Representative real-liii: problems include weather forecast modeling, modeling of physical, chemical and
biological processes, computer aided design, large-scale database management, artificial intelligence, crime
controL and strategic defense initiatives, just to name a few. The application domains ofparallcl processing
computers are expanding steadily. With a good understanding ofscalablc computer architectures and mastery
of parallel programming techniques, the reader will be better prepared to face future computing challenges.
Clock Rate and CPI‘ The CPU {or simply theproeesserj oftoday‘s digital computer is driven by a clock
with a constant cycle time t‘. The inverse ofthe cycle time is the eloeir rare -[f= Ht‘). The size ofa pregrarn is
determined by its irrsrruerion mum‘ {L}, in terms of the number of machine instructions to be executed in the
program. Different machine instructions may require dificrent numbers ofcloclt cycles to execute. Therefore,
the eydes per instruction { C PI '1 becomes an important parameter for measuring the time needed to execute
each instrucfjon.
Fora given instnretion set, we can calculate an merrrge CPI over all instruction types, provided we know
their frequencies of appearance in the program. An accurate estimate of the average CPI requires a large
amount of program eode to be traced over s long period of time. Unless specifically focusing on a single
instruction type, we simply use the term CPI to mean the average value with respect to a given instruction set
and a given program mix.
Perfbrmance Factors Let I, be the number of instructions in a given program, or the instruction count.
The CPU time ( Tin secondsrprogram} needed to execute the program is estimated by finding the product of
three eontributing factors:
T=1,.><C'PI><t' {1.1}
The execution ofan instruction requires going through a cycle ofevents involving the instruction fetch,
decode, operand( s] fetch, execution, and store results. In this cycle, only the instruction decode and execution
phases are carried out in the CPU. The remaining three operations may require access to the memory. We
define a memorjt-‘ e_1-‘dc as the time needed to complete one memory reference. Usually, a memory cycle is it
times the processor cycle t. The value ofir depends on the speed of the cache and memory technology and
processor-memory interconnection scheme used.
The CPI of an instruction type can be divided into two component terms corresponding to the total
processor cycles and memory cycles needed to complete the execution of the instruction. Depending on the
instnrct ion type, the complete instruction cycle may involve one to as many as four memory references (one
for instruction fetch, two for operand fetch, and one for store results]. Therefore we can rewrite Eq. 1.1 as
follows:
T=I,.><{p+m><k)><r {L21
where p is the number ofprocessor cycles needed for the instruction decode and execution, m is the number
of memory references needed, It is the ratio between memory cycle and processor cycle, 1,. is the instnrction
eount, and r is the processor cycle time. Equation 1.2 can be further refined once the CPI components (p, m,
Ir) are weighted over the entire instnrction set.
System Attribute: The above five pczrformancc factors (1,, p, m, Ir, r) arc influenced by four system
attributes: instnretion-set architecture, compiler technology, CPU implementation and control, and cache and
memory hierarchy, as specified in Table 1.2.
The instn.|ction-set architecture afiects the program length (1,) and processor cycles needed (p). The
eompiler technology aifects the values ofr',.,p, and the memory reference count (m). The CPU implementation
and control determine the total processor time (p - I‘) needed. Finally, the memory technology and hierarchy
design affect the memory access latency (Ir - t‘). The above CPU time can be used as a basis in estimating the
execution rate ofa processor.
re» Mum-w not I'm'l!Il|(1rlM'\ '
I4 i _ Advorrced Compunerfirchitecto-.re
Per_;l"urmnr|r.-.9 Frrr.'mrs-
Y )'n'.'rt'c rlvemge f.:vr.'r‘e.s per hr.s|'r'r.re!i0:r_ CPI PIr0c'e'.s.wr'
. ysrem - i - --
' _ C01!!!‘-I‘. P.roees."ror .Irf£'mr).|jt' .'l'Ili£'J'7‘l£.l'J’}'- Crete
.-"1.r.rrrbr.r.re.s _
IL. yerll.-'.sper Rre_',le.re'm.'e.s per .-"Ice-cs.s Time,
hr.'r.rruc'! iorr. p .i'n'.'t|‘nrc't‘lwr. m .[a!e'J'rqs'. It T
Instruction-set
. J’ if
Archrrectu re
Compiler
1" \/ \/
Technology
Processor
Implementation vi’ \/
and Control
Cache and
Memory 1/ st’
Hierarchy
MIPS Rate Let C be the total number of clock cycles needed to execute a given program. Then the CPU
time in Eq. 1.2 can be estimated as T= C>< t= Cf)’. Ftn-thermore, CPI = C.-‘L. and T= 1,. ><CPI >< t= L. >< CPlrjf.
The processor speed is often measured in terms of miflion instructions per secornrf (MIPS). We simply call
it the MIPS rate of a given processor. It should be emphasized that the MIPS rate varies with respect to a
number of factors, including the clock rate (fl, the instruction count {L}, and the CPI ofa given machine, as
defined be-low:
MIPS ratc= I" 6 = "f 6 = ‘ii X ['6 (1.31
T><1{1 CPI><l0 C><1t"1
Based on Eq. 1.3, the CPU time in Eq. 1.2 can also be written as T=1,.>< 1O'6r'MIPS. Based on the system
attributes identified in Table 1.2 and the above derived expressions, we conclude by indicating the fact that
the MIPS rate ofa given computer is directly proportional to the clock rate and inversely proportional to the
CPI. All fo|.|r system attributes, instmction set, compiler, processor, and memory technologies, affect the
MIPS rate, which varies also from program to program because ofvariations in the instruction mix.
Floating Point Operations per Second Most compute-intensive applications in science and engineering
make heavy use of floating point operations. Compared to inshuctiorts per second, for such applications a
more relevant measure ofperforrrrsnce is floating point operations per second, which is abbreviated as flops.
With prefix mega. (1051, gigfl tlflglt him (113121 of new 1105'}, this is written as rnegaflops {mt1ops), gigaflops
(gflops), terafleps or petafleps.
‘Throughput Rots: Another important concept is related to how many programs a system can execute per
unit time, called the .s_'t-sterrr throughput l -"___ {in programs?!->econdj. In a multiprogrammcd systcm,thc system
throughput is often lower than the C PU rhmugfipur ll"), defined by:
. f _
“P: % ‘ti-‘“
War If J11!!!‘ r'mr:-;|umn
Rrrallel Cumputaer Models i P |5
Note that Ht}, = {MIPS} >< 106.-1', irom Eq. 1.3. The unit for H1, is also programsfsecond. The CPU
throughput is a measure of how many programs can be executed per second, based only on the MIPS rate
and average program length (_I,.)_ Usually Hf, -=: Hr}, due to the additional system overheads caused by the l.-‘D,
oompilcr, and CIS when multiple programs are interleaved for CPU execution by multiprogramming ortime-
sharing operations. [ftheCPU is kept busy in a perfect prograrn-interleaving fashion, then Hf, = Hr}, This will
probably neverhappen, since the system overhead often causes an extra delay and the CPU may be left idle
ibr some cycles.
These data indicate that the measured CPU time on S, is 12 times longer than that measured on The
object codes n.|nning on the two machines have dii-Terent lengths due to the differences in the machines and
oompilers used. All other overhead times are ignored.
Based on Eq. 1.3, we can soc that the instruction count ofthe object code nrnning on S; must be 1.5 times
longer than that ofthe code running on SI. Furthermore, the average CPI on S, is seen to be 5, while that on
S3 is 1.39 executing the same benchmark program.
S, has a typical CISC (enrrrpfex in.srrr.1erirJn set (."||'JI.|'!‘i_||'J.IU'i-fig] architecture, while S; has a typical RISC
(reduced r'nsIrucrr'on set computing} architecture to be characterized in Chapter 4. This example offers a
simple comparison between the two types ofoomputcrs based on a single program run. When a different
program is run, the conclusion may not be the same.
We cannot calculate the CPU throughput H1, unless we know the program length and the average CPI of
each code. The system throughput Ii-'f._. should be measured across a large number of programs over a long
observation period. The message being conveyed is that one should not draw a sweeping conclusion about
the performance ofa machine based on one ora few program runs.
I6 i Advanced Celnpmerfirehlteetu-.re
Wlren using a parallel computer, one desires a prrrrrllei ant-ircnmenr where parallelism is automatically
exploited. Language extensions or new constructs must be developed to specifi; parallelism or to facilitate
easy detection of parallelism at various granularity levels by more intelligent compilers.
Besides parallel languages and compilers, the operating systems must be also extended to support parallel
processing. The OS must be able to manage the resources behind parallelism. Important issues include
parallel scheduling of concurrent processes, inter-process communication and sync-lironizatin-n, shared
memory allocation, and shared peripheral and communication links.
Implicit Parallelism An implicit approach uses aconventional language, such asC, C-H-, Fortran, or Pascal,
to write the source program. The sequentially coded source program is translated into parallel object code
by a parallciizing compiler. As illustrated in Fig. 1.5a, this compiler must be able to detect parallelism and
assign target machine resources. This compiler approach has been applied in programming shared-memory
multiprocessors.
With parallelism being implicit, success relics heavily on the “intelligence” of a parallclizing compiler.
This approach requires less eifort on the part of the programmer.
Explicit Parallelism The second approach (Fig. l.5b) requires more efibrt by the programmer to develop
a source program using parallel dialects of C, C++, Fortran, or Pascal. Parallelism is explicitly specified in
tl1e user programs. This reduces the burden on the compiler to detect parallelism. In stead, the compiler needs
to preserve parallelism and, where possible, assigns target machine resources. New programming language
Chapel (see Chapter 13) is in this category.
Parallel Concurrent
object cod lb object code
Fig. 1.5 Two approaches to paraiicl prcgramrning {Courtesy cf Gtaries Seirz: adapted with pcrrnisslon from
“CcincurrenrArd1ireerl.rres". pt 51 an-dp.53. VLSI met‘ Parole! Cornpl.ltatlol1.edi1:cd by Strays and Blrrwiscic.
M-organ Kauflrarn Pubiifliers. 1990}
HM‘ If J11!!!‘ I'mi!I;|r1rHt\
Rrrellel Cunputaw Models _ T | -y
Special software tools are noeded to make an environment more iriendly to user groups. Some of the
tools are parallel extensions ofconventional high -level languages. Others are integrated environments which
include tools providing difierent levels ofpmgram abstraction, validation, testing, debugging, and tuning;
performance prediction and monitoring; and visttalization support to aid program development, perfomiance
measurement, and graphics display and animation ofcomputational results.
The UMA Modal ln a LFMA multiprocessor model (Fig. 1.6), the physical memory is uniformly shared
by all the processors. All processors have equal access time to all memory words, which is why it is called
uniform memory access. Each processor may usea private cache. Peripherals are also shared in some fashion.
Procemors
System Interconnect
[Bus Crossbar, Multistage netwofit]
SM II I I
Shared Merrnry
Multiprocessors are called rightly £‘OIJ]J.f£'£'|l.S1FS!£'J'fl.§‘ due to the high degree ofrcsource sharing. The system
interconnect takes the form of a common bus, a crossbar switch, or a multistage network to be studied in
Chapter 7.
Some computer manufacturers have mrr}tipr0r*es.sor {MP} cstcnsions of their unipmcassor {UP) product
line. The UM.-"L model is suitable for general-purpose and times haring applicationsby multiple users. lt can be
used to speed up the execution ofa single large program in time-critical applications. To coordinate parallel
events, synchronization and communication among processors arc done through using shared variables in
the common memory.
Par MIGIITLH Hf" l'mrJI||r_.u|r¢\ :
Wlren all processors have equal access to all peripheral devices, the system is called a synsrnerrfe
multiprocessor. In this case, all the processors are equally capable ofrunning the executive programs, such as
the CIS kernel and l.-‘Cl service routines.
ln an mryrrrnrerrie multiprocessor, only one or a sub set o fprocessors are executive-capable. An executive
or a master processor can execute the operating system and handle IICI. The remaining processors have no
HO capability and thus are called mrnehedpmeessors {A Ps]. Attached processors execute user codes under
the supervision of the master processor. ln both MP and AP configurations, memory sharing among master
and attached processors is still in place.
I»)
lg Example 1.2 Approximated performance of
a multiprocessor
This example exposes the r-rsider to parallel program execution on a shared memory multiprocessor system.
Consider the following Fortran program written for sequential execution on a uniproccssor system. All the
arrays, Ail ], Bfl), and Cfl ], are assumed to have N elements.
Suppose each line ofcode L2, L4, and L6 takes 1 mach inc cycle to execute. Tilt? time required to execute the
program control statements; Ll , L3, L5, and L? is ignored to simplify the analysis. .-°..ssumethat It cycles are needed
for each intcrproocssor commun icat ion operation via the shared memory.
Initially, all arrays are assumed already loaded in the main memory and the short program fragment
already loaded in the instruction cache. In other words, instruction fetch and data loading overhead is ignored.
Also. we ignore bus oontention or memory access conflicts problems. in this way. we can concentrate on the
analysis ofCPU demand.
The above program can be executed on a sequential machine in 23"." cycles under the above assumptions.
Ncycles are needed to execute the N independent iterations in the 1 loop. Similarly, N cycles are needod for
theJ loop, which contains N recursive iterations.
To execute tl1e program on an M-processor system, we partition the looping operations into M sections
with L =Nr‘M elements per section. ln the following parallel code, Dnall declares that all M sections be
executed by M processors in parallel.
For M-way parallel execution, the sectioned I loop can be done in L cycles.
The sectioned J loop produces Mpartial sums in L cycles. Thus EL cycles are consumed to produce all M
partial sums. StilL we need to merge these M partial sums to produce the final sum of N elements.
rm‘ MIGIELH H“ l'm'rIq|r_.r.I|n*\ _
Rrrullel Ccrnpuur Models 1 |9
Doall K = l, M
Do 1t]l=(K—1]*L+1,l(*L
A{_l)= B(l'j +C=['l)
10 Continue
SUM(K) =0
Du 20 .l = l, L
SUM(K) = sumoq + AIIK — 11* L +11
20 Continue
Endall
The addition of each pair of partial sums requires It cycles through the shared memory. An I-level
binary adder tree can be constructed to merge all the partial sums, where I = log; M. The adder tree takes
Mr + 1] cycles to merge the M partial sums sequentially from the leaves to the root ofthe tree. Therefore, the
mu ltiprccessor rcqu ires EL + Kit + 1]= 2Ni‘.M + (Ir + 1)log2 M cycles to produce the final sum.
Suppose N = 23} elements in the array. Sequential execution of the original program takes EN = 23'
machine cycles. Assume that each IPC synchronization overhead has an average value of k = 204} cycles.
Parallel execution on M = 256 processors requires E '3 + 1608 = 9809 machine cycles.
Comparing the abovetiming results, the multiproccs sorshowsa speedup iactorofl 14 out ofthe maximum
value of 256. Therefore, an effi ciency ot'214f256 = 83.6% has been achieved. We will study the speedup and
cflicieney issues in Chapter 3.
The above result was obtained under favorable assumptions about overhead. [n reality, the resulting
speedup might be lower after considering all software overhead and potential resource conflicts. Nevertheless,
the example shows the promising side of parallel processing ifthe intcrprocessor communication overhead
can be maintairieid to a suflieiently low level, represented here in the value of r.
The NUMJN. Model A NUNIA multiprocessor is a shared-mernory system in which the access time varies
with the location of the memory word. Two N'Lll‘vI.-4|. machine models are depicted in Fig. 1.7. The shared
memory is physically distributed to all processors, called local memories. The collection ofall fora! nrenrorics
forms a global address space accessible by all processors.
[t is faster to access a local memory with a local processor. The access ofremote memory attached to other
processors taltes longer due to the added delay through the interconnection network. The BBN TC-2000
Butterfly multiprocessor had the configuration shown in Fig. Lia.
Besides distributed memories, globally shared memory can be added to a multiprocessor system. In this
case, there are three memory-access patterns: The fastest is local memory access. The next is global memory
access. The slowest is access of remote memory as illustrated in Fig. l.'l'b. As a matter offact, the models
shown in Figs. 1.6 and 1.? can he easily modified to allow a mixture of shared memory and private memory
with prespecified access rights.
Ahierarehically structured multiprocessor is modeled in Fig. l.Tb. The processors are divided into several
cl'us'rers'*. Each cluster is itself an UMA or a NUMA multiprocessor. The clusters are connected to global
s'hnrcra'-rrrtrmory modules. The entire system is considered a NUMA multiproccs sor. All processors belonging
to the same cluster are allowed to uniformly access the efrrsrer shared-rncmori-' modules.
‘The word ‘cluster’ is used in a ditlierent sense in cluster computing, as we shall see later.
rhr MIEIHW HI-l'l_lNf.l]l(1|llf\
Zfl i Admlrrced Cempiimerfirelritecm-re
All clusters have equal access to the global memory. However, the access time to the cluster memory is
shorter than that to the global memory. One can specify the access rights among intereluster memories in
va.rious ways. The Cedar multiprocessor, built at the University of Illinois, had such a structure in which each
clusterwas an Alliant FX.-‘SD multiprocessor.
B =:|-|rCli.is1et1
|=—15I' QQl§=
:
i“"_'_“_'_“"_“"_“] '|Il:
I Il:
I l_ _ _ _ _ _ -_. -_. -_. -_. -_.|
Cii.ls'tetN _ _ _ _ _ ?:_ _§ _ i
{aj Sharedlo-cal memories [e.g. the [b] Ahierarehieal cluster model [e.g. the Cedar system at the Uni-
H Butterfly] versity of Illinois}
The CONIA Model A multiprocessor using cache-only memory assumes the COl\-‘IA model. Early
examples ofCOMA machines include the Swedish Institute of Computer Science's Data Diffusion Machine
(DDM, Hagersten ct aI., 1990) and Kendall Square Research's KSR-1 rnaehine{Burli:l1ardt et al., I992]. The
COMA model is depicted in Fig. 1.8. Details ofKSR-1 are given in Chapter 9.
l lnteremnectbn Network |
III
C I in I C
II II II
Fig. 1.! The COMA model of a rriuiltlproeessor (P: Prc.\c-ess-rir.C: Cache. D: Directory; -2.3. the KER-1)
The COM.-"L model is a special ease ofa NLll\-‘IA machine, in which the distributed main memories are
oonverted to cliches. There is no memory hierarchy at each processor node. All the caches form a global
Rriellel Cunputer Models
.
1 2|
address space. Remote cache access is assisted by the distributed cache directories [D in Fig, 1.8). Depending
on the interconnection network used. sometimes hierarchical directories may be used to help locate copies oi
eachcblocks. Initial dataplacemcnt is not critical because data will eventually migrate to where it will be usod.
Besides the UMA, NUMA, and COMA models specified above, other variations exist for multiprocessors.
For example, a einehe-eoherzrrir rion-unifiirni l'fl€’l';lI-tJl':'l-' flC'£"£’S.5' {CC-NUl'l-'lA]I model can be specified with
distributed shared memory and cache directories. Early examples of the CC-NUMA model include the
Stanford Dash (Lenoslci ct al., 1990) and the MIT Alewiie (Agarwal et al., 1990] to be studied in Chapter 9.
A cache-ooherent COM.-it machine is one in which all cache copies must be kept consistent.
The S-S1 was a transaction processing multiprocessor consisting of 3-U BB6-"i486 microprocessors tied
to a common badizplane bus. The IBM ESIQODO models were the latest IBM mainframes having up to 6
processors with attached vector facilities. The TC-2000 could be configured to have 512 M83100 processors
interconnected by a multistage Butterfly network, This was designed as a NUMA machine for real-time or
time-c ritical applications.
Par MIGIITLH HI" l'mrJI||r_.u|n¢s :
Z1 i Advanced Cmnprroerfirchriteeturc
Multiprocessor systems are suitable for general-purpose mu ltiuser applications where programmabilit'y is
the major concern. A major shortcoming ofrnultiproeessors is the lack of scalability. It is rather diflicult to
build MPP machines using centralized shared memory model. Latency tolerance for remote memory access
is also a major limitation.
Packaging and cooling impose additional constraints on scalability. We will study scalability and
programmability in subsequent chapters.
M
P II H
> Message-passing P M
l['|lIBl'(X)l'tl1-B\C.1Ibfl network
: [Me-sh, ring, torus, :
hypercube, cube-
oc-nnected cycle, etc.) p- M
P
M H Ill
Fig. 1.9 Generic model ofa message-pasrlng multloorrrp-Lmer
The message-passing network provides point-to-point static connections among the nodes. All local
memories are private and are accessible only by local processors. Forth is reason, traditional multicomputers
have also been called no-nrnmrwmenmr_r=nec=.'ss (NORMA) machines. lnternode communication is carried
out by passing messages through the static connection network. With advances in interconnection and
network technologies, this model of computing has gained importance, because ofits suitability for certain
applications, scalability, and fault-tolerance.
Nlulticqrrlput-er Gal-emtion: Modern multicomputersuse hardware routers to pass messages. A computer
node is attached to each router. The botmdary router may be connected to l.-‘O and peripheral devices. Mes sage
pa_s.sing between any two nodes involves a sequence of routers and channels. Mixed types of nodes are
allowed in a heterogeneous multico mputer. The imemode conununication in a heterogeneous multicomputer
is achieved through compatible data representations and message-passing protocols.
Early message-passing multicomputers were based on processor board technology using hypercube
architecture and software-controlled message switching. The Caltech Cosmic and lntel iPSCr"l represented
this early development.
The second generation was implemented with mesh-connected architecture, hardware message routing,
and a software environment for medium-grain distributed computing, as represented by the lntel Paragon and
the Parsys SuperNode 1CI'l]'U.
F?» if run! r'nm;|wm1
RJrcMlelCunpote|'Mod-sis M 23
Subsequent systems of this lytte are fine-grain multicomputers, early examples being the MIT I-Machine
and Caltech Mosaic, implemented with both processor and communication gears on the same ‘s-‘LS-l chip. For
further discussion; sec Chapter l3.
In Section 2.4, we will study various static network topologies used to construct multicomputers.
Commonly used topologies include the rirrg, tree. rrrcsh. torus. in-‘perenbe. enbe-mrrrrer:rede\'eie, etc. Various
communication patterns are demanded among the nodes, such as one-to-one, broadcasting, permutations, and
multicast pattern s.
Important issues for multicomputers include message-routing schemes, network flow control strategies,
deadlock avoidance, virtual channels, message-pas.sing primitives, and program decomposition techniques.
Z4 ii Adrorrcad Compmerfirehltacm-re
The Paragon system had a mesh architecture, and the nCUBE.-'2 had a hypercube architecture. The lntel
iilfifls and some custom-designed VLSI processors were used as building blocks in these machines. All three
O5s were UNLX-compatible with extended lbnctions to support message passing.
Most multicomputers can be upgraded to yield a higher degree ofparallclism with enhanced processors.
We will study various massively parallel systems in Part lll where the tradooffs between scalability and
progranmtability are arraiymod.
Fig. 1.10 Bells tavronorrly efM|MD cornp-uters (Courtesy -ofGorelon Bel: rqarlnnecl with perrnlsdon from the
Commtmlcntlons ofAC.l'rl, August 1991}
PM‘ I Ifllli l'I>I'rIqIr_.I.III¢I _
Rrrollal CorrIpI.rI5er Models 1 25
Multicomputers use distributed memories with multiple address spaces. They are scalable with distributed
memory. The evolution of fast LAN (Inert! trretr nenvork]-cormeeted workstations has er-eated “commodity
supcreomputing”. Bell was the first to advocate high-speed workstation clusters interconnected by high-
speed switches in lieu of special-purpose multicomputers. The C M-5 development was an early move in that
direction.
The scalability of MIMD computers will be further studied in Section 3.4 and Chapter 9. In Part lll, we
will study distributedonemory multiproccssors (KER-1, SCI, etc.}; central-memory multiproccssors (Cray,
IBM, DEC, Fujitsu, Encore, ete.); multicomputers by lntel, TMC, and nCUBE; fast LAN-ba-red wor'kstation
clusters, and other exploratory research systems.
r
I I I I I I I I I I I II I I
“'1
Scalar Pro-oeosor -J
Seals
FI.nc1I_on:I
Pp-olnea
"""""" 'TuTI5.5I'I5rIi>E55&""““'”'
Seals |nai'Ix:'non
M M»Mm Z
We
i
i
‘labour
Control
l'l9"l"|9'"°'1'|' I-botar
3'53“ {PI'og'ar1HId 'l|l'gI(;i'3|'
“la Dara; Rggqrm
—|
Hoot
Mme “"93
Shaw F__ _ -_ I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I
IIO {Llaor)
Ii i Advanced Cmnpireerfirchitocturc
[fthe instruction is decoded as a vector operation, it will be sent to the vector control unit. This control
unjt will supervise the flow of vector data between the main memory and vector functional pipelines. The
vector data flow is coordinated by the control unit. A number ofvector functional pipelines may be built into
a vectorprocessor. Two pipeline vector supercomputer models are described below.
Hector Processor Model: Figure l .l l shows a register-to-register architecture. Vector registers are used
to hold the vector operands, intermediate and final vector results. The vector functional pipelines retrieve
operands from and put results imo the vector registers. All vector registers are programmable irI user
instructions. Each vector register is equipped with a component countcrwhich keeps track ofthe component
registers used in successive pipeline cycles.
The length of each vector register is usually fixed, say, sixty-four 64-bit component registers in a vector
register in a Cray Series supercomputer. Other machines, like the Fujitsu VPZUDG Series, rise reconfigurable
vector registers to dynamically match the register length with that of the vector operands.
In general, there are fixed numbers of vector registers and functional pipelines in a vector processor.
Therefore. both resources must he reserved in advance to avoid resource conflicts between different vector
operations. Sorne early vector-reg;is,ter based supercomputers are summarized in Table 1.5.
Representative Supercomputer: Over a dozen pipelinod vector computers have been manufactured,
ranging irom workstations to mini- and supercomputers. Notable early examples include the Stardmt 3000
multiprocessor equipped with vector pipelines, the Convex C3 Series, the DEC 'v'A}i 9000, the [BM 390:"
VF, the Cray Research Y-MP family, the NEC SX Series, the Fujitsu VP2000, and the Hitachi S-810120. For
fi.|rther discussion, soe Chapters 8 and 13.
The Convex C 1 and C2 Series were made with ECL-“CMOS technologies. The latter C3 Series was based
on Ga.-is technology.
The DEC VAX 9000 was Digital's largest mainframe system providing concurrent sealarfvector
and multiprocessing capabilities. The ‘s-‘AX 9000 processors used a hybrid architecture. The vector unit
was an optional feature attached to the K-‘AX 9000 CPU. The Cray Y-MP family ofi'ered both vector and
multiprocessing capabilities.
Cmtrol Unit
I lotoroonnoetion Netwoflc ‘
where
-['1] N is the number ofpmrussing elements (PEs) in the machine. For example, the llliac [V had 64 PEs
and the Connection Machine CM-2 had 65,536 PEs.
{2} C is the set of instructions directly executed by the r-onrmf unit (CU), including scalar and program
flow control instructions.
{'3} I is the set of instructions broadcast by the CU to all PEs for parallel execution. These include
arithmetic, logic, data routing, masking, and other local operations esocuted by each active PE over
data within that PE.
{4} M is the set ofmasking schemes, where each mask partitions the set of PEs into enabled and disabled
subsets.
('5) R is the set ofdata-routing functions, specifying various patterns to be set up in the interconnection
network ibr inter-PE communications.
One can describe a particular SIMD machine architecture by spociiying the S-tuple. An example SIMD
machine is partially specified below.
I»)
lg Example 1.3 Operational specification of the MasPar
MP-1 computer
We will study the detailed arcbitecttlre of the MasPar MP-l in Chapter T. Listed below is a partial specification
ofthe 5-tuple for this machine:
{lj The MP-1 was an SIMD m.achine with N = 1024 to 16,38-'-l PEs, depending on which configuration is
oonsidered.
{2_j The CU executed scalar instructions, broadcast decoded vector instructions to the PE array, and
oontrolled intcr- PE communications.
{'3} Each PE was a register-based load.-"store RISC processor capable of executing integer operations over
various data sizes and standard floating-point operations. The PEs rcoeived instructions from the CU.
{lll The masking scheme was built within each PE and continuously monitored by the CU which could set
and reset the status ofeach PE dynamically at run time.
{Sj The MP-i had an X-Net mesh network plus a global multistage crossbar router for inter-CU-PE,
X-Net nearest B-neighbor, and global router communications.
Repmsontotive SIMD Computer: 'I'hree early commercial STMT) supercomputers are summarized in
Table 1.6. The number of PEs in these systems ranges from 4096 in the DAP6l0 to 16,354 irl the Ma.sPar
MP—1 and 65,536 in the CM—2. Both the CM—2 and DAP610 were fine-grain, bit-slice SLMD eornputers with
attached floating-point accelerators for blocks of PEs’.
Each PE of the MP-i was equipped with a 1-bit logic unit, 4-bit integer.-KLU, 64-bit mantissa unit, and
16-bit exponent unit. Multiple PEs could be built on a single chip due to the simplicity ofeach PE. The MP-l
" Witlt rapid advarroes in ‘t-“L5! teelutology, use ofbit-slice processors in systems has dtsapp-eured.
HM‘ If J11!!!‘ I'ml!I;|r1rHt\
Rrrellel Cunputaw Models _ T 29
irnplementod 32 PEs perchip with forty 32-bit registers per PE. The 32 PEs were interconnected by an X-Ne!
mesh, which was a 4-neighbor mesh augmented with diagonal dual-stage links.
The CM-2 implemented 16 PEs as a mesh on a single chip. Each 16-PE mesh chip was placed at one
vertex ofa 12-dimensional hypercube. Thus 16 >< 2 '3 = 2 '6 = 65,536 PEs fomted the entire SIMD array.
The D.-\PfilO implement-ed 64 PEs as a mesh on a chip. Globally, a large mesh (64 X 6-'-ll was lbrmed
by interconnecting these small meshes on chips. Fortran 90 and modified versions of C, Lisp, and other
sequential programming languages have been developed to program SIMD machines.
Mae Par Dcaig;ned for configurations from Fortran T7. MasP'ar Fortran
Computer I024 to l6,3 E4 pro-ee-ssors with {MIPF), and M:e;Par Parallel
Corporation 26.000 MIPS or 1.3 Gflops. Each Application Language;
MP-l Family PE was a RISC pruc1e5aor.Witl1 16 UNIX.-'03 with X-windurv.
Kbytes local memory. An X-Net symbolic debugger, visualizers
mesh plus a mult ietage crossbar and animators.
interconnect.
Thinking A bit~sliee array of up to 65,536 Driven by a host of VAX,
Machines PEs arranged rat a ll]-dimers ional Sun, or Symbolies 36410, Lisp
('Iorp-oration, hy]:|-ercube with 4 :-< 4 mesh on each emnpiler, Fo1tran9'D, C‘, and
CM-2 vertex, up to 1M bits of memory “Lisp supported by PARIS
per PE, with optional FPU shared
betvneen blocks ol'32 PEs. 23
Gflops peak and 5.6 Gfiops
sustained.
Active A fine-grain, bit-slice SIM!) array Provided by host VAXfVl‘vl5
Memory of up to 40945 PEs interconnected or UNIX Fortran-plus or
Technology by a square mesh with I K bitsper .-'iPi'tL on D.-'iP, Fortran T? or
D.¢'iP6i]fl PE, orthogonal and 4-rie ighb-or C on host.
Family links, 20 GLPS and 560 Mflops
peak pertbcrrnauoe.
also useful in scalability and programmability analysis. when real machines are compared with an idealized
parallel machine without worrying about communication overhead among processing nodes.
NP‘-Completeness An algorithm has a pol\-'nomit1l eomrrferin-' if there exists apolynomial p{s'] such that the
time complexity is O(p {sl} for problem size s. The setofprob lems having polyno mial-complex ity algorithms
is called P-elms (for polynomial class]. The set of problems solvable by nondcterministic algorithms in
polynomial time is called NP-class‘ { for nondcterministic polynomial class).
Since deterministic algorithms are special cases ofthe nondcterministic ones, we know that P -1: NP. The
P -class problems are computationally n'ttc-ttrble, while the NP — P-class problems are inrrnetttbie. But we do
not know whether P = NP or P as NP. This is still an open problem in oomputcr science.
To simulate a nondcterministic algorithm with a deterministic algorithm may require exponential time.
Therefore, intractable NP-class problems are also said to have exponential-time complexity.
L»-l
égl Example 1.4 Polynomial- and exponential-complexity
algorithms
Polynomial-complexity algorithms are known for sorting n numbers in Ofn log rt] time and for multiplication
of two rt X rt matrices in O'{n3j time. Therefore, both problems belong to the P-class.
PM‘ I Ifllli l'm'rIq|r_.\.I|n*\ _
Rrrdlel Cunpuoer Models 1 3|
Nonpolynomial algorithms have been developed for tl1e traveling salesperson problem with complexity
Qfrlzl") and for the knapsack problem with complexity O\[E"Pj. These complexities are eJq'Jom:nIr'oi, greater
than the polynomial complexities. So far, deterministic polynomial algorithms have not been found for these
problems. Therefore, these exponential-complexity problems belong to the NP-class.
Most oomputcr scientists believe that P sfi NP. This leads to the conjecture that there exists a subclass,
called NP-c-0nipLcIe(:l\lPC] problems, such that NPC C NPbut NPCFN P = l,'.'1{'Fig. 1.13]. ln fact, it has been
proved that ifany NP-complete problem is polynomial-time solvable, then one can conclude P = NP. Thus
NP-complete problems are oonsidered the hardest ones to solve. Only approximation algorithms can be
derived for solving the NP-complete problems in polynomial time.
Hg. 1.1! The relationships conjectured among the NP. E and NFC eimses of eompumrlonal pmhaloms
FRAM Models Conventional uniproccssor computers have been modeled as random |:Ic‘£'e’.i‘.S‘ nmehines
(RAM) by Shepcrdson and Sturgis (1963). A ,rJ.omHe1 r.ondom-access nmchirie [PRAMIII model has been
developed by Fortune and Wyllie ([973] for modeling idealized parallel computers with zcno synchronization
or memory aeeess overhead. This PRAM model will be used for parallel algorithm development and for
scalability and complexity analysis.
An n-processor PRAM (Fig. 1.14} has a globally addressable memory. The shared memory can be
distributed among the prroeessors or centralized in one plaoe. The n process-ors—also called proe:.'s.-ring
demenr.-r {PEs}—opcrate on a synchronized read-memory, compute, and write-memoty cycle. With shared
memory, the model must speciiy how concurrent read and concurrent write of memory are handled. Four
memory-update options are po ssiblc:
Tlghtty 9
synchronized Shaved
lvlemory
Fig. 1.14 PRAM model of a m:.|-ltlproceseor system width shamed mernc.r_y. on which all n processors operate
in loekstep in morn-ory access and pmgrarn ea-teoutl-on operations. Each processor can access any
memory location in unit time
F?» Mtfiruw Hlllr'».-rqtwrnw
31 i Advanced Covnpultertllrcirttteettrre
t Erc1'tt.sr'te read (ER}—This allows at most one processor to read fi'om any memory location in each
cycle, a rather restrictive policy.
~ Erulttsrte it-'rr're [EWl—This allows at most one processor to write into a memory location at a time.
' Concurrent read‘ (CR)—This allows multiple processors to read the same information from the same
memory cell in the same cycle.
' C'orrt'rtr'rt'rtr ‘l1-Tile’ {'CW_l—This allows simultaneous writes to the same memory location. In order to
avoid confusion_ some policy must be set up to resolve the write conflicts.
Various combinations of the above options lead to several variants of the PRAM model as specified below.
Since CR does not create a conflict problem, variants differ mainly in how they handle the CW conflicts.
FRAM ibriant: Described below arc four variants of the PRAtvI model, depending on how the memory
reads and writes are handled.
{'1} EREW~PR.*fM rrmdel—This m-odcl forbids more than one processor from reading or writing the same
memory cell simultaneously (Snir, 1982; Karp and Ramachandran, 1935]. This is the most restrictive
PRAM model proposed.
{2} CIli'Ellr'-PR.'l.1-f rrr-rtnlci'—Thc write conflicts are avoided by mutttal exclusion. Concurrent reads to thc
same memory location are allowed.
{'3} ERCll~"-PR.-IM rrrm;l'cl—This allows exclusive read or concurrent writes to the same memory location.
{'-fl] CREW-PR.-1M rrtod'e!—This model allows either concurrent reads or concurrent writes to the same
memory location.
Step I
l. E‘ <— tr
Z. Repeat
E<— H2
if-[Ir s: E] then
begin
Read C{'i,j, Ir]
Read C'{1',j, i'r+ E)
Compute C(r',j, Ir) + C.'{r',_,r', k+ E‘)
Store in C(i,j, Ir]
end
until (E = 1)
To reduce the numbcrof PEs to n3.~"log rt, use a PE array of size n >< n >< n.-‘log rt. Each PE is responsible tor
computing log n product terms and su.t|:u‘ning them up. Step l can be easily modified to produce n.-‘log rt partial
sums, each consisting of log n multiplications and {logn — 1] additions. Now we have art array C-[:r',_,r',kj,'U S
i,_,r' Sn - l, G S it S n.-‘log n — l, which can be summed up in log(n.-‘log rt] time. Combining the time spent in
step l and step 2, we haw: a total execution time 2 log n — l + log{n.-‘log nj = Oflog rt] for large n.
Discrepancy with Physical Model: PRAM models idealize parallel computers, in which all memory
references and pmgram executions by multiple processors are synchronized without extra cost. In reality,
such parallel machines do not exist. An SIMD machine with shared memory is the closest architecture
modeled by PRAM. However, PRAL-I allows different instructions to be executed on different processors
simultaneously. Therefore, PRAM really operates in synchronized MIMD mode with a shared memory.
Among the four PRFM variants, the BREW and CREW are the most popular models used. In fact,
every CREW algorithm can be simulated by an EREW algorithm. The CREW algorithm runs faster than an
equivalent EREW algorithm. It has been proved that the best n-processor EREW algorithm can be no more
than Oflog rt] times slower than any n-processor CRCW algorithm.
The CREW model has received more attention in the literature than the ERCW model. For our purposes,
we will use the CRCW-PRAM model unless otherwise stated. This particular model will be used in defining
scalability in Chapter 3.
For oomplexity analysis orperfoflnflltce comparison, various PRAM variants offeran ideal model ofparallel
computers. Therefore, computer scientists use the PRAM model more often than computer engineers. ln this
book, we design parallel-“vector computers using physical architectural models rather than PRAM models.
The PRAIM model will be used for scalability and performance studies in Chapter 3 as a theoretical reference
machine. PRAM models can indicate upper and lower bounds on the performance of real parallel comp-uters.
34 i Advanced Colnpiieerfiiclritaeru-.ra
is presented below, based on the work of Clark Thompson (1950). Three lower bounds on \-‘LS-l circuits
are imerpreted by Jeffrey Ullman {I984}. The bounds are obtained by setting limits on memory, l.-‘CI, and
communication for implementing parallel algorithms with VLF‘.-l chips.
The AT‘, Nlodel Let A be the chip area and The the latency for completing a given computation using a
VLSI circuit chip. Let s by the problem size involved in the computation. Thornpson stated in his doctoral
thesis that for certain computations, there exists a lower boundfls] such that
Memory Bound on Chip Arno There are many computations which are memory-bound, due to the need
to process large data sets. To implement this type of computation in silicon, one is limited by how densely
information {bit cells) can be placed on the chip. As depicted in Fig. l.l5a, the memory requirement oi" a
computation scts a lower bound on the chip area .»l.
The amount of information processed by the chip can be visualized as information flow upward across the
chip area. Each bit can flow through a unit area of the horizontal chip slice. Thus, the chip area bounds the
amount of memory bits stored on the chip.
HO Bound on Volume AT The volume of the rectangular cube is represented by the product AT. As
information flows through the chip for a period oftime T, thc number of irlput bits cannot etcccd the volume
AT, as demonstrated in Fig. l. 15a.
Time Time
. .9'.L 1
, Gmparoa
The area A corresponds to data imo and out ofthe entire surface ofthe silicon chip. This areal measure sets
the maximum l.-‘O limit rather than u.sing the peripheral l.~"O pads as seen in conventional chips. The height T
ofthe volume can be visualized as a number of snapshots on the chip, as computing time elapses. The volume
represents the amount of infomtation flowing through the chip during the entire course of the computation.
Bixaction Communication Bound, <-JET Figure l.l5b depicts a communication limitod lower bound
on the bisection area The bisection is represented by the vertical slice cutting across the shorter
dimension ofthe chip area. The distance ofthis dimension is for a square chip. The height ofthe cross
section is T.
The bisection area represents the maximum amount of intbrmalion exchange between the two halves of
the chip cireuit during thc time period T. The cross-section area JET limits tl1c communication bandwidth
of a computation. VLSI complexity theoreticians have used the square ofthis measure, 1-{T1, to which the
lower bound applies, as seen in Eq. 1.6.
I»)
lg Example 1.6 VLSI chip implementation of a matrix multi-
plication algorithm (Viktor Prasanna,1992)
This example shows how to estimate the chip area A a.nd compute time T for rt >< n matrix multiplication
C = A >< B on a mesh ofproeessing elements -[PEs] with a broadcast bus on each row and each oolumn. The
2-D mesh architecture is shown in Fig. 1.16. Inter-PE commmtication is done through the broadcast buses.
We want to prove the bound AT! = Ofiflj by developing a parallel matrix multiplication algorithm with time
T= O[_n_] in using the mesh with broadcast buses. Therefore, we need to prove that the chip area is bounded
by .-1 = O{_n2].
Each PE occupies a unit area, and the broadcast buses require Ofnzj wire area. Thus the total chip area
needed is O{n3] for an n >< n mesh with broadcast buses. We show next that then >< n matrix multiplication can
be performed on this mesh chip in T= Olin) time. Denote the PEs as PE(i',j],'[l E r',_,r' E rt — 1.
[nitially the input matrix elements .»t(i,j} and B-[:r', j] are stored in PE{i,j_) with no duplicated data. The
memory is distributed among all the PEs. Each PE can access only its own local memory. The tbllowing
parallel algorithm shows how to perform the dot-product operations in generating all the output elements
C|['r',j)= 12;}, .»t(i, k]>< B{_k,j] tbrtl S r',j£n -1.
36 i Admnced Ciimpimerfiiehitectu-.re
I I I
O0 01 O2 _ 03
If I I ‘l
10 11 12 13.
in I 1 I
20 21 22 23
# ‘l 1‘ ‘l
3-Ct 31 32 33
Fig. 1.16 A 4 x 4 mesh of processing -elements {PB} wlth broa-clcast buses -on each row and on each ootumn
{Courtesy of Prasarma Ken-iar and Raghav4sndi~a;reprtn1:ed from journd offiirdbt and Dtsrrlbuted
Computing, A.pr~ll 193?}
Dnall1'[]for0Ei,jEn—1
10 PE(i,j] sets C{t,j}to{}fIni1ializaiionf
Do 5'11 fort'l£kSn—l
Dual] 20 for'[l E r'£n—l
20 PE{i, Ir] broadcasts zt{_r', Ir] along its row bus
Doall3(l for'[l £jEn—l
30 PE{k,j] broadcasts B-[ir,f] along its column bus
v‘PE(i',_,i'] now has AU‘, kjand B-[:fr,j'j, (1 S r',j 5 rt — 11"
Dllflll-I‘-10 fortls a',j5n—1
40 PE{r',j) oomputes C{i',j] <— C{i,_,r'j + .4(i, Ir) >< B{.1r,jj
50 Continue
The above algorithm has a sequential loop along the dimension indexed by Ir. lt takes H time units
.. . . . . 7 ‘.\ . . 1 .
[iterations] in this It-loop. Thus, we have T= t']'{_n]. Therefore, AT‘ = Om‘). {_O'{n)]' = O{nil].
ARCHITECTURAL DEVELOFMENTTRACK5
1 The architectures ofmost existing computers tbllow certain development tracks. Understanding
features of various tracks provides insights tor new architectural development. We look imo
six tracks to be studied in later chapters. These tracks are distinguished by similarity in computational models
and technological bases. We also review a few early representative systems in each track.
1.5.1 Multiple-Fr~ot:essorTraeks
Generally speaking, a multiple-prooessor system can be either a shared-memory multiprocessor or a
distributed-memory multicomputer as modeled in Section 1.2. Bell listed these machines at the leafnodes of
Par I J11!!!‘ l'mrJI|ir_.uii¢\
Rrieflel Cunpuoer Models 1 31
rhe taxonomy tree (Fig. 1.11)). Instead of a horizontal listing, we show a historical development along each
important track of the taxonomy.
Shared-Jlflemory Truck Figure l.1'?a shows a track of multiprocessor development employing a single
address space in the entire system. The track started with the C.mrnp system developed at Carnegie-Mellon
University (Wulf and Bell, 1972]. The C.mmp was an LIMA multiprocessor Sixteen PDP ll.-"1-l'lIl processors
were interconnected to lo shared-memory modules via acrossbar switch. A special iriterprocessor interrupt
bus was provided for first interproccss communication, besides the shared memory. The C .mmp project
pioneered shared-memory multiprocessor development, not only in the crossbar architecture but also in the
multiprocessor operating system {_l-lydra] development.
Stan1iorr:h'Dash
(Lenoski. Henneesy et al, 1992}
Fujitsu VPP5-D0
llllnols Cedar (Fujmul |m_ 1992)
Kurzltetal
1sar}
KSR1
CMwc_mmp [Kendall Square Raseareh,1Q9D]
lfwulf and Bell, 19??)
IBM RP3
“Yul, [Pfister at al. 1935]
Ullraeornputer <
tGottIieb et ol. 1982!}
BEN Butterfly
{BBN.19<flQ)
Fig. 1.1 7 Two mutrtple-processor cracks with and without st1.a.md mo-nory
Both the NYU Ultracomputer project (Gottlieb ct al., 1983] and the Illinois Cedar project {Kuck ct al.,
198'?) were developed with a single address space. Both systems used multistage networks as a system
interconnect. The major achievements in the Cedar project were in parallel compilers arid performance
bcnehrnarleing experiments. The Ultracomputer developed the combining network for fast syneliroirization
among multiple processors, to be studied in Chapter 7.
PM‘ MIGIELH HI" r'mr:qn_.r-um ‘I _
The Stanford Dash lbcnoslci, Hennessy ct al., 1992) was a HIJMA multiprocessor with distributed
memories formi1'|g a global address space. Cache coherence was enforced with distributed directories. The
KSR-I was a typical CDMA model. The Fujitsu WP 501] was a 222-processor system with a crossbar
interconnect. The shared memories were distributed to all processor nodes. We will study the Dash and the
KSR-] in Chapter 9 and the VPPSGU in Chapter 8.
Following the Ultracomputer are two large-scale multiproccssors, both using multistage networks but
with diiferent interstage connections to he studied in Chapters 2 and 7. Among the systems listed in Fig.
1.17s, only flie KSR-I. VPPSDO, and BBN Butterfly (BBN ftdvaricerl Computers, 1989} were commercial
products. The rest were research systems; only prototypes were built in laboratories, with a view to validate
specific architectural concepts.
Menage-Fbning Track The Cosrnic Cube {Seitz et a1., I981} pioneered the development of message-
passing multicomputers (Fig. 1.17b]. Since then, lntel produced a series of medium-grain hypercube
computers (the iPSC-s}. The nCUBE 2 also assumed a hypercube configuration. A subsequent lntel system
was the Paragon {I 9'92] to be studied in Chapter T. On the research track, the Mosaic C [Seiu-., 1992] and the
MIT J-Machine (Dally at a1., 1992} were two fine-grain multicomputers, to be studied in Chapter 9.
r-.1) Mulliu'er1orir'aclt
one are
{AMI inc. rssrr
Goodyear MPF
{Eiatehen 195]:
CH2 CM5
Iliacw |[TMC, 193)) ('l'MC,1Q91:|
{Barnes alaL 1968)
l'rIasF'ar MP1
{Nicltds, 1990)
BSP
{Kuek and Shires. 1982)
IBMGFI11
{Baelem at al, 1'BB5)
(tr) SIMD track
Both tracks arc useli.|l for conctrrrcnt scalar-"vectorprocessing. Detailed studies can be found in Chapter 8,
with further discussion in Chapter 13.
Multivector Track These are traditional vector supercomputers. The CDC T600 was the first vector dual-
proccssor system. Two subtraclcs were derived from the CDC 7600. The Cray and Japanese supercomputers
all followed the register-to-register architecture. Cray 1 pioneered the multivoctor development in 19']"8.
The C rayr"MPP was a massively parallel system with distributed shared memory, to work as a back-end
accelerator engine compatible with the C ray ‘fr’-MP Series.
The other subtrack used memory-to-memory architecture in building vector supercomputers. We have
identified only the CDC Cyber 205 and its successor the ETAICI here, for completeness in tracking different
supercomputer architectures.
The SIMD Truck The llliac [V pionecrcd thc construction of SIMD computers, although thc array
processorconccpt can be traced back iar earl icr to the 19605. The subtrack, consisting of the Goodyear MPP,
the AMT.-‘DAP6 ll], and the TMC.-‘CM-2, were all SIMD machines built with bit-slice PEs. The CM-5 was a
synchronized l'v'[1'MD machine executing in a multiple-SIMD mode.
The other subtrack corresponds to medium-grain S-[MD computers using word-wide PEs. The BSP
(Kuck and Stokes, 1982] was a shared-memory SIMD machine built with 16 processors updating a group
of 17 memory modules synchronously. The GFII (El-octcm ct al., 1985) was developed at the [BM Watson
Laboratory for scientific simulation research use. The MasPar MP1 was the only medium-grain SIMD
computer to achieve production use in that time period. We will describe the CM-E, MasPar MP1, and CM-5
in Chapter B.
Static Dataflow
[Dennis 1914;
Manchester S 1 EM5
lJ$“t':D3‘
a ll,
1932 'i"[SigHniiifadaeta| I tssrt l53l“““*“- 19593
[bl Dataflowtrack
The cont-entional von Neumann machines are built with processors that execute a single context by each
processor at a time. ln other words, each processor maintains a single thread ofeontrnl with its hardware
resources. ln a multithrcaded architecture, each processor can execute multiple contexts at the same time.
The term mufrirlireotfing implies that there are multiple threads ofcontrol in each processor. Multithreading
oifers an efiicctive mechanism ibr hiding long latency in building large-scale multiproccssors and is today a
mature technology.
As shown in Fig. l.19a, the multithrcading idea was pioneered by Burton Smith [ 1978] in the HEP system
which extended the concept of scoreboarding of multiple functional units in the CDC 6460. Subsequent
multithreadod multiprocessor projects were the Tera computer {Alverson, Smith ct al., 199(1) and the MIT
.-hlewife {Agarwal ct al., 1989) to be studied in Section 9.4. ln Chapters 12 and I3, we shall discuss the
present technological factors which have led to the design of multi-threaded processors.
The Demrfllmrr Track We will introduce the basic concepts of dataflow computers in Section 2.3. Some
experimental datatlow systems are described in Section 9.5. The key idea is to use a datallow mechanism,
instead of a control-flow mechanism as in von Neumann machines, to direct the program flow. Fine~g1'ain,
instruction-level parallelism is exploited in dataflow computers.
As listed in Fig. 1.l9b, the dataflow concept was pioneered by Jae-it Dennis (1974) with a “static”
architecture. 'I'he concept later inspired the development of “dynamic” dataflow computers. A series of
tagged-token architectures was developed at MIT by Arvind and coworkers. We will describe the tagged-
token architecture in Section 2.3.] and then the *T prototype (Nikhil ct al., 199] ) in Section 9.5.3.
Another suhtrack of dynamic datafiow computer was represented by the Manchester machine [Gurd and
Watson, 1982). The ETL Sigma 1 {Shimada et al., 198?] and EMS evolved from the MIT and Manchester
machines. We will study the EMS (Sakai ct al., I989] in Section 9.5.2. These datafiow machines represent
research concepts which have not had a major impact in terms ofwidcsprcad use.
1.
~1$\\ Summary
In science and in engineering, theory and practice go hand-in-hand, and any significant achievement
irrvariahly relies on a judicious blend of the two. In this chapter, as the first step towards a conceptual
understanding ofparallelism in computer arcltritecturc, we have looked at thc models ofparallcl computer
systems which have emerged over the years. We started our study with a brief look at the development
of modern computers and computer architecture, including the means of classification of computer
architecture, and in particular Flynn’s scheme of classification.
'I'he performance of any engineering system must he quantifiable. In the case of computer systems.
we have performance measures such as processor clock rate, cycles per instruction (CPI), word size, and
throughput in !t»'l£Ps andfor MFLtJPs, These measures have been defined, and basic relationships between
them have been examined. Thus the ground has been prepared for our study in subsequent chapters ofhow
processor architecture, system architecture, and software determine performance.
Next we looked at the architecture of shared memory multiproccssors and distributed memory
multicomputers, laying the fotmdation for a taxonomy of MIMD computers. A key system characteristic
is whethcrdifierent processors in the system have accesn; to a common shared memory and—ifthey do—
FM MiG-l'i1I-i-' Hfiiformlunm _:
Fbrolkrl Ccmputer Models 4|
whether the access is uniform or non-uniform. Vector computers and SIMD computers were examined,
which address the needs of highly compute-intensive scientific and engineering applications.
Over the last two or three decades, advances in K-"LSl technology have resulted in huge advances in
oomputer system performarrcc; however, the basic architectural concepts which were developed prior to
thc ‘VLSI revolution‘ corrtinue to remain valid.
Parallel random access machine {PRAM] is e theoretical model ofa parallel computer. No real computer
system can behave exactly like the PRAM. but at the same time the PRAM model provides us with a
basis for the study ofparallcl algorithms and their performance in terms of time andfor space complexity.
Difierent sub-types of the PRAM model emerge-, depending on whether or not multiple processors can
perform concurrent read or write operations to the shared memory.
Towartls the end ofthe chapter, we could discern the separate architectural development tracks which
leave emerged over thc years in computer systems. We looked at multiple-processor systems, vector
processing, SIMD systems, and multi—thrcaded and dataflow systems. We shall see in Chapters 12 and 1'3
that, due to various technological factors, multi-threaded processors have gained in importance over the
last decade or so.
&............
Problem 1.1 A 400-l"'lHz processor was used to each memory access.
execute a benchmark program with the following {a} What is the effectixe CPI of this computer?
instruction mix and dock cycle counts: {b} Suppose the processor is being upgraded
with a 3.0 Gl-lz clod-t. However. even with
Instruction type Instruction count Clo-ck cycle count
faster cache, two clock cycles are needed per
Integer aritl'n1e1:ic 45»OiXl0 memory access. lf 30% of the instructions
D-ate transfer IIZOIXIO require one memory access and another 5%
Floating point 15C[lJ0
require two memory accesses per instruction.
Control transfer SIIOD |'-I KJIM-l I
what is the performance of the upgraded
processor with a compatible instruction set
Determine the eifcetlvc CPI. MIPS rate. and execution
time for this program. and equal instruction counts in the given
program mix?
Problem 1.2 Explain how instruction set, compil-
Problem 1.4 Consider the execution of an
er technology CPU implementation and control. and
object code with 2 '>< 106 instructions on a 4(1)-l"'lHz
cache and memory hierarchy affect the CPU per-
processor.The program consists of four major types
formance and justify the effects in terms of program
of instructions.The instruction mix and the number
length. clock rate. and effective CPI.
of qrclfi [CPI] needed for each instruction type are
Problem 1.3 Aworlrstation usesa 1.5 GHZ pro- given below based on the result of a program trace
cessor with a claimed 1030-HIPS rating to execute experiment:
a given program mi:n:.Assurne a one-cycle delay for
TM Hnffirnil-' Hllllfmminnm
42 — Adswrced Cuvnpiuteriliiclnitecture
lnstr|.|ccion type CPI Instruction mix three computers! Give reasons If you find a way to
Arithmetic and lcgic 1 60% rank them statistically.
Loadistore with I 13%
cache hit Execu1:ionT'me [in seconds)
Branch 4 11% Progran Con1puterA Computer B Computer C
Memory reference B 10% Progran 1 ‘I ‘IO 10
with cache miss
Progrln I ‘IUJCI 100 ICI
Progrin 3 5C0 ‘I500 50
{a} Calculate the average CPl when the program
Progran 4 ‘ICU EBB ‘HID
is executed on a uniprocessor with the above
trace results.
Problem 1.7 Characterize the ardwitectural op-
(b) Calculate the corresponding MIPS rate based
erations of SIMD and MIMD computers. Distinguish
on the CPI obtained in part (a).
between multiprocessors and multicomputers based
Problem 1.5 Indicate whether each of the fol-
on their structures. resource sharing. and interpro-
lowing statements is true or fialse and justify your
cessor communications. Also. esqalain the differenc-
answer with rsoning and supportive or counter
es among UMA. NUMA. and COMA. and NORMA
examples:
computers.
(a} The CPU computations and IIO operations
cannot be overlapped in a multiprogrammed Problem 1.8 The following code segment.
computer. consisting of six instructions. needs to be executed
(b) Synchronintion of all PEs in an SIMD 64 times for the evaluation of vector arithmetic
computer is clone by l"ardware rather than expression: D[l]- =A(l) + B(l]- X C{l) for U E l S 63.
by software m is often done in most MIMD Load Ri . B(l) IR] 1- Memory (rr + I)!
computers.
Load R2. C(|} 1R2 <— Memory QB + I)!
(c} As far as programmrability is concerned.
Multiply R]. R2 JR] <— (R1) '>< {R2}!
shared-memory multiproccssors offer
Load R3. A{|} 1R3 1- Memory ( y+ I]-1'
simpler interprocessor communication
Add R3. Ri JR3 1- (R3) + {R1}!
support than that offered by a message-
Store D(l). R3 {Memory {B + I} <— (R3)!
passing multicom pute r.
{d} in an MIMD computer". all processors must where R1. R2. and R3 are CPU registers. {R1} is
execute the same instruction at the same the content of R1. 11'. [3, y. and B‘ are the starting
time synchronously. memory addressa of arrays B(l). C(l}. A(l}. and D|[l}.
(e} As far as scalability is concerned. multicom- respectively. Assume four clock cycles for each Load
puters with distributed memory are more or Store. two cycles for the Add. and eight cycles for
scalable than shared-memory multiproces- dwe Multiply on either a uniprocessor or a single PE
sors. in an SIMD machine.
(a) Calculate the total number of CPU cycles
Problem 1.6 The execution times [in seconds} of
needed to execute the above code segment
four programs on three computers are giyen below:
repeatedly 64 times on an SlSD uniprocessor
Assume that 109 instructions were executed in
computer sequentially. ignoring all od'1er time
d'1of the four programs. Calculate the MIPS rating
delays.
of each program on each of die dwree machines.
Based on thse ratings. can you draw a clr (b) Consider the use of an SIMD computer with
conclusion regarding die relative performance of the 64 PEs to execute the above vector operations
TM Illnffirlhir Hfllfiurnpennri .
Fbmlkl Canputer Models
in six synchronized vector instructions over Problem 1.13 Design an algorithm to find the
54-component sector data and both driven maximum of n numbers in O[log n) time on an
by the same-speed clock. Calculate the total EREW-PRAM model.Assume that initially ch
loca-
execution time on the SIMD machine. ignoring tion holds one input value. Explain how you would
instruction broadcast and other delays. make the algorithm processor time optimal.
(c) W'hat is the speedup gain of the SIMD
Problem 1.14 Develop two algorithms for fast
computer over the SISD computer?
multiplication of two n >< n matrices with a system
Problem 1.9 Prove that the best parallel algo- of p processors. where 1 £ p E naflog n. Choose
rithm written for an n-processor ERE\N PRAM an appropriate PRAM machine model to prove that
model can be no more than O[log n} times slower dwe matrix multiplication can be done in T = Clfnafp)
than any algorithm for a CRCW model of PRAM time.
having the same number of processors. (a) Prove d"|atT = Ofni} ifp = n.The corresponding
algorithm must be shown. similar to that in
Problem 1 .10 Consider the multiplication oftwo Example 1.5.
n-bit binary integers using a 1.2-pm CMOS multi- (b) Sholw the parallel algorithm with T = O(n]- if p
plier chip. Prove the lower bound AT: > k.n1.where A =n .
is the chip area. T is dae execution time. n is the word Problem 1.15 Match each of the following eight
length. and k is a technology-dependent constant computer systems: KSR-1 . RP3. Paragon. Dash. C l"’l-2.
Problem 1.11 Compare the PRAM models with VPP500. EM-5. and Tera. with one of the best de-
physical models of real parallel computers in inch of scriptions listed below.The mapping is a one-to-one
the following categories: correspondence.
(a) W'hich PRAM variant can best model SIMD (a) A massively parallel system built with
madaines and how? multiple-context processors and a 3-D torus
(b} Repeat the question in part {a} for shared- architecture.
memory Ml MD machines. (b) A data-parallel computer built with bit-slice
PEs interconnected by a hypercubelmesh
Problem 1.11 Answer the following questions network
related to the architectural development tracks pre-
(c) A ring-connected multiprocessor using a
sented in Section 1.5:
cache-only memory architecture.
(a} For the shared-memory track (Fig. 1.17‘). ex- (d) An experimental multiprocessor built with a
plain the trend in physical memory organi- dynamic clataflow architecture.
zations from the earlier system (Cmmp) to (e) A crossbar-connected multiprocessor built
more recent systems [such as Dash.etc.). with distributed processorlmemory nodes
(b) Distinguish between medium-grain and fine- forming a single address space.
grain multicomputers in dweir architectures ff) A multicomputer built with commercial
and programming requirements. microprocessors widw multiple address
(c} Distinguish between register-to-register spaces.
and memory-to-memory architectures (g) A scalable multiprocessor built with
for building conventional multivector distributed shared memory and coherent
supercomputers. caches.
(d} Distinguish between single-thrded and (h) An MIMD computer built with a large
multithrded processor architecturfi. multistage switdaing network.
FM Mnffirirli-' Hflllmminnm
CONDITIONS OF FARALLELISM
i
Thc cxploitation ofparrllclism has crcatcd a ncw dimension in computcr scicncc. ln ortlcrto
move parallel processing into the mainstream of computing, ll.'I'. Kung H991) has identified
the need to make significant progress in three lrey areas: c-mnpumrrmr l'i'NJtl'c’ll1‘ ibr parallcl computing,
inrcrprrx-::s.s'0r c-ommurrinnrion in parallcl architoclurcs, and sjrsrorrr irrregrariori lbr incorporating parallel
systems into general computing cnvirorunents.
A thcorctical trcatmcnt ofparallclism is thus ncc-dc-tl to build a basis for thc above challcngcs. ln practice,
parallelism appears in various ibrms in a computing cnvironmcnt. All forms can bc attributcd to lcvcls oi
parallelism, computational granularity. time and space complexities, communication latencies, scheduling
policies. and load balancing. Very often. tradcoffs exist among time, space. performance. and cost factors.
Data Dependence Thc ordering rclalionship bctwecn stalcrncnls is indicated by thc data dependence.
Five types of data dependence are defined below:
,.,W,mm,,,,k,.,,,,,,,. 45
['1] Ffon-' dependence: A statement S2 is_fiot1-'-d't*pendt*nr on statement S1 if an execution path exists from
S1 to S2 and if at least one output {variables as-signedj of S1 feeds in as input {operands to be used] to
S2. Flow dependence is denoted as S1 -3» S2.
(21 .-intidepertdertce: Statement S2 is anridependenr on statement S1 if S2 follows S1 in program order and
ifthc output ofS2 overlaps the input to S]. A direct arrow crossed with a bar as in S1 -l—ZI> S2 dicates
antidependence from 5-1 to S2.
(31 Output dependertce: Two statements are oropur-depenrienr if they produce [write] the same output
variable. S1 0-Eb S2 indicates output dependence from Si to S2.
(41 HO d-epertdeneer. Read and write are [IO statements. l.~"O dependence occurs not because the same
variable is involved but because the same file is referenced by both l.-‘O statements.
(5) Unknon-"n dependent-¢'. The dependence relation between two statements cannot be determined in the
following situations:
I»)
g Example 2.1 Data dependence in programs
Consider the following code fiagment of four instructions:
SI: Load R1, A IRI 1- Men1ory(A).-'
S2: Add R2. RI IRS! <— (RI) + (R2)!
S3: Move Rl,R3 IR] <—[R3]-I
S4: Store B. R1 fMemorv(B) 1- (R1)!
As illustrated in Fig. 2,la, S2 is flow—dep-endent on SI because the variable A is passed via the register
R1. S3 is antidcpcndent on S1 because of potential conflicts in register content in R1. S3 is output-dependent
on S1 because they both modify the same register R1. Other data dependence relationships can be similarly
revealed on a pairwise basis. Note that dependence is a partial ordering relation; that is, the members ofnot
every pair of statements are related. For example, the statements S2 and S4 in the above program are totally
t'rte2'e']Je’rte2'e'Hf.
Nest, we consider a code fiagrnent involving HO operations:
SI: Read (4), A(T) fllead array A from file 4?
S2: Process fProcess data.’
S3: Write [4]. BU) fwrite array B into file 41‘
S4: Close (4) K.‘-lose file 4.-'
: _ PM‘ I Ifllli t'm'rIq|r_.\.I|n*\ _
4% i Adwmced Corr|pu'tae|'Architecbtiv12
As shown in Fig. 2.11:-, the rcadiwrite statements, S1 and S3, are UCI-dependent on each other because
they both access the same file. The above data dependence relations should not be arbitrarily violated during
program execution. Otherwise, erroneous results may be produced with changed program order. The order in
which staternents are executed in a sequential program is well defined and repetitive runs produce identical
results. On a multiprocessor system, the program order may or may not be preserved, depending on the
memory model used. Dctcrrninisrn yielding predictable results can be controlled by a programmer as well as
by constrained modification of writable data in a shared memory.
Control Dependence This refers to the situation where the order of execution of statements cannot be
determined before run time. For example, conditional statements will not be resolved until run time. Different
paths taken afier a conditional branch may introduce or eliminate data dependence among instructions.
Dependence may also exist between operations performed in successive iterations of a looping procedure.
[n the following, we show one loop example with and another without control-dependent iterations. The
successive iterations ofthe following loop are control-ino'ependt'nr:
[lo 201 = 1, N
Ail) = Ctll
IF tau) .LT. oi Ail) = 1
2'0 Continue
[I-It llll = l, N
IF [Ail — 1) 1'11. D) Ail) = 0
I0 Continue
Control dependence often prohibits parallelism from being exploited. Compiler techniques or hardware
branch prediction techniques are needed to get around the control dependence in order to exploit more
parallelism.
R-an urce Dependence This isdiifetent from data orcontrol dependence, which demands the independence
of the work to be done. Resource dependence is concerned with the conflicts in using shared resources,
,.,,,,,,,,,,,,,,,,,,,,,,,,,,, 4,
such as integer units, l]oating—poinl units, registers, and memory areas, among parallel events. When the
conflicting resource is anALU. We call it ALL’ a’epenrir-nee.
Ifthe conflicts involve workplace storage, we call it srorngerfepertderiee. [n the case of storage dependence,
each task must work on independem storage locations or use protected access (such as locks or monitors to
be described in Chapter ll] to shared writable data.
The detection of parallelism in programs requires a check of the various dependence relations.
Bernstein’: Condition: In 1966, Bernstein revealed a set of conditions based on which two processes can
execute in parallel. A pr0ce.ss' is a software entity corresponding to the abstraction of a program fragment
defined at various processing levels. We define the input set I, of a process P, as the set ofall input variables
needed to execute the process.
Similarly, the onninr sor O‘; consists of all output variables generated ailer execution oi" the process P,-.
[nput variables are essentially operands which can be fetched from memory or registers, and output variables
are the results to be stored in working registers or memory locations.
Now, consider two processes P, and P; with theirinput sets 1| and I; and output sets O I and O3, respectively.
These two processes can execute in parallel and are denoted P| P; if they are independent and therefore
create deterministic results.
Formally, these conditions are stated as follows:
1, rs 0: = pl
12rwO,=::i} (2.1)
O, rw 03 = p t
These three conditions are known as Bernstein is eonriirrimis. The input set If is also call-od the mod set‘ or
the .n’om.nin of P; by other authors. Similarly, the output set O; has been called the \\1--l"t'l!'r.’.i‘-r.’I or the range of a
process P,-. in terms of data dependences, Bernstein's conditions simply imply that two processes can execute
in parallel if they are flow-independent, antiindcpendent, and output-independent.
The parallel execution of sueh two processes produces the same results regardless of whether they are
executed sequentially in any order or in parallel. This is because the output of one process will not be used
as input to the other process. Furthermore, die two processes do not modify (write) the same set of variables,
either in memory or in the registers.
In general, a set ofprocesscs, P I, P3, , Pg can execute in parallel if Bomstein's conditions are satisfied
on apairwise basis; that is,P| | Pg |P_q | |P;,_ ifand only il"P, |P_,- for all iafij. This is citeirlpliliotl by the
following program illustrated in Fig. 2.2.
P)
g Example 2.2 Detection of parallelism in a program
using Bernstein's conditions
Consider the simple casein which each process is a single HLL statement. We want to detect the parallelism
embedded in the following five statements labeled P I, Pg, P3, P4, and P, in program order
W _ H‘-r Mclinrw HJ'lI|:'im-.-;n_.-.-|-rs _
43 Ii mama Cnu'npu1JerArchitecuue
.F‘|:C=D><E]
l=5;.w=c;+c
@;,1=s+c} (2.2)
1=:,;c=r.+.w-
.F_',:F=G'+E j
Assume that each statement requires one step to execute. No pipelining is considered here. The depenclenee
graph shown in Fig. 2.2a dernenstrates flow dependence as well as resouree dependence. In sequential
execution. five steps are needed (Flg. 2.2b).
0P~
0 Pi'*®
P3
D
E Time
P1
BG C
+
Pg as
1'. A
G Pdlfll
s P15 as
I
E 3 LP2 Paps
P5 M
P4
F 1 C A F
[bi Sequential execution in five sens, [c] Parallel execution in three steps,
assuming one step perstatement assuming two adders are available
{no pipelining) per step
If two adders are available simultaneously, the parallel execution requires only three steps as shmvn in
Fig. 22¢. Pairwise, there are ll] pairs of statements to check against Bernstein's eoitditicms. Only 5 pairs, P,
| P5, P; | P3, P; | P5, P5 | P3, and P, '_| P5, can execute in parallel as revealed in Fig. 2.2a if there are no
,.,,,,,,,,,,,,_,,,,,.,,,,,,,, 4,
resource conflicts. Collectively, only P; I P3 I P, is possible (Fig. 2.20] because P; I P3, P3 I P5, and P, I
P; arc all possible.
In gcncral, thc parallelism relation I is commutative; i.c., P, I implies Pf; I P,-. Bill the relation is not
transitive; i.c., P, I 111,- and 11,- I Pk do not necessarily guarantee P, I Pg For example, we have P, I P, and F,
I P3, but P, II P3, whcrc II mcans P, and P3 cannot cxccutc in parallcl. ln other words, thc ordcr in which P,
and P; arc crtccutcd will make a diifcrcncc in thc computational results.
Thcrciorc, I is not an equivalence relation. l-Iowcver,.P,- I Ff, I Pk implies assoclativity; i.c. (P, I Pi] I P, =
P, I t P,- I P1], since the order in which the parallel executable processes are executed should not make any
diffcrcncc in thc output scts. lt should bc notcd that thc condition 1, {"1 I,-as cl docs not prcvcnt parallelism
bctwccn P, and i
Violations ofany one or more ofthe three conditions in 2.1 prohibits parallelism between two processes.
In general. violation of any one or more ofthe 3n('n W2 Bernstein's conditions among n processes prohibits
parallelism collectively or partially.
ln gcncral, data dcpcndcncc, control dcpcndcncc, and rcsourcc depctrdcn-cc all prcvcnt parallelism from
bcirrg cxplo itab lc.
The statement-level dependence can be generalized to higher levels, such as code segment, subroutine,
process, task, and program lcvcls. Thc dcpcndcncc of two highcr lcvcl objects can bc inicrrcd from thc
dcpcnidcnoc of statements in the corresponding objects. Thc goals of analyzing thc data dcp-cndcncc,
control dependence, and resource dependence in a code an": to identify opportunities for parallelization or
vectorization. Hardware techniques for detecting instruction-level parallelism in a running program are
dcscribod in Chapter 12.
Very often program restructuring or code transformations need to be performed before such opportunities
can bc rcvcalcd. Thc dcpcndcncc tclations arc also used in instruction issuc and pipclinc scheduling operations
dcscribod in Chapter 6 and 12.
Hardware Parallelism This refers to the type of parallelism defined by the machine architecture and
hardware multiplicity. Hardware parallelism is ottcn a firnction of cost and performance tradeoffs. It displays
thc resource utilization patterns of simultaneously cxccutablc operations. It can also indicate the pcalr
performance of thc proocs sor rcsourccs.
One way to characterize the parallelism in a processor is by the number of instruction issues per machine
cycle. If a processor issues It instructions pcr machine cycle, then it is called a It-i.ssrrc process-o r.
_ F|'>r'MfGJ'|Ili' H“ I'mt!I;|(1rHt\ _
50 i Aohioinccd Computer architecture
A conventional pipelinod processor takes one machine cycle to issue a single instruction. These types of
processors an: called one-issue machines, with a single instruction pipeline in the processor. ln a modem
processor, two or more instructions can be issued per machine cycle.
For example, the lntel i96OCA was a three-issue processor with one arithmetic, one memory access, and
one branch instruction issued per cycle. The LBM FJSC!System 6000 is a four-issue processor capable of
issuing one arithmetic, one memory access, one floating-point. and one branch operation per cycle.
Sofhtnnre Hnra.llcl.i:m This type of parallelism is revealed in the program profile or in the program flow
graph. Software parallelism is it function of slgoritlun, programming style, and program design. The program
flow graph displays the partems of simultaneously executable operations.
Q Cyda1
{Q Cycla2
--"G ‘1Y='='3
es-1 e e e “ Q, Cydo4
Cyda 2 Q Q G Cyde 5
Cydo 2. Q. ‘Q 9 Cycle s
A El
Q Cycle 7
Ly: Load opsation
Xy: Multiply operation 5
{ai snfiwaa pad“-Em {bl Hadwae paalslism
Q Cydoi
99 Q Clyde-2
US: Loaoilfi-‘ore op-orainn
JIC Mtlqaly operation
Gilda 3 +.l—: Aodfiubioet operation
so Cyde 4}
Added
I‘tEtlI't.|Glol'E
or iPe
I’ Cycle s
1.-fies 0 Cycle 6
A s
Fig.1.4 Dual-processor omcutloo niche program in Fig. 13:.
Of the many types of sottware parallelism, two are most frequently cited as important to parallel
programming: The first is mnrrol parol'!ch's.-rt, which allows two or more operations to be performed
simultaneously. The second type has been called darn porrillclisrri, in which almost the same operation is
perfomied over many data elements by many processors simultaneously.
Control parallelism, appearing in the form of pipelining or multiple functional units, is limited by the
pipeline length and by the lmlltiplicity of functional units. Both pipelining and functional parallelism are
handled by thc hardware; programmers need take no special actions to invoke thorn.
Data parallelism offers thc highest potential for concurrency. It is practiced inboth SIMD and MIM.D modes
on MPP systems. Data parallel code is easier to write and to debug than control parallel code. Synchronization
in SIMD data parallelism is handled by the hardware. Data parallelism exploits parallelism in proportion to the
quantity ofdata involved. Thus data parallel computations appeal to scaled problems, in which the pcrforniancc
of an MPP system does not drop sharply with the possibly small sequential fraction in the program.
To solve the mismatch problem bctwccn software parallelism and hardware parallelism, one approach is
to develop compilation support, and the other is through hardware redesign for more efficient exploitation of
parallelism. These two approaches must cooperate with each othcrto produce the best result.
FM Mtfiruw titmpwtns _
51 i Aahtcvtced Con'tpu'terAt'chtitectut'e
Hardware processors can be better designed to exploit parallelism by an optimizing compiler. Pioneer work
in processor technology with this objective was seen in the IBM 801, Stanford MIPS, and Berkeley IUSC.
Such processors use a large register filc and sustained instruction pipelining to cxceutc nearly one instruction
per cycle. The large register file supports fast access to temporary values generated by an optimizing compiler.
The registers are exploited by the code optimizer and global register allo-cater in such a compiler.
The instruction scheduler exploits the pipeline hardware by filling brunch and Juan’ delay slots. [rt
superscalar processors, hardware and sofiware branch prediction, multiple instruction issue, speculative
execution, high bandwidth instruction cache, and support for dynamic scheduling are needed to facilitate the
detection of parallelism opportunities. Further discussion on these topics can be found in Chapters 6 and 12.
Instruction Level At the lowest level, a typical grain contains less than 20 instructions, calledfine grain
in Fig. 2.5. Depending on individual programs, fine-grain parallelism at this level may range from two to
thousands. Butler et al. (1991) has shown that single-inst1'uetion-stream parallelism is greater than two. Wall
{I991} finds that the average parallelism at instruction level is around five, rarely exceeding seven. in an
ordinary program. For scientific applications. Kumar (1988) has measured the average parallelism in the
range of 500 to 3000 Forl:ra.n statements executing concurrently in an idealized environment.
,
-Coarse grain
Suop-rograms, job T
Level 4 stops or rotated
parts of a progam
>~ lvlacllum grain
Increasing
conrnunleatlon Procodires,
Laval 3 suh-routlrrn-s, taslas, Higher dagoa
demand and of parallelism
schaclullng L) or no-routines
overhead
il
Nonrneurslvo to-ops or
Lovol 2
unfolded ltaratlons
t» Flna gra In
Instructions or
Level 1 statements
Y
i}
Fig. 2.5 Levels of parallelism in program execution on modem cornprueers {Reprinted from l-twang. Prue. JEEE,
October 1937}
The exploitation of fine-grain parallelism can he assisted by an optimizing compiler which should be able
to automatically detect parallelism and translate the source code to a parallel form which can be recognized
by the run-time system. Instruction-level parallelism can be detected and exploited within the processors, as
we shall sec in Chapter 12.
ma if run! |:'nm;u;um1
54 i AoManccdCon'|po'tcr.luchritectusc
Loop Level This corresponds to the iterative loop operations. A typical loop contains less than SUD
instructions. Some loop operations, if independent in successive iterations, can be vectorlzed for pipelined
execution or for lock-step ertocution on SIMD machines. Some loop operations can be self-scheduled for
parallel execution on MIMD machines.
Loop-level parallelism is often the most optimized program construct to execute on a parallel or vector
-computer. However, recursive loops are rather diflieult to parallclizc. Vector processing is mostly exploited
at the loop level {level 2 in Fig. 2.5) by a vectorizing compiler. The loop level may also he considered a line
grain of computation.
Procedure Level This level corresponds to medium-grain parallelism at the task, procedural, subroutine,
and coroutine levels. A typical grain at this level contains less than 2000 instructions. Detection ofparallelism
at this level is much more diflicult than at the finer-grain levels. lnterprocedural dependence analysis is much
more involved and history-sensitive.
{Iommunication requirement is often less compared with that required in MIMD execution mode. SPlv'I.D
execution mode is a special case at this level. Multitasking also belongs in this category. Significant efforts
by programmers may be nccdod to restructure a program at this level, and some compiler assistance is also
needed.
Subprogram Level This corresponds to the level ofjob steps and related subprograms. The grain size may
typically contain tens or hundreds of thousands of instructions. Job steps can overlap across different jobs.
Subprograms can be scheduled for dii1"erent processors in SPMD or MPMD mode, often on message-passing
multico mp uters.
Multiprogramrning on a uniprocessor or on a multiprocessor is conducted at this level. Traditionally,
parallelism at this level has been exploited by algorithm designers or progranuners, rather than by compilers.
Good compilers for exploiting medium- or coarse-grain parallelism require suitably designed parallel
programming languages.
job |‘Pnogrnmj Level This corresponds to the parallel execution oi‘ essentially independent jobs (programs)
on a parallel computer. The grain size can be as bigh as millions of instructions in a single program. For
supercomputers with a small number of very powerful processors, sueh coarse-grain parallelism is practical.
Job—level parallelism is handled by the program loader and by the operating system in general. Time-sharing
or space-sharing multiproccssors explore this level of parallelism. In fact, both time and space sharing are
extensions of multipmgramm ing.
To surnmarizc, line-grain parallelism is often exploited at instruction or loop levels, preferably assisted by
a parallelizing or vectorizing compiler. Medium-grain parallelism at the task orjob step demands significant
roles for the programrner as well as compilers. Coarse-grain parallelism at thc program level relics heavily
on an eilective OS and on the efficiency ofthe algorithm used. Shared-variable communication is often used
to support fine-grain and medium-grain computations.
Message-passing multicomputers have been used for medium- and coarse—grain computations. Massive
parallelism is ofien explored at the fine-grain level, such as data parallelism on SIMD or Milt-ID computers.
Communiontion Lntency By balancing granularity and latency, one can achieve better performance of
a computer system. Various latencies are attributed to machine architecture, implementing technology, and
communication pottcms involved. The architecture and tcclmology afiiect the design choices for latency
tolerance between subsystems. 1n fact, latency imposes a limiting factor on the scalability of the machine
Pro-gum and Network Props ' i 55
sine. For example, over the years memory latency has increased with respect to processor cycle time. ‘Various
latency hiding or tolerating techniques will be studied in Chapters 9' and I2.
The latency incurred with interproeessor communication is another important parameter for a system
designer to minimize. Besides signal delays in the data path, LPC latency is also affected by the communication
pattems involved. In gcnctal, n tasks communicating with each other may require ntn - l)f2 communication
links among them. Thus the complexity grows quadratically. This leads to a communication bound which
limits the number of processors allowed in a large computer system.
Communication patterns are determined by the algorithms used as well as by the architectural support
provided. Frequently encountered patterns include permurruions and br0on‘c¢r.sr. rrtrrfricttsr, and c-oryierenc-e
{many-to-many) eommunications. The communication demand may limit the granularity or parallelism. Very
oilen tradoo fi's do exist between the two.
The communication issue thus involves the reduction of latency or complexity, the prevention ofdeadlock,
tninintizing blocking in communication patterns, and the tradeoff between parallelism and communication
overhead. We will study techniques that minimize communication latency, prevent deadlock, and optimize
grain size in latter chapters of the book.
I»)
8] Example 2.4 Program graph before and after grain packing
(Kruatrachue and Lewis, 1988)
The basic concept of program partitioning is introduced below. In Fig. 2.6, we show an example program
graph in two different grain sizes. A program graph shows the strucnirc of a program. It is very similar
to the dependence graph introduced in Section 2.1.1. Each node in the program graph corresponds to a
computational unit in the program. The grain .st':e is measured by the number of basic machine cycle-5
-[including both pro-eessor and memory cycles) needed to execute all the operations within the node.
We denote each node in Fig. 2.6 by a pair (rt. st, where rt is the node mime {id} and s is the grain size of the
node. Thus grain size reflects the number of computations involved in a program segment. Fine-grain nodes
have a smaller grain size. and coarse-grain nodes have a larger grain size.
1 _ . 1 I.‘ IBM‘ ln¢r.q|r_.u|»rs
Sfi i Aahvmced Compuoernrchitecuue
The edge label (rt, cl) between two end nodes specifies the output variable rt from the souroe node or the
input variable to the destination node. and the communication delay d between them. This delay includes all
the path delays and memory latency involved.
There are 1? nodes in the fine-grain program graph (Fig. 2.6a) and 5 in the coarse-grain program graph
(Fig. lob]. The coarse-grain node is obtained by combining {grouping} multiple line-grain nodes. The line
grain corresponds to the following program:
Q 6""
tsflsd
. , was
C‘! 11"_o.-t _o.'5orF.‘"
"en
o"*
"'4 at Yd
Fig. 2.! A program graph before and alter grain paddrt; in Example 1.4 {Modified from Kmatradwe and Lewis.
lEE.E Seflvm.ns,]an. 1933]
The idea of grain packing is to apply fine grain 1"n'st in order to achieve a higher degree of parallelism.
Then one combines (packs) multiple fine~g1'ain nodes into a coarsegrain node if it can eliminate unnecessary
communications delays or reduce thc overall scheduling overhead.
Usually, all fine-grain operations within a single cc-arse-grain node arc assigned to the same processor for
execution. Fine-grain partition of a program often demands more interprocessor communication than that
required in a coarse-grain partition. Thus grain packing offers a tradeofi‘between parallelism and scheduling!
communication overhead.
Internal delays among fine-grain operations within thc same coarse-gtain node are negligible because thc
communication delay is contributed mainly by interprocessor delays rather than by delays within the same
processor. The choice of the optimal grain size is meant to achieve the shortest schedule for the nodes on a
parallel computer system.
9.
11}
11
12
7 14 t 14*
T|m a 13: _. 15
2°'— 21
22-— F3
I _..._-3
_=-=.
24 24
2a 2!'f__§ I
:.=.2=
‘I
"' _-mi
E
_.
es-4
#33 Iii
[3] Flno grain [Fig. 2.63] [h-) Coarse grain [Flg. 2.61:]
Fig. 2.‘! Scheduling of the fins-grain and coarse-grain programs (arrows: idle dme; shaded area: communication
fiche!
_ War MIGIIILH H“ I'mr!I;|(1rtnr\ _
5-B i Aoktolnced Colrnputer Architecture
With respect to the line-grain versus coarse-grain program graphs in Fig. 2.6, two multiprocessor schedules
are shown in Fig. 2.7. The fine—grain schedule is longer (42 time units) because more communication delays
were included as shown by the shaded area. The coarse-grain schedule is shorter (38 time units) because
communication delays among nodes 12, 13, and 14 within the same node D (and also the delays among 15,
115, and 17 within the node E) are eliminated after grain packing.
Node Duplication In order to eliminate the idle time and to further reduce the communication delays
among processors, one can duplicate some ofthe nodes in more than one processor.
Figure 2.3a shows a schedule without duplicating any of the five nodes. This schedule contains idle time
as well as long interproccssor delays {S units) between Pl and P2. In Fig. 2.3b, node A is duplicated into A’
and assigned to P2 besides retaining the original copy A in P1. Similarly, a duplicated node C’ is copied into
Pl besides the original node C in P2. The new schedule shown in Fig. 2.Sb is almost 50% shorter than that
in Fig. 2.3a. Thc reduction in schedule time is causcd by elimination ofthe (a, 8] and (c, B] delays between
the two processors.
P, P2 P, P2 P2 P1 P2
4
5 a,1 a.1
e 4
5 4
o o re H
3,1 III B
l D 9
‘U
P’
Q u,1 e,1 P ah.-5
'i‘ 13
.4?"B
‘o
o1* ~5- _@4»_n-
2“ ca:- if
5
II
‘
86
-"= a
l\Jto
-n InIt-I
'-I
[a] Schedule without node dupd lcaion [bl Schooua with node clu plleatlon {A-1 A and A’; C -1» C and -C’)
Fig. 2.8 Node-duplication scheduling to eliminate cmtamutlcarlon delays between processors (I: idle rims;
shaded areas: communication delays}
Grain packing and node duplication are often usedjointly to determine the best grain size and corresponding
schedule. Four mqor steps are involved in the grain determination and the process ofscheduling optimization:
Step l . Construct a fine-grain program graph.
Step 2. Schedule the fine-grain computation.
Step 3. Perform grain packing to produce the coarse grains.
Step 4. Generate a parallel schedule based on the packed graph.
Pm, mMm,k,¢ 5,
The purpose ofmultiproeessor scheduling i5 to obtain a minimal time schedule for the eompulations
involved. The following example clarifies this concept.
‘t1t><5'-g G1g‘*11><511"-‘\:2><32;
CPU CYCLE CPU CYCLE
Mme w An. D1 15 Move L PARLD1 213
Move W B1101, D2 15 Move L PARZU2 20
MPTYD1, D2 T1 ADDL DI, D2 B
MOVE L D2. PAR 219 MOVE L D2. P5-UM 20
E {S I-in E d=T'l+T2+T3l'T4+T5+TG
=ZO+2D+32+2‘D+2CI+10fl
T1 T2T3_m T5 =212cydea
T3 = 32-on ianarllssen tine at 20 Mops
nom1aizedtoM68fiDC|cydea120 MI-lz.
T6 = oehy one ta eoilware proboeh {aaeune 5
Mow Iemiotlene, 1130;
{h}1Calc1ulauen oi oemmuueaton delay d
A 4:» C D E F -1-1,1
d d *
N G 5 u '5'
P
Sun
(<>1FIw~;:I-I1 WWW each
Fig. 2.! Caicu-laeion of graln size and comrmmdcadon delay for the pang:-nan graph in Example 15 {Cotnecsy of
Kruatnehue and Lewis: reprinted witli permission irmn IEEE Sofhmie. 19$}
H _ l'h1'Ml.'I;IflH1I' HI" l'nrr.q|r_.u||rs 5
HI i Aohlolnced Corrtpu'ae|'At'chitecbt11re
C11:-’l|1><3|1'l'/l|2><32|
C12 =fl|1><3|2‘l'fl|;'><B22
C21:-"i21><311+-"i2:><B:1
C22 = 11:1 >< 5'12 + -"122 >< 322
5Hm =C11+C1:+C:1+C-'2:
As shown in Fig. 2.9*a., the eight multiplications are performed in eight 431 nodes, each ofwh ich has a grain
size of 101 CPU cycles. The remaining seven additions are performed in a 3-level binary tree consisting of
seven $ nodes. Each additional node requires S CPU cycles.
The interprocessor cornrnunication latency along all edges in the program graph is eliminated as rt = 212
cycles by adding all path delays between two communicating processors (Fig. 2.9b).
A fine-grain program graph is thus obtained in Fig. 2.9c. Note that the grain size and communication delay
may vary with the difierent processors and communication links used in the system.
Figure 2.10 shows scheduling of the fine—grain program first on a sequential uniprocessor (Pl) and then
on an eigltt-processor [Pl to P8) system (Step 2}. Based on the fine-grain graph (Fig. 2.9c), the sequential
execution requires 364 cycles to complete without incurring any communication delay.
Figure 2.lDb shows the reduced schedule of T41 cycles needed to execute the I5 nodes on E processors
with incurred communication delays (shaded areas}. Note that the communication delays have slowed clown
the parallel execution significantly, resulting in many processors idling {indicated by 1), except for Pl which
produces the final sum. A speedup factor ol‘li64l"1‘4l = 1.16 is observed.
P
__ Q il _ P1 P2 P3 P4 P5 Pa P? Pa
A Cl
o E F o 1-1 C
"11 1
B 101 . . 1 . '
2o2—
303-i‘
D 313
45¢ i 321 me
" E
“M sos T 522.. 1
G
filfi
T41 &\\W%/
B24
%/
117,:
F*—"-{ft
"2 Ill]
M
8-48- Ir]
M Isl
r 554 Iil
[a] Asoquontlal schedule [b] A parallel schedule
Next we show how to use grain packing (Step 3] to reduce the communication overhead. As shown in
Fig. 2.11, we group the nodes in the top two levels into four coarse-grain nodes labeled V, W, X, and Y. The
,.,,E,,,,,,,,,._,,,,,,,,,,,,,, ‘I
remaining three nodes (N, O, P) then form the fifth node Z. Note that there is only one level of interprocessor
communication required as marked by rt in Fig. 2.lla.
O P1 P2 P‘3 F’4
co <-:.
°
ca
E
E ><
G -<
o e_ o _
'l'l I
202 J
210 -
,1 . ' <1 Time If
, Z
Hg. 2.11 Parallel seheduiing for Exarrnpie ‘1.5 atrer grain pedcirtg to reduce eornnnmieariort delays
Since the maximum degree of parallelism is now reduced to 4 in the program graph, we use only four
processors to cxccutc this ooarsc-grain program. Aparallcl schedule is worked out (Fig. 2. 1 1] for this program
in 4415 cycles, resulting in an improved speedup of8t'1-H446 — 1.94.
ln a daraflow computer, the execution of an instruction is driven by data availability instead of being
guided by a program counter, ln theory, any instruction should be ready for execution whenever operands
become available. The instructions in a data-driven program are not ordered in any way. instead of being
stored separately in a main memory, data are directly held inside instructions.
Computational results. {rrnra mirens} are passed directly between instructions. The data generated by an
instruction will be duplicated into many copies and forwarded directly to all needy instructions. Data tokens,
once consumed by an instruction, will no longer he available for reuse by other instructions.
This data-driven scheme requires no program counter, and no eonlrol sequencer. However, it requires
special mechanisms to detect data availability, to match data tokens with needy instructions, and to enable
the chain reaction of asynchronous instruction executions. No memory sharing between instructions results
in no side effects.
Asynchrony implies the need for handshaking or token-matching operations. A pure dataflow computer
exploits fine—grain parallelism at the instruction level. Massive parallelism would be possible if the data-
driven mechanism could be cost-effectively implemented with low instruction execution overhead.
A Dataflow Architecture There have been quite a few experimental datafiow computer projects. Anrind
and his associates at MIT developed a tagged-token architeetme for building dataflow computers. As shown
in Fig. 2.12, the global architecture consists ofn processing elements {PEs} interconnected by an n >< n routing
network. The entire system supports pipclined dataflow operations in all n PEs. Inter-PE eontnntnieations are
done through the pipelinod muting network.
O’
Local path "5
I P
D §
Global pan
-
H 81¢
Processing Etamo 3
To Routing Network
[3] Tho global architecture [bi lntorla design of a pro-coaxing olomont
Flg.I.1‘Z The l*'lIT uggnd-'|tolten clamflow oornputier {aehptted from Arvind and lannuoel. 1936 with permission)
Within each PE, the machine provides a low-level Iokm-nmfr-hing mechanism which dispatches only
those instructions whose input data [tokens] are already available. Each datum is tagged with the address of
.,W, mmN,kkg _._ H
thc instruction to which it belongs and the context in which thc instntction is being executed. Instructions
are stored in the program memory. Tagged tokens enter the PE through a local path. The tokens can also
be passed to other PEs through the routing network. All internal token circulation operations are pipclincd
without blocking.
{Jae can think oi‘ the instruction address in a datafiow computer as replacing the program counter, and
the context identifier replacing the frame base register in a control flow computer. It is the machine’s job to
match up data with the same tag to needy instructions. In so doing, new data will be produced with a new tag
indicating the successor instructiontsl. Thus, each instruction represents a synchronization operation. New
tokens are formed and circulated along the PE pipeline for reuse or to other PEs through the global path,
which is also pip-clined.
Another synchronization mechanism, called the 1-so-in-runs, is provided within each PE. Tl1e 1'-structure is
a tagged memory urtit for overlapped usage ofa data structure by both the producer and consumer processes.
Each word Of I-stt't.tCtt.tDE uses a I-bit tag indicating whether the word is £'.|'flpI_'1-‘, isjirli, or has pending react‘
requests. The use ofI-structure is a retreat fi'om the pure dataflow approach. The purpose is to reduce excessive
copying of large data structures in datallow operations.
Ir)
g Example 2.6 Comparison of dataflow and control-flow
computers (Gajski,Padua, Kuel-r,and K|.|hn,1982)
The dataflow graph in Fig. 2.l3a shows that 24 instructions are to be executed (8 niw'des, 8 mulriplres, and
8 adds). A dataflow graph is similar to a dependence graph or program graph. The only difference is that data
tokens are passed around the edges in a dataflow graph. Assume that each rrriri rmri'rr}Jl_t-', and dit-'r'rr’c requires
1, 2, and 3 cycles to complete, respectively. Sequential execution oi‘ the 24 instructions on a control [low
uniprocessor takes 48 cycles to complete, as shown in Fig. 2.l3b.
On the other hand, a dataflow multiprocessor completes the execution in 14 cycles in Fig. 2.13c. Assume
that all the cxtcmal inputs (d,-, cl-,_f; ihri = l, 2, . ..,8 and q-,'j are available bcibrc entering the loop. With i'ot.n'
processors, instructions. 0|, Hg, 113, and or are all ready for execution in the l‘n'st three cycles. The results
prod uccd then triggcr the -execution of115, in | , rig, and .n-_|- starting finm cycle 4, The data-driven chain reactions
arc shown in Fig. 2. l 3e. The output ca is the last one to produce, due to itsdcpcndcncc on all thcprcvious cl-‘s.
Figure 2. l 3d shows the execution of the same sct of computations on a conventional multiprocessor using
shared memory to hold the intermediate results [s,- and r,- for i = 1, 2, 3, 4). Note that no shared memory is
used in the dataflow implementation. The example does not show any time advantage of datafiow execution
over control flow execution.
The theoretical minimum time is I3 cycles along the critical path rt | h |c|r-3 . “C8. The chain reaction control
in datallow is more ditlicult to implement and may result in longer overhead, as compared with the uniform
operations performed by all the processors in Fig. llid.
H _ rhr I.‘ IBM!‘ I l'nrr.q|r_.u||r\ 5
'54 i Aohtotnced Cornptmer Architecture
1"P"1d~ 9-f ‘*1 "152 9243 93d-ti ‘*4 do "5 dc “ed? “Wis °s
cD=0
iorlfromltofldo
h I
232.; 312 33 34 35 36 3? 36
bl; .-. f1 f f f4 f fa f? f
e
"cg: .l?_~_p- f-
_¢-
+ —-In
'1-|'*2"-Baht: be babtbe
Otllfll-.l'l3, U, C ' f1 1:3 Cd 13-5 gs‘ QB’
[ajfitsampieptogramandttsdataflowgraph
1 4 6 7? 1|} 12 #3 46 4-B
I *1 I b1l*=1l as I be Isl
[bi Sequential execution on a unlpro-oeesor In #8 cycles
4 (7 1491011121314
*1 I *5 l°1|°2|°al°#l°5l°sl°t|e°B|
£1ti’ f.
K J
.%.
E‘
Zr .%.
gr_
ha‘ mg‘E
at . _.
T 9 11 121314
31 as '11 I *5 |$1|l1|'=1|°5| $1=t’2*t*1-*1='*3"$1-‘=1=b1*“o-°5=*’s*°4
no ‘*6 l [_$2[l2|_°;|°t-§~| @'2=b4*b3-*2==-1"52-¢2=51*'=o-¢c=$s*°=1
‘*3;.N 3? D3] ll? l"_3[_‘a|°3]°?| 53:56‘be-la=t'?"5a~°3=‘1*°o-°?=‘a*°4
-‘viviji1
‘*4 ‘iilk- “B ?if mu’ |s4ll4|°4|°B| *4=be*t'?-*4=s=1*$3'°=1=l2"°o-‘=e=l4"°4
tjd] Parallel execution on a shared-memory 4-pro-oossor system in H cycles
Fig. 2.13 Comparison between datiflerw and control-flow computers [adapted from Grajsltl. Pacl1.n,K.u|:lt. and
Kuhn, ‘E952; reprinted with permission from IEEE Computer. Feb. 1931}
Cine advantage of tagging each datum is that data from different contexts can be mixed freely in the
instruction execution pipeline. Thus, instruction-level parallelism of datafiow graphs can absorb the
communication latency and minimize the losses due to synchronization waits. Besides token matching
and l-strllctttre, compiler technology is also needed to generate datafiow graphs for tagged-token dataflow
computers. The dataflow architecture ol‘l'e1s in theory a promising model For massively parallel computations
because all for-reaching side effects are removed. However. implementation ofthese concepts onacommercial
scale has proved to be very diflicult.
Pm mmN,kkg _._ is
2.3.2 Demand-Driven Mechanisms
In a reduction rnnt-hinc, the computation is triggered by the demand for an operation's result. Consider the
evaluation ofa nested arithmetic expression tr = {U1 + 1} '>< c — {rt + cl]. The data-driven computation Seen
above chooses a bottom-up approach, starting from the innermost operations b + I and tr’ + c, then proceeding
to the >< operation, and finally to the outermost operation —_ Such a computation has been called eager
ct-mtinrion because operations are carried out immediately after all their operands become available.
A dcmmiri’o‘rit-on computation chooses a top-down approach by first demanding the value ofn, which
triggers the demand for evaluating the next-level expressions [in + l}>< c and n‘ + c, which in tum triggers tl1c
demand tor evaluating b + 1 at the innermost level. The resttlts are then returned to the nested dectnandcr in
the reverse order before tr is evaluated.
Adetnand-driven computation corresponds to hrzy e-t-nlmrinn. because operations are executed only when
their results are required by another instruction. "The demand driven approach matches naturally with the
functional programming concept. The removal of side efiircts in functional programming makes programs
easier to parallelize. There are two types of reduction machine models, both having a recursive control
mechanism as characterized below.
Reduction Machine Model: In a string reduction model, each demander gets a separate copy of the
expression for its o\vt| evaluation. A long string expression is reduced to a single value in a recursive fashion.
Each reduction step has an operator followed by an embedded reference to demand the corresponding input
operands. The operator is suspended while its input arguments are being evaluated. An expression is said to
be fully reduced when all the arguments have been replaced by literal values.
In a graph rcducrimi model, the expression is represented as a directed graph. The graph is reduced by
evaluation of branches or subgraphs. Diiilicreot parts of a graph or subgraphs can be reduced or evaluated
in parallel upon demand. Each demander is given a pointer to the result of the reduction. The demander
manipulates all rciercnccs to that graph.
Graph manipulation is based on sharing the atgtrtncnts using pointers. This traversal of the graph and
reversal ofthe rcilercnccs are continued tmtil constant arguments are encountered. This proceeds until the
value of ti is determined and a copy is rettuned to the original demanding insuuction.
."t-fuclrin-t.' .'H0dt'f I.’_'o.rr.|'m-1' Firm‘ fconrmf-dri ten) Dalaflan' |"da.t.r.t-drr'ven) Re'dr.rc.rion {J£'rru.rnuLd'ri\=e'n}
(Courtesy olwali, Lowrie, and Li; reprinI.e=d with p-eimission from Computersfor A‘rnfficiol intelligence Pmcersing edited
by Wah and Ra.maJ:noorthy, Wiley and Sons. l.uc., 1990]
In this book, we study mostly control-flow parallel computers. But dataflow and rnultithreadcd architectures
will be further studied in Chapter 9. Dataflow or hybrid von Neumann and dataflow machines otter design
alternatives; .i‘fl"-EH1’?! proccs.-rr'ng { see C hapter 13) can be considered an example.
As far as innovative computer architecture is conccmed thc dataflow or hybrid models cannot he ignored.
Both thc Electroteclmical Laboratory (ETL) in Japan and the Massachusetts institute of Technology have
paid attention to these approaches. The book edited by Gaudiot and Bic (1991) provides details of some
development on dataflow computers in that period.
Node Degree and Network Diameter The number of edges {links or channels) incident on a node is
called the node degree d. In the case oftmidircetional channels, thc number ofchanncls into a node is the in
degree, and that out ofa node is thc our degree. Then thc node degree is thc sum ofthe two. Thc node degree
reflects the number of IIO ports required per node, and thus the cost of a node. Therefore, the node degree
should be kept a (small) constant, in order to reduce oost. Aeonslant node degree helps to achieve modularity
in building blocks for scalable systems.
The rfirrmerer D of a network is the maximum shortest path between any two nodes. The path length is
measured by the number of links traversed. The network diameter indicates the maximum number of distinct
hops between any two nodes, thus providing a figure of communication merit for the network. Therefore, t:he
network diameter should be as small as possible from a communication point ofvicw.
Bfseetlon Width When a given network is cut into two equal halves, the minimum number of edges
{channels} along thc cut is called thc ehrmmsl bisection it-'idrh b. In the cascofa communication network, each
edge may correspond to a channel‘ with w bit wires. Then thc nits’ hr'.serrion u-'r'nl‘h is B = bu‘. This parameter
B reflects the wiring density of a network. When B is fixed, the channel‘ it-'in'rh (in bits} w = Bib. Thus the
bisection width provides a good indicator ofthe maximum commtmication bandwidth along thc bisection of
a network.
Another quantitative parameter is the wire length (or channel length] between nodes. This may affect
the signal latency, clock skewing, or power requirements. We label a network s__tmmen-r‘-:- ifthe topology is
the same looking from any node. Symmetric networks are easier to implement or to program. Whetlter the
nodes arc homogeneous, the channels are buffered, or some ofthe nodes arc switches, arc some oth-cr useftil
properties for characterizing the structure of a network.
Data-Routing Fmention: A data-routing network is used for inter-PE data exchange. This routing network
can be static, such as the hypercube routing network used in the TMCICM-2, or dynarnic such as the multistage
_ Ft‘:-r Mtfiruw Hliirml!I;|(1rtnr\ '
BB i i Aahtotnced Computer Architecture
network used in the IBM GFI 1. ln the case of a multicomputer network, the data routing is achieved through
message passing. Hardware routers are used to route messages among multiple computer nodes.
We specify below some primitive data-routing functions implementable on an inter-PE routing network.
The versatility of a routing network will reduce the time needed for data exchange and thus can significantly
improve the system performance.
Commonly seen data-routing fimctions among the PEs include Sflffiing. mmri.-m. perm1.1mIr'mI {one»to-
one), hronrierrsr (one-to-all], mnfrieosr [one-to-many}, shuflie, arehrrnge, etc. These routing ftmctions can be
implemented on ring, mesh, hypercube, or multistage networks.
Permutation: For n objects, then: are n! permutations by which the n objects can he reordered. The set of
all perm utations form a permutation group with respect to composition operation. One can use cycle notation
to specify a permutation function.
For example, the permutation rr = fa, b. e] (rt. 2] stands for the bijection mapping; £1 —a b, in —> r-, e -> rt,
rt’ -3- e, and e —> din a eireiilar fashion. The cycle (rt. F). c) has a period of3, and the cycle (d, e) a period ofl
Combining the two cycles, the permutation rr has a period of2 >< 3 = 6. lfone applies thc permutation rrsix
times, the identity mapping I = ii-rt), U1), {ej, frfj, {cl is obtained.
One can use a crossbar switch to implement the permutation in connecting rt PEs among thernselves.
Multistage networks can implement some ofthe permutations in one or multiple passes through the network.
Permutations can also be implemented with shifting or broadcast operations. The permutation capability of
a network is often used to indicate the data routing capability. When n is large, tl1e permutation speed often
dorm inates the perfo rmance ofa data routing network.
Perfect Sltuflle and Exchange Perfect shuffle is a special permutation function suggested by Harold
Stone (l9Tl) for parallel processing applications. The mapping corresponding to a perfect shuflie is shown
in Fig. 2.1-Ila. Its inverse is shown on the right-hand side (Fig. 2.14b).
000?;-000 000?.-000 =0
Fig. 2.14 Perfect siufifie and its inverse mapping over eight. oblects [Courtesy of H. Stone; reprlrmed whit
permission from JEEE Tn:|ns.Cornpi.rters, 19?1)
In general, to shuffle n = 2 .i objects evenly, one can express each object in the domain by a k- bit binary
numbcrx = {xx |,. . . ,1‘, , .1",-_,), The perfect shuffle maps x to __\-', wl1ere__t-' = {.11 1,. . . ,x| , .1r{,, xk 1] is obtained from
.1‘ by shifting 1 bit to the lefi and wrapping around the most significant to the least significant position.
,,_,g,,,,,,,,,,,,,,,,,,,,,,,., . H
Hypercube Routing Function: A three-dimensional binary cube network is shown in Fig. 2.15. Three
routing functions are defined by th.ree bits in the node address. For example, one can exchange the data
betwoeri adjacent nodes which differ in the least signifieartt bit C9,, as shown in Fig. 2.] Sb.
Similarly, two other routing patterns can be obtained by checking the middle hit C, (Fig. 2.l5c} and
the most significant bit C; (Fig. 2.l5d), respectively. In general, an n-dimensional hypercube has n routing
fimctions, defined by each bit of thc rt-bit address. These data exchange functions can be used in routing
messages in a hypercube multicomputer.
110 1'11
I
ooo
F
001
[3] A 3-cube with nod-as donated as C2016‘; in binary
|noo|
l
|oo1[ |o1o[
i
[011] |1oo[
i
[1o1| |11o[
+
[111]
T l l l
[dji Rotting by most signiflcmt bit, C2
Broadcast and Multiooat Bmaol:'os! is a one-to-all mapping. This can be easily achieved in an SIMD
computer using a broadcast bus extending from the array controller to all PEs. A. message-passing
multicomputer also has mechanisms to broadcast messages. ilr'f1ifi"iCflSf corresponds to a mapping from one
PE to other PEs [one to many].
Broadcast is often treated as a global operation in a multicomputer. Multioast has to he implemented with
matching ofdestination codes in tl'|e network.
(lj Frrneri0nnIir_1-'—Tl1is reiers to how the network supports data routing, interrupt handling,
synchronization, request-"message combining, and coherence.
[2] ."'t"ernnrk Iorem:'_1-'-—This refers to the worst-ease time delay ibra unit message to betran-sferretl through
the network.
(31 Bflltflltt-'i£|l!it—Thi5 refers to the maximum data transfer rate, in terms of Mliytesfs or Gliytesfs,
transmitted through the network.
(41 Hurdnure compfexirt-'—This refers to implementation costs such as those for wires, switches,
connectors, arbitration, and interface logic.
['5] .S'coIcr!Jf£ir_1-'—This refers to the ability ofa network to be modularly expandable with a scalable
performance with increasing machine resources.
Linear Array This is a one-dimensional network in which N no-des are connected by N~ 1 links in a line
{Fig. 2.16:1). Internal nodes have degree 2, and the terminal nodes have degree 1. The diameter is N I , which
is rather long for large N. The bisection width in = l . Linear a;rra_ts are the simplest connection topology. The
structure is not symmetric and poses a communication inefiiciency when N becomes very large.
For N= 2, it is clearly simple and economic to implement a linear array. As the diameter increases linearly
with respect to N, it should not be used for large N. It should be noted that a linear array is very difibrent from
a bus which is time—shared through switching among the many nodes attached to it. A linear array allows
concurrent use of different sections (channels) of the structure by different source and destination pairs.
Ring and Chordol Ring A ring is obtained by connecting the two terminal nodes of a linear array with
one extra link (Fig. 2.1 6b]. A ring can be unidirectional or bidirectional. It is symmetric with a constant node
degree ofl The diameter is for a bidirectional ring, and N for tmidireetional ring.
The IBM token ring had this topology, in which messages circulate along the ring until they reach
the destination with a matching ID. Pip-elined or packet-switched rings have been implemented in the
CDC Cyberplus multiprocessor (I935) and in the KSR-1 computer system (1992) for interprocessor
communications.
By increasing thc node degree from 2 to 3 or 4, we obtain two ehoniai rings as shown in Figs. 2. ltie and
2.16:1, respectively. One and two extra links are added to produce the two chordal rings, respectively. In
general, the more links added, the higher the node degree and the shorterthe network diameter.
Comparing the 16-node ring (Fig. 2.16b} with the two chordal rings (Figs. 2.16:: and 2. I 5d}. the network
diameter drops from B to 5 and to 3., respectively. In the extreme, the eonrp!emi_1-' r-onner-rerf network in
Fig. 2.] sr has a node degree of I 5 with the shortest possible diameter of 1.
Barrel Shifter As shown in Fig. 2.l6e for a network ofN = 16 nodes, the bar'n:I shirt-r is obtained from
the ring by adding cittra links from each node to those nodes having a distance equal to an integer power of
2. This implies that node i is connected to nodej if|_,r'— fl = 2" for some r ='[1,l,Z,..., n —l and the network
size is N = 2". Such a barrel shifter has a node degree of r!= In I and a diameter D = n.-‘Z.
PWmmN,kR¢ _
c 1
1s 2
0 1 2 3
14 3
il‘ 6 5 4 13 4
12
B 9 10 11
11 s
15 14 13 12 10 9 8 1
[a] Linear away [bi Ring
0 1
15 2 -:1 1
,.ir1'£\|-.
1t 3 _,|"fn="‘3~Ti
I"
13 1 -$1 4*-
12 5 12 Q‘
H
11 6 ‘Q2-,,__- 1-1'‘II
11]
9 s
1 109 1'4 '°":li-.1’"'i"".;-!tag.-; :4’-'"’
I\,._
I_
Fig. 2.16 Linear array: ring.el'iorchi rings of degrees 3 and 4. barrel shifrenand completely connected networit
Obviously, the oonnectivity in the barrel shifiecr is increased over that of any ehordal ring of lower node
degree. For N= I6, the barrel shifter has a node degree oi‘ 7 with a diameter of 2. But. the barrel shifter
complexity is still much lower than that of the completely connected network {Fig 2. 16f).
Tree and Star A biririrj: rme of 31 nodes in five levels is shown in Fig. 2.l7a. In general, a Ir-level,
completely balanced binary tree should have N= 2* - 1 nodes. The maximum node degree is 3 and the
diameter is EU: - 1). With a constant node degree, the binary tree is a scalable architecture. However, the
diameter is rather long.
Thc srnr is 11 two-level tree with a high node degree at the central node ol‘.r.1'= ."i~'— I (Fig. 2. i‘?h] and a small
constant diameter of 2. A DADO multiprocessor was built at Columbia University (1987) with a 10-level
binary tree of I023 nodes. The star architecture has been used in systems with a centralized supenrisor node.
I _ Par I J11!!!‘ l'mrJI||r_.u|r¢\
TI i -5riNtJl'iCrl!£l'l:iJrITlPlr'r'.B'.5rI'Crl1>rlt£r'Jllttl'B
For Tree The corrventional tree structure used in computer science can be modified to become thefar tree,
as introduced by Lciscrson in 1935. A binary fat tree is shown in Fig. 2.17c. The channel width of a fat tree
increases as we ascend from leaves to thc mot. Thc tat trcc is more like a teal trcc in that branches get thicker
toward the mot.
it; Ti
r
.-*-‘_‘\. _/-2
"-- r_.-’ {
/,,.
_'_'-tlr.__ '_‘;u" r.
.;-.___ O‘-._t4-.'\ )"1
c.~—"'“- . .,_ :j::cr’_ -.r ._,_-_':1L_
-__~ '____=tr ._'I-___\_._ 1-,,
"-1‘-"I -3- I1‘!
. -I1
_,._,_ ‘____.-T1r -“-_ ‘___-.Ji'
._1 _. ._.I:t____ CI Gib
-_r ~_ ii;
.-1' U’ -— Q},-3-
-'7.
Q‘-‘I . D--.
Q
' fr-".-"l'_‘ ‘i1’.1.-_ '' :1-‘ 1!'+___ 3}:
- .>___j_ _r_0-—-..
R3-__. .')-"fitt- ._. C1
(alfiirurytree fbifitar {ciflluryfattroo
One ofthe major problems in using the conventional binary tree is the bottleneck problem toward the root,
since the traffic toward die root becomes heavier. The fat tree has been proposed to alleviate the problem. The
idea ofa fat tree was applied in the Connection Machine CM-5, to be studied in Chapter E. The idea ofbinary
fat trees can also be extended to mtrltiway fat trees.
Mull and Torus A 3 >< 3 example mesh network is shown in Fig. 2.18s. The mesh is a frequently used
architecture which has been implemented in the Illiac IV, MPP, DAP, and Intel Paragon with variations.
lo general, a Jr-dimensional mesh with N = nk nodes has an interior node dcgtcc of Eli‘ and thc network
diameter is .ir'(n I]. Note that the pure mesh as shown in Fig. 2. lSa is not symmetric. The node degrees at
the boundary and comer nodes are 3 or 2.
Figure 2.181: shows a variation of the mesh by allowing wraparormd connections. The llliac IV assumed an
8 >< 3 mesh with a constant node degree of 4 and a diameter of 7. The llliac mesh is topologically equivalent
to a chordai ring ofdegree 4 as shown in Fig. 2. 16d for an N = 9 = 3 >< 3 configuration.
in general, an :1 >< n llliac mesh should have a diameter ofd= n l, which is only haifofthe diameter for
a pure mesh. Thc rorrrs shown in Fig. 2.180 can be viewed as another variant ofthe mesh with an even shorter
diameter. This topology combines the ring and mesh and extends to higher dimensions.
The torus has ring connections along each row and along each column of the array. In general, an n >< n
binary torus has a no-tic degree of 4 and a diameter of The torus is a symmetric topology. All added
wraparoimd connections help reduce the diameter by one-hall‘ from that of the mesh.
Systolic Array: This is a class ofmultidimensional pipelinod anay architectures designed for implementing
fixed algorithms. What is shown in Fig. 2.lBd is a systolic array specially designed for perforrning matrix
multiplication. The interior node degree is ti in this example.
in general, static systolic arrays are pipelinod with multidirectional flow of data streams. The comnzrcrcial
machine lntel iwarp system lfiuiatalone ct al., 1936) was designed with a systolic architecture. The systolic
array has become a popular research area ever since its introduction by Kong and Leiserson in l9'?8.
,,,,,,, ,,,,,,_,,,,,,,,,,, ,_,_
C'=I':I'=l
.t="::‘:.
"JlQ|§'.l
I'.iI.iI.:l
{a) Mosh {bi liac mesh {c{|Tcma {di Systcie aray
With fixed interconnection and synchronous operation, a systolic array matches the communication
structure of the algorithm. For special applications like signalliniagc processing, systolic arrays may offer
a better perfo-mtanceilcost ratio. However, the structure has limited applicability and can be very difiieult to
program. Since this book emphasizes general-purpose computing, we will not study systolic arrays further.
interested readers may refer to the hook by S.Y. Kung (1938) for using systolic and wavefront architectures
in building VLSI array processors.
Hypercube: This is a binary n-cube architecture which has been implemented in the iPSC, nCUBE, and
-CM-2 systems, In general, an n-cube consists ofN= 2" nodes spanning along n dimensions, with two nodes
per dimension. A 3-cube with 8 nodes is shown in Fig. 2. I 9a.
A 4-cube can be formed by interconnecting the corresponding nodes of two 3 cubes, as illustrated in
Fig, 2,19b_ The node degree ofan n-cube equals n and so does the network diameter. In fact, the node degree
increases linearly with respect to the dimension, making it difficult to consider the hypercube a scalable
architecture.
Binary hypercube has been a very popular architecture for research and development in the 19805. Both
Intel iPSC_.~' 1. iPSCl‘2, and nCUBE machines were built with the hypercube architecture. The architecture
has dense connections. Many other architectures. such as binary trees, meshes, etc., can be embedded in the
hypercube.
With pour scalability and ditliculty in packaging higher-dimensional hypcrcubes, the hypercube
architecture was gradually being replaced by other architectures. For example, the CM-5 employed the fat
tree over the hypercube implemented in the CM-2. The lntel Paragon employed a t"wo-dimensional mesh
over its hypercube predecessors. Topological equivalence has been established among a munber of network
architectures. The bottom line for an architecture to survive in future systems is packaging efficiency and
scalability to allow modular grrrwth.
Cube-Connected Cycle: This architecture is modified from the hypercube. As illustrated in Fig. 2.19c, a
3-cube is modified to form 3-eztbeeonntrerirrf tjt-‘dos {CCC]. The idea is to cut offthe corner nodes (vert ieesj
ofthe 3-cube a.nd replace each by a ring {cycle} of'3 nodes.
ln general, one can eon sttuet .lr-e1tbc~conneetco' ct-‘cits from a k-eubewith n = 2?‘ cycles nodes as illustrated
in Fig. 2. 19d. The idea is to replace each vertex of the it dimensional hypercube by a ring ofIt nodes. A It-cube
ean be thus transformed to a It-CCC with Ir >< 2" nodes.
The 3-CCC shown in Fig. 2.l9b has a diameter of 6, twice that of the original 3-cube. ln general, the
nettvorlt diameter ofa Jr-CCC equals 21:. The major improvement ofa CCC lies in its constant node degree of
3, which is independent of the dimension of the underlying hypercube.
T4 i Adumtcedfornpucerkdniteeuim
[a] 3-cube
lll!ta!
{b~j|A 4-cube formed by lntoroon nocting two 3-ctbee
%
1 it
ii lit (B
1 1
e _
(ii
{-::] 3-cttto-connected cycles [d] Replacing each node of a teeuho by a ring [cycle]
of it no-do-s to form the k-cube-connected cycles
Consider a hypercube with N = 2" nodes. A {ICC with an equal number of N nodes must be built from a
lower-dimension Ir-cube such that 2" = Jr - Z‘ for some It <1 rt.
For example, a 64-node CCC can be forrned by replacing the corner nodes of a 4-cube with cycles of four
nodes, corresponding to the case n = 6 and It = -4. The CCC has a diameter ofllr = S, longer than 6 in a 6-cube.
But the CCC has a node degree of 3, smaller than the node degree of 6 in a 6-cube. In this sense, the OCC is
a better architecture for building scalable systems if latency can be tolerated in some way.
k-ary n-Cube Network: Rings, meshes, tori, binary n-cubes (hypercubcsj, and Omega networks are
topologically isomorphic to a family of k-my n-cube networks. Figure 2.20 shows a 4-ary 3—cube network.
--'-s.-1-ea-t-4
viiij
/I/I/
I‘. '\
I./JV17
:9-my-lg
ufjfi
'l ‘\-
\\“:\‘?.\;:
oi ‘s.\~t&‘!1\
1n.".1|.0
I I
lII'I'i'i 'i i X\_‘_":\",",:\§li|.‘a
I FI P. I F. I
II II II
iii '_I'-‘—
¢|.l.1|.l.1|.l.1|.-.-+
"ii"-"_ 'iii'iiii'
c.l.|1'.'|1...|1-|_
Ii.-.1|.-.1|.-.1|.-J» n.-.|1-'11.-.|1'-|—-
I I I I I J I I
-[al Tratitiond torus {a -it-ary 2-oubal {blflttorus wilt folded eornectiorn
Fig. 111 Folded connections tn equalize the wine Iengdt in a tnorus network (Courtesy o‘l'W1 Dally; reprinted
with permission from l'.EEE Tmna Computers, june W90}
William Dally (I990) has revealed a number of interesting properties of k-ary n cube networks. The oost
of such a network is dominated by the amount of wire, rather by the numb-ezr oi‘ switches required. Under the
assumption ofconstant wire bisection, low-dimensional networks with wide channels provide lower latency,
less contention, and higher hot-spot throughput than higher-dimensional networks with narrow channels.
Network Throughput The network rhmrigfniur is defined as the total number of messages the network can
handle per unit time. One method of estimating throughput is to calculate the capacity of a network, the total
number of messages that can be in the network at once. Typically, the maximum throughput of a network is
some fraction of its capacity.
A hm spur is a pair of nodes that accounts for a disproportionately large portion of the total network
traffic. Hot-spot traflic can degrade performance of the entire network by caltsing congestion. The ho!-spot
throughput oi‘ a network is the maximum rate at which messages can be sent from one specific node P, to
another specific node P)-.
Low-dimensional networks operate better under nonuniform loads because they allow better resource
sharing. in a high-dimensional network, wires are assigned to particular dimensions and cannot be shared
between dimensions. For example, in a binary n-cube, it is possible for a wire to be saturated while a
physically adjacent wire assigned to a different dimension remains idle. In a torus, all physically adjacent
wires are combined into a single channel which is shared by all messages.
‘ _ Par I J11!!!‘ l'mrJI||r_.u|i¢\
‘ta i Ad~iov1cedCorr|pute|'.lu'ch¢itecbtrn2
As a rule oi‘ thumb, minimum network latency is achieved when the network radix Ii: and dimension
n are chosen to make the components of communication latency due to distance D {the number of hops
between nodes] and the message aspect ratio Li‘ H-" (message length L normalized to tllc channel width Hr)
approximately equal.
Low-dimensional networks reduce contention because haying a few high—bandwidtl'1 channels results in
more resource sharing and thus a bctter queueing performance than having many low-bandwidth channels.
While network capacity and worst-case blocking latency are independent of dimension, low-dimensional
networks have a higher maximum throughput and lower average block latency than do high-dimensional
networks.
Both fat tree networks and Jr-ary n-cube networks are considered universal in the sense that they can
efliciently simulate any other network ofthe same volume. Dally claimed that any point-to-point network can
be embedded in a 3-D mesh with no more than a constant increase in wiring length.
Summary ofstotic Network In Table 2.2, we summarize thc important characteristics of static connection
networks. The node degrees of most networks are less than 4, which is rather desirable. For example, the
[NMCIS Transputer chip was a compute communication microprocessor with four ports for communication.
Sec also the T[LE6-4 systern-on-a-chip described in Chapter 13.
‘ ' - I
Su baystom
. ' . I
Bus \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\3
Switch Moduli: An cl >< b sn-'i1't-it module has o inputs and b outputs. A br'n.or__\-' sn-‘itch corresponds to a 2 ><
2 switch mod|.tlc in which n = in = 2. in theory, n and in do not have to be equal. However, in practice, n and I:
arc often chosen as integer powers of2; that is, a = {J = 2* forsomc it 2 1.
Table 2.3 lists several commonly used switch module sizes: 2 >4 2, 4 >< 4, and 8 >< 8. Each input can be
connected to one or more oi‘ the outputs. However, conflicts must be avoided at the output terminals. ln other
words, one-to-one and one-to-many mappings are allowed; but many-to—one mappings are not allowed due
to conflicts at the output terminal.
When only one-to-one mappings {permutations} are allowed, we call tl1e module an n >< n crossbar switch.
For example, a 2 >< 2 crossbar switch can connect two possible patterns: straight or crnssm-'cr. In general, an
n >< n cros shat can achieve rt! pcrmutat ion s. The numbers oflcgitimatc connection patterns for switch modules
oi‘ various sizes are listed in Table 2.3.
Multistage Interconnection Network: 3»-lINs have been used in both MIND) and SIMD computers. A
generalized multistage network is illustrated in Fig. 2.23. A number of o >< h switches arc used in each
stage. Fixed interstage connections are used between the switches in adjacent stages. The switches can be
dynamically set to establish the desired connections between the inputs and outputs.
Different classes of MINs differ in the switch modules used and in the kind of inrcrsmgc cormecfion -[i [SC]
patterns used. The simplest switch module would be the 2 >< E switches -[is = b = 2 in Fig. 2.23}. The ISC
panems ofien used include perfect skufiie, butterfly. mnltiway shagfiie, crossbar, cube connection, etc. Some
ofthesc [SC patterns are shown below with examples.
,.,,,,,,,,, ,,,,_,,,,,,,,,.,,,,,,., ,9
Ci
1 ass -1
—" so I
CF
I i."
IF‘
I I
a.=<.i:-
. 1 I
D
l}
+03
R?» _
ass
switch ,
isc,
,
I
assCF
switch .
IS-C2
J.
3
__$_
iscn
.
ass
. n+1
switch , 2b_1
Z I I I
I I I I
"- b“-h
“and“ BK!)
switch : :
31-=.iJIF
switch :
21‘-'
_L"'_L :
so .
switch : b,,_1
Fig. 2.23 A generalized structure of a multistage interconnection networlt {MIN} bolt with a x b switch
filDdi.Ht‘.!S and intcrstagc ccrmcetien patterns ISC1, |SC;, i5C,.,
Omega Network Figures 2.24a to 2.2441 show four possible connections of 2 >< 2 switches used in
constructing the Omega network. A I6 >< 16 Omega network is shown in Fig. 2.2-4e. Four stages of 2 X 2
switches are needed. There an: lo inputs on the left and 16 outputs on the right. The ISC pattcm is the perfect
sltuflle over 16 objects.
lo general, an rt-input Omega network requires log; rt stages ofl >< 2 switches. Each stage requires m'2
switch modules. In total, thc network uses n log; n.I’2 switches. Each switch module is individually controlled.
Various combinations of the switch states implement dil'l'ercnt permutations, broadcast, or other
connections front the inputs to the outputs. The interconnection capabilities ofthe Omega and other networks
will be fi.n"t|1eI studied in Chapter T.
Baseline Network Wu and Feng {I98-G) have studied the relationship among a class of multistage
interconnection networks. A Bast-fine netit-‘ark can be generated recursively as shown in Fig, 2.25:1,
The first stage contains one NX Nbkick,andtl1c second stage contains t\vo(A"/2]>< {Nt'2)subb1o-t:ks,1abelcd
Cr, and C |. The construction process can be recursively applied to the subl:-locks until the M2 subblocks cl‘
size 2 X 2 are reached.
The small boxes and the ultimate building blocks ofthe subblocks are the 2 >< 2 switches, each with two
legitimate connection states: straight and £‘.l'O.'i‘.S‘O'|-'£'!‘ between thc two inputsand two outputs. A 16 >< 16 Baseline
network is shown in Fig. 2.25b. In Problem 2.15, readers are asked to prove the topological equivalence
between the Baseline and other networks.
l _ _ rr<- Mclinrw HJ'lI:'|-rr.-W.-.-|-r~
BU ii Adwmced Cm1pu1JerArchitecunre
O 0 G O 0 0 w 0
1 1 1 T 1 1 1 t
{a}51rarg'11 {up Doeacwer 10: Upper lroaclcaai ¢d:| Lower brcxacl»::aa1
2 I I I 3’
émmmw
M=i=.%*»'i=i\M=i*»%*»"'
g ~w=w*~*»~a:¢-*~*»w-~-
1;
‘a=4%=$%=$*'
I I
4 G } T6\<.15
I
. Dme-g flflfl-1 mm
I 12
Fig. 2.14 The use ofl x 1 swiuchus and perfisct shuffle as an inuricagu cmnectinn pmern um construct a
16 x 1-5 Omega nitwork {Camrbesy of Dunan Lawrbl; reprimed with purrriisslon from IEEE Tmns.
C-ompumrs.Dec.19T5)
2( I I ( ?
2,‘ I-Q-I
I-Q::;{/:m‘$‘-I:
2:? I-W‘-I-y-I-7-I2
N ><- N
%fl
[b;§~§;:;é1
{aj Re-aurawwwwIuqion
1.:-:’0f’&'i'i;%Ehn : =
:3 ~|ii'4|i-V: :3
15
(b}|.A16 >=.16 Baaeme IH!'1'lJOl‘ll
15
Fig. 125 Rewrshre curmruction of I Baseline network {Caurnsy of'Wu and Fang; reprirmed with perminicm
frnrn [EEE Tmns. C0fl14;vlI!eI's,August 1930}
,.,W,M,,_,,,,,,,k,.,q,,fl H
Crossbar Network The highest bandwidth and interconnection capability are provided by crossbar
networks. A crossbar network can be visualized as a single-stage switch network. Lilce a telephone
switchboard, thc crosspoint switches dynamic conncctions hctwccn source, destination pairs. Each
cmsspoint switch can provide a dedicated connection path hcttvccn at pair. The switch can be s-ct on or ofi
dynamically upon program demand. Two types of crossbar networks are illustrated in Fig. 2.26.
To build :1 sharod-memory multiprocessor, one can use a crossbar rtclworlc bctwccn thc processors
and memory modules (Fig. 2.262.]. This is essentially a memory-access network. The pioneering C.1:m:up
multiprocessor (Wolf and Bell, 1972} implemented a [6 >< I6 crossbar network which connected 16 PDP
ll processors to I 6 memory modules, each ofwhich had s capability of 1 million words ofmemory cclls. The
115 memory modules could be accessed by thc processors in parallel.
Transmit
............. . . -0
oostIII
w -
II
M2 ...
-
M15
0 on
E
0-0 Q
CF‘
o-O-5 P
E E03
Q}
:-_'_ I
I I IPE219 PEZZO F‘EZ21F'E222l
I I
O or Receive
[aj lnterprooessor-rnorrtory crossbar [bj The lnterpro-oooso-r cnoesbar network bultt in tho
network bu ltt In the C.n1rnp Fuj|tsu\.-'PP5-D0voctorpa|'a|lolprooo<ssor(19921
multiprocessor at Carnegie-
Mollon University [19-T2]
Note that each memory module can satisfy only one processor request at a time. When multiple requests
arrive at the some memory module simultaneously, the crossbar must resolve the conflicts. The behavior
of each crossbar switch is very similar to that of a bus. However, each processor can generate a sequence
_ War If J11!!!‘ I'mi!I;|(1rinr\ _
B1 I Adnoinccd Corr|pn'tct'At'chitcctuvc
oi‘ addresses to access multiple memory modules simidtaneously. Thus, in Fig. 2.2641, only one crosspoint
switch can be set on in each column. However, several crosspoint switches can be set on simultaneously in
order to support parallel (or interleaved] memory accesses.
Another type ofctossbar network is for interproccssor communication and is depicted in Fig. 2.2611. This
large crossbar [224 >< 224) was actually built in a vector parallel processor [\"PP5tlD} by Fujitsu Inc. (1992).
The PEs are processors with attached memory. The CPs stand for control processors which are used to
supervise the entire system operation, including the crossbar networks. In this crossbar, at one time only one
crosspoint switch can be set on in each row and each column.
The interprocessor crossbar provides permutation connections among the processors. Only one-to-one
connections are provided. Thereibrc, thc n >< rt crossbar connects at most n source, destination pairs at :1 time.
We will further study crossbar networks in Chapters 7 and 8.
Summary In Table 1.4, we summarize the important features of buses, multistage networks, and crossbar
switches in building dynamic networks. Obviously, the bus is the cheapest to build, but its drawback lies in
the low bandwidth available to each processor.
Another problem with the bus is that it is prone to failure. Some fault-tolerant systems, like the Tandem
multiprocessor for transaction processing, used dual buses to protect the system from single failures.
The crossbar switch is the most ertportsive one to build, due to the fact that its hardware complexity
inc reascs as :12. However, the crossbar has the highest bandwidth and routing capability. For a small network
size, it is the desired choice.
,,,__,W M, ,,,Mk Hm
rr--alrfimw Hriicmpwv _ H
Multistage networks pmvidc a compromise between the two c:-ttrcmcs. Thc major advantage of MlNs
lies in their scalability with modular construction. However, the latency increases with log n, thc number of
stagcs in the network. Also, costs duo to increased wiring and switching complexity are another constraint.
For building MPP systems, some of thc static topologies arc more scalable in specific applications.
Advances in VLS1 and intcrcoimccl icclmologies have had a major impact on multiprocessor system
architecture. as \ve shall see in Chapter I3, and there has been a clear shift towards the use of packet-based
switched-media inte'rconnccts.
,__
$5 Summary
ln thischaptenwe have foc used on basic program properties which make parallelism possible and determine
the amount and type ofparallelism which can be exploited.'lNith incrsing degree of multiprocessing,the
rate at whida data must be communicated between subsystems also increases, and therefore the system
interconnect architecture becoma important in determining system performance.
We started this chapter with a study of the basic conditions which must be satisfied for parallel
computations to be possible. in essence. it is dependences between operations whidw limit the amount
of parallelism which can be e)cploited.After all. any set of N fully independent operations can always be
performed in parallel.
The three basic data dependences between operations are flow dependence, o.rrt.i-dependence and output
dependence. Resource dependence refers to a limitation in available hardware andfor software raources
which limits the achievable degree of parallelism. Bernstein‘s conditio ns—which apply to input and output
sets of processes—must be satisfied for parallel execution of processes to be possible.
Parallelism may be exploited at the level of software or hardware. For software parallelism. program
design, and the program development and runtime environments play the key role. For hardware
parallelism. availability of the right mix of hardware resources plays the key role. Program partitioning,
grain size, communication latency and scheduling are important concepts; scheduling may be static or
dynamic.
Program flow may be control-driven, data-driven or demand-driven. Of these, comzrol-driven program
flow. as exemplified in the von Neumann model, is the only one that has proved commercially successful
over the last six decadesfither program flow models have been tried out on research-oriented syscems.
but in general these models have not found acceptance on a broader basis.
When computer systems consist of multiple processors—and several other sub-systems such as
memory modules and network adapters—the system interconnect architecture plays a very important
role in determining final system performance.We studied basic network properties. including topology
and routing functionality Network performance can be characterized in terms of bandwidth. latency.
functionality and scalability.
We studied static network topologies such as the linear array. ring, tree. fat tree. toms and hypercube;
we also looked at dynamic network topologies which involve switching andlor routing of data.'iNith higher
degree of multiprocessing.bus-based systems are unable to meet aggregate bandwidth requirements of
the system; multistage inter-connection networks and crossbar switches can provide better alternatives.
FM M¢Gl'i1I-H Hillfmmlunm :
B4 D Advanced Compu'terA|'chiitecture
53 Exercises
Problem 1.1 Define the following terms related (c) Wlrat are the differences between string
to parallelism and dependence relations: reduction and graph reduction machines?
lll Computational granularity.
Problem 2.4 Perform a data dependence analysis
(bl Communication latency. on each of the following Fortran program fragments.
{Cl Flow depe ndence. Show the dependence graphs among the statements
ldl Antidependence. with justification.
[El Output dependence. {a} S1 = B+ D
lll HO dependence. $2 =A X 3
{El Control dependence. 53 =A + C
{hi Resource dependence. 54 =AI I
(ll Bernstein conditions.
(bi) S1 = $lN(Y)
lll Degree of parallelism.
52 = X +W
Problem 2.2 Define the following terms for S3 -<H)<"'|Il-i"‘Zb-= -2.5 XW
various system interconnect architectures: 54: X = CO5(Z}
{El} NC-Clfl dEgI‘EE- {c} Determine the data dependences in
(b) Network diameter. the same and adjacent iterations of the
(c) Bisection bandwidth. following Do-loop.
[d} Static connection networks. DD 10 | = LN
{e} Dynamic connection networks. 5]; A“ + 1) = B{| _ 1) + cm
(f) Nonbloclting networks. 51; gm = Aw X |(
(g) l"1ulticast and broadcast 53; cm = 5“) _ 1
(h) Mesh versus torus. 1Q continue
(i) Symmetry in neovorlts.
Problem 2.5 Analyze the data dependencfl
[ii Multistage networks. am the fa" .n mt ts . .
ong oun g s emen |n a gnren program:
(it) Crossbar networks.
(I) Digital buses. Si? Lflfiil R]. lUZ4 IR] (-1024!
51: Load R1. l"l{lO) IR! <— l"1emory(lCl).l
Problem 1.3 Answer the following questions on
5]: Add R]. R2 IR] t—(R1]+(R1)l
program flow mechanisms and computer models:
54: Store M(1n24}. R1 i'l"'1emory[ 1024} <- (R1)!
la‘! Compare oontrol-flow. dataflow. and 55: Store M([R2)). 1024 ll‘*‘Iemory{64] <- 1024:
reduction computers in terms ofthe program
fl°\'\" mE¢l13|'ll5lT'l |-|5Ed- where (Ri) means the content of register Hi and
(bi Comment on the advantages and l“‘|emc|-ry|[l D) contains 64 initially.
disadvantages in control complexity. potential {a} Draw a dependence graph to show all the
for parallelism. and cost-effectiveness of the dependences.
above computer models. (la) Are there any resource dependences if only
Program and Networlt Prape i 35
one copy of ch
functional unit is available in S5: l"l=G+C
the CPU? S6: A=L+C
{c} Repeat the above for the following program ST: A=E+A
statements:
Problem 1.B According to program order. dve
S1: Load R1. l"l{100} rm <- l"lemory[1UD]-I following six arithmetic expressions need to be
S2: Hove R1. R1 nu <- {R1 )1 executed in minimum time. Assume dvat all are
inc R1 rm <- (R1) + 1: integer operands already loaded into working
Add R1. R1 nu <- (R2) + (R1)! registers. No memory reference is needed for the
Store M(100). R1 .fl""lemory(1DO) <- {R1}! operand fetch.Also. all intermediate or final results
Problem 1.6 A sequential program consists are written back to working registers without
of the following five statements. S1 through S5. conflicts.
Considering each statement as a separate process. P1: X-t—[A+B)><(A-B]
clearly identify input set ly and output set O; of each P1: ‘i'<—(C+D]f(C-D)
process. Restructure the program using Bernstein's P3: Z t— X +‘l’
conditions in order to achieve maximum parallelism P4: A -t— E >< F
between processes. If any pair of processes cannot P5: Y t— E — Z
be executed concurrently. specify which ofthe three P6: B <— (X — F) '>< A
conditions is not satisfied.
(a) Use the minimum number of working
= B+C registers to rewrite the above HLL program
= B :>< D into a minimum-length assembly language
sew.-_~
'U‘lU‘lU'l '-fin)» = Cl code using arithmetic opcodes odd. subtract.
S4: Do I = A. 100 multiply. and divide exclusively. Assume a fixed
5 = S + X{l) instruction format with three register fields:
End Do two for sources and one for destinations.
S5: IF (S .G'l'.1000} C = C :>< 2 {la} Perform a flow analysis of the assembly
code obtained in part {a} to reveal all data
Problem 1.7 Consider the execution of dve
dependences with a dependence graph.
following code segment consisting of seven
{c} The CPU is assumed to have two odd units.
statements. Use Bernstein's conditions to detect
one multiply unit. and one divide un.|'t.Work out
the maximum parallelism embedded in this code.
an optimal schedule to execute the assembly
justify the portions that can be executed in parallel
code in minimum time. assuming 1 cycle for
and the remaining portions that must be executed
the add unit. 3 cydes for the multiply unit.
sequentially Rewrite the code using parallel
and 18 cycles for the divide unit to complete
constructs such as Cobegin and Coend. No variable
the execution of one instruction. Ignore all
substitution is allowed. All statements can be
overhead caused by instruction fetch. decode.
executed in parallel if they are declared within the
and writeback. No pipelining is assumed here.
same block of a |[Cc|-begin. Co-end) pair.
M: A=B+C Problem 1.9 Consider the following assembly
S1: C = D + E language code. Exploit the maximum degree of
: G + E parallelism among the 16 instructions. assuming no
resource conflicts and multiple functional units are
n'H =A + F
available simultaneously. For simplicity. no pipelining
TM Illnffirihi-* Hfllfiuroponnri .
B5 i Advanced Compu'terArchi'tectui'e
is assumed.All instructions take one machine cycle (b) Work out an optimal schedule for parallel
to execute. Ignore all other overhd. execution of the above divided program by
1: Load R1.A .lR1t— l"'lem{A)J' the two processors in minimum time.
1: Load R1, s no <- l"'1em(B}! Problem 1.11 You are asked to design a direct
Z Hui R3. R1. R2 .fR3 (— {R1} X [R2}f network for a multicomputer with 64 nodes using
2 Load R4. D .fR4 <— l"'lem{D)f a three-dimensional torus. a six-dimensional binary
: Hui R5, R1. R4 IRS t— {R1} I>< {R4}! hypercube. and cube-connected-cycles (CCC]- widv
: Add R6. R3. R5 .fFl.6 t— {R3} 1' (R5]f a minimum diameter. The following questions are
: Store X, R6 fMem[X] <— (R6)! related to the relative merits of thae network
N“-l G\U"l-h f-n l 1 Load R1. C .fR7 <— l"1em(C).i' topologies:
9; Mul R8. R7. R4 ins ¢- in?) >< (R4)! {a} Let d be the node degree. D the network
10: Load R9. E IR‘? <— l"'lem(E}i' diameter. and I the total number of links in
11: Add R10. R8. R9 {R10 t— (RB) + (R9)! a network Suppose the quality of a network
1;‘; StoreY.P.10 l‘l'lem(Y} <- {R10}! is measured by (.1 >< p .>< ljr‘. Rank the three
architectures according to this quality
131 Add R11. R6. R10 .lR11 t— (R6) + {R10}!
measure.
14: Store u,n.11 fl'“1em(U} <- (R11)!
(b) A mean irrtemode distance is defined as the
15: Sub R12. R6. R10 IR12 t— {R6} — (R10)!
average number of hops {links} along the
1s; Storelf. R12 l'Her1'|(V) <- {R12}! shortest path for a message to travel from one
{a} Draw a program graph with 16 nodes to node to another".The average is calculated for
show the flow relationships among the 16 all (source.destination} pairs.Order the three
instructions. architectures based on their mean internode
(b) Consider die use of a three-issue superscalar distances. assuming that the probability that a
processor to execute this program fragment node wfll send a message to all other nodes
in minimum time. The processor can issue with distance i is (0 -r+1}1zf=, k.where 0
one memory-access instruction (Load or is the network diameter.
Store but not both). oneAddl'Sub instruction.
Problem 1.11 Consider an llliac mesh (B >< 8).
and one l"'lul (multiply) instruction per cycle.
a binary hypercube. and a barrel shifter. all with 64
The Add unit. Load.lStore unit. and l"‘lultiply
nodes labeled N0. N1. Ng3.All network links are
unit can be used simultaneously if there is no
bidirectional.
data dependence.
{a} List all the nodes reachable from node N9
Problem 1.10 Repeat part (b]- of Problem 2.9 in exactly three steps for each of the three
on a dual-processor system with shared memory. networks.
Assume that the same superscalar processors are (b) Indicate in each case the tightest upper bound
used and that all instructions take one cycle to on the minimum number of routing steps
execute. needed to send data from any node N, to
(a) Partition dve given program into two balanced another node Ni.
halves. You may want to insert some load {c} Repeat part (b) for a larger network with
or store instructions to pass intermediate 1024 nodes.
results generated by the two processors to
each other. Show the divided program flow Problem 1.13 Compare buses. crossbar switches.
graph with the final output U and‘! generated and multistage networks for building a multiprocessor
by the two processors separately. system with n processors and m shared-memory
Program and Network Pmpa W 31
modules.Assume a word length of w bits and that (c) Based on the results obtained in (a} and [b].
2 '>< 2 switches are used in building the multistage prone the topological equivalence between
networks. The comparison study is carried out the Flip network and the Omega network.
separately in each of the following four categories:
(a) Hardware complexities such as switdwing. ‘i I I I S
arbitmtion. wires. connector. or cable
requirements.
(b) Minimum latency in unit data transfer between
the processor and memory module.
[c] Bandwidflw range available to each processor.
(d) Communication capabilities such as
permutations. data broadcast. blocking
handling. etc.
2 I-ll"-I-lll‘-I-ll.“-I-Ni A
Problem 2.14 Answer t.he following questions
“.1 I4 l 2*.
1: I-I I-I I-I I-l 2.
. \:'\:'\='
related to multistage networks:
{a} How many legitimate states are mere in a
4 >< 4 switch module. induding both broadcast
and permutations? justify your answer with
reasoning. Fig. 2.27 A 16 >< 16 Flip network {Courtesy of
(b} Construct a 64-input Omega network using
Keri Baecher: reprinted from Proc. int.
cm. Pan:|lieiPmc-esslng. 1915;
4 >< 4 switch modules in multiple stages. How
many permutations can be implemented
Problem 2.16 Answer the following questions
direcdy in a single pass through the network
for the k-ary n-cube network:
without blocking?
{a} How many nodes does the network contain?
{c} What is the pencentage of one-pass
(b) What is the network diameter?
permutations compared with the total
{c} What is the bisection bandwidth!‘
number of permutations achievable in one or
more passes through the network!‘ (d) What is the node degree?
(e) Eaqalain the graph-theoretic relationship
Problem 1.15 Topologically equivalent networks among k-ary n-cube networks and rings.
are those whose graph representations are iso- meshes. tori. binary n-cubes. and Omega
morphic widw the same interconnection capabili- networltrs.
ties. Prove the topological equivalence among the {f} Explain the difference between a conventional
Omeg. Flip.and Baseline networks.
torus and a folded torus.
[a] Prove that dwe Omega network (Fig. 2.24} {g} Under the assumption of constant wire
is topologically equivalent to the Baseline bisection. why do low-dimensional networks
network [Fig. 2.25b}. (tori) have lower latency and higher hot-spot
(b) The Flip network (Fig. 2.2?) is constructed throughput dwan high-dimensional networks
using inverse perfect shuffle (Fig. 2.14b} for (hypercubes)?
interstage connections. Prove that the Flip
network is topologically equivalent to the Problem 1.17 Read the paper on fat tnees by
Baseline network. Leiserson.which appeared in IEE Trans. Computers.
Fhr MIG-l‘l7l|H Hflllluqieuira-s
BB W fiialwaviced Computer Architecture
pp. 391-9111. Oct 1985. Answer the following is equal to the bandwidth of a single channel
questions related to the organization and application vv = k— 1. under the assumption ofa constant
of fat trees: wire cost.
{a} Explain the advantages of using binary fat
Problem 1.19 Network embedding is a technique
trees over conventional binary trees as a
to implement a network A on a network B. Explain
multiprocessor interconnection network.
how to perform the following network embeddings:
(ti) A uriiversai fat tree is defined as a fat tree of
{a} Embed a two-dimensional torus r >< r on an
n nodes with root capacity w. where nm 5 W
n-dimensional hypercube with N = 1" nods
E n . and for each channel qt at level it of die
where r1 = 1".
tree. the capacity is
(b) Embed the largat ring on a CCC with
C. = min ll"-Tl]. i“»'2’*"‘*1i N=k>< 2"nQdi==§=n-=ii<z3_
Prove that the capacities of a universal fat tree [c] Embed a complete balanced binary tree with
grow exponentially as we go up the tree from maximum height on a mesh of r>< r nodes.
the leaves. The channel capacity is defined Problem 1.20 Read the paper on hypernets
here as the number of wires in a channel.
by Hwang and Ghosh, which appeared in lEE.E
Problem 1.18 Read the paper on k-ary n-cube Tmns. Computers. Dec. 1989.Answer the following
networks by Dally. which appeared in IEEE Trans. questions related to the network properties and
Computers. june 1990. pp. 775-785. Answer applications of hypernets:
the following questions related to the network (a) Erqalain how hypernets integrate positive
properties and applications as aVL5l communication features of hypercube and tree-based
network: topologies into one combined architecture.
{a} Prove that the bisection width B of a it-ary
(b) Prove that the average node degree of a
n-cube with iv-bit wide communication
hypernet can be maintained essentially
channels is
constant when the network size is increased.
B(k. J1) = 1w-J'fik“'1"' = 2wNIk (c) Discuss the application potentials of hypernets
where N = it" is the network size. in terms of message routing complexity. cost-
effective support for global as well as localized
(b) Prove that the hot-spot throughput ofa it-ary
communication. HO capabilities. and fault
n-cube neowork with deterministic routing
tole ran ce.
TM Htfiflhlt Hfllio-nponnti
— —
Principles of Scalable
Performance
We study performance mmsures. speedup laws. and scalability prindpies in this chapter.Three speedup
models are presented under different computing objectises and rmource constraints. These include
Amdahfs law (1967). Gustafson‘s scaled speedup {19E-B). and the mernory-bounded speedup by Sun and
Ni (1993).
The efficiency. redundancy. utilization. and quality of a parallel computation are defined. involving the
interplay between architectures and algorithms. Standard performance memures and several benchmark
kernels are introduced with relevant performance data.
The performance of parallel computers relies on a design that balances lordware and software. The
system architects and programmers must exploit parallelism. pipelining. and networking in a balanced
approach. Toward building massively parallel systems. the scalability issues must be resolved first.
Fundamental concepts of scalable systems are introduced in this chapter. Case studies can be found in
subsequent chapters, especially in Chapters 9 and 13.
The plot of the D-OP as a function ol" time is called the parallelism pm-fitle of a given program. For
simplicity, we concentrate on the analysis of single-program profiles. Some software tools are available to
Lrace the parallelism profile. The profiling of multiple programs in an interleaved fashion can in theory be
extended from this study.
Fluctuation of the profile during an observation period depends on the algoritlunic structure, program
optimization, resource utilization, and run-time conditions of a computer system. The DOP was defined
under the assumption ofhat.-‘ing an unbounded number of available processors and other neoessary resources.
The DOP may not always be achievable on a real computer with limited resources.
Wlien the DOP exceeds the maximtun number of available processors in a system, some parallel branches
must be executed in chunks sequentially. However, parallelism still exists within each chunk, limited by the
machine sizte. The DDP may be also limited by memory arid by other nonprocessor resources. We consider
only the limit imposed by processors in our discussions on speedup models.
Average Fturallelism In what follows, we consider a parallel computer consisting of n homogeneous
processors. The maximum parallelism in a profile is m. In the ideal case, rt .1>;> m. The contpriririg cqvm:'i!_y
A of a single processor is approximated by the execution rate, sueh as MIPS or Mflops, without considering
the penalties from memory access, communication latency, or system overhead. When i processors are busy
during an observation period, we have DDP = i.
The total amount of work ll-"' (instructions or computations} performed is proportional to the area under
the profile curve:
is
w=.¢-._[[ none) dr (3.1)
I
where r,- is the total amount of time that DOP = i and If’ , r, = r; — r; is the total elapsed time.
The at-wage parallelism A is computed by
1 I
A — ij ’ DOP{_r) df (3.3)
I2 — I] 1|
HT HI
A= (3.4)
As illustrated in Fig. 3.1, the parallelism profile of a divide-and-conquer algorithm increases from l to its
peak value m = 8 and then decreases to 0 during the observation period (rt, I2).
Fr‘:-r Mtfiruur Hlllrrqn
rd’ imo \ '
Principle: ofS<:orlo.bile Perfrrmonce i. - q|
It Degree of Parallelism
a - roots
1'-
6._
5.-
4._
3_
"""" ".liv'€r5r_iei E>.§r'itE|'t§ri-Fri "““""
2_
_ _ _ _u
1- t r
1 I r 4 t I t 1 r 2
2 4 T ‘lfl 13 15 1? 2!] 24 27
Time?!»
Available Parallelism There is a wide range of potential parallelism in application programs. Engineering
and scientific codes exhibit a high DOP due to data parallelism. Manoj Kumar (I983) has reported that
computation-intensive codes may execute 500 to 3500 arithmetic operations concurrently in each clock cycle
in an idealized environment. Nicolau and Fisher (I984) reported that standard Fortran programs averaged
about a factor of 90 parallelism available for very-long-ins‘o'uction word architectures. These numbers show
thc optimistic side of available parallelism.
However, David Wall (I991) indicated that limits of instruction-level parallelism is around 5, rarely
exceeding 7. Bulter et al. (I991) reported that when all constraints are removed, the DOP in programs
may exceed 17 instructions per .cyclc. If the hardware is perfectly balanced, one can sustain fi-om 2.0 to
5.8 instructions per cycle on a superscalar processor that is reasonably designed. These numbers show the
pessimistic side of available parallelism.
The above measures of available parallelism show that computation that is less numeric than that in
scientific codes has rclativcly little parallelism cvcn when basic block boundaries art: igt1r.n'cd.A basic block
is a sequence or block of instructions in a program that has a single entry and a single exit points. While
compiler Dptimirflfion and algorithm redesign may increase the available parallelism in an application,
limiting parallelism extraction to a basic block limits thc potential instruction-lcvcl parallelism to a factor of
about 2 to 5 in ordinary programs. However, the DO? may be pushed to thousands in sortie scientific codes
when multiple processors are used to exploit parallelism beyond the boundary of basic blocks.
ilsympnatic Speedup Denote the amount of work executed with DOP = i as ll’, = r‘.'1tr,- or we can write
W = E,’-°"| ll-'}. Thc execution time of Hr} on a single processor [sequentially] is r,{l) = rrys. The execution
time of ll’, on it processors is r,{_lr) - ltr',J.t-A. With an infinite number of available processors, r,{=>~=] — llilio for
l E i E rrr, Thus we can W‘l'ite the response time as
a
nn- Zr. m- —'l-1-._¥ out
ll l[*'l§
War Mcliruur irrttr-...s-,.n,t.¢. '
92 i Advanced Cornp-uter Architecture
zit-1.
= rm = .- . _
s._. -—-nun) -in, (3.7)
Zn-‘F
1' I
Comparing Eqs. 3.4 and 3.1’, we realize that S.__, = A in the ideal ease. In general, S,_ £ A if communication
latency and other system overhead are considered. Note that both S... and .-t are defined under the assumption
n=wornl>;"-in.
Rtr = fr"
r -1
The expression R“ assumes equal weighting (lim) on all m programs. If the programs are weighted with a
distribution rr + {;§| i — l, 2, .. ., m}, we define a 'tt‘c’igJi?Ic’(|i in-irhmerir mean ct-ccurimi mic as follows:
R:= Eu.-a-.1
J" I
(1.91
Arithmetic mean execution rate is proportional to the stun of the inverses of execution times; it is not
inversely proportional to the sum of execution times. Consequently, the arithmetic mean execution rate fails
to represent the real times oonstuned by the benchmarks when they are actually executed.
Harmonic Mean Performance With the weakness of arithmetic mean performance measure, we need
to develop a mean performance expression based on aritlunetie mean execution time. 1n fact, 7} = UR, is thc
mean execution time per instruction for program i. The arirhnicric mean t'.ror*urii'1n time per instruction is
defined by
Principle: of Scalable Perftrmonce i Q3
III III
T..= = ow)
The hrrrrrronie moan oxorririorr rote across m bcnclirmark programs is thus dcfinod by the fact RI, = li"?':,:
m
R ,= i
I l{-1|:-El] {111} .
Therefore, the harmonic mean performance is indeed related to the average execution time. With a weight
distribution It = {_f,3|r' = 1, 2, . .., rrr}, we can define the it-wightcr! harmonic rrrcrm exrmrrtion rote as:
s-=;,= A (1:2)
E" 1 U"-'rRr_l
The above harmonic mean performance expressions correspond to the total number of operations divided
by the total time. Compared to arithmetic mean, the harmonic mean execution rate is closer to the real
performance.
Harmonic Mean Speedup Another way to apply the harmonic mean concept is to tie the various modes
of a program to the number of processors used. Suppose a program {or a workload of multiple programs
combined) is to be executed on an n-processor system. During thc executing period, the program may use
i - 1, 2, .. ., n processors in different time periods.
We say the program is executed in rrror.i'o i, if r‘ processors are used. The corresponding execution rate R, is
used to reflect the collective spend of r‘ processors. Assume that T1 = HR] = l is the sequential execution time
on a uniprocessor with an execution rate R, — l.Then T,- - 1tR,- - lit is the execution time of using i‘ processors
with a combined execution rate of R, = i in the ideal case.
Suppose the given program is executed in n execution modes with a weight distribution w = {j,?|i = 1, 2,
. . ., rt}. A it-'or'g}rtod harmonic moon .spocdup is defined as follows:
s=r,n"*= A (3.13)
{E-i"i | J'i"Rrl
where T"‘ — ltR1t, is the it-riglrrr-rt nrir!rnu=n'r- nrerm o.x'or:'rrIion tr'nrr_' across the n execution modes, similar to
that derived in Eq. 312.
Ir]
g Example 3.2 Harmonic mean speedup for a multipro-
cessor operating in n execution modes
(Hwarlg and Briggs, 19:14)
In Fig. 3.2, we plot Eq. 3.13 based on the assumption that 7] = lti for all i = 1, 2, ..., n. This corresponds to
the ideal case in which a unit-time job is done by i processors in minimum time. The assumption can also
be interpreted as R, = i because the execution rate increases i times fi"om R1 = I when iprocessors are fully
utilized without waste.
The three probability distributions rt], I2, and E3 correspond to thrcc processor utilization patterns. Lot
s = E:-' | F. rt] = (1/n, lfn, _. ., Lin) corresponds to a uniform distribution over the rt cxocution modes, rt; =(1t's,
2."s, ..., n.I'.s') favors rising more processors, and 21'; = (m‘s,{n — ljfs, .. .,2."s,l."s_) favors using fcwcr processors.
rm‘ MIGIELH HI" t'm'rIq|r_.\.I|n*\ ‘I _
The ideal ease corresponds to the 45“ dashed line. Obviously, 11'; produces a highccr speedup than tr] does.
The distribution 11-1 is superior to the distribution :13 in Fig. 3.2.
‘ s=%i
r
1024- 51,», P I = ___,
/ 2 5
or / as
\
ii
N .-—.-—. ‘J-
he-h Et-IMM
\.
Spggdup 64- cf _ n 11-1 1
,1’ I3-'l§' _:s_'
‘IE — ,5 H
,1 whemas= El‘
d = ._¢
4_1
I
Fig. 3.2 Harrnordc rn-eon speedup performance with respect no 1:hree probability dls1:rtn.r|:lonsr :r1 for tmtforrn
-cHsr.riburlori. ;rr1 in faror ofuslng more prooessors. and rt; to favor ofuslog fewer processors
Amdnhli Law Using Eq. 3.13, one can derive Amda1'|l's law as follows: First, asstlme R,-= i, w={|'1‘, GI, D, .. .,
0, 1 — tr); i.e., wl = or, ti-',, = I — tr, and +1; = O for r‘ ¢ 1 and r'a= n. This implies that the systern is used either in a
pure sequential mode on one processor with a probability tr, or in a fiilly parallel mode usingn processors with a
probability 1 — 0:. Substituting R; = 1 andR,,=n and w into Eq. 3.13, we obtain the following speedup expression:
H
$.= W (3-14)
This is known as Amdahl’s law. The implication is that S —> lfrras H —> W. In other words. under the above
probability assumption, the best speedup one can expect is upper—b-ounded by llrr, regardless of how many
processors are employed.
In Fig. 3.3, we plot Eq. 3.14 as a function of n for four values of tr. When tr = Cl, the ideal speedup is
achieved. As the value of rr increases from 0.0] to 0.1 to 0.9, the speedup performance drops sharply.
; _ n
S_'l+{rr—1[|tr
1024- ,°o=lJ
I
2
25s — ,<f
/1’ tr = C|.C|1
Speedup 5; _ Id
I
dd
15” I
1 u:=U.'|
I
4_
11 = [LB
1 n
4 16 64 256 11124
Fig.1.} Speedup perforrnanee wl1:h respect no the probability distribution rr = {opt}, CL1- cc) where II ls the
fralzrion of sequential boctlenedc
rs» MIG:-|:|'u|' Hillr1 tr.I;|(rm-r It '
Principle: ofS-soluble Perflrmonce i. - Q5
For many years, Amdahl's law has painted a pessimistic picture For parallel processing. That is, the system
performance cannot be high as long as the serial fiaction rr exists. We will furtlier examine Amdabl’s law in
Section 3.3.1 from the perspective of workload growth.
Redundancy and Lftilizotion The l"t.’|tIfN.F?|til|iI;l'I-C‘_}-‘ in a parallel computation is defined as the ratio of Ofn) to
17(1):
R(n] = O(ri)fO(l] (3.17)
This ratio signifies the extent of matching between software parallelism and hardware parallelism.
Obviously 15 Rio) £ n. The s_t-stem un'l'i:a!r'on in a parallel computation is defined as
O
Lilfri) = R{n_}E(n_) = [_3.1ll_)
nT{_n)
The system utilization indicates the percentage of resources (processors, memories, etc.) that was
kept busy tll.lI‘i1‘|g the execution of a parallel program. lt is interesting to note the following relationships:
1:515 E(n]| S L"(n) S 1 and 1 SR(n] 5 HE-['n] 5 n,
Quality of Parallelism The qznrlirv of a parallel computation is directly proportional to the speedup and
efficieney and inversely related to the redundancy. Thus, we have
3 .
Qt») = {Ml
= HT -[n)O{nj (3.19)
Since Elin) is always a 'Frat:t‘ion and Rlfn) is a number between l and n, the quality Q‘{_n) is always upper-
bounded by the speedup .5l[n').
96 ii Advanced Carnp-uter Arclw'Iectm'e
I/I
lg Example 3.3 A hypothetical workload and performance
plots
In Fig. 3.4, we compare the relative magnitudes of5(_n), E{n), R{_n), U{n), and Q{n'] as a fimetion of machine
sire n. with respect to a hypothencal workload characterized by O(l } = T111) = n3, O{n) = H1 + nlloggn, and
1m= 4,r‘r(n + 3).
Substituting these measures into Eqs. 3.15 to 3.19, we obtain the following performance expressions:
S|['n_) = (n + 3_)f4
E('n] = {n + 3]/(4:1)
R-['n} — {n + log; n]I'n
Lin] = (ri + 3_](n + log; n).-‘(4n2)
fin] = 1;» + 3]Fr(1s(n + log; ND
The relationships lfn S E(_n) S Ufn) S I and I] S Q(_n) S BT11} S n are observed where the linear speedup
eerresponds to the ideal ease of 1DD% effieieney.
Speedup S[n]
Efiidencv Elfll Redundancy Rm]
Utilization U[n] Quality Q[n]
It I
10- --------------------------- -- +32
\
Linear /\
E E.
os— &“**% -16
E E'- \
“W.
o.e- \ 2 =-
\ -is
\
_-_ __--_‘:
0.4- \ -'4
N I
\ l
\
02- X.
\
-2
\
\
\
00- 4q[|‘n‘IEMIlI.'* -1
i u I I I
1 2 4 a 1s 32
Numberof processors {n]
Fig. 3.4 Perfcrrnanea measures for Example 3.3 on a parallel computer with up to 32 processors
To summarise the above discussion on perforrnanee indices, we use the speedup Sin] to indicate the
degree of speed gain in a parallel computation. The efieieney E{'n) measures the useful portion ofthe total
work perfonned by n processors. The redundancy Rm] measures the extent of workload increase.
The utilization Ufn) indicates the extent to which resources are utilized during a parallel computation.
ram Mrlirului Hfllf 1 nr".I||r_.u| w u :
Principles ofScololslePetftI'ma.nce 0 - Q1
Finally, the quality Qfnj combines the effects of speedup, efficiency, and redundancy into a single expression
to assess the relative merit of a parallel computation on a computer system.
The speedup and efiiciency of 10 parallel computers are reported in Table 3.1 for solving a linear system
of I000 equations. The table entries are excerpts from Table I in I1-‘ongan"a's report (1092) on LINPACK
benchmark performance over a large ntunber of computers.
Either the standard LINPAL‘-K algorithm or an algorithm based on matrix-matrix multiplication was used
in these experiments. A high degree of parallelism is embedded in these experiments. Thus high efficiency
{such as 0.94 for the IBM 30‘90."fi-008 VF and 0.95 for the Convex C3240] was achieved. The low efficiency
reported on the lntel Delta was based on some initial data.
Table 3-1 Speedtrp and Efliciency offlnrallel Computers fiar Solving 0 Linear System with ‘F000 Unknowns
Source: Jack Dongarra. “Performance of Various Computers Using Standard Linear Equations Soflvraref‘ Computer
Science Dept, Univ. of'Tetmessec, Knoxville. TH N996-1301. March ll. 1992.
To compare processors with different clock cycles and different instruction sets is not totally fair. Besides
the native MIPS, one can define a reittttve MIPS with respect to a reference machine. We will discuss
relative MIPS rating against the VAKFTRU when Dhrystone perforrnanee is introduced below. For numerical
computing, the LINPACK results on a large number of computers are reported in Chapter B.
Similarly, the Mflops rating depends on the machine hardware design and on t11e program behavior. MIPS
and Mflops ratings are not convertible because they measure difierent ranges ofoperations. The conventional
rating is called the native Mflops, which doe-s not distinguish unnormalized from normalized floating-point
operations.
For example, a r-ea! fioanng-pom: divide operation may correspond to four normoltzedfioating-point
rift-'ic1'c operations. One needs to use a conversion table between real and normalized floating-point operations
to convert a native Mflops rating to a normalized Milo-ps rating.
The Dhrynone Result: This is a CPU-intensive benchmark consisting of a mix of about IUD high-
level language instructions and data types found in system programming applications where floating-point
operations are not used fweicker, 1984). The Dbrystone statements are balanced with respect to statement
type, data type, and locality of reference, with no operating system calls and making no use of library
functions or subroutines. Thus the Dhrystone rating should be a measure of the integer perfomiance of
modem processors. The unit K.Dhtj-vst‘or:esi'.s is often used in reporting Dhrystone results.
The Dhrystone benchmark version l.l was applied to a number of processors. DEC VAX ll."'i'8U scored
1.7 KDltrystonesfs performance. This machine has been used as a reference computer with a IMIPS
pcrfomtance. The relative VAX.-"MlPS rating is commonly accepted by the computer industry.
The Whetstone Result: This is a Fortran-based synthetic benchmark assessing the floating-point
performance, measured in the number ofKWhetstortesfs that a system can perform. The benchmark includes
both integer and floating-point operations involving array indexing, subroutine calls, parameter passing,
conditional branching, and trigonometricitrariscertdental functions.
The Whetstone benchmark does not contain any vectorizable code and shows dependence on the system’s
trtathelrtatics library and efiiciency ofthe code generated by a compiler.
The Whetstone performance is not equivalent to t.he Mflops performance, although the Whetstone contains
a large number ofscalar floating-point operations.
Both the Dhrystone and Whetstone are synthetic benchmarks whose perfomiance results depend heavily
on the compilers used. As a matter of fact, the Dhrystone benchmark program was originally written to test
the CPU and compiler performance for a typical program. Compiler techniques, especially procedure in-
lining, can significantly affect the Dhrystone performance.
Both benchmarks were criticized for being unable to predict the performance of user programs. The
sensitivity to compilers is a major drawback of these benchmarlts. In real-life problems, only application-
oriented benchmarks will do the trick. We will examine the SPEC and other benchmark suites in Chapter 9.
The TPS and KUPS Rating: On-line transaction processing applications demand rapid, interactive
processing for a large number of relatively sirnple transactions. They are typically supported by very large
databases. Automated teller machines and airline reservation systems are familiar examples. Today many
such applications one web-based.
FM Mtfirpw Hlllr1 wt’qt wins
Principle: of Scalable Peiflrmonce i pg
The throughput o fcomputers for on-line transaction processing is often measured in !r.on.s'm:'Iion.spc"r second
(TF5). Each transaction may involve a database search, query answering, and database update operations.
Business computers and servers should be designed to deliver a high TPS rate. The TP1 benchmark was
originally proposed in I985 for measuring the transaction processing of business application computers. This
benclnnark also became a standard for gauging relational database perfonnances.
Over the last couple of decades, there has been an enormous increase both in the diversity and the scale
of computer applications deployed around the world. The world-wide web, web-hosed applications, multi-
media applications and search engines did not exist in the early l99'9Ds. Such scale and diversity have been
made possible by huge advances in processing, storage, graphics display and networking capabilities over
this period, which have been reviewed in Chapter I3.
For such applications, application-specific henchmarlts have become more important than general purpose
benchmarks such as Whetstone. For web servers providing 24 >< 7 service for example, we may wish to
bencl1mark—under simulated but realistic load conditions—performance parameters such as: throughput {in
number of requests served and.-‘or amount of data delivered) and or-vzrage rsrsyionse time.
ln artificial intelligence applications, the measure KLIPS (kilo logic iri_,r‘i'r+.'ric-¢'s per soc-mint‘) was used at
one time to indicate the reasoning power of an A] machine. For example, the high-speed inference machine
developed under Japan's Fii’th—Generation Computer System Project claimed a performance of 400 KLIPS.
Assuming that each logic inference operation involves about 100 assembly instructions, 400 KLIPS
implies approximately 40 MIPS in this sense. The conversion ratio is by no means fixed. Logic inference
demands symbolic manipulations rather than numeric computations. Interested readers are referred to the
book edited by Walt and Ramamoorthy -[I990].
{'1 j Thc magnetic rc-cording industry rclics on thc usc ofoomputcrs to srttdy mcgmctostatic and ctcharigc
F?» Mtfirpw Hlllrlimpwinw
interactions in ordcr to reduce noisc in mctallic thin films used to coat high-density disk-;. ln gcncraL
all rcscarch in science and cnginccring makes hcavy demands on computing powcr.
(El Rational drug design is bcing aided by computers in thc search for a curc for cancer, acquircd
immunodeficiency syndrome and other diseases. Using a high-pcribrmanoc computer, new potential
agents have bccn idcntificd that block thc action ofhuman immunodeficiency virus protcasc.
{'3} Dcsign of high-spccd transport aircraft is bcing aidcd by computational fluid dynamics running on
supcrcomputcts. Fucl combustion can bc madc mon: cfficicnt by designing bcttcr cnginc models
through chemical kinetics calculations.
(J-lj Catalysts for chemical reactions arc being dcsigncd with computers for many biological proccsscs
which arc catalytically controlled by cnzymcs. Massively parallel quantum models demand largc
simulations to rcducc thc time rcquircd to design catalysts and to optimia: thcirpropcrtics.
{'5} Ocean modeling cannot bc accurate without supcreomputing MPP systcms. Ozonc dcplction and
climatc rcscarch demands thc usc of computers in analyzing thc complex thcrmal, chcmical and fluid-
dynamic mcchanisnu; involved.
{'6} Othcr important arcas demanding computational support includc digital anatomy in rcal-time mcdical
diagnosis, air pollution reduction through computational modeling, thc dcsign ofprotcin structures
by computational biologists, image processing and understanding, and technology linking rcscatch to
cducation.
Bcsidcs computer scicncc and computer cnginccring, thc abovc challcngcs also cncouragc thc cmcrging
discipline of computational science and engineering. This demands systematic application of computer
systems and computational solution techniques to mathematical models formulated to describe and to
simulate phenomena ofscicntific and cnginccring intcrcst.
Thc I-[FCC Program also identified some grand challenge computing rcquircrncnts ofthe time, as shown
in Fig. 3.5. This diagram shows the levels ofprocessing speed and memory size required to support scientific
simulation modeling, advanced computer-aided design (CAD), and real-time processing of largevscale
database and information rctricval operations. In the sincc thc carly 19905, thcrc have bccn hugc
advances in the processing, storage and networking capabilities of computer systems. Some MFP systems
have reached petaflop perforrnance, while even PCs have gigabytes of memory. At the same time, com.puting
rcquircmcnts in scicnoc and cnginccring have also grown cnormously.
Exploiting Massive Parallelism The parallelism embedded in the instruction level or procedural level is
rather limited. Very few parallel computers can successfirlly execute more than two instructions per machine
cyclc from thc samc program. Instruction parallelism is oficn constrained by program behavior, con1pilcrr"US
incapabilities, and program flow and execution mechanisms built into modern computers.
On the other hand, alum paralleimm is much higher than instruction parallelism. Data parallelism refers to
thc situation whcrc thc samc operation (instruction or program} eitccutcs ovcr a large array ofdata lopcrands}.
Data parallelism has been implemented on pipelinod vector processors, SIMD array processors, and SPMD
or MPMD multicomputer systems.
In Table 1.6, we find SIMD data parallelism over 65,536 PEs in the CM-2. One may argue that the CM-2
was a bit-slicc machine. Evcn if vvc divide thc numl:rc:r by 64 (thc word lcng1.h ofa typical supercomputer},
we still end up with a DDP on the order of thousands in CM-2.
The vector length can be used to determine the parallelism implementable on a vector supercomputer.
I'M.- Mrfiruw Hill!‘ I NT .l]lI_J.|lM"| -
Principls ofS<:ol-ubile Pelflrmonce -- H"
In the ease of the Cray Y.-‘MP C-9|), 32 pipelines in lo processors eoulrl potentially achieve a DUP of
32 >< 5 = 160 ifthe average pipeline has five stages. Thus a pipelined processor can support a lower degree of
data parallelism than an STMD computer.
M°'"'°"lf c"=\P3°“l" Global Change
I. Human Geno-mo
Flilo Turbulence
Vehicle Dynam lea
1000 GB i _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Q4533“ clfgliaflon
Viscous Fluid Dy-nalnles
Suporoonducbr Modeling
Semiconductor Modeling
Quantum £';l1romod3|' namles
100 GB __ Vlfilflfl I
1o GB — ~————————————— 5"“°‘““*‘
Mohlelo Bbbar
Sig rnturo
72-Hour
1 '35 — Weather Pmrggcgxfical
SD Plasma
10435551 Modeling
Chemical Dynamics
118-How
Weather
10 MB —
Alrfoll OH Reservoir
Modeling
_ _ _ _ _ _ _J
Early MPP systems operating in MIMD mode included the BEN TC-2000 with a maximum configuration
of 512 processors. The IBM RP-3 was designed to have 512 processors (only a 64—pro-cessor version was
built). The Intel Touchstone Delta was a system with 5'l'[l processors.
Several subsequent MPP projects included the Paragon by lntel Supercomputer Systems, the CM-5 by
Thinking Machine Corporation, the I-{SR-I by Kendall Square Research, the Fujitsu VPP500 System, the
Tera computer, and the MIT *T system.
[BM announced MPP projects using thousands of IBM RS.!6tl0O and later Power processors, while C-ray
developed MPP systems using Digital‘s Alpha processors and later AMD Opterort processors as building
blocks. Some early MPP projects are summarized in Table 3.2. We will study some of these systems in later
chapters, and more recent advances in Chapter 13.
lntel Paragon A 2-D mesh-cormlected muiticomputler, built with i860 XP procsors and
wonnhole routers, targeted for 300 Ciflops in peak performance.
IBM MPI‘ Model Use IBM RlSC.I‘t‘§IlIlIl procsors as building blocks, 5-ll Gflops
peak expected for a I024-processor configuration.
TMC CM-S A universal architecnrre for SIMIJ-'MTlvID computing using SPARC PEs and
custom-dmigned FPUs, control and data networks, 2 Tflopa peak for 15K nodes.
Cray Research A 313 torus heterogeneous architecture using DEC Alpha ehips with special
It-IPP Model communication support, global addrem space over physically distributed memory;
first system offered 150 Gllops in a I024-process-or configuration in 1993; capable
of growing to Tllops with larger configurations.
Kendall Square An ALLCACHE ring-connected multiprocsorr with custom-designed processors,
Research KSR-I -t3 Gflops peak performance for u ltlllli-processor configuration.
Fujitsu VPPSOU A crossbar-connected 222-PE MIME! vector system, with shared distributed
memories using VP2tlt1tl as a host; peak performance = 355 Cifleps.
Workload Efflclgriql
1 mums 1 __ ___ B
H {Exponential} ‘I’
ll [Suhllnoarj
0.5 B
o. [Constant]
I1
I ,I | ,I ,. 0 I I |,,.,
1 10 100 1000 _.n—-\
10 11343 111043
Machine size, n Ma-chin-eslzie, n
la] Four workload growth patterns [hi Conespondlng efileleney curves
Work load
F |neel- memory
Me-"Mow
Bound
model
Flsed-time model
F
Machine size [n]
(e}Appllc.atlon models for parallel oornp uters
Fig. 3.5 Wlorldood growcli. lfliellnqr curves. and qiplicatlon models of parallel eomputanr undw resources
conso-am:
The Efilcioncy Curve: Corresponding to the four workload patterns specified in Fig, 3.6a, four efiicieney
curves an: shown in Fig. 3.6h, respectively. With a constant workload, the efficiency curve {tr} drops rapidly.
In fact, curve rr corresponds to the famous Amdahl’s law. For a linear workload, the efficiency curve (11) is
almost fiat, as observed by Gustafson in 1983.
The exponential workload (8) may not he implementable due to memory shortage or U0 bounds [if real-
time application is considered). Thus the Gefliciency (dashed lines) is achievable only with exponentially
increased memory [or U0) capacity. The suhlinear efficiency curve (B) lies somewhere between curves r:
and y.
Scalability analysis determines whether parallel processing of a given problem can offer the desired
improvement in perforrnance. The analysis should help guide the design ofa massively parallel processor. lt
is clear that no single scalability metric suffices to cover all possible eases. Different measures will be usefiul
in different contexts, and further analysis is needed along nitdtiple dimensions for any specific application.
Aparallol system can he used tn solve arbitrarily large problems in a fixed time if and only if its workload
FM Mtfiruw Hlllrilmylorrns
pattern is allowed to grow linearly. Sometimes, even if minimum time is achieved with more processors, the
system utilization {or efliciency) may be very poor.
Application Model: The workload patterns shown in Fig. 3.6a are not allowed to grow unbounded. In
Fig. 3.oc, we show three models for the application of parallel computers. These models are bounded
by limited memory, limited tolerance of IPC latency, or limited U0 bandwidth. These models are briefly
introduced below. They lead to three speedup performance models to be formulated in Section 3.3.
Thcfixed-load model corresponds to a constant workload (curve rr in Fig. 3.6a}. The use of this model is
eventually limited by the communication bound shown by the shaded area in Fig. 3.6c.
The_fi.:eci-time model demands a constant program execution time, regardless of bow the workload scales
up with machine size. The linear workload growth [curve y in Fig. 3.6a) corresponds to this model. The
fixed-memorjv model is limited by the memory bound, corresponding to a workload curve between yand H‘ in
Fig. 3.6a.
From the application point of view, the shaded areas are forbidden. The communication bound includes
not only the increasing IPC overhead but also the increasing U0 demands. The memory bound is determined
by main memory and disk capacities.
in practice, an algorithm designer or a parallel computer programmer may choose an application model
within the above resource constraints, as shown in the unshaded application region in Fig. 3.-tic.
Trad-cuff: in Scalability Analysis Computer cost c and programming overhead p (in addition to speedup
and efficiency) are equally important in scalability analysis. After all, cost-effectiveness may impose the
ultimate constraint on computing with a limited budget. What we have studied above was concentrated on
system efficicncy and fast execution of a single algorithtrtfprogram on a given parallel computer.
ll would be interesting to extend the scalability analysis to multiuser environments in which multiple
programs are executed concurrently by sharing the available resources. Sometimes one problem is poorly
scalable, while another has good scalability characteristics. Tmdeoffs exist in increasing resource utilization
but not necessarily to minimize the overall execution time in an optimization process.
Exploiting parallelism for higher performance demands both scalable architectures and scalable
algorithms. The architectural scalability can be limited by long communication latency, bounded memory
capacity, bounded [ID bandwidth, and limited processing speed. I-low to achieve a balanced design among
these practical constraints is the major challenge ol'today’s MPP system designers. On the other band, parallel
algorithms and efficient data structures also need to be scalable.
We siunmarize below important characteristics of parallel algorithms which are machine implementable:
( 1] Deterttritrr'str'r- versrrs nondcterministic: As defined in Section 1.4. 1, only deterministic algorithms are
implementable on real machines. Our study is confined to deterministic algorithms with polynomial
time complexity
(Zj Computcrtinnrti grrurrriarity: As introduced in Section 2.2.1, granularity decides the size of data items
and program modules used in computation. ln this sense, we also classily algorithms as fine-grain,
medium -g rain, or coarse-grain.
-['3] Patr:rHcIt.smptr:1tife:The distribution offne degree ofparallelism in an algorithm reveals the opportunity
for parallel processing. This ofien afiects the effectiveness of the parallel algorithms.
{J-ll Cottrtnrttrtrrrttnn prrrttertrs and s_'mr'hmnt:r;rtt'ntr reqrrtnzments: Communication patterns address both
memory access and interprocessor communications. The partems can he .strrtr'c or dit-nrrttrtc, depending
on the algorithms. Static algorithmsare more suitable tbr SIMD orpipelined machines, while dynamic
algorithms are for MIMD machines. The st-'ne!rmtrfiatr'otr _,ti'eqrrerrc_v often affects the efficiency of an
algorithm.
{'5} L-"trr:,t'hrmr't_v of the operations: This refers to the types of fundamental operations to be pcrlbrmed.
Obviously, ifthe operations are uniform across the data set, the SIMD processing or pipelining may
be more desirable. [n other words, randomly structured algorithms are more suitable for MIMD
processing. Other related issues include data types and precision desired.
{oj .'l»l'ettrorfr-' reqrrtrcttrent and data structures: In solving large-scale problems, the data sets may require
huge memory space. Mettrnrfr-' tjfhctenrji-' is atfected by data structures chosen and data movement
patterns in the algorithms. Both time and space complexities are key measures ofthe granularity ofa
parallel algorithm.
The lloafilcilnqr Concept The workload ll‘ of an algorithm grows with s, the problem size. Thus, we
denote the workload w = no"; as a function of s. Kumar and Ran {I987} have introduced an t's0e;fl‘ict'enr.?y
concept relating workload to machine size n needed to maintain a fixed efficiency E when implementing
a parallel algorithm on a parallel computer. Let tr be the total communication overhead involved in the
algorithrn implementation. This overhead is usually a lhnction ofboth machine size and problem size, thus
denoted tr = Ms, n].
The cfficieney of a parallel algorithm implemented on a given parallel computer is thus defined as
E= t1\[sj+h|[.s,n]
rm)
The workload wis) corresponds to useful computations while the overhead Fits, rt} are computations
attributed to synchronization and data communication delays. In general, the overhead increases with respect
to both increasing values ofs and n. Thus, the efficiency is always less than I. The question is hinged on
relative growth rates between u'{-Y} and lll-Y, H).
With a fixed problem size {or fixed workload), the efficiency decreases as n increase. The reason is that the
overhead it-ls, rt) increases with n. With a fitted machine size, the overhead It grows slower than the workload
l1-‘. Thus the effic-iency increases with increasing problem size for a fixed-size machine. Therefore, one can
expect to maintain a constant efficieney if the workload ‘H-' is allowed to grow properly with increasing
machine size.
F?» Mtfiruw Hlllrlimpwrnw
For a given algorithm, the workload w might need to grow polynomially or exponentially with respect to
n in order to maintain a fixed efiiciency. l3ifl‘erent algorithms may require different workload growth rates
to keep the efficiency from dropping, as n is increased. The isoefiiciency functions of common parallel
algorithms are polynomial fimctions ofn; i.e., they are rJ(_n“) for some It 2 I. The smaller the power ofn in
the isoelficiency function, the more scalable the parallel system. Here, the system includes the algorithm and
architecture combination.
lsoefficieney Function We can rewrite Eq. 3.20 as E = lt(l + fr(.s__ n]v‘u{s}). In order to maintain a constant
E, the workload wls) should grow in proportion to the overhead his, tr). This leads to the following condition:
The Factor C = Et{l — E] is a constant for a fixed elliciency E. Thus we can define the r's'oe_fil‘icr1cncy_fimr'ti0n
as follows:
jg-{'rr) = C >< !r{_s, tr] (3.22)
If the workload v.-ts) grows as fast a.s_,fl._-(in) in Eq. 3.2 l, then a constant efliciecncy can be maintained for a
given algorithm-architecture combination. Two examples are given below to illustrate the use ofisoefficiency
functions for scalability analysis.
I/)
£5 Example 3.4 Scalability of matrix multiplication algorithms
(Gupta and l(uma.r,1992)
Four algorithms for matrix. multiplication are compared below. The problem size s is represented by the
matrix order. In other words, we consider the multiplication of two s >< s matrices A and B to produce an
output matrix C = AX B. The total workload involved is iv = C-‘(s3). The number of processors used is confined
within l 5 n £ s3. Some of the algorithms may use less than 53 processors.
The isoefliciency functions of the lbur algorithms are derived below based on equating the workload with
the communication overhead [Eq. 3.21) in each algorithm. Details of these algorithms and corresponding
architectures can be found in the original papers identified in Table 3.3 as well as in the paper by Gupta and
Kumar [I 9'92}. The derivation of the communication overheads is left as an exercise in Problem 3.l4.
The Fox-{ltto~Hey algorithm has a total overhead tits, n} = rJ(n log rt + $2 [ii }. The workload n'= 0t¢3)=
Gin logn + s2 Ni; J. Thus we must have Dlltjl = Ulrt logfll and 5T-'~‘l = Oi ~51. Combining the two, we obtain
the isoefficiency fi.1ncl:ion tlsj) = (Iain), where l S tr S as shown in the first row of Table 3.3.
Although this algorithm is written for the torus architecture, the toms can be easily embedded in a
hypercube architecture. Thus we can conduct a fair comparison of the four algorithms against the hypercube
architecture.
Berntsen’s algorithm restricts the use of n 5 ad"; processors. The total overhead is Ola“ + n logn + .s2nm’]|.
To match this with (Isa), we must have O{'s3) = O(tr'm) and O{.s3) = Oln). Thus, Olsj) must be chosen to yield
the isoelliciency function tiling).
The Gupta—l(.umar algorithm has an overhead Gin logo + .s'2n “log ti). Thus vvemust have Otsi) = Ola log n)
and Utfijl = 5'l'$2H“3 log tr]. This leads to the isoefficicncy fimction Ulntlog old) in the third row of Table 3.3.
Fr‘:-r Meomw nrmr rd’ am-r '
Prim-We at Sortable Petftrmonce I W i ' i. - to‘!
The Dekel-Nassimi-Sahni algorithm has a total overhead O{n logn + .-13) besides a useful computation time
of tlrlfril for sz 5 n 5 s3. Thus the workload growth 0(.-vi) = O[_n log n) will yield the isoefliciency listed in
the last row of Table 3.3.
Table 3.3 llsymptotic lsoefliciency Functions of Four Matrix Mulnipiiootion Algorithms (Gupta and Ktmior. 1992}
The above isoefliciency fimctions indicate the asymptotic scalahilities of the four algorithms. In practice,
none of the algorithms is strictly hettecr ’cl'ta.n the others for all possible problem sizes and machine sizes. For
example, when these algorithms are implemented on a multicomputer with a long communication latency [as
in Intel iPSC 1), Berntsen's algorithm is superior to the others.
To map the algorithms on an SIMD computer with an extremely low synchronization overhead, the
algorithm by Gupta and Kumar is inferior to the others. Hence, it is h-est to use the Dekel-Nassirni-Sahni
algorithm for s2 5 n 5 s3, the Fox-Otto-Hey algorithm for .sm £ n 5 sz, and Berntseifs algorithm for n £ sm
for SIMD hypercube machines.
I»)
gl Example 3.5 Fast Fourier transform on mesh and hypercube
computers (Gupta and l(umar,1993)
This example demonstrates the sensitivity of machine architecture on the scalability of the FFT on two
different parallel computers: mesh rind h__vpercnbe. We consider the Cooley-Tukey algorithm for one-
dimensional s-point fast Fourier nensfonn.
Gupta and Kiunar have established the overheads: h5{s, n] = Din logn + .s log n] for FFT on a hypercube
machine with n processors, and I12 {'s, n} = C-‘(n log n + s J; ) on a viii >< vrri mesh with n processors.
Foran s»point FFT, the total workload involved is it-{_s_] = O{.s log s}. Bquating the workload with overheads,
we must satisfy Cllis logs) = Oin logn} and Dis logs} = O{sIogn], leading to the isoefficicncy fimction
_,t'] = Ofn logn] for the hypercube machine.
re» Mrliruuv H'["I'm'l!I||r1rlM'\ '
Ins 1- _ .lrdv\onosdCnrnptrter'.ltrdrJIectru'e
Similarly, we must satisfy Ofs logs) — Urn log n) and O'~[_s'lOgs] —- Ots viii) by equating ntj s) - kgs, n). This
leads to the isoefficiency functionf; — Oi M5 ) for some constant Ir '5 2.
The above analysis leads to the conclusion that FFT is indeed very scalable on a hypercube computer. The
result is plotted in Fig. 3.?a for three effieiency values.
Problem size [s] Problem also [si
1o.=<1c5 1o_><1o5
E oar Mesh
B;-t‘lD5 s>=.1c5
21-<1o5 E MB 2111115
= ' Hypereube
Fig.3.? lsoeffleiency curves for FFT on rvvo parallel computers {Courtesy of Gupta and l<r.rnur: 1W3]
To maintain the same efiiciency, the mesh is rather poorly scalable as demonstrated in Fig. 3.Tb.
This is predictable by thc fact that the workload must grow exponentially in {Iii nk "iii ) for the mosh
architecture, while the hypercube demands only D(n log n} workload increase as the machine size increases.
Thus, we conclude that the FFT is scalable on a hypercube but not so on a mesh architecture.
If the bandwidth of the comrnunication channels in a mesh architecture increases proportional to thc
increase of machine size, the above conclusion will change. 1-'or the design and analysis of Fl-'1‘ on parallel
machines, readers are referred to the books by Abo, Hopcroft and Ullman (1974) and by Quinn (I987). We
will ftnthcr address scalability issucsfrom thc architecture standpoint in S-oction 3.4.
as soon as possible. In other words, rninimal turnaround time is the primary goal. Speedup obtained for time-
eritical applications is called fixed—load speedup.
Fixed-Load Speedup The ideal speedup fonrnttla given in Eq. 3.7 is based on a fixed workload. regardless
of the machine size. Traditional formulations for speedup, including Amdahl's law, are all based on a fixed
problem size and thus on a fixed load. The speedup factor is upper-bounded by a sequential bottleneck in this
C338.
We consider below both the cases of DDP < n and cl‘DOP 2 n. We use the ceiling Function |i.t-_| to represent
the smallest integer that is greater than or equal to the positive real number x. Whenx is a fraction, lxi equals
l. Consider the ease where DOP = F 1* n. Asstune all rt processors are used to execute ll] exclusively, The
execution time of ll-} is
ll r
rod ifiini
, = ——’ - — ( 3.23 l
Thus the response time is
1'11 .- .
T(N)= Z (3.24)
1 I
Note that ii‘? < n, then r,{n_] = r,{=-=) = W,~*'id. Now, we define tbejired-load speedtipfiactor as the ratio of
T[1) to Tin}:
S":
I
a =
Note that S-',, S S“ 5 .-l, by comparing Eqs. 3.4, 3.7, and 3.15.
K number of factors we have ignored may lower 1:he speedup perforrnance. These include communication
latencies caused by delayed memory access, interprocessor communication over a bus or a network,
or operating system overhead and delay caused by interrupts. Let Qin) be the lumped sum of all system
overheads on an n-processor system. We can rewrite Eq, 3.25 as follows:
II‘!
S = =
Ec-
I l
" T(n]+Q(n] _
(3.26)
ZMH ~E
' in-i + QM)
Jl.mdahJ": Law Revisited In I967, Gene Amdahl derived a fixed-load speedup For the special case
where the computer operates either in sequential mode {with DOP = 1) or in perfectly parallel mode (with
DOP = n]. That is, ll’, = ID if i#= lor F asn in the profile. Equation 3.25 is then simplified to
if m+m
(3.2?)
m+mm
That Ml.'I;Ifllb' HI" l'n¢r.q|r_.tI|»r\ -
Amdahl’s law implies that the sequential portion oi‘ the program ii} does not change with respect to the
machine size H. However, the parallel portion is evenly executed by n processors, resulting in a reduced time.
Consider a normalized situation in which ii] + Hi, = rr i (1 - tr] = 1, with rr = IF, and rr = iij,
Equation 3.27 is reduced to Eq. 3.14, where cr represents the percentage of a program that must be executed
sequentially and 1 — rr corresponds to the portion of the code that can be executed in parallel.
Amdahl’s law is illustrated in Fig. 3.8. when the number of processors increases, the load on each
processor decreases. However. the total amount of work (workload) ii’, + W” is kept constant as shown in
Fig. 3.8a. In Fig. 3.Bb, the total execution time decreases because 7], = H-",,r'n. Eventually, the sequential part
will dominate the perfonnance because 1",, —-> U as n becomes very large and T1 is kept unchanged.
Workload Eioacution Tlmo
IE 1
-‘=‘-tmdtm
s
i "3 102491!
1024
91* Sass‘ m.
4-Bx
31” -——- 24x 1x
0% 1% 2% 3% 4% 100%
Sequential fraction of prcgram
[e] Speedup with afisod load
Sequential Bottleneck Figure 3.8c plots Amdahl’s law using Eq. 3.14 over the range ii 5 rr 5 1. The
maximum speedup S" = rt if rr = D. The minimum speedup 8,, = 1 if rr = l. As rt —> M, the limiting value of
5',,—> 11"rr. This implies that the speedup is upper-bounded by lfrr, as the machine size becomes very large.
FM Mcfiruw Htllr1 orqr wins
Pimple Dfs-C|t?rlGbtl|E Perfwmorrce i |||
The speedup curve in Fig. 3.8:: drops very rapidly as rr increases. This means that with a small percentage
ofthe sequential code, the entire perfonnance cannot go higher than llrr. This rrhas been called the sequential
horrlene-:-Ir in a program.
The problem of a sequential bottleneck cannot be solved just by increasing the number of processors in a
system. The real problem lies in the existence ofa sequential fraction of the co-tlc. This property has imposed
a pessimistic view on parallel processing over the past two decades.
In fact, two major impacts on the parallel computer industry were observed. First, manufacturers were
discouraged from making large-scale parallel computers. Second, more research attention was shifted toward
tlevclopitrg parallelizing compilers which would reduce the value of rr and in turn boost the perfomianee.
m ml _r; .
A general formula for fitted-time speedup is defined by Sf, = Til )fT'(n), modified from Eq. 3.26:
M“ -s -s
s',,= = :tmi
~4= (3.29)
LL Qt») zlu-;
IP45
Gu.rtofson": Law Fixed-time speedup was originally developed by Gustafson for a special parallelism
profile with W, = 0 if i sé l and i as n. Similar to ."tmdahl‘s law, we can rewrite Eq. 3.29 as follows, asswtiing
Qfnl = D,
.
m
ii,_-1
Sir _ t_t'{+ it-',; _ ti-1 Hm-',, (3.30)
‘" HQ + PF" ll-"1 + W"
Zn}.
1" I
where H’; — n H-"',, and H-'1 + H-"',, — H-"1 + H-"'j,="n, corresponding to the fixed-time condition. From Eq. 3.30, the
parallel workload H-‘f, has been scaled to n times H--1, in a linear fashion.
The relationship of a scaled workload to Gustafson's scaled speedup is depicted in Fig. 3.9. In fact,
Guatafsonb law can be restated as follows in terms of rr = HF] and 1 — rr = HF" under the same assuntption
W, + H-',, — 1 that we have made for Arndahl‘s law:
s:.= l"‘i’
rr + {1— rt) =~--so' 1) cw)
In Fig. 3.911, we demonstrate the workload scaling situation. Figure 3.91:: shows the fixed-time execution
style. Figure 3.9c plots Ji-‘I, as a fiirlction of the sequential portion 1'1‘ ofa program ru.t|J1i.t1g on a system with
n — 1024 processors.
Note that the slope of the .5, curve in Fig. 3.90 is much flatter than that in Fig. 3.EIc. This implies that
Gustafson‘s law does support mlable perfomiance as the machine size increases. The idea is to keep all
processors busy by increasing the problem size. When the problem can scale to match available computing
power, the sequential fraction is no longer a bottleneck.
W“
W W»
n Til TH TH TH TH TH
1
‘"“
2 3 4 5 6
No. of processors
n
1 2 3 4 5 6
No. of processors
n
Speedup).
[SA] 1024a
'i F-1 F -t t
1014:: 1954;; 993,; *3,‘
- -I-11
0% 1% 2% 3% 4%
S-aouortlal fraction of pro-gram
[cl Speedup with flioad oxia-cutlon time
or U0-bound. This is especially true in a multicomputer system using distributed memory. The local memory
attached to each node may be relatively small. Therefore, each node can handle only a small subproblem.
when a large number of nodes are used collectively to solve a single large problem, the total memory
capacity increases proportionally. This enables the system to solve a scaled problem through program
partitioning or replication and domain decomposition of the data set.
instead of keeping the execution time fixed, one may want to use up all the increased memory by scaling
the problem size further. In other words, if you have adequate memory space and the scaled problem meets
the time limit imposed by Gustafsorfs law, you can further increase the problem size, yielding an even better
or more accurate solution.
A memory-bounded model was developed under this philosophy. The idea is to solve the largest possible
problem, limited only by the available memory capacity. This model may result in an increase in execution
time to achieve scalable pcrformanoe.
Fr‘:-r Mtfirow uritt-...¢-,.,at.¢. '
I I4 "XII Advanced Cunp-uterArdu';tectm'c
Fixed-Nlemory Speedup Lct M be the memory requirement ofa given problem and H-‘be the computational
workload. These two factors are related to each other in various ways, depending on the address space and
architectural constraints. Let us write It’ = g{'il-1") or M = g l{_ H"), where g l is the inverse ofg.
ln a multicomputer, the total memory capacity increases linearly with the number of nodes available.
We write W = E:-"1 Ii} as the workload for sequential execution of the prograln on a single node, and
H»-'* = if-7', li-'1 as the scaled workload for execution on n nodes, where m* is the maximum DOP of the scaled
problem. The memory requirement for an active node is thus bounded by g 1 (_E_j." I Iii.) .
A_,|"i.xed-memory speedup is defined below similarly to that in Eq. 3.29.
of
s:=.=
2»?
1 I
o-so
1 i'l+oc~i
2%. I N
The workload for sequential execution on a single processor is independent of the problem size or system
size. Therefore, we can write W1 = lt"1 = lt'1‘- in all three speedup models. Let us consider the special case of
two operational modes: .sequcnrin! versus pwrfirctlit parallel execution. The enhanced memory is related to the
sealed workload by W1‘, = g*(nM_], where mid is the increased memory capacity for an n-node multicomputer.
Furthermore, we assume g*{_nM‘j = G[n)g(_M) = G{n]li’,,, where ll"',, = gtit-f,t and g"‘ is a homogeneous
function. The factor Gin) reflects the increase in workload as memory increases n times. Now we are ready
110 reunite Eq. 3.32 tutder the Hssumpfion that Ht] = Cl if i ac lor n and Q11) = ll:
St: tr; + W; = ti; + o{_a)tt',, (3.33)
up + it-;r*;,, H-I + G-[n]li'".~’n
Rigorously speaking, the above speedup model is valid under two assumptions: {ll The collection of
all memory forms a global address space (in other words, we assume a shared distributed memory space];
and (2) All available memory areas are used up for the sealed problem. There are three special cases where
Eq. 3.33 canapply:
Cam I .' Gfnl = l. This corresponds to thc case whcrcthcpmblcm size is fixed. Thus, thc fixcd-memory
speedup bccomcs cquivalcnt to Arndahlls law; i.c. Eqs. 3.2? and 3.33 arc cquivalcnt whcn a fixcd
workload is given.
Case 2: G('n] = n. This applics to thc case whcrc thc workload increases rt times when thc mcmory is
increased rt timcs. Thus, Eq. 3.33 is identical to Gu.stafson‘s law (Eq. 3.30j with afixed csccution time.
Case 3: 6'1’rt] 2* n. This corresponds to thc situation where thc computational workload incrcascs faster
than thc memory tcquircmcnt. Thus, thc fixed-mcmoty modcl (Eq. 3.33] will likcly give a higbcr
speedup than thc fixed-time spccdup |[:Eq. 3.30).
The above analysis leads to the following conclusions: Amdahl’s law and Gustafson‘s law are special
cases of the fixed-memory model. When computation grows faster than the memory requirement, as is often
true in some scientific simulation and engineering applications, the fixed-memory model (Fig. 3.10} may
yield an even higher speedup (i.e., 5'1‘, 2 51, 2 3,) and better resource utilization.
The fixed-memory model also assumes a scaled workload and allows an increase in execution time. The
increase in workload (problem size) is memory-bottnd. The growth in machine sine is limited by increasing
rt».-Mrliruw rrmr1 ..|-qt;mm \ '
Principle: ofS<:olo.ble P£lf£I'fl"l0JlCE _ i - | [5
eonununication demands as the number ofproeessors becomes large. The fixed-time model can be moved
very close to the fi:r.ed—memory model if available memory is fully utilized.
Workload Ettoeutlon Time
T 1
W“ wn ‘ll
W T T
i n Tn TI] Tn Tn n n
II II
A5to -ll en co
toU5 _.. to or -ll on er:
No. of processors No. of pro-oossors
[a) Scaled workload {la} Increased 9JtQfl.lllO1'lllfl\B
Fig. 3.10 Scaled speedup model using fixed memory {Cournesy of Sun and Niwepfineed with purrnission from
MM Supercmnpud|1g,'l‘?9D}
L»)
£3 Example 3.6 Scaled matrix multiplication using global
versus local computation models (Sun and
Hi,1993)
In scientific computations, a matrix ofien represents some diseretized data continuum. Enlarging the matrix
size generally leads to a more aoeutate solution for the continutun. For matrices with dimension n, the munber
of computations involved in matrix multiplication is 2:13 and thc memory requirement ia roughly M = 3n2.
As the memory increases n times in an n-processor multicomputer system, nil! = n >< 3-n2 = 3n3. If the
enlarged matrix has a dimension of N, then 3:23 - 3N2. Therefore, N - 111-5. Thus Gui] - ni "5, and the sealed
workload lV‘f, = G'[n]W,, = nu Hr’. Using Eq. 3,33, we have
under the global computation model’ illustrated in Fig. 3.113,. where all the distributed memories are used as.
a common memory shared by all processor nodes.
As illustrated in Fig. 3.] lb, the node memories are used locally without sharing. In such a lam! t-om,nurnn'rn
nirztdef, Gin} - n, and we obtain the following speedup:
sf; _ (3.35)
ii, + ll"
I I6 1- Advanced Cornputerhrchitecture
- -1
liil
AB1 A52 AB3
~l ii ti
AB“
[a] Global computation with eistrihuted shared memories
The above example illustrates Gustafson’s scaled speedup for local computation. Comparing the above
two speedup expressions, we realize that the fixed-memory speedup (Eq. 3.34) may be higher than the fixed-
time speedup [Eq. 3.35}. In general, many applications demand the use of a combination of local and global
addressing spaces. Data may be distributed in some nodes and duplicated in other nodes. Data duplication is
added deliberately to reduce communication demand. Speedup factors for these applications depend on the
ratio between the global and local computations.
Eli
is Hui Problem
s
Fig. 3.12 Scalability metric
Communication
Ov ad
Clock rnre (fl—the clock rate dctennines the basic machine cycle. We hope to build a machine with
components (processors, memory, bus or network, etc.) driven by a clock which cart scale up with
better technology.
Problem size {s')—the amount ofcomputational workload or the ntlmber ofdata points used to solve a
given problem. The problem size is directly proportional to the seqrienrriei exeeririon rime T(.s, 1] for a
uniprocessor system because each data point may demand one or more operation s.
CPL’ time (T_t—the act1.|al CPU time {in seconds] elapsed in eseeuting a given program on a parallel
machine with rt processors oollectively. This is the parable! exeerrrion tirrte, denoted as ifs, rt] and is a
firnction ofbofn .1." and rr.
HO deemed (d')—the inputioutput demand in moving the program, data, and results associated with a
given application run. The [IO operations may overlap with the CPU operations in a mu ltiprog rammed
environment.
Merrtorft-' eupneiF_1-' {'rrrj—the amotmt of main memory [in bytes or words] used in a program execution.
Note that the memory demand is affected by the problem size, the program size, the algorithms, and
the data structures used.
The memory demand varies dynamically during program -tntecution. Here, we refer to the maximum
number of memory words demanded. Virtual memory is almost unlimited with a 64-bit address space.
lt is the physical memory which may be limited in capacity.
Cnrnmrmienrinn at-'erfi-end {.lr'j—the amount of time spent for interprocessor communication,
synchronization, remote memory access, etc. This overhead also includes all noneompute operations
which do not involve the C'PUs or [IO devices. This overhead !r{.s__ rt] is a function ofs and rr and is not
part of Tfs, rrj. Fora uniproeessor system, the overhead h(s, 1) = U.
Cornpurer ens‘! ('e]—the total cost of hartlware and software resources required to carry out the
execution ofa program.
Progmnrrrnng overlread {pj—the development overhead associated with an application program.
Programming overhead may slow down software productivity and thus implies a high cost. Unless
otherwise stated, both computer cost and programming cost are ignored in our scalability analysis.
Depending on the computational objectives and resource constraints imposed, one can fix some of the
above pra.ramEl¢1'$ and optimize the remaining ones to achieve the highest performance with the lowest eost.
The notion of scalability is tied to the notions of speedup and efiieieney. A sound definition of scalability
must be able to express the effects ofthe art:l:|iteeture‘s interconnection network, ofthe communication patterns
re» Altliruw um r-...=-mm. '
I in 1- _ .ltduenoedCornptrte|'.l|rclti1ectm'e
inltercnt to algorithms, of the physical constraints imposed by technology, and of the cost effectiveness or
system cfliciency. We introduce first the notion of speedup and efficiency. Then we define scalability based
on the relative performance of a real machine compared with that of an idealized theoretical machine.
Speedup and Efflcleney Revisited For a given architecture, algorithm, and problem size s, the est-‘ntpro.'ic
.5‘peerfi.ip 5'(s, n) is the best speedup that is attainable, varying only the number tn} of processors. Let Tis, 1)
be the sequential execution time on a uniprocessor, T(s, n) be the minimum parallel execution time on an
n-processor system, and Ms, n) be the lump sum of all comrnunioation and UCI overheads. The asymptotic
speedup is formally defined as follows:
sor, H] = i (3.315)
T(_s,rI)+ h(.~t, rt]
The problem size is the independent parameter, upon which all other metrics are based. A meaningful
measurement of asymptotic speedup mandates the use of a good sequential algorithm, even it is different from
the structure of the corresponding parallel algorithm. The Ti-t, rt) is minimal in the sense that the problem is
solved using as many processors as necessary to achieve the minimum runtime for the given problem size.
In scalability analysis, we are mainly interested in results obtained from solving large problems. Therefore,
the run times ‘Its, n) and Its, I} should be expressed using order-of-magnitude notations, reflecting the
asymptotic behavior.
The system efliciency of using the machine to solve a given problem is defined by the following ratio:
E(s, H) = ~‘i*i-1-”l om
ln general, the best possible cfliciency is one, implying that the best speedup is linear, or Slfs, n} = n.
Therefore, an intuitive definition of scalability is: A system it scalable ifrhe svsrem efliciertcy El.-r, rt} = lfor
nil nigtnrirhrns it-‘lift any ntvmirer ofn pmeessors turd any problerrr size .s_
Mark Hill (I990) has indicated that this definition is too restrictive to be useful because it precludes any
system from being called scalable. For this reason, a more practical elliciency or scalability definition is
needed, comparing the performance of the real machine with respect to the theoretical PRAM model.
Scalability Definition Nussbaum and Agarwa] ([991) have given the following scalability definition
based on a PRAM model. The scalability <D(_s, n) ofa machine for a given algorithm is defined as the ratio of
the asymptotic speedup Sts, rt) on the real machine to the asymptotic speedup S,{_s, rt] on the ideal realization
of an EREW PRAM.
T(.s,l)
s_,{s_. rt) = i
in [Sm]
where 1']-{'s, rt) is the parallel execution time on the PRAM, ignoring all communication overhead. The
scalability is defined as follows:
S(s, rt] T} (s, rt]
., = i = —? 3.38
(Du. H) .S',.{_.s,n] T(.s,n) ( )
intuitively, the larger the scalability, the better the performance that the given architecture can yield
running the given algorithm. in the ideal case, S;{_s, rt} = rt, the scalability definition in Bq. 3.38 becomes
identical to the efliciency definition given in Eq. 3.31".
Principle ofS-soluble Perftrmonce i | |g
."l1lcrc'hine' ."l.F£'.i1'l:.l'é'£'l‘I.|'.l"¥.'
."l-{erricat
Linear army I 2-D ntrsir 3-D mesh fi'_iyx-rcube O-negu Nenwrir
Tl}, n) rm rm 3"‘ logs lflgzs
This calculation examines s bits, determining whether the number of bits set is evtm or odd using a
balanced binary tree. For this algorithm, T-[s, I}- s, i"]{s, n] — logs, and 5‘,{'s, n) - silogs for the ideal PRAIH
machine.
On real architectures, the parity algorithm"‘s performance is limited by network diameter. For example,
the linear array has a network diameter equal to n — l, yielding a total parallel running time of sin + n. The
optimal partition of the problem is to use n — J; processors so that each processor performs the parity check
on J; bits locally. This partition gives the best match between computation costs and communication costs
w ith ills, rt) = am, S[_s, n} = .3": and thus scalability <l>{s, n) = logsism.
The 2D and 3D mesh architectures use a similar partition to match their own cornniunieation structure with
the computational loads, yielding even better scalability results. It is interesting to note that the scalability
increases as the communication latency decreases in a network with a smaller diameter.
The hypercube and the Omega network provide richer communication structures [and lower diameters)
than meshes of lower dirnensionality. The hypercube tlocs as well as a PRAM for this algorithm, yielding
tlilts, n} — l.
The Omega network {I-‘lg. 2.24) does not exploit locality: communication with all processors takes the
same amount of time. This loss of locality hurts its performance when compared to the hypercube, but its
lower diameter gives it better scalability than any of the meshes.
Although performance is limited by network diameter for the above parity algorithm, for many other
algorithms the network bandwidth is the performance-limiting factor. The above analysis assumed unit
eornmunieafion time between directly connected cornniunication no:1es.An architecture may be scalable for
one algorithm but unsealable for another. One must examine a large class of useful algorithms before drawing
a scalability conclusion on a given architecture.
I 20 "Z5 Advanced Carnpinzerhrclnitectm-e
10.000
The lntel
Teiaflop saaoii
..-. t _/'- C-ray
cos s2»:or:,/ /‘ Massively _
§ "“*\
I /’ .Ei‘sti,
ems S120-M I-‘T
_PG |ma|s55|v| a . /
CM5 SEJDM
S
l.
{Ganeeigs-flo NEC
" Sip-are _
10-
PeakPer
m fo
ray
Sipersr
1 .
19-B3 1990 19% 1994 1996 1998- 2000
Year
Fl‘. 3.13 The perfornwi-i:e {lo Gftops] of various comp-uters manufactured during 19% by Cray itesmrch. |ne..
NEG lra:el.antl'l'l'|lnltlng Machines Corporation Slloursesy oi Gordon Bell: reprinted with permlulon
from the Communications 0-MGM. August ‘i99'2][1
In 198 8, the Cray ‘r’-MP S delivered a peak of2.8 Gflops. By 1991, the Intel Touchstone Delta, a 672—node
multicomputer, and the Thinking Machines CM-2, a 2K PF. SIMD machine, both began to supply an order-
of-rnagnitude greater peak power (20 Gfiops} than conventional supercomputers. By mid-1992, a completely
new generation of computers were introduced. including the CM-5 and Paragon.
in Thinking l\"l.|'.‘|Cl1lt‘l£'S Corporation has since gone out of business.
F?» Mtfirpw Hlllf1 wt’qt mm
Principle: ofS-colohle Perftrmonce i |1|
tn the past, the IBM Systenfloll provided a ltltlzl range of growth for its various models. DEC VAX
machines spanned a range of lllllllcl over their lifetime. Based on past experiences, Gordon Bell has identified
three objectives for designing scalable computers. Implications and case studies of these challenges will be
further discussed in subsequent chapters.
Size Scalability The study of system scalability started with the desire to increase the machine size. A size-
scalable computer is designed to have a scaling range fi'om a small to a large number of resource components.
The expectation is to achieve linearly increased performance with incremental expansion for a well-defined
set of applications. The components include computers, processors or processing elements, memories,
interconnects, switches, cabinets, etc.
Size scalability depends on spatial and temporal locality as well as component bottleneck. Since very large
systems have inherently longer latencies than small and centralized systems, the locality behavior ofprogram
execution will help tolerate the increased latency. Locality will be characterized in Chapter 4. The bottleneck-
free condition demands a balanced design among pming, storage, and H0 bandwidth.
For example, since MP'Ps are mostly interconnected by large networks or switches, the bandwidth of the
switch should increase linearly with processor power. The L"O demand may exceed the processing bandwidth
in some real-time and large-scale applications.
The Cray Y-MP series scaled over a range of I6 processors (the C-90 model) and the current range of
Cray supercomputers offer a much larger range of scalability (see Chapter I3). The CM-2 was designed to
scale between SK and 6411 processing elements. The CM-5 scaling range was 1024 to 16K computers. The
KER-l had a range of 3 to I038 processor-memory pairs. Size-scalability cannot be achieved alone without
considering cost, efficiency, and prograrnmability on reasonable time scale.
Generation [Time] Scalability Since the basic processor nodes become obsolete every three years, the
time scalability is equally important as the size scalability. Not only should the hardware technology be
scalable, such as the CMOS circuits and packaging technologies in building processors and memory chips,
but also the soflwarelalgoritlun which demands software compatibility and portability with new hardware
systems.
DEC claimed that the Alpha microprocessor was generation-scalable for 25 years. In general, all computer
characteristics must scale proportionally: processing speed, memory speed and size. interconnect bandwidth
and latency, U0, and soflware overhead, in order to be useful for a given application.
Pmblelrl Scalability The problem size corresponds to the data set size. This is the key to achieving scalable
performance as the program granularity changes. A problem scalable computer should be able to perfomi
well as the problem size increases. The problem size can be scaled to he sufficiently large in order to operate
cfliciently on a computer with a given granularity.
Problems such as Monte Carlo simulation and ray tracing are “perfectly parallel”, since their threads of
computation do not come together over long spells of computation. Such an independence among threads is
very much desired in using a scalable MPP system. In general, the pro elem gnmu!rtrr't‘_t-‘ (operations on a grid
point./data required from adjacent grid points] must be greater than a nuichirw Is grannlariri-' (node operation
mte.|'no-dc-to-node communication data rate) in order for a multicomputer to be effective.
I 22' 1- Advanced Comp-uter Architecture
where I 5 i,j, It 5 N and Nis the number of grid points along each dimension. In total, there are hi] grid points
in the problem domain to be evaluated during eaeh iteration m for 1 S m S M.
The three-di.|:nensional domain can be partitioned into p sub-domains, each having n3 grid points sueh that
)'JH3 - Ev“. where p is the machine size. The computations involved in each subclomain are assigned to one
node of a multicomputer. Therefore, in each iteration, each node is required to perfomi Trri computations as
specified in Eq. 3.40.
Z
[0, r rm, r n]
[0, m, r n) [0, rn, r n]
AP
g [0, r n, r n-n]
[r n, ID, r rt)
,-L In
‘ ‘ L’ (0, r n, my
isQillqan A,
X
[rn, Ct, ID] [r n, rn, D]
la] 5|! GU59 5IJ|i'5<>"‘B|"$ [b] An Ha" N 1-: Hgrld partitioned Into p subdomelns,
aqaeent to a cube so belomaln 3 3 3
Mme came, each being an n3 etbe, where p = r = N In
Each subdomain is adjacent to sis other subdomains (Fig. 3.14s]. Therefore, in each iteration, each node
needs to exchange (send or receive] a total of6:12 words offloating-point numbers with its neighbors. Assume
each floating-point number is double-precision (ti-4 bits, or 8 bytes}. Each processing node has the capability
ofperforming lll-tl Mtlops (or lJ.l.il he per floating-point operation}. The internode communication latency is
assumed to be 1,us [or l megaword/s} for transferring a floating-point number.
For a balanced multicomputer, the computation time within each node and inter-node communication
latency should be equal. Thus l].D?n3,us equals (Sn: its communication latency, implying that n has to be at
least as large as 86. A node rnernory of capacity 863 >< S = 640K >< 8 = 5120 Kwords = 5 megabytes is needed
to hold each subdomain of data.
On the other hand, suppose each message exchange takes 2 ,us [one receive and one send) per word. The
eonummicafion latency is doubled. We desire to scale up the problem size with an enlarged local memory
of 32 megabytes. The subdomain dimension sine n can be extended to at most loti, because I503 >< ti = 32
megabytes. This size problem requires 0.3 s of computation time and 2 >< 0.15 s of send and receive time.
Thus each iteration takes [L6 (0.3 + 0.3) s, resulting in a computation rate of fill Mflops, which is only 5i'l%
of the peak speed of each node.
ll" the problem size n is further increased, the elTective Milops rate and efficiency will be improved. But
this cannot be achieved unless the memory capacity is further enlarged. For a fixed memory capacity, the
situation corresponds to die memorybound region shown in Fig. 3.6c. Another risk of problem sealing is to
exacerbate the limited U0 capability which is not demonstrated in this example.
To summarize the above studies on scalability. we realize that the machine size, problem size, and
technology scalabilities are not necessarily orthogonal to each other. They must be considered jointly. In the
next section, we will identify additional issues relating scalability studies to software compatibility, latency
tolerance. machine prograrnmability. and cost-effectiveness.
Some Approaches In order to overcome the above difficulties, listed below are some approaches being
pursued by researchers:
re» Mtfiruw um =-...=-mam. '
I 24 1- _ flidmnced Clrnpti-tarhrchitecuue
Scalability analysis can be carried out either by analytical methods or through trace-driven simulation
experiments. In Chapter '9, we will study both approaches toward the development of scalable computer
architectures that match program.-’ algorithmic behaviors. Analytical tools include the use of Markov chains,
Pelri nets, or queueing models. A number of simulation packages have already been develop-ed at Stanford
University and at MIT.
Supporting Issue: Besides the emphases of scalability on machine size. Problem sire and technology, we
identify below several extended areas For continued research and development:
-[1] .'i'q,iFinure seoloiaiiiri-': As problem sin: scales in proportion to the increase in machine size, the
algorithms can be optimized to match the architectural constraints. Software tools are being developed
to help programmers in mapping algorithms onto a target architecture.
A pericct match between architecture and algorithm requires matching both computational and
communication patterns through performance-tuning experiments in addition to simple numerical
analysis. Optimizing compilers and visualization tools should be designed to reveal opportunities for
algorithm.-‘program restructuring to match with the architectural growth.
(Zj Reducing eonrnnrniearion or-'crherm".' Scalability analysis should concern both uselitl computations
and available parallelism in programs. The most difficult part of the analysis is to estimate the
communication overhead accurately. Excessive communication overhead, such as the time required to
synchronin: a large number of processors, wastes system resources. This overhead grows rapidly as
machine size and problem size increase.
Furthermore, the run time conditions are often diflicult to capture. l-low to reduce the growth of
communication overhead and how to tolerate the growth of memory-access latency in very large
systems are still wide-open research problems.
(3) Enimnerhgrimgrornrnnhiiirt-': Thccomputing community generally agrees that multicomputers are more
scalable; multiproccssors may be more easily programmed but are less scalable thart multicomputers.
It is the centralized-memory versus distributed private-memory organization that makes the difihrcnoe.
In the ideal case, we want to build machines which will retain the advantages of both architectures.
This implies a system with shared distributed memory and simplified message communication among
process-ornodes. Hsterogencous programming paradigms are needed for fi.|t1.|re systems.
{'4} Providing longer-'ir_i-' and gent-miiry: Other scalability issues include longer-ir_t-, which requires an
architecture with sufliciently large address space, and g,v.=nemiiI_1-', which supports a wide variety oi
languages and binary migration of software.
Performance, scalability, progxarnmability, and generality will be studied throughout the book for general-
purpose parallel processing applications, unless otherwise noted.
n-rrrcrrmrv Hffliormortnrr ‘
Princripl-in of Scalable Performance Z I25
||-- -
I
g Exercises
Problem 3.1 Consider the parallel execution for synchronization among the four program parts
of the some program in Problem 1.4 on tr four- 50000 extra instructions are added to each divided
p rocessor systc m with shared me mory. The p rogram p rogram part.
Cflfl bfl pQ|'l.llIlOl‘lIEd lI‘lI.O fUl.lf 'BqL|3.l PHFI3 fO|' b-Blflflflfld _A5g_,|r|1,E thg garfye in-5[f'|_,|fljQ|'1 35 in Ppgblem
execution by the four processors. Due to the need for each divided program part;
TM Illnffifihir Hillfiurnpennri .
|z‘—
Advanced Compuiter Architecture
The CPI for the memory reference (with cache Problem 3.3 Let rr be the percentage of a
miss) instructions has been increased from B to 12 program code which can be executed simultaneously
cycles due to contentions.The CPlsfor the remaining by n processors in a computer system.A.ssume that
instruction types do not change. the remaining code must be executed sequentially by
{a} Repeat part (a) in Problem 1.4 when die a single processor. Each processor has an execution
program is executed on the four-processor rate of it HIPS. and all the processors are assumed
system. equally capable.
(b) Repeat part [b] in Problem 1.4 when the {a} Derive an expression for the effective HIPS
program is executed on the four-processor rate when using the system for exclusive
system. execution of this program. in terms of five
(c) Calculate the speedup factor of the four- parameters n. rz, and x.
processor system over the uniprocessor (b) lfn= 16 and x=40O l"‘llPS.determine the value
system in Problem 1.4 under the respective of {I which will yield a system performance of
trace statistics. 4000 HIPS.
{cl} Calculate the efficiency of the four-processor
Problem 3.4 Consider a computer which can
system by comparing the speedup factor in
execute a program in two operational modes: regular
part (c) widw dwe ideal case.
mode versus enhanced mode. with a probability
Problem 3.1 A uniprocessor computer can distribution of irr. i rr}, respectively.
operate in either scalar or vector mode. In vector {a)lfr1'varies between oand band O E o -=‘-b E 1.
mode. computations can be performed nine tima derive an -expression for fine average speedup
faster than in scalar mode. A ceriain benchmark factor using the har'monic mean concept
program tool-c time T to run on this computer. (b) Calculate the speedup factor when 0 -3 O and
Further. it was found that 25% of T was attributed to b -1 1.
the vector mode. In the remaining time.the machine
Problem 3.5 Considertheuseofafour-proces.s»or.
operated in the scalar mode.
shared-memory computer for the execution of a
[a) Calculate the effective speedup under the
program mix.The multiprocessor can be used in
above condition as compared with die
four execution modes corresponding to die active
condition when dwe vector mode is not used
use of one.tvvo. three. and four processors. Assurn-e
at all. Also calculate :1‘, the percentage of
that each processor has a peak execution rate of
code that has been vectorized in the above
500 l‘-1| PS.
program.
Let f] be the percentage of time dwat I processors
(b) Suppose vve double the speed ratio between
will be used in the above program execution and
the vector mode and the scalar mode by
1’, + fl + 1‘:-1 + f4 = 1.You can assume the execution
hardware improvements. Calculate five
rates R1. R1. R3. and R4. corresponding to five
effective speedup that can be achieved.
distribution (fi. fi. 15. fl). respectively.
(c) Supposethesame speedup obtainedinpart [b]-
(a) Derive an expression to show the harmonic
must be obtained by compiler improvements
mn execution rate R of the multiprocessor
instead of hardware improvements. ‘What
in terms of ff and R. for i = 1. 2. 3. 4.Also
would be the new vectorization ratio rz
show an expression for the harmonic mean
that should be supported by the vectorizing execution time T in terms of R.
compiler for the same benchmark program?
FM liilcfimu-‘ Hilllimnponm
Piirrcipko of Scalable Peifwmance W r21
(b) What would be the value of dwe harmonic next 32 iterations (l' = 33 to 64). and so on.
mean execution time Tof die above program What are the execution time and speedup
mix given fi = 0.4.fi = 0.3.1‘; = 0.2.151 = 0.1 and factors compared with part (a)? (Note that
R1= 400 l"1lPS.R1 = B00 l"1|PS.R3 = 1100 MIPS. the computational worldoad. dictated by the
R4 = 1500 MIPS? Explain the possible causes j-loop. is unbalanced among the processors.)
of observed R. values in the above program (c]- Modify the given program to facilitate
execution. a balanced parallel execution of all the
{c} Suppose an intelligent compiler is used to computational workload over 32 processors.
enhance the degree of parallelization in the By a balanced load.we mun an equal number
above program mix with a new distribution of additions assigned to each processor with
f1 = 0.1. f1= 0.2. f3 =0.3. 1‘; =0.4.Whatwould respect to both loops.
be the harmonic mean execution time of the {d} What is the minimum execution time resulting
same program under the same assumption on from the balanced parallel execution on 32
{R} as in part (b)? processors? What is the new speedup over
the uniprocessor!‘
Problem 3.6 Explain the applicability and
die restrictions involved in using Amdahl's law. Problem 3.8 Consider the multiplications of
Gustafs/on‘s law. and Sun and Ni's law to estimate two n >< n matrices A = {oi} and B = {by} on a scalar
die speedup performance of an n-processor system uniprocessor and on a multiprocessor. respectively.
compared with that of a single-processor system. The matrix elements are floating-point numbers.
Ignore all communication overheads. initially stored in the main memory in row-major
order.The resulting product matrix C = {cg} where
Problem 3.? The following Fortran program is
C = A >< B. should be stored back to memory in
to be executed on a uniprocessor. and a parallel
contiguous locations.
version is to be executed on a shared-memory
multiprocessor. Assume a 2-address instruction format and
an instruction set of your choice. Each loadistore
Ll: no 10 I = 1,1024
instructiontakes. on the average.4 cycles to complete.
L2: SIJM-{Ii = If]
All ALU operations must be done sequentially on
L3: Do 2'3 J = 1, I the processor with 2 cycles if no memory reference
L4: 20 SLIM -:Ii = SUM -:Ii + I is required in the instruction. Dthenrvise. 4 cycles
La: 10 Continue are added for each memory reference to fetch an
Suppose statements 2 and 4 each take two operand. Branch-type instructions require. on the
machine cycle times, including all CPU and memory- average. 2 cycla.
access activities. Ignore the overhead caused by the {a} Write a minimal-length assembly-language
software loop control {statements L1. L3. and L5} program to perform the matrix multiplication
and all other system overhead and resource con- on a scalar processor with a load-store
flicts. architecture and floating-point hardware.
(a) What is the total execution time of the {b} Calculate the total instruction count. the
program on a uniprocessor? total number of cycles needed for the
[b] Divide the outer loop iterations among program execution. and the average cydes
32 processors with prescheduling as follows: per instruction [CPI]-.
Processor 1 executes the first 32 iterations {c} What is the MIPS rate of dwis scalar machine.
(i = 1 to 32). processor 2 executes the If the processor is driven by a 400-MH: clock?
FM liilcfimur Hilllimnponm
|zB—
Advanced Computer Architecture
(d) Sugest a partition of the above program to special case of the Si‘ expression.
execute dve divided program parts on an {d} Prove the relation 5.1.‘; S1. 2 Sn for solving the
N-processor shared-memory system with same problem on the same machine under
minimum time.Assume n = 1000N. Estimate different assumptions.
the potential speedup of the multiprocessor
Problem 3.11 Prove the following relations
over the uniprocasonassu ming dwe same type
among the speedup Sin}. efficiency E(n}, utilization
of processors are used in both systems. Ignore
U[n). redundancy Ri[n). and quality Q(n} of a parallel
the memory-access conflicts. synchronization
and other overheads.
computation. based on the definitions given by Lee
{W80}:
(e) Sketch a scheme to perform distributed
{a} Prove 1fn E E[n) E U[n) E 1, where n is the
matrix computations with distributed
number of processors used in the parallel
data sets on an N-node multicomputer
computation.
with distributed memory. Each node has a
{b} Prove 1 5 R(n} S 1i'E{n]- E n.
computer equivalent to the scalar processor
used in part [a]. {c} Prove the expression for Q(n} in 3.19.
|[f} Specify die message-passing operations (d} Verify the above relations using the
required in part [e). Suppose dvat. on the hypodwetical workload in Example 3.3.
average. each message passing requires 100 Problem 3.11 Rept Example 3.? for sorting s
processor cycles to complete. Estimate the numbers on five dilferentn-processor machines using
total execution time on the multicomputer dve linear array. 2D-mesh. 3D-mesh. hypercube. and
for the distributed matrix multiplication. Clnvep network as interprocessor communication
Hake appropriate assumptions if needed in ardwitectures. respectively.
your timing analysis. (aj Show the scalability of the five architectures
Problem 3.9 Consider the interleaved execution as compared with the EREVV-PRAM model.
of the four programs in Problem 1.6 on each of {b} Compare the results obtained in part (a) with
dve three machines. Each program is executed in a those in Example 3.7‘. Based on these two
particular mode with the msured HIPS rating. benchmark raults. rank the relative scalability
{a} Determine the arithmetic mean execution of the five architectures. Can the results be
time per instruction for each machine generalized to the performance of other
executing the combined workload. assuming algorithms?
equal weights for the four programs. Problem 3.13 Consider the execution of two
(b) Determine the harrnonic mean MIPS rate of benchmark programs. The performance of three
each machine. computers running these two benchmarks are given
{c} Rank the machines based on the harmonic below:
mean performance. Compare this ranking
Be.uit.l'z.|ritu'.I'r .lfii'.i'r'0.u.v' t'.'om.iJu.rc.i' ifonupurer t'.'o.m.imIe.i'
with that obtained in Problem 1.6.
of i 2 3
Problem 3.10 Answer or prove the following flmflhgv T. |".v'1‘r.: _i T;|’.\'ec) ]i'_.r.m-..1
statements related to speedup performance law: |l'J0fJI.i'
r.~p¢.=rr¢:|Ir'ni:'.v
(a] Derive the fixed-memory speedup expression
S3: in Eq. 3.33 under reasonable assumptions. Pmblcml | mo i I0
(b) Derive Amdahl‘s law {Sn in Eq. 3.14) as a Problem 2 | 100 rum S 1'-l l-l :I<:I
{a} Calculate Rd and Fly, for each computer under {c} Prove that h(.s. n) = O{nlogn + s3} when
the equal-weight assumption fi = 15 = 0.5. mapping the Delcel-Nassimi-Sahni algorithm
{b} W"hen benchmark 1 has a constant R1 = on a hypercube with n = s3 = 23* nodes.
10 Mflops performance across the three Problem 3.15 Xian-He Sun [1992]] has
computers. plot Ra and Rh asa function of R1. introduced an iscrspeecl concept for scalability
which varies from 1 to 100 |*"'I'l"lops under the analysis.The concept is to maintain a fixed speed for
assumption fl = 0.8 and fl = 0.1 each processor while increasing d1e problem size.
{c} Repeat part (b} for the case f1 = O2 and Let W and W be two workloads corresponding to
f1 = 0.8. two problem sizes. Let N and N’ be two machine
{d} From the above performance results under sizes {in terms of the number of processors}. Let TH
diffe rentconditions,can you draw a conclusion and Ty be the parallel execution times using N and
regarding the relative performance of the N’ processors. respectively.
three machines! The isospeed is achieved when WKNTH} = VII’!
[N ’Thr). The isoeflfiolency concept defined by Kumar
Problem 3.14 In Example 3.5. four parallel
and Rao (1987) is achieved by maintaining a fixed
algorithms are mentioned for multiplioation of s >< s
elficiency through SN{W'_)fN = SH-(W)!N’. where
matrices. After read ing the original papers describing
S~(W} and are the corresponding speedup
these algorithms. prove the following com munication
factors.
overheads on the target machine architectures:
Prove that the two concepts are indeed equivalent
fa) Prove that h[s. n} = O(n log n +516) when if (i] the speedup factors are defined as the ratio of
mapping the Fox-Otto-Hey algorithm on a pom.ll'el speed Rh, to sequential speed R1 {rather than
J; X ‘ll; 1I'Dl"LlS. as the ratio of sequential execution time to parallel
{b} Prove that h(s. n} = Ofinm + nlogn + sin“)- execution time). and {ii} R1(W} = R1[W’}. ln other
words. isoefiiciency is identical to isospeed when
when mapping Berntsen‘s algorithm on
the sequential speed is fixed as the problem size is
a hypercube with n = 23* nodes, where
increased.
kS% logs.