0% found this document useful (0 votes)
103 views

Aca Notes

This document discusses parallel computing and summarizes key concepts: 1) Parallel computers use multiple processing units operating independently to solve problems concurrently. Ideally, this reduces completion time, but communication between processes limits scalability. 2) Scalability measures how well a problem can be solved faster by adding more processing units. Hardware and software must scale up to increasing demands and scale down to reduce costs. 3) Parallel computers are classified by memory-processor organization (shared, distributed, distributed shared), interconnection of processors, and static vs. dynamic network topologies like meshes, rings, stars, and hypercubes. Dimension-order routing is common for meshes and hypercubes provide good scalability.

Uploaded by

Mayur Hanchate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views

Aca Notes

This document discusses parallel computing and summarizes key concepts: 1) Parallel computers use multiple processing units operating independently to solve problems concurrently. Ideally, this reduces completion time, but communication between processes limits scalability. 2) Scalability measures how well a problem can be solved faster by adding more processing units. Hardware and software must scale up to increasing demands and scale down to reduce costs. 3) Parallel computers are classified by memory-processor organization (shared, distributed, distributed shared), interconnection of processors, and static vs. dynamic network topologies like meshes, rings, stars, and hypercubes. Dimension-order routing is common for meshes and hypercubes provide good scalability.

Uploaded by

Mayur Hanchate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

ACA NOTES

Parallel Computing

A common way of satisfying the described needs is to use parallel computers. A parallel
computer consists of two or more processing units, which are operating more or less
independently in parallel. Using such a computer, a problem can (theoretically) be
divided into n sub problems (where n is typically the number of available processing
units), and each part of the problem will be solved by one of the processing units
concurrently. Ideally, the completion time of the computation will be t/n, where t is the
completion time for the problem on a computer containing only one processing unit. In
practice, a value of t/n will rarely be achieved due to manifold reasons: sometimes a
problem cannot be divided exactly into n independent parts, usually there will be a need
for communication between the parallel executing processes (e.g. for data exchanges,
synchronization, etc.), some problems contain parts that are per se sequential and
therefore cannot be processed in parallel, and so on. This leads us to the term scalability.
Scalability is a measure that specifies, whether or not a given problem can be solved
faster as more processing units are added to the computer. This applies to hardware and
software.

Scalability
A computer system, including all its hardware and software resources, is called scalable
if it can scale up (i.e., improve its resources) to accommodate ever-increasing
performance and functionality demand and/or scale down (i.e., decrease its resources) to
reduce cost.

Parallel computers can be classified by various aspects of their architecture. Here we


present three different classification schemes. In the first, parallel computers are
distinguished by the way the processors are connected with the memory. The second
scheme (called "Flynn's Classification Scheme") takes the number of instruction-streams
and the number of data-streams into account. Finally, the third scheme (ECS, the
"Erlanger Classification Scheme") focuses on the number of control units, functional
units, and the word-size of the computer.

Memory-Processor Organization

In terms of memory-processor organization three main groups of architectures can be


distinguished. These are

• shared memory architectures,


• distributed memory architectures, and
• distributed shared memory architectures
Shared Memory Architectures

The main property of shared memory architectures is, that all processors in the system
have access to the same memory, there is only one global address space. Typically, the
main memory consists of several memory modules (whose number is not necessarily
equal to the number of processors in the computer, see Figure 2-1). In such a system,
communication and synchronization between the processors is done implicitly via shared
variables.

The processors are connected to the memory modules via some kind of interconnection
network. This type of parallel computer is also called UMA, which stands for uniform
memory access, since all processors access every memory module in the same way
concerning latency and bandwidth.

A big advantage of shared memory computers is, that programming a shared memory
computer is very convenient due to the fact that all data are accessible by all processors,
such that there is no need to copy data. Furthermore the programmer does not have to
care for synchronization, since this is carried out by the system automatically (which
makes the hardware more complex and hence more expensive). However, it is very
difficult to obtain high levels of parallelism with shared memory machines; most systems
do not have more than 64 processors. This limitation stems from the fact, that a
centralized memory and the interconnection network are both difficult to scale once built.

2.3.1.2 Distributed Memory Architectures

In case of a distributed memory computer (in literature also


called multiprocessor or multi computer, each processor has its own, private memory.
There is no common address space, i.e. the processors can access only their own
memories. Communication and synchronization between the processors is done by
exchanging messages over the interconnection network.

Figure shows the organization of the processors and memory modules in a distributed
memory computer. In contrary to shared memory architecture a distributed memory
machine scales very well, since all processors have their own local memory which means
that there are no memory access conflicts. Using this architecture, massively parallel
processors (MPP) can be built, with up to several hundred or even thousands of
processors.

Typical representatives of a pure distributed memory architecture are clusters of


computers, which become more and more important nowadays. In a cluster each node is a
complete computer, and these computers are connected through a low-cost commodity
network (e.g. Ethernet, Myrinet, etc.). The big advantage of clusters compared to MPPs
is, that they have a much better cost/performance ratio.

Distributed Shared Memory Architectures

To combine the advantages of the architectures described above, ease of programming on


the one hand, and high scalability on the other hand, a third kind of architecture has been
established: distributed shared memory machines. Here, each processor has its own local
memory, but, contrary to the distributed memory architecture, all memory modules form
one common address space, i.e. each memory cell has a system-wide unique address. In
order to avoid the disadvantage of shared memory computers, namely the low scalability,
each processor uses a cache, which keeps the number of memory access conflicts and the
network contention low. However, the usage of caches introduces a number of problems,
for example how to keep the data in the memory and the copies in the caches up-to-date.
This problem is solved by using sophisticated cache coherence and consistency protocols.
A detailed description of the most important protocols can be found in.
Interconnection Networks

Since the processors in a parallel computer need to communicate in order to solve a given
problem, there is a need for some kind of communication infrastructure, i.e. the
processors need to be connected in some way. Basically, there are two kinds of
interconnection networks: static and dynamic. In case of a static interconnection network,
all connections are fixed, i.e. the processors are wired directly, whereas in the latter case
there are switches in between. The decision whether to use a static or dynamic
interconnection network depends on the kind of problem that should be solved with the
computer. Generally, static topologies are suitable for problems whose communication
patterns can be predicted reasonably well, whereas dynamic topologies (switching
networks), though more expensive, are suitable for a wider class of problems [1].

In the following, we will give a description of some important static and dynamic
topologies, including routing protocols.

Static Topologies

Descriptions

Meshes and Rings

The simplest - and cheapest - way to connect the nodes of a parallel computer is to use a
one-dimensional mesh. Each node has two connections, boundary nodes have one. If the
boundary nodes are connected to each other, we have a ring, and all nodes have two
connections. The one-dimensional mesh can be generalized to a k-dimensional mesh,
where each node (except boundary nodes) have 2k connections. Again, boundary nodes
can be connected, but there is no general consensus on what to do on boundary nodes.
However, this type of topology is not suitable to build large-scale computers, since the
maximum message latency, that is, the maximum delay of a message from one of the N
processors to another, is ; this is bad for two reasons: firstly, there is a wide range of
latencies (the latency between neighbouring processors is much lower than between not-
neighbors), and secondly the maximum latency grows with the number of processors.

Stars

In a star topology there is one central node, to which all other nodes are connected; each
node has one connection, except the centre node, which has N-1 connections.

Stars are also not suitable for large systems, since the centre node will become a
bottleneck with increasing number of processors.
Hyper cubes

The hypercube topology is one of the most popular and used in many large scale systems.
A k-dimensional hypercube has 2knodes, each with k connections. In Figure a four-
dimensional hypercube is displayed.

Hyper cubes scale very well, the maximum latency in a k-dimensional (or "k-ary")
hypercube is log2 N, with N = 2k. An important property of hypercubes is the relationship
between node-number and which nodes are connected together. The rule is, that any two
nodes in the hypercube, whose binary representations differ in exactly one bit, are
connected together. For example in a four-dimensional hypercube, node 0 (0000) is
connected to node 1 (0001), node 2 (0010), node 4 (0100) and node 8 (1000). This
numbering scheme is called Gray code scheme

Routing

Meshes and Rings

Typically, in meshes the so called dimension-order routing technique is used. That is,
routing is performed in one dimension at a time. In a three-dimensional mesh for
example, a message's path from node (a,b,c) to the node (x,y,z) would be moved along
the first dimension to node (x,b,c), then, along the second dimension to node (x,y,c), and
finally, in the third dimension to the destination-node (x,y,z).

Stars

Routing in stars is trivial. If one of the communicating nodes is the centre node, then the
path is just the edge connecting them. If not, the message is routed from the source node
to the centre node, and from there to the destination node.
Hypercubes

A k-dimensional hypercube is nothing else than a k-dimensional mesh with only two
nodes in each dimension, and thus the routing algorithm is the same as for meshes; apart
from one difference: the path from node A to node B is calculated by simply calculating
the exclusive-or X = A B from the binary representations for node A and B. If the i-th
bit in X is '1' the message is moved to the neighbouring node in the i-th dimension. If
the i-th bit is '0' the message is not moved anyway. This means, that it takes at most
log2 N steps for a message to reach it's destination (where N is the number of nodes in the
hypercube).

Dynamic Topologies

Single-Stage Networks

Buses and crossbars are the two main representatives of this class. A bus is the simplest
way to connect a number of processors with each other: all processors are simply
connected to one wire. This makes communication and especially message routing very
simple. The drawback of this type of network is, that the available bandwidth is inversely
proportional to the number of connected processors. This means, that buses are good only
for small networks with a maximum of about 10 processors.

The other extreme in terms of complexity is the crossbar network. With a crossbar full
connectivity is given, i.e. all processors can communicate with each other simultaneously
without reduction of bandwidth. In Figure the connection of n processors with m memory
modules (as in a shared memory system) is shown. Certainly crossbars can also be used
to connect processors with each other. In that case the memory modules are connected
directly to the processors (which results in a distributed memory system), and the lines
that were connected to the memory modules Mi are now connected to the processors Pi.
To connect n processors to n memory modules n2 switches are needed. Consequently,
crossbar networks can not be scaled to any arbitrary size. Today's commercially available
crossbars can connect up to 256 units.

Multi-Stage Networks

Multi-stage networks are based on the so called shuffle-exchange switching element,


which is basically a 2 x 2 crossbar. Multiple layers of these elements are connected and
form the network. Depending on the way these elements are connected, the following
topologies can be distinguished:
• Banyan
• Baseline
• Cube
• Delta
• Flip
• Indirect cube
• Omega

As an example of a multistage network a 8 x 8 Benes network is shown in Figure

Summary
The networks can be classified as static or dynamic. Static interconnection networks are
mainly used in message-passing architectures; the following types are commonly defined:

• completely-connected network.
• star-connected network.
• linear array or ring of processors.
• mesh network (in 2- or 3D). Each processor has a direct link to four/six (in
2D/3D) neighbor processors. Extensions of this kind of networks is a wraparound
mesh or torus. Commercial examples are Intel Paragon XP/S and Cray T3D/E.
These examples cover also another class, namely the direct network topology.
• tree network of processors. Communication bottleneck likely to occur in large
configurations can be alleviated by increasing the number of communication links
for processors closer to the root, which results in the fat-tree topology, efficiently
used in the TMC CM5 computer. CM5 could be also an example of indirect
network topology.
• hypercube network. Classically this is a multidimensional mesh of processors
with exactly two processors in each dimension. An example of such a system is
the Intel iPSC/860 computer. Some new projects incorporate the idea of several
procesors in each node which results in fat hypercube, i.e. indirect
network topology. An example is the SGI/Cray Origin2000 computer.

Dynamics interconnection networks implement one of four main alternatives:

• bus-based networks - the simplest and efficient solution when the cost and
moderate number of processors are involved. Its main drawback is a bottleneck to
the memory when number of processors becomes large and also a single point of
failure. To overcome the problems, sometimes several parallel buses are
incorporated. The classical example of such machine is the SGI Power Challenge
computer with packet data bus.

Table 1: Properties of various types of multiprocessor interconnections


Property Bus Crossbar Multistage
Speed low High high
Cost low High moderate
Reliability low High high
Configurability high Low moderate
Complexity low High moderate
• crossbar switching networks, which employ a grid of switching elements. The
network is nonblocking, since the connection of a processor to a memory bank
does not block the connection of any other processor to any other memory bank.
In spite of high speed, their use is limited, due to nonlinear complexity (o(p2), p -
number of processors) and the cost (cf. Table 1). They are applied mostly in
multiprocessor vector computers (like Cray YMP) and in multiprocessors with
multilevel interconnections (e.g. HP/Convex Exemplar SPP). One outstanding
example is the Fujitsu VPP500 which incorporates 224x224 crossbar switch.
• multistage interconnection networks formulate the most advanced pure solution,
which lies between the two extremes (Table 1). A typical example is the omega

network, which consists of stages, where p is number of inputs and outputs


(usually number of processor and of memory banks). Its complexity

is , less than for the crossbar switch. However, in the omega network
some memory accesses can be blocked. Although machines of this kind of
interconnections offer virtual global memory model of programming and ease of
use they are still not much popular. Examples from the past cover BBN Butterfly
and IBM RP-3 computers, at present IBM RS6K SP incorporates multistage
interconnections with Vulcan switch.
• multilevel interconnection network seems to be a relatively recent development.
The idea comes directly from clusters of computers and consists of two or more
levels of connections with different aggregated bandwidths. Typical examples are:
SGI/Cray Origin2000, IBM RS6K SP with PowerPC604 SMP nodes and
HP/Convex Exemplar. This kind of architecture is getting the most interest at
present.

Hierarchical Memory Technology

The hierarchical arrangement of storage in current computer architectures is called the


memory hierarchy. It is designed to take advantage of memory locality in computer
programs. Each level of the hierarchy has the properties of higher bandwidth, smaller
size, and lower latency than lower levels.

LEVELS in Memory Hierarchy

A ranking of computer memory devices, with devices having the fastest access time at
the top of the hierarchy, and devices with slower access times but larger capacity and
lower cost at lower levels.

Most modern CPUs are so fast that for most program workloads, the locality of reference
of memory accesses and the efficiency of the caching and memory transfer between
different levels of the hierarchy are the practical limitation on processing speed. As a
result, the CPU spends much of its time idling, waiting for memory I/O to complete.

The memory hierarchy in most computers is:


• Processor registers – fastest possible access (usually 1 CPU cycle), only hundreds
of bytes in size.
• Level 1 (L1) cache – often accessed in just a few cycles, usually tens of kilobytes.
• Level 2 (L2) cache – higher latency than L1 by 2× to 10×, often 512 KB or more.
• Main memory (DRAM) – may take hundreds of cycles, but can be multiple
gigabytes. Access times may not be uniform, in the case of a NUMA machine.
• Disk storage – millions of cycles latency, but very large.
• Tertiary storage – several seconds latency, can be huge.

The various major units in a typical memory system can be viewed as forming a
hierarchy of memories (M1,M2,...,Mn) in which each member Mi is in a sense subordinate
to the next highest member Mi-1 of the hierarchy.

Memory Hierarchy MANAGEMENT

Modern programming languages mainly assume two levels of memory, main memory
and disk storage, though in assembly language, and in inline assembler in languages such
as C, registers can be directly accessed.

Taking optimal advantage of the memory hierarchy requires the cooperation of


programmers, hardware, and compilers (as well as underlying support from the operating
system):

• Programmers are responsible for moving data between disk and memory through
file I/O.
• Hardware is responsible for moving data between memory and caches.
• Optimizing compilers are responsible for generating code that, when executed,
will cause the hardware to use caches and registers efficiently.

Memory Hierarchy: TERMINOLOGY


• Hit: data appears in some block in the cache (example: Block X)
– Hit Rate: the fraction of cache access found in the cache
– Hit Time: Time to access the upper level which consists of RAM access
time + Time to determine hit/miss.

• Miss: data needs to be retrieved from a block in the main memory (Block Y)
– Miss Rate = 1 - (Hit Rate)
– Miss Penalty: Time to replace a block in cache + Time to deliver the
block to the

processor.

Memory Hierarchy – PARAMETERS & PROPERTIES

The 5 parameters associated with memory technologies arranged in a hierarchy:

1) Access Time: Time for the CPU to fetch a value from memory -- including delays
through any intermediate levels.

2) Memory Size: The amount of memory of a given type in a system.

3) Cost Per Unit (byte): Cost per unit times size roughly equals total cost.

4) Transfer bandwidth: Units (bytes) per second transferred to the next level.

5) Unit of transfer: Number of units moved between adjacent levels in a single


move.

The 3 properties of a memory hierarchy:

1) Inclusion: If a value is found at one level, it is present at all of the levels below it.
2) Coherence: The copies at all of the levels are consistent.

3) Locality: Programs access a restricted portion of their address space in any time
window.

Obviously, none of these is strictly true. Most hierarchies are inclusive from the registers
to the main memory (although we could imagine a multi-level cache that skips a level on
loading, and only copies out to the lower level when writing back). However, most tape
units do not spool to disk before going to main memory -- they are DMA devices, just
like the disks.

INCLUSION Property

The Inclusion property is stated by the following set inclusion relations among n memory
levels.

M1 ⊂ M2⊂ M3 ⊂ … ⊂ Mn

Here :
M1,M2,M3 are memory levels.
n is the number of levels.

The above equation signifies that if a value is found at one level in the memory hierarchy,
it is present at all of the memory levels below it.

COHERENCE Property

Coherence requires that copies of the same information items be consistent at different
memory levels.

The coherence property extends all the way from the cache at M1 to the outermost
memory Mn.

2 strategies to maintain the coherence in a memory hierarchy are:

1) Write Through (WT)


2) Write Back (WB)
LOCALITY Property

Locality is entirely program-dependent. For example, LISP programs have logical


locality that does not correspond to physical address locality. Most caches assume array
type data access and sequential code. The book identifies three aspects of this form of
locality:

1) Temporal locality: Recently accessed items tend to be accessed again in the near
future.

2) Spatial locality: Accesses are clustered in the address space.

3) Sequential locality: Instructions tend to be accessed in sequential memory


locations.

NEED of Multilevel Hierarchy


The current trends in hardware technologies show a constantly increasing performance
gap between the speed of CPUs (processors) and main memory. This gap results the need
of large but fast memory caches. If we had only one very big (L1) cache then we would
be forced to have larger clock cycles. Thus, we overcome this problem by adding
additional levels of larger and slower memory caches between the CPU and the main
memory, as shown in figure 1.

CPU
L1
Additional Levels L2
L3
Main Memory
Figure 1: Multi-level cache hierarchy.

The main reason of using multi-level caching is the need for reducing the penalty for the
cache misses. In particular, the cache that it is the smallest and it is closest to the
processor can operate in very high frequency equal to the frequency of the processor. The
lower level caches are further away from the processor, they operate in smaller frequency
but still their access time is much smaller than the corresponding access time of the main
memory. Those caches are big enough to contain data reducing effectively the need of
main memory accesses (they have large hit ratios). It is obvious that by using this
technique the average memory access time is reduced.

INTRODUCTION

A vector processor , or array processor , is a CPU design where the instruction


set includes operations that can perform mathematical operations on multiple data
elements simultaneously . This is in contrast to a scalar processor which handles
one element at a time using multiple instructions . The vast majority of CPUs are
scalar . Vector processors were common in the scientific computing area , where
they formed the basis of most supercomputers through the 1980s and into the
1990s , but general increases in performance and processor design saw the near
disappearance of the vector processor as a general-purpose CPU.

A vector is a set of scalar data items , all of the same type , stored in memory .
Usually , the vector elements are ordered to have a fixed addressing increment
between successive elements called the stride .
Vector Processing occurs when arithmetic or logical operations are applied to
vectors. It is distinguished from scalar processing which operates on one or one
pair of data . The conversion from scalar code to vector code is called
vectorization . Vector processing is faster and more efficient than scalar processing
. It reduces software overhead incurred in the maintenance of looping control and
reduces memory-access conflicts .

VECTOR INSTRUCTION TYPES

1 . Vector-vector instructions – one or two vector operands are fetched from the
respective vector registers , enter through a functional pipeline unit and produce
results in another vector registers . They are defined by the following mappings :
f1 : Vi  Vj
f2 : Vj X Vk  Vi
Examples are V1 = sin(V2) and V3 = V1 + V2 for the mappings of f1 and f2
respectively .

2 . Vector-scalar instructions – The scalar s is operated on elements of the a


vector V to produce another vector . It is defined as :
f3 : s X Vi  Vj

3 . Vector-memory instructions - This corresponds to vector load or vector store ,


element by element , between the vector register ( V ) and the memory ( M ) ,
defined as :
f4 : M  V Vector load
f5 : V  M Vector store

4 . Vector reduction instructions - These are defined as :


f6 : Vi  sj
f7 : Vi X Vj  sk
Examples of f6 include finding the maximum , minimum , sum and mean value of
all elements in a vector . Example of f7 is the dot product .

5 . Gather and scatter instructions – These instructions use two vector registers to
gather or to scatter vector elements randomly throughout the memory . It is
defined as follows :
f8 : M  V1 X Vo Gather
f9 : V1 X Vo  M Scatter
Gather is an operation that fetches from memory the nonzero elements of a sparse
vector using indices that themselves are indexed . Scatter stores into memory a
vector in a sparse vector whose nonzero entries are indexed . The vector register
V1 contains the data , and the vector register Vo is used as an index to gather or
scatter data from or to random memory locations .

6 . Masking instructions – This type of instruction uses a mask vector to compress


or to expand a vector to a shorter or longer index vector , respectively , they are
defined as follows :
f10 : Vo X Vm  V1

VECTOR – ACCESS MEMORY SCHEMES

The flow of vector operands between the main memory and vector registers is
usually pipelined with multiple access paths .

Vector Operand Specifications – Vector operands may have arbitrary length .


Vector elements are not necessarily stored in contiguous memory locations . For
example , the entries in a matrix may be stored in a row major or in column
major . Each row , column or diagonal of the matrix can be used as a vector .
When row elements are stored in contiguous locations with a unit stride , the
column elements must be stored with a stride of n , where n is the matrix order .
Similarly , the diagonal elements are also separated by a stride of n+1 .
To access a vector in memory , one must specify its base address , stride and
length . Since each vector register has a fixed number of component registers , only
a segment of the vector can be loaded into the vector register in a fixed number
of cycles . Long vectors must be segmented and processed one segment at a time .
Vector operands should be stored in memory to allow pipelined or parallel access .
The memory system for a vector processor must be specifically designed to
enable fast vector access . The access rate should match the pipeline rate . The
access path is often itself pipelined and is called an access pipe .

The main memory is built with multiple modules . Once presented with a memory
address , each memory module returns with one word per cycle . It is possible to
present different addresses to different memory modules so that parallel access of
multiple words can be done simultaneously or in a pipelined fashion . Consider a
main memory formed with m = 2^a memory modules , each containing w = 2^b
words of memory cells . The total memory capacity is m . w = 2^(a+b) . These
memory words are assigned linear addresses . Different ways of assigning linear
addresses results in different memory organizations . There are three vector access
memory organizations :

C-Access Memory Organization : This has the Low-Order Interleaving , which


spreads contiguous memory locations across the m modules horizontally . This
implies that the low-order a bits of the memory address are used to identify the
memory module . The high order b bits are the word addresses within each module
. Access of the m memory modules can be overlapped in a pipelined fashion ,
called the Pipelined Memory Access . For this purpose , the memory cycle ( called
the major cycle ) is subdivided into m minor cycles . An eight-way interleaved
memory ( with m = 8 and w = 8 and thus a = b = 3 ) is :

Memory address Register ( 6 bits )

Word address Module address


Mo M1 M2 M3 M4 M5 M6 M7

The same word address is applied to all memory modules simultaneously . A


module address decoder is used to distribute module addresses . This type of
concurrent access of contiguous words has been called a C-access memory scheme
.

S-Access Memory Organization : The low-order interleaved memory can be


rearranged to allow simultaneous access , or S-access . In this case , all memory
modules are accessed simultaneously in a synchronized manner . The high order
( n-a) bits select the same offset word from each module .

C / S – Access Memory Organization : This is a memory access where C-Access


and S-Access are combined . Here n access buses are used with m interleaved
memory modules attached to each bus . The m modules on each bus are m-way
interleaved to allow C-access . The n buses operate in parallel to allow S-access .
In each memory cycle , at most m . n words are fetched if the n buses are fully
used with pipelined memory accesses .

SYSTEM INTERCONNECT ARCHITECTURES


Direct networks for static connections and indirect network for dynamic connections can
be used for internal connections among processors, memory modules and I/O disk arrays
in a centralised system, or for distributed networking of multicomputer nodes.

Various topologies for building networks are specified below. We focus on the
communication properties of interconnection networks. These include latency analysis,
bisection bandwidth, and data routing functions.

The communication efficiency of the underlying network is critical to the performance of


a parallel computer. We hope to achieve a low-latency network with a high data transfer
rate and thus a wide communication bandwidth. These network properties help make
design choices for machine architecture.

NETWORK PROPERTIES AND ROUTING

The topologies of an interconnection network can be either static or dynamic. Static


networks are formed of point-to-point direct connection, which will not change during
program execution. Dynamic networks are implemented with switched channels, which
are dynamically configured to match the communication demands in user programs.

Static networks are used for fixed connections among subsystems of a centralized system
or multiple computing nodes of a distributed system. Dynamic networks include buses,
crossbar switches, and multistage networks, which are often used in shared memory
multiprocessors.

Some of the parameters used to estimate the complexity, communication efficiency, and
cost of a network are defined below:-

(i) Network size:- In general, a network is represented by the graph of a finite number of
nodes linked by directed or undirected edges. The number of nodes in a graph is called
the network size.

(ii) Node Degree and Network Diameter:-The number of edges (links or channels)
incident on a node is called the node degree d. In case of unidirectional channels, the
number of channels into a node is the in degree, and that out of a channel is the out
degree. Then the node degree is the sum of the two. The node degree should be kept
constant , and small in order to reduce the cost.

The diameter D of a network is the maximum shortest path between any two nodes. The
path length is measured by the number of links traversed. The diameter should be small
from a communication point of view.

(iii)Bisection Width:- When a given network is cut into two equal halves, the minimum
number of edges (channels) along the cut is called the channel bisection width b. In case
of a communication network, each edge corresponds to a channel with w bit wires. Then
the wire bisection width is B=bw. B reflects the wiring density of a network. Thus,
bisection width provides a good indicator of the maximum communication bandwidth
along the bisection f a network.

(iv)Data Routing Functions:-A data routing network is used for inter-processing


element data exchange. This routing network can be static, such as hypercube routing
network used in the TMC/CM-2, or dynamic such as the multistage network used in the
IBM GF11.In the case of a multicomputer network, the data routing is achieved through
message passing. Hardware routers are used to route messages among multiple
computers. The versatility of a routing network will reduce the time needed for data
exchange and thus will significantly improve the performance.

Commonly used data routing functions among the processing elements include shifting,
rotation, broadcast (one-to-all), multicast (many-to-many), personalized communication
(one-to-many) etc.

Hyper Cube Routing Functions

A three-dimensional binary cube network is shown below. Three routing functions are
defined by three bits in the node address.

a. One can exchange data between adjacent nodes which differ in the least
significant bit. eg. 000---001---010---011---100---101---110---111
b. Routing by middle bit.eg. 000---010 ,001---011
c. Routing by most significant bit.eg. 000---100 ,001---101

Instruction Set Architecture (ISA)


An instruction set is a list of all the instructions, and all their variations, that a processor
(or in the case of a virtual machine, an interpreter) can execute.

Instructions include:

 Arithmetic such as add and subtract


 Logic instructions such as and, or, and not
 Data instructions such as move, input, output, load, and store
 Control flow instructions such as goto, if ... goto, call, and return.

An instruction set, or instruction set architecture (ISA), is the part of the computer
architecture related to programming, including the native data
types, instructions, registers,addressing modes, memory
architecture, interrupt and exception handling, and external I/O. An ISA includes a
specification of the set of opcodes (machine language), the native commands
implemented by a particular CPU design.
Categories of ISA

 CISC
 RISC

RISC
Why is this architecture called RISC? What is Reduced about it?
The answer is that to make all instructions the same length the number of bits that are
used for the opcode is reduced. Thus less instructions are provided. The instructions that
were thrown out are the less important string and BCD (binary-coded decimal)
operations. In fact, now that memory access is restricted there aren't several kinds of
MOV or ADD instructions. Thus the older architecture is called CISC (Complete
Instruction Set Computer). RISC architectures are also
called LOAD/STORE architectures.
The number of registers in RISC is usualy 32 or more.
The CISC Architecture

The RISC Architecture

Reduced instruction set computer


The acronym RISC (pronounced risk), for reduced instruction set computing,
represents a CPU design strategy emphasizing the insight that simplified instructions that
"do less" may still provide for higher performance if this simplicity can be utilized to
make instructions execute very quickly. Many proposals for a "precise" definition have
been attempted, and the term is being slowly replaced by the more descriptive load-store
architecture. Well known RISC families include Alpha,ARC, ARM, AVR, MIPS, PA-
RISC, Power Architecture (including PowerPC), SuperH, and SPARC.
The complex addressing inherently takes many cycles to perform. It was argued that such
functions would better be performed by sequences of simpler instructions, if this could
yield implementations simple enough to cope with really high frequencies, and small
enough to leave room for many registers[1],
factoring out slow memory accesses. Uniform, fixed length instructions with arithmetics
restricted to registers were chosen to ease instruction pipelining in these simple designs,
with special load-storeinstructions accessing memory.
Typical characteristics of RISC
For any given level of general performance, a RISC chip will typically have far
fewer transistors dedicated to the core logic which originally allowed designers to
increase the size of the register set and increase internal parallelism.

Other features, which are typically found in RISC architectures are:

 Uniform instruction format, using a single word with the opcode in the same bit
positions in every instruction, demanding less decoding;
 Identical general purpose registers, allowing any register to be used in any
context, simplifying compiler design (although normally there are separate floating
point registers);
 Simple addressing modes. Complex addressing performed via sequences of
arithmetic and/or load-store operations;
 Few data types in hardware, some CISCs have byte string instructions, or
support complex numbers; this is so far unlikely to be found on a RISC.

Complex instruction set computer


A complex instruction set computer (CISC, pronounced like "sisk") is
a computer instruction set architecture (ISA) in which each instruction can execute
several low-level operations, such as a load from memory, an arithmetic operation, and
amemory store, all in a single instruction.
Benefits
Before the RISC philosophy became prominent, many computer architects tried to bridge
the so called semantic gap, i.e. to design instruction sets that directly supported high-level
programming constructs such as procedure calls, loop control, and complex addressing
modes, allowing data structure and array accesses to be combined into single instructions.
The compact nature of such instruction sets results in smaller program sizes and fewer
calls to main memory, which at the time (early 1960s and onwards) resulted in a
tremendous savings on the cost of computer memory and disc storage.
Problems
While many designs achieved the aim of higher throughput at lower cost and also
allowed high-level language constructs to be expressed by fewer instructions, it was
observed that this was not always the case. For instance, low-end versions of complex
architectures (i.e. using less hardware) could lead to situations where it was possible to
improve performance by notusing a complex instruction (such as a procedure call or enter
instruction), but instead using a sequence of simpler instructions.

CISC vs RISC

CISC

Pronounced sisk, and stands for Complex Instruction Set Computer. Most PC's use CPU
based on this architecture. For instance Intel and AMD CPU's are based on CISC
architectures.

Typically CISC chips have a large amount of different and complex instructions. The
philosophy behind it is that hardware is always faster than software, therefore one should
make a powerful instructionset, which provides programmers with assembly instructions
to do a lot with short programs.

In common CISC chips are relatively slow (compared to RISC chips) per instruction, but
use little (less than RISC) instructions.

RISC

Pronounced risk, and stands for Reduced Instruction Set Computer. RISC
chips evolved around the mid-1980 as a reaction at CISC chips. The
philosophy behind it is that almost no one uses complex assembly language
instructions as used by CISC, and people mostly use compilers which never
use complex instructions. Apple for instance uses RISC chips.

Therefore fewer, simpler and faster instructions would be better, than the large, complex
and slower CISC instructions. However, more instructions are needed to accomplish a
task.

An other advantage of RISC is that - in theory - because of the more simple instructions,
RISC chips require fewer transistors, which makes them easier to design and cheaper to
produce.

Finally, it's easier to write powerful optimised compilers, since fewer instructions exist.

RISC vs CISC

There is still considerable controversy among experts about which architecture is better.
Some say that RISC is cheaper and faster and therefor the architecture of the future.
Others note that by making the hardware simpler, RISC puts a greater burden on the
software. Software needs to become more complex. Software developers need to write
more lines for the same tasks.

Therefore they argue that RISC is not the architecture of the future, since conventional
CISC chips are becoming faster and cheaper anyway.

RISC has now existed more than 10 years and hasn't been able to kick CISC out of the
market. If we forget about the embedded market and mainly look at the market for PC's,
workstations and servers I guess a least 75% of the processors are based on the CISC
architecture. Most of them the x86 standard (Intel, AMD, etc.), but even in the mainframe
territory CISC is dominant via the IBM/390 chip. Looks like CISC is here to stay …

Is RISC than really not better? The answer isn't quite that simple. RISC and CISC
architectures are becoming more and more alike. Many of today's RISC chips support
just as many instructions as yesterday's CISC chips. The PowerPC 601, for example,
supports more instructions than the Pentium. Yet the 601 is considered a RISC chip,
while the Pentium is definitely CISC. Further more today's CISC chips use many
techniques formerly associated with RISC chips.

So simply said: RISC and CISC are growing to each other.

x86

An important factor is also that the x86 standard, as used by for instance Intel and AMD,
is based on CISC architecture. X86 is thé standard for home based PC's. Windows 95 and
98 won't run at any other platform. Therefore companies like AMD an Intel will not
abandoning the x86 market just overnight even if RISC was more powerful.

Changing their chips in such a way that on the outside they stay compatible with the
CISC x86 standard, but use a RISC architecture inside is difficult and gives all kinds of
overhead which could undo all the possible gains. Nevertheless Intel and AMD are doing
this more or less with their current CPU's. Most acceleration mechanisms available to
RISC CPUs are now available to the x86 CPU's as well.

Since in the x86 the competition is killing, prices are low, even lower than for most RISC
CPU's. Although RISC prices are dropping also a, for instance, SUN UltraSPARC is still
more expensive than an equal performing PII workstation is.

Equal that is in terms of integer performance. In the floating point-area RISC still holds
the crown. However CISC's 7th generation x86 chips like the K7 will catch up with that.

The one exception to this might be the Alpha EV-6. Those machines are overall about
twice as fast as the fastest x86 CPU available. However this Alpha chip costs about
€20000, not something you're willing to pay for a home PC.
Maybe interesting to mention is that it's no coincidence that AMD's K7 is developed in
co-operation with Alpha and is for al large part based on the same Alpha EV-6
technology.

EPIC

The biggest threat for CISC and RISC might not be eachother, but a new technology
called EPIC. EPIC stands for Explicitly Parallel Instruction Computing. Like the word
parallel already says EPIC can do many instruction executions in parallel to one another.

EPIC is a created by Intel and is in a way a combination of both CISC and RISC. This
will in theory allow the processing of Windows-based as well as UNIX-based
applications by the same CPU.

It will not be until 2000 before we can see an EPIC chip. Intel is working on it under
code-name Merced. Microsoft is already developing their Win64 standard for it. Like the
name says, Merced will be a 64-bit chip.

If Intel's EPIC architecture is successful, it might be the biggest thread for RISC. All of
the big CPU manufactures but Sun and Motorola are now selling x86-based products, and
some are just waiting for Merced to come out (HP, SGI). Because of the x86 market it is
not likely that CISC will die soon, but RISC may.

So the future might bring EPIC processors and more CISC processors, while the RISC
processors are becoming extinct.

Conclusion

The difference between RISC and CISC chips is getting smaller and smaller. What counts
is how fast a chip can execute the instructions it is given and how well it runs existing
software. Today, both RISC and CISC manufacturers are doing everything to get an edge
on the competition.

The future might not bring victory to one of them, but makes both extinct. EPIC might
make first RISC obsolete and later CISC too.

Written by A.A.Gerritsen
for the CPU Site
March '99

DATAFLOW ARCHITECTURE
One of the few experimental dataflow computer project is the token-tagged architecture
for building dataflow computers developed by a couple of students from MIT.

The global architecture consists of n processing elements (PEs) interconnected by an nxn


routing network.The entire system supports pipelined dataflow operations in all n
PEs.Inter-PE communications are done through the pipelined routing network.

Processing Element Architecture

Within each PE,the machine provides a low-level token-matching mechanism which


Dispatches only those instructions whose input data(tokens) are already available.Each
datum is tagged with the address of the instruction to which it belongs and the context in
which the instruction is being executed.Instructions are stored in the program
memory.Tagged tokens enter the PE through a local path.The tokens can also be passed
to other PEs through the routing network.

Another synchronization mechanism,called the I-structure ,is provided within each


PE.The I-structure is a tagged memory unit for overlapped usuage of a data structure by
both the producer and consumer processes.
Demand-Driven Mechanisms

In a reduction machine, the computation is triggered by the demand for an operation’s


result . Consider the evaluation of a nested arithmetic expression a=((b+1)*c-(d%e)).The
data-driven computation chooses a bottom-up approach, starting from the innermost
operations b+1 and d% e, then proceeding to the * operation, and finally to the outermost
operation -.Such a computation has been called eager evaluation because operations are
carried out immediately after all their operands become available.
A demand-driven computation chooses a top down approach by first demanding the value
of a,which triggers the demand for evaluating the next-level expressions (b+1)*c and d
%e,which in turn triggers the demand for evaluating b+1 at the innermost level. The
results are then returned to the nested demander in the reverse order before a is evaluated.
A demand-driven computation corresponds to lazy evaluation , because operations are
executed only when their results are required by another instruction.The demand-driven
approach matches naturally with the functional programming concept.The removal of
side effects in functional programming makes programs easier to parallelize . There are
two types of reduction machine models,both having a recursive control mechanism.

Reduction Machine Models


In a string reduction model,each demander gets a separate copy of the expression for its
own evaluation.A long string expression is reduced to a single value in a recursive
fashion. Each reduction step has an operator followed by an embedded reference to
demand the corresponding input operands.The operator is suspended while its input
arguments are being evaluated.An expression is said to be fully reduced when all the
arguments have been replaced by literal values.
In a graph reduction model,the expression is represented as a directed graph. Different
parts of a graph or subgraphs can be reduced or evaluated in parallel upon demand. Each
demander is given a pointer to the result of the reduction.The demander manipulates all
references to that graph.
Graph manipulation is based on sharing the arguments using pointers.This traversal of the
graph and reversal of the reference are continued until constant arguments are
encountered.This proceeds until the value of a is determined and a copy is returned to the
original demanding instruction.

MULTIPROCESSOR SYSTEM INTERCONNECTS


Parallel processing demands the use of efficient system interconnects for fast
communication among multiple processors and shared memory, I/O, and peripheral
devices.Hierarchical buses, crossbar switches, and multistage networks are often used for
this purpose.
A generalized multiprocessor architecture combines features from the UMA, NUMA, and
COMA models.Each processor Pi is attatched to its own local memory and private cache.
Multiple processors are connected to shared-memory modules through an interprocessor-
memory network (IPMN).
The processors share the access of I/o and peripheral devices through a processor I/O
network (PION). Both IPMN and PION are necessary in a shared resource
multiprocessor. Direct interprocessor communications are supported by an optional
interprocessor communication network (IPCN) instead of through the shared memory.
Network characteristics : Each of the above types of networks can be designed with
many choices. The choices are based on the topology, timing protocol, switching
method, and control strategy. Dynamic networks are used in multiprocessors in which the
interconnections are under program control. Timing switching, and control are three
major operational characteristics of an interconnection network. The timing control can
be either synchronous or asynchronous. Synchronous networks are controlled by a global
clock that synchronizes all network activities. Asynchronous networks use handshaking
or interlocking mechanisms to coordinate fast and slow devices requesting use of the
same network.
A network can transfer data using either circuit switching or packet switching. In circuit
switching, once a device is granted a path in the network, it occupies the path for the
entire duration of the data transfer. In packet switching, the information is broken into
small packets individually competing for a path in the network.
Network control strategy is classified as centralized or distributed. With centralized
control, a global controller receives requests from all devices attatched to the network and
grants the network access to one or more requesters. In a distributed system, requests are
handled by local devices independently.

HIERARCHICAL BUS SYSTEMS

A BUS SYSTEM CONSISTES OF A HIERARCHY OF BUSES connecting various


system and subsystem components in a computer. Each bus is formed with a number of
signal, control and power lines. Different buses are used to perform different
interconnection functions.
In general, the hierarchy of bus systems are packaged at different levels including local
buses on boards, backplanes buses, and I/O buses.

Local Bus:
Buses implemented on printed circuit boards are called local buses. On a processor board
one often finds a local bus which provides a common communication path among major
components (chips) mounted on the board. A memory board uses a memory bus to
connect the memory with the interface logic.
An I/O board or network interface board uses a data bus. Each of these board buses
consists of signal and utility lines. With the sharing of the lines by many I/O devices, the
layout of these lines may be at different layers of the PC board.
Backplane Bus :
A backplane is a printed circuit on which many connectors are used to plug in functional
boards. A System bus, consisting of shared signal paths and utility lines, is built on the
backplane. This system bus provides a common communication path among all plug in
boards.
Several backplane bus systems have been developed such as VME bus, multibus II and
Futurebus+.
I/O Bus :
Input/Output devices are connected to a computer system through an I/O bus such as
SCSI (small computer system interface) bus. This bus is made of coaxial disks with tape
connecting disks, printer, and tape unit to a processor through an I/O controller. Special
interface logic is used to connect various board types to the backplane bus.

Hiarchical Bus system


Multiprocessor System Interconnects
Legend:
IPMN Inter processor memory network
PION Processor I/O network
IPCN Inter Processor communication network
P Processor
C cache
SM Shared Memory
LM Local Memory

COMPARISON OF FLOW MECHANISMS

Machine Model Control Flow Dataflow Reduction


(control-driven) (data-driven) (demand-driven)
Basic Definition Conventional Eager evaluation; Lazy evaluation
computation; token statements are statements are
of control indicates executed when all of executed only
when a statement their operands are when their result is
should be executed available required for
another
computation
Only required
Very high potential
Full Control instructions are
for parallelism
executed
Advantages Complex data and High degree of
High throughput
control structures parallelism
are easily Free from side Easy manipulation
implemented effects of data structures
Does not support
Time lost waiting for sharing of objects
Less efficient
unneeded arguments with changing
local state
Disadvantages Difficult in High control
programming overhead Time needed to
Difficult in Difficult in propagate demand
preventing run- manipulating data tokens
time error structures

Computer Arithmetic Principles


Arithmetic operations can be performed by considering two basic forms of operations…
1) That is performed due to fixed memory size

2) The other which can be performed by rounding off or truncating the value

Fixed Point Operations: - As defined early the concept behind involves fixed point
operation with a sign magnitude, by using the concept of 1’s Complement and 2’s
Complement. But 1’s complement introduces a second zero, also known as the dirty zero.
This includes general arithmetic operations, such as:
1) Add

2) Subtract

3) Multiply
4) Divide

Floating Point Numbers: - in there are two parts:


1) M- Mantissa

2) E – Exponent with implies base

Formula that we work on is: X= M.R^E


Where, R=2 incase binary number system
The size of 32 is utilized as:
1) 1bit for Sign(0 bit),

2) 8bit for Exponential(1-8), and

3) 23bit for Mantissa (9-31).

E= (-127,128) is represented as (0,255) & X= (-1) ^s.2^ (E-127). (1. M)

Conditions that exist are as follows:


1) If E = 255 & m! = 0 implies that X is Not a Number.

2) If E = 255 & m = 0 implies that X is an Infinite Number.

3) If E = 0 & m! = 0 implies that X is a Number.

4) If E = 0 & m = 0 implies that X +0, -0.

Floating Point Operations: - The operations that can be performed are as follows
1) X+Y = (Mx. 2( Ex-Ey) + My) X^Ey)

2) X-Y = (Mx. 2( Ex-Ey) - My) X^Ey)

3) X*Y = (Mx * My) 2^(Ex+Ey)

4) X*Y = (Mx * My) 2^(Ex+Ey)

SIMD ARCHITECTURE

FLYNN’S TAXONOMY OF COMPUTER ARCHITECTURE


Flynn’s classification scheme is based on the notion of a stream of information. Two
types of information flow into a processor: instructions and data. The instruction stream
is defined as the sequence of instructions performed by the processing unit. The data
stream is defined as the data traffic exchanged between the memory and the processing
unit. According to Flynn’s classification, either of the instruction or data streams can be
single or multiple. Computer architecture can be classified into the following four distinct
categories:
1) Single-Instruction Single-Data streams (SISD);
2) Single-Instruction Multiple-Data streams (SIMD);
3) Multiple-Instruction Single-Data streams (MISD); and
4) Multiple-Instruction Multiple-Data streams (MIMD).

The architecture of SIMD Computer models are determined by:


1) Memory Distribution, and

2) Addressing Schemes Used.

SIMD computers use a single control unit and distributed memories and some of them
use associative memories. The instruction set of an SIMD computer is decoded by the
array control unit. The major components of SIMD computers are:

1) Processing Elements (PEs) in the SIMD array are passive.

2) Arithmetic and Control Units (ALUs) executes instructions broadcast from


control unit.

All PEs must operate in lockstep, synchronized by the same array controller.
DISTRIBUTED MEMORY MODEL

• It consists of an array of PEs which is controlled by the same array control unit.

• Host computer is responsible for the programs and data being loaded on the
control memory.

• When an instruction is sent to the control unit for decoding, then a scalar or
program control operation is executed by the scalar processor that is attached to
the control unit.

• In case of a vector operation, it gets broadcasted to all the PEs for parallel
execution. The partition data is distributed to all the local memories through a
vector data bus.

• Data routing network is a program control through the control unit.

• Masking logic is there to provide to enable or disable any PE from the instruction
cycle.

Examples of Distributed Memory:


1) MESH ARCHITECTURE: Illiac IV, Goodyear MPP, AMT DAP 610.

2) HYPERCUBE: CM-2, X-Net.

SHARED MEMORY

• An alignment network is used as the inter PE memory communication network.

• The memory module (m) should be relative prime to number of PEs so that
parallel memory access can be achieved through skewing without conflicts.

Examples of Shared Memory: Burroughs Scientific Processor (BSP- having 16 PEs &
17 memory modules), CM/ 200.

SIMD Instructions: SIMD computers to execute vector instructions for arithmetic, logic,
data routing and masking operations over vector quantities. In case of

• Bit-Slice SIMD: vectors are binary vectors.

• Word-Parallel SIMD: vector components are 4-byte or 8-byte numerical value.

All the SIMD instructions are vectors of equal length n, where n corresponds to the
number of PEs.

HOST & I/ O: All I/O operations are handled by the host computers in an SIMD
organization. A special control memory is used between the host and the array control
unit. This is a staging memory for holding program and data.
Divided data sets are distributed to the local memory modules or the shared
memory modules before starting the program instruction.

The host manages the mass storage or the graphics display of computational
results.

Computer Arithmetic Principles


Arithmetic operations can be performed by considering two basic forms of operations…
3) That is performed due to fixed memory size

4) The other which can be performed by rounding off or truncating the value

Fixed Point Operations: - As defined early the concept behind involves fixed point
operation with a sign magnitude, by using the concept of 1’s Complement and 2’s
Complement. But 1’s complement introduces a second zero, also known as the dirty zero.
This includes general arithmetic operations, such as:
5) Add

6) Subtract

7) Multiply

8) Divide

Floating Point Numbers: - in there are two parts:


3) M- Mantissa

4) E – Exponent with implies base

Formula that we work on is: X= M.R^E


Where, R=2 incase binary number system
The size of 32 is utilized as:
4) 1bit for Sign(0 bit),

5) 8bit for Exponential(1-8), and

6) 23bit for Mantissa (9-31).

E= (-127,128) is represented as (0,255) & X= (-1) ^s.2^ (E-127). (1. M)


Conditions that exist are as follows:
5) If E = 255 & m! = 0 implies that X is Not a Number.

6) If E = 255 & m = 0 implies that X is an Infinite Number.

7) If E = 0 & m! = 0 implies that X is a Number.

8) If E = 0 & m = 0 implies that X +0, -0.

Floating Point Operations: - The operations that can be performed are as follows
5) X+Y = (Mx. 2( Ex-Ey) + My) X^Ey)

6) X-Y = (Mx. 2( Ex-Ey) - My) X^Ey)

7) X*Y = (Mx * My) 2^(Ex+Ey)

8) X*Y = (Mx * My) 2^(Ex+Ey)

SIMD ARCHITECTURE

FLYNN’S TAXONOMY OF COMPUTER ARCHITECTURE


Flynn’s classification scheme is based on the notion of a stream of information. Two
types of information flow into a processor: instructions and data. The instruction stream
is defined as the sequence of instructions performed by the processing unit. The data
stream is defined as the data traffic exchanged between the memory and the processing
unit. According to Flynn’s classification, either of the instruction or data streams can be
single or multiple. Computer architecture can be classified into the following four distinct
categories:
1) Single-Instruction Single-Data streams (SISD);
2) Single-Instruction Multiple-Data streams (SIMD);
3) Multiple-Instruction Single-Data streams (MISD); and
4) Multiple-Instruction Multiple-Data streams (MIMD).

The architecture of SIMD Computer models are determined by:


3) Memory Distribution, and

4) Addressing Schemes Used.

SIMD computers use a single control unit and distributed memories and some of them
use associative memories. The instruction set of an SIMD computer is decoded by the
array control unit. The major components of SIMD computers are:
3) Processing Elements (PEs) in the SIMD array are passive.

4) Arithmetic and Control Units (ALUs) executes instructions broadcast from


control unit.

All PEs must operate in lockstep, synchronized by the same array controller.

DISTRIBUTED MEMORY MODEL

• It consists of an array of PEs which is controlled by the same array control unit.

• Host computer is responsible for the programs and data being loaded on the
control memory.

• When an instruction is sent to the control unit for decoding, then a scalar or
program control operation is executed by the scalar processor that is attached to
the control unit.
• In case of a vector operation, it gets broadcasted to all the PEs for parallel
execution. The partition data is distributed to all the local memories through a
vector data bus.

• Data routing network is a program control through the control unit.

• Masking logic is there to provide to enable or disable any PE from the instruction
cycle.

Examples of Distributed Memory:

3) MESH ARCHITECTURE: Illiac IV, Goodyear MPP, AMT DAP 610.

4) HYPERCUBE: CM-2, X-Net.

SHARED MEMORY

• An alignment network is used as the inter PE memory communication network.

• The memory module (m) should be relative prime to number of PEs so that
parallel memory access can be achieved through skewing without conflicts.

Examples of Shared Memory: Burroughs Scientific Processor (BSP- having 16 PEs &
17 memory modules), CM/ 200.

SIMD Instructions: SIMD computers to execute vector instructions for arithmetic, logic,
data routing and masking operations over vector quantities. In case of

• Bit-Slice SIMD: vectors are binary vectors.

• Word-Parallel SIMD: vector components are 4-byte or 8-byte numerical value.


All the SIMD instructions are vectors of equal length n, where n corresponds to the
number of PEs.

HOST & I/ O: All I/O operations are handled by the host computers in an SIMD
organization. A special control memory is used between the host and the array control
unit. This is a staging memory for holding program and data.

Divided data sets are distributed to the local memory modules or the shared
memory modules before starting the program instruction.

The host manages the mass storage or the graphics display of computational
results.

Introduction:

In computing, a pipeline is a set of data processing elements connected in series, so that


the output of one element is the input of the next one. The elements of a pipeline are
often executed in parallel or in time-sliced fashion; in that case, some amount of buffer
storage is often inserted between elements.

Computer-related pipelines include:

1. Instruction pipelines, such as the classic RISC pipeline, which are used
in processors to allow overlapping execution of multiple instructions with the
same circuitry. The circuitry is usually divided up into stages, including
instruction decoding, arithmetic, and register fetching stages, wherein each stage
processes one instruction at a time.
2. Graphics pipelines, found in most graphics cards, which consist of
multiple arithmetic units, or complete CPUs, that implement the various stages of
common rendering operations (perspective projection, window
clipping, color and light calculation, rendering, etc.).
3. Software pipelines, consisting of multiple processes arranged so that the output
stream of one process is automatically and promptly fed as the input stream of the
next one. Unix pipelines are the classical implementation of this concept.

Advantages of Pipelining:

1. The cycle time of the processor is reduced, thus increasing instruction issue-rate
in most cases.
2. Some combinatorial circuits such as adders or multipliers can be made faster by
adding more circuitry.
3. If pipelining is used instead, it can save circuitry vs. a more complex
combinatorial circuit.

Disadvantages of Pipelining:

1. A non-pipelined processor executes only a single instruction at a time. This


prevents branch delays (in effect, every branch is delayed) and problems with
serial instructions being executed concurrently. Consequently the design is
simpler and cheaper to manufacture.

2. The instruction latency in a non-pipelined processor is slightly lower than in a


pipelined equivalent. This is due to the fact that extra flip flops must be added to
the data path of a pipelined processor.

3. A non-pipelined processor will have a stable instruction bandwidth. The


performance of a pipelined processor is much harder to predict and may vary
more widely between different programs.

Arithmetic pipelines
The most popular arithmetic operation utilized to illustrate the operation
of arithmetic pipelines in the literature are: floating-point addition and
multiplication.
Floating-point addition
Consider the addition of two normalized floating-point numbers:

A = (E., Ma) and B = (Et, Mb)

to obtain the sum

S = (Es, Ms)

where E and M represent the exponent and mantissa, respectively.

The addition follows the steps shown below:

1. Equalize the exponents:

if E.< Eb, swap A and B; Ediff = Ea-Eb


Shift Mb right Edith bits

2. Add Mantissae:

Ms = Ma + Mb

Es=Ea

3. Normalize Ms and adjust Es to reflect the number of shifts required to


normalize.
4. Normalized M, might have larger number of bits than can be
accommodated by the mantissa field in the representation. If so, round
M.
5.If rounding causes a mantissa overflow, renormalize M. and adjust EQ
accordingly.

Figure shows a five-stage pipeline configuration for the addition process


given above.
Floating-point add pipeline
The throughput of the above pipeline can be enhanced by rearranging the
computations into a larger number of stages, each consuming a smaller
amount of time, as shown in Figure 3.6. Here, equalizing exponents is
performed using a subtract exponents stage and a shift stage that shifts
mantissa appropriately. Similarly, normalizing is split into two stages.
This eight-stage pipeline provides a speedup of 8/5 = 1.6 over the
pipeline of the above figure.

Modified floating-point add pipeline

In the pipeline of above figure we have assumed that the shift stages can
perform an arbitrary number of shifts in one cycle. If that is not the case, the
shifters have to be used repeatedly. Figure 3.7 shows the rearranged pipeline
where the feedback paths indicate the reuse of the corresponding stage.
Floating-point multiplication
Consider the multiplication of two floating-point numbers A = (E,, Ma) and B =
(Eb,Mb), resulting in the product
P = (Er,Mr). The multiplication follows the pipeline configuration shown in figure 1
and the steps are listed below:

1. Add exponents: Ep = Ea + Eb.


2. Multiply mantissae: Mp = Ma * Mb • Mp will be a double-length mantissa.
3. Normalize Mp and adjust Ep accordingly.

4. Convert Mp into single-length mantissa by rounding.


5. If rounding causes a mantissa overflow, renormalize and adjust
EP accordingly.

Stage 2 in the above pipeline would consume the largest amount of time. In
Figure below stage 2 is split into two stages, one performing partial products
and the other accumulating them. In fact, the operations of these two stages
can be overlapped in the sense that when the accumulate stage is adding, the
other stage can be producing the next partial product.
Floating-point multiplication pipeline

Floating-point multiplier pipeline with feedback loops

\
Floating-point adder/ multiplier

The pipelines shown so far in this section are unifunction pipelines since
they are designed to perform only one function. Note that the pipelines

of Figures above have several common stages. If a processor is


required to perform both addition and multiplication, the two
pipelines can be merged into one as shown in figure above. Obviously,
there will be two distinct paths of dataflow in this pipeline, one for
addition and the other for multiplication. This is a
multifunction pipeline. A multifunction pipeline can perform
more than one operation. The interconnection between the stages of
the pipeline changes according to the function it is performing.
Obviously, a control input that determines the particular function to be
performed on the operand being input is needed for proper operation of the
multifunction pipeline.

Static Arithmetic Pipelines


Most of today’s arithmetic pipelines are designed to perform fixed functions.
These arithmetic and logic units perform fixed point and floating point
operations separately. The fixed point unit is also called the integer unit. The
floating point unit can be built either as part of control processor or on a
separate coprocessor.
These arithmetic units perform scalar operations involving one pair of
operands at a time. The pipelining in scalar arithmetic pipelines is controlled
by software loops. Vector arithmetic units can be designed with pipeline
hardware directly under firmware or hardwired control.
Scalar and vector arithmetic pipelines differ mainly in the area of register
files and control mechanism involved. Vector hardware pipelines are often
built as add on option to a scalar processor or as an attached processor driven
by a control processor. Both scalar and vector processors are used in modern
supercomputers.
Arithmetic Pipeline Stages

Depending on the function to be implemented, different pipeline stages in


an arithmetic unit require different hardware logic. Since all arithmetic
operations (such as add, subtract, multiply, divide, squaring, square rooting,
logarithm, etc.) can be implemented with the basic add and shifting
operations, the core arithmetic stages require some form of hardware to
add or to shift.

For example. a typical three-stage floating-point adder includes a first stage for
exponent comparison and equalization which is implemented with an
integer adder and some shifting logic; a second stage for fraction addition
using a high-speed carry look-ahead adder; and a third stage for fraction
normalization and exponent readjustment using a shifter and another addition
logic.
Arithmetic or logical shifts can be easily implemented with shift
registers. Highspeed addition requires either the use of a carry-
propagation adder (CPA) which adds two numbers and produces an
arithmetic sum as shown in Fig. 6.22a, or the use of a carry-save adder
(CSA) to "add" three input numbers and produce one sum output and
a carry output as exemplified in Figure below.

In a CPA, the carries generated in successive digits are allowed to


propagate from the low end to the high end, using either ripple carry
propagation or some carry lookahead technique.

In a CSA, the carries are not allowed to propagate but instead are
saved in a carry vector. In general, an n-bit CSA is specified as follows: Let
X, Y, and Z be three n-bit input numbers. expressed as
X = (xn-I , xn -2... , x1, xo ). The CSA performs bitwise operations
simultaneously on all columns of digits to produce two n-bit output
numbers, denoted as
b
S = (0. Sn-1. Sn-2, ... , S1. So) and C = (Cn Cn-1 ... . C1.0).
Note that the leading hit of the bitwise sum Sb is always a 0, and the tail
bit of the carry vector C is always a 0. The input-output relationships are
expressed below:
Si= xi O Y. t zi
Ci+1= xiyi V yizi V zixi (6.21)
for i = O. 1, 2, ... . n - 1, where o is the exclusive OR and V is the logical OR
operation.
Note that the arithmetic sum of three input numbers, i.e.,
S = X + Y + Z, is obtained by adding the two output numbers, i.e., S
= Sb + C, using a CPA. We use the CPA
and CSA s to implement the pipeline stages of a fixed-point multiply unit
as follows.

Multiply Pipeline Design


a.)An n-bit carry propagate adder(CPA) which allows either carry propagation or applies
the carry look ahead technique.

b.)An n-bit carry save adder(csa) where sb is the bitwise sum of X,Y, nad Z and c is the
carry vector generated without carry propogation between digits.

Consider the multiplication of two 8-bit integers A x B = pi where p is the


16 bit product in double precision. This fixed point multiplication can be
written as the summation of eight partial products as shown below. P=A
x B = p0+p1+p2+….p7 where x and + are arithmetic multiply and add
operations respectively.
Note that the partial product pj is obtained by
multiplying the multiplicand A by the jth bit of B
and then shifting the result j bits to the left for j-
0,1,2….7.Thus pj is (8+j) bits long with j trailing
zeroes.

The first stage generates all eight partial products


ranging from 8 bits to 15 bits simultaneously. The
second stage is made up of two levels of four CSAs
which essentially merges eight numbers into four
numbers ranging from 13 to 15 bits. The third stage
consists of two CSAs which merge four numbers
into two 16 bit numbers . The final stage is a cpa
which adds up the last two numbers to produce the
final product P.
A pipeline unit for fixed point multiplication of 8 bit integers.

SIMD Computers and Performance Enhancement

SIMD (Single Instruction, Multiple Data) is a technique employed to achieve data level
parallelism.

Performance Enhancement Using SIMD:

The SIMD concept is a method of improving performance in applications where highly


repetitive operations need to be performed. Simply put, SIMD is a technique of
performing the same operation, be it arithmetic or otherwise, on multiple pieces of data
simultaneously.
Traditionally, when an application is being programmed and a single operation needs to
be performed across a large dataset, a loop is used to iterate through each element in the
dataset and perform the required procedure. During each iteration, a single piece of data
has a single operation performed on it. This is known as Single Instruction Single Data
(SISD) programming. SISD is generally trivial to implement and both the intent and
method of the programmer can quickly be seen at a later time.
Loops such as this, however, are typically very inefficient, as they may have to iterate
thousands, or even millions of times.
Ideally, to increase performance, the number of iterations of a loop needs to be reduced.
One method of reducing iterations is known as loop unrolling. This takes the single
operation that was being performed in the loop, and carries it out multiple times in each
iteration. For example, if a loop was previously performing a single operation and taking
10,000 iterations, its efficiency could be improved by performing this operation 4 times
in each loop and only having 2500 iterations.
The SIMD concept takes loop unrolling one step further by incorporating the multiple
actions in each loop iteration, and performing them simultaneously. With SIMD, not only
can the number of loop iterations be reduced, but also the multiple operations that are
required can be reduced to a single, optimized action.
SIMD does this through the use of ‘packed vectors’ (hence the alternate name of vector
processing). A packed vector, like traditional programming vectors or arrays, is a data
structure that contains multiple pieces of basic data. Unlike traditional vectors, however,
a SIMD packed vector can then be used as an argument for a specific instruction (For
example an arithmetic operation) that will then be performed on all elements in the vector
simultaneously (Or very close to). Because of this, the number of values that can be
loaded into the vector directly affects performance; the more values being processed at
once, the faster a complete dataset can be completed.
This size depends on two things:
1. The data type being used (ie int, float, double etc)
2. The SIMD implementation
When values are stored in packed vectors and ‘worked upon’ by a SIMD operation, they
are actually moved to a special set of CPU registers where the parallel processing takes
place. The size and number of these registers is determined by the SIMD implementation
being used.
The other area that dictates the usefulness of a SIMD implementation (Other than the
level of hardware performance itself) is the instruction set. The instruction set is the list
of available operations that a SIMD implementation provides for use with packed
vectors. These typically include operations to efficiently store and load values to and
from a vector, arithmetic operations (add, subtract, divide, square root etc), logical
operations (AND, OR etc) and comparison operations (greater than, equal to etc).
The more operations a SIMD implementation provides, the simpler it is for a developer to
perform the required function. SIMD operations are available directly when writing code
in assembly however not in the C language. To simplify SIMD optimization in C,
intrinsics can be used that are essentially a header file containing functions that translate
values to their corresponding call in assembler.

SIMD Example

The best way to demonstrate the effectiveness of SIMD is through an example. One area
where SIMD instructions are particularly useful is within image manipulation. When a
raster-based image, for example a photo, has a filter of some kind applied to it, the filter
has to process the colour value of each pixel and return the new value. The larger the
image, the more pixels that need to be processed. The operation of calculating each new
pixel value, however, is the same for every pixel. Put another way, there is a single
operation to be performed and multiple pieces of data on which it must be completed.
Such a scenario is perfect for SIMD optimization.
In this case, a SIMD optimized version of the filter would still have a main loop to go
through the entire pixel array, however the number of iterations would be significantly
reduced because in each pass, the loop would be transforming multiple pixels.

SIMD Computers

A SIMD computer consists of N identical processors, each with its own local memory
where it can store data. All processors work under the control of a single instruction
stream issued by a central control unit. There are N data streams, one per processor. The
processors operate synchronously: at each step, all processors execute the same
instruction on a different data element.
SIMD computers are much more versatile that MISD computers. Numerous problems
covering a wide variety of applications can be solved by parallel algorithms on SIMD
computers. Another interesting feature is that algorithms for these computers are
relatively easy to design, analyze and implement. On the downside, only problems that
can be subdivided into a set of of identical subproblems all of which are then solved
simultaneously by the same set of instructions can be tackled with SIMD computers.
There are many computations that do not fit this pattern: such problems are typically
subdivided into subproblems that are not necessarily identical, and are solved using
MIMD computers.
SIMD machines have one instruction processing unit, sometimes called a controller and
indicated by a K in the PMS notation, and several data processing units, generally called
D-units or processing elements (PEs). The first operational machine of this class was the
ILLIAC-IV, a joint project by DARPA, Burroughs Corporation, and the University of
Illinois Institute for Advanced Computation. Later machines included the Distributed
Array Processor (DAP) from the British corporation ICL, and the Goodyear MPP.

The control unit is responsible for fetching and interpreting instructions. When it
encounters an arithmetic or other data processing instruction, it broadcasts the instruction
to all PEs, which then all perform the same operation. For example, the instruction might
be `` add R3,R0.'' Each PE would add the contents of its own internal register R3 to its
own R0. To allow for needed flexibility in implementing algorithms, a PE can be
deactivated. Thus on each instruction, a PE is either idle, in which case it does nothing, or
it is active, in which case it performs the same operation as all other active PEs. Each PE
has its own memory for storing data. A memory reference instruction, for example ``load
R0,100'' directs each PE to load its internal register with the contents of memory location
100, meaning the 100th cell in its own local memory.

One of the advantages of this style of parallel machine organization is a savings in the
amount of logic. Anywhere from 20% to 50% of the logic on a typical processor chip is
devoted to control, namely to fetching, decoding, and scheduling instructions. The
remainder is used for on-chip storage (registers and cache) and the logic required to
implement the data processing (adders, multipliers, etc.). In an SIMD machine, only one
control unit fetches and processes instructions, so more logic can be dedicated to
arithmetic circuits and registers. For example, 32 PEs fit on one chip in the MasPar MP-
1, and a 1024- processor system is built from 32 chips, all of which fit on a single board
(the control unit occupies a separate board).

Vector processing is performed on an SIMD machine by distributing elements of vectors


across all data memories. For example, suppose we have two vectors, a and b, and a
machine with 1024 PEs. We would store in location 0 of memory i and in location
1 of memory i. To add a and b, the machine would tell each PE to load the contents of
location 0 into one register, the contents of location 1 into another register, add the two
registers, and write the result. As long as the number of PEs is greater than the length of
the vectors, vector processing on an SIMD machine is done in constant time, i.e. it does
not depend on the length of the vectors. Vector operations on a pipelined SISD vector
processor, however, take time that is a linear function of the length of the vectors

DIRECT MAPPING AND ASSOCIATIVE CACHES

As you will recall, we discussed three cache mapping functions, i.e., methods of
addressing to locate data within a cache.

• Direct
• Full Associative
• Set Associative

Each of these depends on two facts:

• RAM is divided into blocks of memory locations. In other words, memory


locations are grouped into blocks of 2n locations where n represents the number of
bits used to identify a word within a block. These n bits are found at the least-
significant end of the physical address. The image below has n=2 indicating that
for each block of memory, there are 22 = 4 memory locations.
Therefore, for this example, the least two significant bits of an address indicate
the location within a block while the remaining bits indicate the block number.
The table below shows an example with a 20 bit address with four words per
block. Notice that for each group of four words, the word bits take on each of the
four possible values allowed with 2 bits while the block identification bits remain
constant.,

Block Address Block identification bits Word bits


0x00000 0000 0000 0000 0000 00 00
0x00001 0000 0000 0000 0000 00 01
Block 0
0x00002 0000 0000 0000 0000 00 10
0x00003 0000 0000 0000 0000 00 11
0x00004 0000 0000 0000 0000 01 00
0x00005 0000 0000 0000 0000 01 01
Block 1
0x00006 0000 0000 0000 0000 01 10
0x00007 0000 0000 0000 0000 01 11
Block 2 0x00008 0000 0000 0000 0000 10 00
0x00009 0000 0000 0000 0000 10 01
0x0000A 0000 0000 0000 0000 10 10
0x0000B 0000 0000 0000 0000 10 11
0x0000C 0000 0000 0000 0000 11 00
0x0000D 0000 0000 0000 0000 11 01
Block 3
0x0000E 0000 0000 0000 0000 11 10
0x0000F 0000 0000 0000 0000 11 11
And so on...until we get to the last row
0xFFFFC 1111 1111 1111 1111 11 00
0xFFFFD 1111 1111 1111 1111 11 01
Block 2n-1
0xFFFFE 1111 1111 1111 1111 11 10
0xFFFFF 1111 1111 1111 1111 11 11
• The cache is organized into lines, each of which contains enough space to store
exactly one block of data and a tag uniquely identifying where that block came
from in memory.

As far as the mapping functions are concerned, the book did an okay job describing the
details and differences of each. I, however, would like to describe them with an emphasis
on how we would model them using code.

Direct Mapping

Remember that direct mapping assigned each memory block to a specific line in the
cache. If a line is all ready taken up by a memory block when a new block needs to be
loaded, the old block is trashed. The figure below shows how multiple blocks are mapped
to the same line in the cache. This line is the only line that each of these blocks can be
sent to. In the case of this figure, there are 8 bits in the block identification portion of the
memory address.
The address for this example is broken down something like the following:

Tag 8 bits identifying line in cache word id bits

Once the block is stored in the line of the cache, the tag is copied to the tag location of
the line.

Direct Mapping Summary

The address is broken into three parts: (s-r) MSB bits represent the tag to be stored in a
line of the cache corresponding to the block stored in the line; r bits in the middle
identifying which line the block is always stored in; and the w LSB bits identifying each
word within the block. This means that:

• The number of addressable units = 2s+w words or bytes


• The block size (cache line width not including tag) = 2w words or bytes
• The number of blocks in main memory = 2s (i.e., all the bits that are not in w)
• The number of lines in cache = m = 2r
• The size of the tag stored in each line of the cache = (s - r) bits

Direct mapping is simple and inexpensive to implement, but if a program accesses 2


blocks that map to the same line repeatedly, the cache begins to thrash back and forth
reloading the line over and over again meaning misses are very high.
Full Associative Mapping

In full associative, any block can go into any line of the cache. This means that the word
id bits are used to identify which word in the block is needed, but the tag becomes all of
the remaining bits.

Tag word id bits

Full Associative Mapping Summary

The address is broken into two parts: a tag used to identify which block is stored in which
line of the cache (s bits) and a fixed number of LSB bits identifying the word within the
block (w bits). This means that:

• The number of addressable units = 2s+w words or bytes


• The block size (cache line width not including tag) = 2w words or bytes
• The number of blocks in main memory = 2s (i.e., all the bits that are not in w)
• The number of lines in cache is not dependent on any part of the memory address
• The size of the tag stored in each line of the cache = s bits

Set Associative Mapping

This is the one that you really need to pay attention to because this is the one for the
homework. Set associative addresses the problem of possible thrashing in the direct
mapping method. It does this by saying that instead of having exactly one line that a
block can map to in the cache, we will group a few lines together creating a set. Then a
block in memory can map to any one of the lines of a specific set. There is still only one
set that the block can map to.
Note that blocks 0, 256, 512, 768, etc. can only be mapped to one set. Within the set,
however, they can be mapped associatively to one of two lines.

The memory address is broken down in a similar way to direct mapping except that there
is a slightly different number of bits for the tag (s-r) and the set identification (r). It
should look something like the following:

Tag (s-r bits) set identifier (r bits) word id (w bits)

Now if you have a 24 bit address in direct mapping with a block size of 4 words (2 bit id)
and 1K lines in a cache (10 bit id), the partitioning of the address for the cache would
look like this.

Direct Mapping Address Partitions


Tag (12 bits) line identifier (10 bits) word id (2 bits)

If we took the exact same system, but converted it to 2-way set associative mapping (2-
way meaning we have 2 lines per set), we'd get the following:

Tag (13 bits) line identifier (9 bits) word id (2 bits)


Notice that by making the number of sets equal to half the number of lines (i.e., 2 lines
per set), one less bit is needed to identify the set within the cache. This bit is moved to the
tag so that the tag can be used to identify the block within the set.

You might also like