Aca Notes
Aca Notes
Parallel Computing
A common way of satisfying the described needs is to use parallel computers. A parallel
computer consists of two or more processing units, which are operating more or less
independently in parallel. Using such a computer, a problem can (theoretically) be
divided into n sub problems (where n is typically the number of available processing
units), and each part of the problem will be solved by one of the processing units
concurrently. Ideally, the completion time of the computation will be t/n, where t is the
completion time for the problem on a computer containing only one processing unit. In
practice, a value of t/n will rarely be achieved due to manifold reasons: sometimes a
problem cannot be divided exactly into n independent parts, usually there will be a need
for communication between the parallel executing processes (e.g. for data exchanges,
synchronization, etc.), some problems contain parts that are per se sequential and
therefore cannot be processed in parallel, and so on. This leads us to the term scalability.
Scalability is a measure that specifies, whether or not a given problem can be solved
faster as more processing units are added to the computer. This applies to hardware and
software.
Scalability
A computer system, including all its hardware and software resources, is called scalable
if it can scale up (i.e., improve its resources) to accommodate ever-increasing
performance and functionality demand and/or scale down (i.e., decrease its resources) to
reduce cost.
Memory-Processor Organization
The main property of shared memory architectures is, that all processors in the system
have access to the same memory, there is only one global address space. Typically, the
main memory consists of several memory modules (whose number is not necessarily
equal to the number of processors in the computer, see Figure 2-1). In such a system,
communication and synchronization between the processors is done implicitly via shared
variables.
The processors are connected to the memory modules via some kind of interconnection
network. This type of parallel computer is also called UMA, which stands for uniform
memory access, since all processors access every memory module in the same way
concerning latency and bandwidth.
A big advantage of shared memory computers is, that programming a shared memory
computer is very convenient due to the fact that all data are accessible by all processors,
such that there is no need to copy data. Furthermore the programmer does not have to
care for synchronization, since this is carried out by the system automatically (which
makes the hardware more complex and hence more expensive). However, it is very
difficult to obtain high levels of parallelism with shared memory machines; most systems
do not have more than 64 processors. This limitation stems from the fact, that a
centralized memory and the interconnection network are both difficult to scale once built.
Figure shows the organization of the processors and memory modules in a distributed
memory computer. In contrary to shared memory architecture a distributed memory
machine scales very well, since all processors have their own local memory which means
that there are no memory access conflicts. Using this architecture, massively parallel
processors (MPP) can be built, with up to several hundred or even thousands of
processors.
Since the processors in a parallel computer need to communicate in order to solve a given
problem, there is a need for some kind of communication infrastructure, i.e. the
processors need to be connected in some way. Basically, there are two kinds of
interconnection networks: static and dynamic. In case of a static interconnection network,
all connections are fixed, i.e. the processors are wired directly, whereas in the latter case
there are switches in between. The decision whether to use a static or dynamic
interconnection network depends on the kind of problem that should be solved with the
computer. Generally, static topologies are suitable for problems whose communication
patterns can be predicted reasonably well, whereas dynamic topologies (switching
networks), though more expensive, are suitable for a wider class of problems [1].
In the following, we will give a description of some important static and dynamic
topologies, including routing protocols.
Static Topologies
Descriptions
The simplest - and cheapest - way to connect the nodes of a parallel computer is to use a
one-dimensional mesh. Each node has two connections, boundary nodes have one. If the
boundary nodes are connected to each other, we have a ring, and all nodes have two
connections. The one-dimensional mesh can be generalized to a k-dimensional mesh,
where each node (except boundary nodes) have 2k connections. Again, boundary nodes
can be connected, but there is no general consensus on what to do on boundary nodes.
However, this type of topology is not suitable to build large-scale computers, since the
maximum message latency, that is, the maximum delay of a message from one of the N
processors to another, is ; this is bad for two reasons: firstly, there is a wide range of
latencies (the latency between neighbouring processors is much lower than between not-
neighbors), and secondly the maximum latency grows with the number of processors.
Stars
In a star topology there is one central node, to which all other nodes are connected; each
node has one connection, except the centre node, which has N-1 connections.
Stars are also not suitable for large systems, since the centre node will become a
bottleneck with increasing number of processors.
Hyper cubes
The hypercube topology is one of the most popular and used in many large scale systems.
A k-dimensional hypercube has 2knodes, each with k connections. In Figure a four-
dimensional hypercube is displayed.
Hyper cubes scale very well, the maximum latency in a k-dimensional (or "k-ary")
hypercube is log2 N, with N = 2k. An important property of hypercubes is the relationship
between node-number and which nodes are connected together. The rule is, that any two
nodes in the hypercube, whose binary representations differ in exactly one bit, are
connected together. For example in a four-dimensional hypercube, node 0 (0000) is
connected to node 1 (0001), node 2 (0010), node 4 (0100) and node 8 (1000). This
numbering scheme is called Gray code scheme
Routing
Typically, in meshes the so called dimension-order routing technique is used. That is,
routing is performed in one dimension at a time. In a three-dimensional mesh for
example, a message's path from node (a,b,c) to the node (x,y,z) would be moved along
the first dimension to node (x,b,c), then, along the second dimension to node (x,y,c), and
finally, in the third dimension to the destination-node (x,y,z).
Stars
Routing in stars is trivial. If one of the communicating nodes is the centre node, then the
path is just the edge connecting them. If not, the message is routed from the source node
to the centre node, and from there to the destination node.
Hypercubes
A k-dimensional hypercube is nothing else than a k-dimensional mesh with only two
nodes in each dimension, and thus the routing algorithm is the same as for meshes; apart
from one difference: the path from node A to node B is calculated by simply calculating
the exclusive-or X = A B from the binary representations for node A and B. If the i-th
bit in X is '1' the message is moved to the neighbouring node in the i-th dimension. If
the i-th bit is '0' the message is not moved anyway. This means, that it takes at most
log2 N steps for a message to reach it's destination (where N is the number of nodes in the
hypercube).
Dynamic Topologies
Single-Stage Networks
Buses and crossbars are the two main representatives of this class. A bus is the simplest
way to connect a number of processors with each other: all processors are simply
connected to one wire. This makes communication and especially message routing very
simple. The drawback of this type of network is, that the available bandwidth is inversely
proportional to the number of connected processors. This means, that buses are good only
for small networks with a maximum of about 10 processors.
The other extreme in terms of complexity is the crossbar network. With a crossbar full
connectivity is given, i.e. all processors can communicate with each other simultaneously
without reduction of bandwidth. In Figure the connection of n processors with m memory
modules (as in a shared memory system) is shown. Certainly crossbars can also be used
to connect processors with each other. In that case the memory modules are connected
directly to the processors (which results in a distributed memory system), and the lines
that were connected to the memory modules Mi are now connected to the processors Pi.
To connect n processors to n memory modules n2 switches are needed. Consequently,
crossbar networks can not be scaled to any arbitrary size. Today's commercially available
crossbars can connect up to 256 units.
Multi-Stage Networks
Summary
The networks can be classified as static or dynamic. Static interconnection networks are
mainly used in message-passing architectures; the following types are commonly defined:
• completely-connected network.
• star-connected network.
• linear array or ring of processors.
• mesh network (in 2- or 3D). Each processor has a direct link to four/six (in
2D/3D) neighbor processors. Extensions of this kind of networks is a wraparound
mesh or torus. Commercial examples are Intel Paragon XP/S and Cray T3D/E.
These examples cover also another class, namely the direct network topology.
• tree network of processors. Communication bottleneck likely to occur in large
configurations can be alleviated by increasing the number of communication links
for processors closer to the root, which results in the fat-tree topology, efficiently
used in the TMC CM5 computer. CM5 could be also an example of indirect
network topology.
• hypercube network. Classically this is a multidimensional mesh of processors
with exactly two processors in each dimension. An example of such a system is
the Intel iPSC/860 computer. Some new projects incorporate the idea of several
procesors in each node which results in fat hypercube, i.e. indirect
network topology. An example is the SGI/Cray Origin2000 computer.
• bus-based networks - the simplest and efficient solution when the cost and
moderate number of processors are involved. Its main drawback is a bottleneck to
the memory when number of processors becomes large and also a single point of
failure. To overcome the problems, sometimes several parallel buses are
incorporated. The classical example of such machine is the SGI Power Challenge
computer with packet data bus.
is , less than for the crossbar switch. However, in the omega network
some memory accesses can be blocked. Although machines of this kind of
interconnections offer virtual global memory model of programming and ease of
use they are still not much popular. Examples from the past cover BBN Butterfly
and IBM RP-3 computers, at present IBM RS6K SP incorporates multistage
interconnections with Vulcan switch.
• multilevel interconnection network seems to be a relatively recent development.
The idea comes directly from clusters of computers and consists of two or more
levels of connections with different aggregated bandwidths. Typical examples are:
SGI/Cray Origin2000, IBM RS6K SP with PowerPC604 SMP nodes and
HP/Convex Exemplar. This kind of architecture is getting the most interest at
present.
A ranking of computer memory devices, with devices having the fastest access time at
the top of the hierarchy, and devices with slower access times but larger capacity and
lower cost at lower levels.
Most modern CPUs are so fast that for most program workloads, the locality of reference
of memory accesses and the efficiency of the caching and memory transfer between
different levels of the hierarchy are the practical limitation on processing speed. As a
result, the CPU spends much of its time idling, waiting for memory I/O to complete.
The various major units in a typical memory system can be viewed as forming a
hierarchy of memories (M1,M2,...,Mn) in which each member Mi is in a sense subordinate
to the next highest member Mi-1 of the hierarchy.
Modern programming languages mainly assume two levels of memory, main memory
and disk storage, though in assembly language, and in inline assembler in languages such
as C, registers can be directly accessed.
• Programmers are responsible for moving data between disk and memory through
file I/O.
• Hardware is responsible for moving data between memory and caches.
• Optimizing compilers are responsible for generating code that, when executed,
will cause the hardware to use caches and registers efficiently.
• Miss: data needs to be retrieved from a block in the main memory (Block Y)
– Miss Rate = 1 - (Hit Rate)
– Miss Penalty: Time to replace a block in cache + Time to deliver the
block to the
processor.
1) Access Time: Time for the CPU to fetch a value from memory -- including delays
through any intermediate levels.
3) Cost Per Unit (byte): Cost per unit times size roughly equals total cost.
4) Transfer bandwidth: Units (bytes) per second transferred to the next level.
1) Inclusion: If a value is found at one level, it is present at all of the levels below it.
2) Coherence: The copies at all of the levels are consistent.
3) Locality: Programs access a restricted portion of their address space in any time
window.
Obviously, none of these is strictly true. Most hierarchies are inclusive from the registers
to the main memory (although we could imagine a multi-level cache that skips a level on
loading, and only copies out to the lower level when writing back). However, most tape
units do not spool to disk before going to main memory -- they are DMA devices, just
like the disks.
INCLUSION Property
The Inclusion property is stated by the following set inclusion relations among n memory
levels.
M1 ⊂ M2⊂ M3 ⊂ … ⊂ Mn
Here :
M1,M2,M3 are memory levels.
n is the number of levels.
The above equation signifies that if a value is found at one level in the memory hierarchy,
it is present at all of the memory levels below it.
COHERENCE Property
Coherence requires that copies of the same information items be consistent at different
memory levels.
The coherence property extends all the way from the cache at M1 to the outermost
memory Mn.
1) Temporal locality: Recently accessed items tend to be accessed again in the near
future.
CPU
L1
Additional Levels L2
L3
Main Memory
Figure 1: Multi-level cache hierarchy.
The main reason of using multi-level caching is the need for reducing the penalty for the
cache misses. In particular, the cache that it is the smallest and it is closest to the
processor can operate in very high frequency equal to the frequency of the processor. The
lower level caches are further away from the processor, they operate in smaller frequency
but still their access time is much smaller than the corresponding access time of the main
memory. Those caches are big enough to contain data reducing effectively the need of
main memory accesses (they have large hit ratios). It is obvious that by using this
technique the average memory access time is reduced.
INTRODUCTION
A vector is a set of scalar data items , all of the same type , stored in memory .
Usually , the vector elements are ordered to have a fixed addressing increment
between successive elements called the stride .
Vector Processing occurs when arithmetic or logical operations are applied to
vectors. It is distinguished from scalar processing which operates on one or one
pair of data . The conversion from scalar code to vector code is called
vectorization . Vector processing is faster and more efficient than scalar processing
. It reduces software overhead incurred in the maintenance of looping control and
reduces memory-access conflicts .
1 . Vector-vector instructions – one or two vector operands are fetched from the
respective vector registers , enter through a functional pipeline unit and produce
results in another vector registers . They are defined by the following mappings :
f1 : Vi Vj
f2 : Vj X Vk Vi
Examples are V1 = sin(V2) and V3 = V1 + V2 for the mappings of f1 and f2
respectively .
5 . Gather and scatter instructions – These instructions use two vector registers to
gather or to scatter vector elements randomly throughout the memory . It is
defined as follows :
f8 : M V1 X Vo Gather
f9 : V1 X Vo M Scatter
Gather is an operation that fetches from memory the nonzero elements of a sparse
vector using indices that themselves are indexed . Scatter stores into memory a
vector in a sparse vector whose nonzero entries are indexed . The vector register
V1 contains the data , and the vector register Vo is used as an index to gather or
scatter data from or to random memory locations .
The flow of vector operands between the main memory and vector registers is
usually pipelined with multiple access paths .
The main memory is built with multiple modules . Once presented with a memory
address , each memory module returns with one word per cycle . It is possible to
present different addresses to different memory modules so that parallel access of
multiple words can be done simultaneously or in a pipelined fashion . Consider a
main memory formed with m = 2^a memory modules , each containing w = 2^b
words of memory cells . The total memory capacity is m . w = 2^(a+b) . These
memory words are assigned linear addresses . Different ways of assigning linear
addresses results in different memory organizations . There are three vector access
memory organizations :
Various topologies for building networks are specified below. We focus on the
communication properties of interconnection networks. These include latency analysis,
bisection bandwidth, and data routing functions.
Static networks are used for fixed connections among subsystems of a centralized system
or multiple computing nodes of a distributed system. Dynamic networks include buses,
crossbar switches, and multistage networks, which are often used in shared memory
multiprocessors.
Some of the parameters used to estimate the complexity, communication efficiency, and
cost of a network are defined below:-
(i) Network size:- In general, a network is represented by the graph of a finite number of
nodes linked by directed or undirected edges. The number of nodes in a graph is called
the network size.
(ii) Node Degree and Network Diameter:-The number of edges (links or channels)
incident on a node is called the node degree d. In case of unidirectional channels, the
number of channels into a node is the in degree, and that out of a channel is the out
degree. Then the node degree is the sum of the two. The node degree should be kept
constant , and small in order to reduce the cost.
The diameter D of a network is the maximum shortest path between any two nodes. The
path length is measured by the number of links traversed. The diameter should be small
from a communication point of view.
(iii)Bisection Width:- When a given network is cut into two equal halves, the minimum
number of edges (channels) along the cut is called the channel bisection width b. In case
of a communication network, each edge corresponds to a channel with w bit wires. Then
the wire bisection width is B=bw. B reflects the wiring density of a network. Thus,
bisection width provides a good indicator of the maximum communication bandwidth
along the bisection f a network.
Commonly used data routing functions among the processing elements include shifting,
rotation, broadcast (one-to-all), multicast (many-to-many), personalized communication
(one-to-many) etc.
A three-dimensional binary cube network is shown below. Three routing functions are
defined by three bits in the node address.
a. One can exchange data between adjacent nodes which differ in the least
significant bit. eg. 000---001---010---011---100---101---110---111
b. Routing by middle bit.eg. 000---010 ,001---011
c. Routing by most significant bit.eg. 000---100 ,001---101
Instructions include:
An instruction set, or instruction set architecture (ISA), is the part of the computer
architecture related to programming, including the native data
types, instructions, registers,addressing modes, memory
architecture, interrupt and exception handling, and external I/O. An ISA includes a
specification of the set of opcodes (machine language), the native commands
implemented by a particular CPU design.
Categories of ISA
CISC
RISC
RISC
Why is this architecture called RISC? What is Reduced about it?
The answer is that to make all instructions the same length the number of bits that are
used for the opcode is reduced. Thus less instructions are provided. The instructions that
were thrown out are the less important string and BCD (binary-coded decimal)
operations. In fact, now that memory access is restricted there aren't several kinds of
MOV or ADD instructions. Thus the older architecture is called CISC (Complete
Instruction Set Computer). RISC architectures are also
called LOAD/STORE architectures.
The number of registers in RISC is usualy 32 or more.
The CISC Architecture
Uniform instruction format, using a single word with the opcode in the same bit
positions in every instruction, demanding less decoding;
Identical general purpose registers, allowing any register to be used in any
context, simplifying compiler design (although normally there are separate floating
point registers);
Simple addressing modes. Complex addressing performed via sequences of
arithmetic and/or load-store operations;
Few data types in hardware, some CISCs have byte string instructions, or
support complex numbers; this is so far unlikely to be found on a RISC.
CISC vs RISC
CISC
Pronounced sisk, and stands for Complex Instruction Set Computer. Most PC's use CPU
based on this architecture. For instance Intel and AMD CPU's are based on CISC
architectures.
Typically CISC chips have a large amount of different and complex instructions. The
philosophy behind it is that hardware is always faster than software, therefore one should
make a powerful instructionset, which provides programmers with assembly instructions
to do a lot with short programs.
In common CISC chips are relatively slow (compared to RISC chips) per instruction, but
use little (less than RISC) instructions.
RISC
Pronounced risk, and stands for Reduced Instruction Set Computer. RISC
chips evolved around the mid-1980 as a reaction at CISC chips. The
philosophy behind it is that almost no one uses complex assembly language
instructions as used by CISC, and people mostly use compilers which never
use complex instructions. Apple for instance uses RISC chips.
Therefore fewer, simpler and faster instructions would be better, than the large, complex
and slower CISC instructions. However, more instructions are needed to accomplish a
task.
An other advantage of RISC is that - in theory - because of the more simple instructions,
RISC chips require fewer transistors, which makes them easier to design and cheaper to
produce.
Finally, it's easier to write powerful optimised compilers, since fewer instructions exist.
RISC vs CISC
There is still considerable controversy among experts about which architecture is better.
Some say that RISC is cheaper and faster and therefor the architecture of the future.
Others note that by making the hardware simpler, RISC puts a greater burden on the
software. Software needs to become more complex. Software developers need to write
more lines for the same tasks.
Therefore they argue that RISC is not the architecture of the future, since conventional
CISC chips are becoming faster and cheaper anyway.
RISC has now existed more than 10 years and hasn't been able to kick CISC out of the
market. If we forget about the embedded market and mainly look at the market for PC's,
workstations and servers I guess a least 75% of the processors are based on the CISC
architecture. Most of them the x86 standard (Intel, AMD, etc.), but even in the mainframe
territory CISC is dominant via the IBM/390 chip. Looks like CISC is here to stay …
Is RISC than really not better? The answer isn't quite that simple. RISC and CISC
architectures are becoming more and more alike. Many of today's RISC chips support
just as many instructions as yesterday's CISC chips. The PowerPC 601, for example,
supports more instructions than the Pentium. Yet the 601 is considered a RISC chip,
while the Pentium is definitely CISC. Further more today's CISC chips use many
techniques formerly associated with RISC chips.
x86
An important factor is also that the x86 standard, as used by for instance Intel and AMD,
is based on CISC architecture. X86 is thé standard for home based PC's. Windows 95 and
98 won't run at any other platform. Therefore companies like AMD an Intel will not
abandoning the x86 market just overnight even if RISC was more powerful.
Changing their chips in such a way that on the outside they stay compatible with the
CISC x86 standard, but use a RISC architecture inside is difficult and gives all kinds of
overhead which could undo all the possible gains. Nevertheless Intel and AMD are doing
this more or less with their current CPU's. Most acceleration mechanisms available to
RISC CPUs are now available to the x86 CPU's as well.
Since in the x86 the competition is killing, prices are low, even lower than for most RISC
CPU's. Although RISC prices are dropping also a, for instance, SUN UltraSPARC is still
more expensive than an equal performing PII workstation is.
Equal that is in terms of integer performance. In the floating point-area RISC still holds
the crown. However CISC's 7th generation x86 chips like the K7 will catch up with that.
The one exception to this might be the Alpha EV-6. Those machines are overall about
twice as fast as the fastest x86 CPU available. However this Alpha chip costs about
€20000, not something you're willing to pay for a home PC.
Maybe interesting to mention is that it's no coincidence that AMD's K7 is developed in
co-operation with Alpha and is for al large part based on the same Alpha EV-6
technology.
EPIC
The biggest threat for CISC and RISC might not be eachother, but a new technology
called EPIC. EPIC stands for Explicitly Parallel Instruction Computing. Like the word
parallel already says EPIC can do many instruction executions in parallel to one another.
EPIC is a created by Intel and is in a way a combination of both CISC and RISC. This
will in theory allow the processing of Windows-based as well as UNIX-based
applications by the same CPU.
It will not be until 2000 before we can see an EPIC chip. Intel is working on it under
code-name Merced. Microsoft is already developing their Win64 standard for it. Like the
name says, Merced will be a 64-bit chip.
If Intel's EPIC architecture is successful, it might be the biggest thread for RISC. All of
the big CPU manufactures but Sun and Motorola are now selling x86-based products, and
some are just waiting for Merced to come out (HP, SGI). Because of the x86 market it is
not likely that CISC will die soon, but RISC may.
So the future might bring EPIC processors and more CISC processors, while the RISC
processors are becoming extinct.
Conclusion
The difference between RISC and CISC chips is getting smaller and smaller. What counts
is how fast a chip can execute the instructions it is given and how well it runs existing
software. Today, both RISC and CISC manufacturers are doing everything to get an edge
on the competition.
The future might not bring victory to one of them, but makes both extinct. EPIC might
make first RISC obsolete and later CISC too.
Written by A.A.Gerritsen
for the CPU Site
March '99
DATAFLOW ARCHITECTURE
One of the few experimental dataflow computer project is the token-tagged architecture
for building dataflow computers developed by a couple of students from MIT.
Local Bus:
Buses implemented on printed circuit boards are called local buses. On a processor board
one often finds a local bus which provides a common communication path among major
components (chips) mounted on the board. A memory board uses a memory bus to
connect the memory with the interface logic.
An I/O board or network interface board uses a data bus. Each of these board buses
consists of signal and utility lines. With the sharing of the lines by many I/O devices, the
layout of these lines may be at different layers of the PC board.
Backplane Bus :
A backplane is a printed circuit on which many connectors are used to plug in functional
boards. A System bus, consisting of shared signal paths and utility lines, is built on the
backplane. This system bus provides a common communication path among all plug in
boards.
Several backplane bus systems have been developed such as VME bus, multibus II and
Futurebus+.
I/O Bus :
Input/Output devices are connected to a computer system through an I/O bus such as
SCSI (small computer system interface) bus. This bus is made of coaxial disks with tape
connecting disks, printer, and tape unit to a processor through an I/O controller. Special
interface logic is used to connect various board types to the backplane bus.
2) The other which can be performed by rounding off or truncating the value
Fixed Point Operations: - As defined early the concept behind involves fixed point
operation with a sign magnitude, by using the concept of 1’s Complement and 2’s
Complement. But 1’s complement introduces a second zero, also known as the dirty zero.
This includes general arithmetic operations, such as:
1) Add
2) Subtract
3) Multiply
4) Divide
Floating Point Operations: - The operations that can be performed are as follows
1) X+Y = (Mx. 2( Ex-Ey) + My) X^Ey)
SIMD ARCHITECTURE
SIMD computers use a single control unit and distributed memories and some of them
use associative memories. The instruction set of an SIMD computer is decoded by the
array control unit. The major components of SIMD computers are:
All PEs must operate in lockstep, synchronized by the same array controller.
DISTRIBUTED MEMORY MODEL
• It consists of an array of PEs which is controlled by the same array control unit.
• Host computer is responsible for the programs and data being loaded on the
control memory.
• When an instruction is sent to the control unit for decoding, then a scalar or
program control operation is executed by the scalar processor that is attached to
the control unit.
• In case of a vector operation, it gets broadcasted to all the PEs for parallel
execution. The partition data is distributed to all the local memories through a
vector data bus.
• Masking logic is there to provide to enable or disable any PE from the instruction
cycle.
SHARED MEMORY
• The memory module (m) should be relative prime to number of PEs so that
parallel memory access can be achieved through skewing without conflicts.
Examples of Shared Memory: Burroughs Scientific Processor (BSP- having 16 PEs &
17 memory modules), CM/ 200.
SIMD Instructions: SIMD computers to execute vector instructions for arithmetic, logic,
data routing and masking operations over vector quantities. In case of
All the SIMD instructions are vectors of equal length n, where n corresponds to the
number of PEs.
HOST & I/ O: All I/O operations are handled by the host computers in an SIMD
organization. A special control memory is used between the host and the array control
unit. This is a staging memory for holding program and data.
Divided data sets are distributed to the local memory modules or the shared
memory modules before starting the program instruction.
The host manages the mass storage or the graphics display of computational
results.
4) The other which can be performed by rounding off or truncating the value
Fixed Point Operations: - As defined early the concept behind involves fixed point
operation with a sign magnitude, by using the concept of 1’s Complement and 2’s
Complement. But 1’s complement introduces a second zero, also known as the dirty zero.
This includes general arithmetic operations, such as:
5) Add
6) Subtract
7) Multiply
8) Divide
Floating Point Operations: - The operations that can be performed are as follows
5) X+Y = (Mx. 2( Ex-Ey) + My) X^Ey)
SIMD ARCHITECTURE
SIMD computers use a single control unit and distributed memories and some of them
use associative memories. The instruction set of an SIMD computer is decoded by the
array control unit. The major components of SIMD computers are:
3) Processing Elements (PEs) in the SIMD array are passive.
All PEs must operate in lockstep, synchronized by the same array controller.
• It consists of an array of PEs which is controlled by the same array control unit.
• Host computer is responsible for the programs and data being loaded on the
control memory.
• When an instruction is sent to the control unit for decoding, then a scalar or
program control operation is executed by the scalar processor that is attached to
the control unit.
• In case of a vector operation, it gets broadcasted to all the PEs for parallel
execution. The partition data is distributed to all the local memories through a
vector data bus.
• Masking logic is there to provide to enable or disable any PE from the instruction
cycle.
SHARED MEMORY
• The memory module (m) should be relative prime to number of PEs so that
parallel memory access can be achieved through skewing without conflicts.
Examples of Shared Memory: Burroughs Scientific Processor (BSP- having 16 PEs &
17 memory modules), CM/ 200.
SIMD Instructions: SIMD computers to execute vector instructions for arithmetic, logic,
data routing and masking operations over vector quantities. In case of
HOST & I/ O: All I/O operations are handled by the host computers in an SIMD
organization. A special control memory is used between the host and the array control
unit. This is a staging memory for holding program and data.
Divided data sets are distributed to the local memory modules or the shared
memory modules before starting the program instruction.
The host manages the mass storage or the graphics display of computational
results.
Introduction:
1. Instruction pipelines, such as the classic RISC pipeline, which are used
in processors to allow overlapping execution of multiple instructions with the
same circuitry. The circuitry is usually divided up into stages, including
instruction decoding, arithmetic, and register fetching stages, wherein each stage
processes one instruction at a time.
2. Graphics pipelines, found in most graphics cards, which consist of
multiple arithmetic units, or complete CPUs, that implement the various stages of
common rendering operations (perspective projection, window
clipping, color and light calculation, rendering, etc.).
3. Software pipelines, consisting of multiple processes arranged so that the output
stream of one process is automatically and promptly fed as the input stream of the
next one. Unix pipelines are the classical implementation of this concept.
Advantages of Pipelining:
1. The cycle time of the processor is reduced, thus increasing instruction issue-rate
in most cases.
2. Some combinatorial circuits such as adders or multipliers can be made faster by
adding more circuitry.
3. If pipelining is used instead, it can save circuitry vs. a more complex
combinatorial circuit.
Disadvantages of Pipelining:
Arithmetic pipelines
The most popular arithmetic operation utilized to illustrate the operation
of arithmetic pipelines in the literature are: floating-point addition and
multiplication.
Floating-point addition
Consider the addition of two normalized floating-point numbers:
S = (Es, Ms)
2. Add Mantissae:
Ms = Ma + Mb
Es=Ea
In the pipeline of above figure we have assumed that the shift stages can
perform an arbitrary number of shifts in one cycle. If that is not the case, the
shifters have to be used repeatedly. Figure 3.7 shows the rearranged pipeline
where the feedback paths indicate the reuse of the corresponding stage.
Floating-point multiplication
Consider the multiplication of two floating-point numbers A = (E,, Ma) and B =
(Eb,Mb), resulting in the product
P = (Er,Mr). The multiplication follows the pipeline configuration shown in figure 1
and the steps are listed below:
Stage 2 in the above pipeline would consume the largest amount of time. In
Figure below stage 2 is split into two stages, one performing partial products
and the other accumulating them. In fact, the operations of these two stages
can be overlapped in the sense that when the accumulate stage is adding, the
other stage can be producing the next partial product.
Floating-point multiplication pipeline
\
Floating-point adder/ multiplier
The pipelines shown so far in this section are unifunction pipelines since
they are designed to perform only one function. Note that the pipelines
For example. a typical three-stage floating-point adder includes a first stage for
exponent comparison and equalization which is implemented with an
integer adder and some shifting logic; a second stage for fraction addition
using a high-speed carry look-ahead adder; and a third stage for fraction
normalization and exponent readjustment using a shifter and another addition
logic.
Arithmetic or logical shifts can be easily implemented with shift
registers. Highspeed addition requires either the use of a carry-
propagation adder (CPA) which adds two numbers and produces an
arithmetic sum as shown in Fig. 6.22a, or the use of a carry-save adder
(CSA) to "add" three input numbers and produce one sum output and
a carry output as exemplified in Figure below.
In a CSA, the carries are not allowed to propagate but instead are
saved in a carry vector. In general, an n-bit CSA is specified as follows: Let
X, Y, and Z be three n-bit input numbers. expressed as
X = (xn-I , xn -2... , x1, xo ). The CSA performs bitwise operations
simultaneously on all columns of digits to produce two n-bit output
numbers, denoted as
b
S = (0. Sn-1. Sn-2, ... , S1. So) and C = (Cn Cn-1 ... . C1.0).
Note that the leading hit of the bitwise sum Sb is always a 0, and the tail
bit of the carry vector C is always a 0. The input-output relationships are
expressed below:
Si= xi O Y. t zi
Ci+1= xiyi V yizi V zixi (6.21)
for i = O. 1, 2, ... . n - 1, where o is the exclusive OR and V is the logical OR
operation.
Note that the arithmetic sum of three input numbers, i.e.,
S = X + Y + Z, is obtained by adding the two output numbers, i.e., S
= Sb + C, using a CPA. We use the CPA
and CSA s to implement the pipeline stages of a fixed-point multiply unit
as follows.
b.)An n-bit carry save adder(csa) where sb is the bitwise sum of X,Y, nad Z and c is the
carry vector generated without carry propogation between digits.
SIMD (Single Instruction, Multiple Data) is a technique employed to achieve data level
parallelism.
SIMD Example
The best way to demonstrate the effectiveness of SIMD is through an example. One area
where SIMD instructions are particularly useful is within image manipulation. When a
raster-based image, for example a photo, has a filter of some kind applied to it, the filter
has to process the colour value of each pixel and return the new value. The larger the
image, the more pixels that need to be processed. The operation of calculating each new
pixel value, however, is the same for every pixel. Put another way, there is a single
operation to be performed and multiple pieces of data on which it must be completed.
Such a scenario is perfect for SIMD optimization.
In this case, a SIMD optimized version of the filter would still have a main loop to go
through the entire pixel array, however the number of iterations would be significantly
reduced because in each pass, the loop would be transforming multiple pixels.
SIMD Computers
A SIMD computer consists of N identical processors, each with its own local memory
where it can store data. All processors work under the control of a single instruction
stream issued by a central control unit. There are N data streams, one per processor. The
processors operate synchronously: at each step, all processors execute the same
instruction on a different data element.
SIMD computers are much more versatile that MISD computers. Numerous problems
covering a wide variety of applications can be solved by parallel algorithms on SIMD
computers. Another interesting feature is that algorithms for these computers are
relatively easy to design, analyze and implement. On the downside, only problems that
can be subdivided into a set of of identical subproblems all of which are then solved
simultaneously by the same set of instructions can be tackled with SIMD computers.
There are many computations that do not fit this pattern: such problems are typically
subdivided into subproblems that are not necessarily identical, and are solved using
MIMD computers.
SIMD machines have one instruction processing unit, sometimes called a controller and
indicated by a K in the PMS notation, and several data processing units, generally called
D-units or processing elements (PEs). The first operational machine of this class was the
ILLIAC-IV, a joint project by DARPA, Burroughs Corporation, and the University of
Illinois Institute for Advanced Computation. Later machines included the Distributed
Array Processor (DAP) from the British corporation ICL, and the Goodyear MPP.
The control unit is responsible for fetching and interpreting instructions. When it
encounters an arithmetic or other data processing instruction, it broadcasts the instruction
to all PEs, which then all perform the same operation. For example, the instruction might
be `` add R3,R0.'' Each PE would add the contents of its own internal register R3 to its
own R0. To allow for needed flexibility in implementing algorithms, a PE can be
deactivated. Thus on each instruction, a PE is either idle, in which case it does nothing, or
it is active, in which case it performs the same operation as all other active PEs. Each PE
has its own memory for storing data. A memory reference instruction, for example ``load
R0,100'' directs each PE to load its internal register with the contents of memory location
100, meaning the 100th cell in its own local memory.
One of the advantages of this style of parallel machine organization is a savings in the
amount of logic. Anywhere from 20% to 50% of the logic on a typical processor chip is
devoted to control, namely to fetching, decoding, and scheduling instructions. The
remainder is used for on-chip storage (registers and cache) and the logic required to
implement the data processing (adders, multipliers, etc.). In an SIMD machine, only one
control unit fetches and processes instructions, so more logic can be dedicated to
arithmetic circuits and registers. For example, 32 PEs fit on one chip in the MasPar MP-
1, and a 1024- processor system is built from 32 chips, all of which fit on a single board
(the control unit occupies a separate board).
As you will recall, we discussed three cache mapping functions, i.e., methods of
addressing to locate data within a cache.
• Direct
• Full Associative
• Set Associative
As far as the mapping functions are concerned, the book did an okay job describing the
details and differences of each. I, however, would like to describe them with an emphasis
on how we would model them using code.
Direct Mapping
Remember that direct mapping assigned each memory block to a specific line in the
cache. If a line is all ready taken up by a memory block when a new block needs to be
loaded, the old block is trashed. The figure below shows how multiple blocks are mapped
to the same line in the cache. This line is the only line that each of these blocks can be
sent to. In the case of this figure, there are 8 bits in the block identification portion of the
memory address.
The address for this example is broken down something like the following:
Once the block is stored in the line of the cache, the tag is copied to the tag location of
the line.
The address is broken into three parts: (s-r) MSB bits represent the tag to be stored in a
line of the cache corresponding to the block stored in the line; r bits in the middle
identifying which line the block is always stored in; and the w LSB bits identifying each
word within the block. This means that:
In full associative, any block can go into any line of the cache. This means that the word
id bits are used to identify which word in the block is needed, but the tag becomes all of
the remaining bits.
The address is broken into two parts: a tag used to identify which block is stored in which
line of the cache (s bits) and a fixed number of LSB bits identifying the word within the
block (w bits). This means that:
This is the one that you really need to pay attention to because this is the one for the
homework. Set associative addresses the problem of possible thrashing in the direct
mapping method. It does this by saying that instead of having exactly one line that a
block can map to in the cache, we will group a few lines together creating a set. Then a
block in memory can map to any one of the lines of a specific set. There is still only one
set that the block can map to.
Note that blocks 0, 256, 512, 768, etc. can only be mapped to one set. Within the set,
however, they can be mapped associatively to one of two lines.
The memory address is broken down in a similar way to direct mapping except that there
is a slightly different number of bits for the tag (s-r) and the set identification (r). It
should look something like the following:
Now if you have a 24 bit address in direct mapping with a block size of 4 words (2 bit id)
and 1K lines in a cache (10 bit id), the partitioning of the address for the cache would
look like this.
If we took the exact same system, but converted it to 2-way set associative mapping (2-
way meaning we have 2 lines per set), we'd get the following: