Advanced Computer Architecture - Unit 1 - WWW - Rgpvnotes.in
Advanced Computer Architecture - Unit 1 - WWW - Rgpvnotes.in
in
By - Aniket Sugandhi
Page no: 1 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Data
1 2 N
streams
Instruction stream
Control
Instruction
Data streams
stream
Processor 1 Control 1
1
Processor N Control N
N
By - Aniket Sugandhi
Page no: 2 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Data 1 2 N
streams
By - Aniket Sugandhi
Page no: 3 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
an important parameter.
Average CPI
It is easy to determine the average number of cycles per instruction for a particular processor if we know the
frequency of occurrence of each instruction type. Any estimate is valid only for a specific set of programs
(which defines the instruction mix), and then only if there are sufficiently large number of instructions.
In general, the term CPI is used with respect to a particular instruction set and a given program mix. The time
required to execute a program containing Ic instructions is just T = Ic * CPI * τ.
Each instruction must be fetched from memory, decoded, then operands fetched from memory, the
instruction executed, and the results stored.
The time required to access memory is called the memory cycle time, which is usually k times the processor
cycle time τ. The value of k depends on the memory technology and the processor-memory interconnection
scheme. The processor cycles required for each instruction (CPI) can be attributed to cycles needed for
instruction decode and execution (p), and cycles needed for memory references (m* k).
The total time needed to execute a program can then be rewritten as
T = Ic* (p + m*k)*τ
CPU
Figure 1.5: Shared Memory (UMA)
Non-Uniform Memory Access (NUMA):
• Often made by physically linking two or more SMPs
• One SMP can directly access memory of another SMP
• Not all processors have equal access time to all memories
• Memory access across link is slower
If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA
By - Aniket Sugandhi
Page no: 4 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Bus Interconnect
By - Aniket Sugandhi
Page no: 5 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Advantages:
• Memory is scalable with number of processors. Increase the number of processors and the size of
memory increases proportionately.
• Each processor can rapidly access its own memory without interference and without the overhead
incurred with trying to maintain cache coherency.
• Cost effectiveness: can use commodity, off-the-shelf processors and networking.
Disadvantages:
• The programmer is responsible for many of the details associated with data communication
between processors.
• It may be difficult to map existing data structures, based on global memory, to this memory
organization.
Multi-vector and SIMD Computers
A vector operand contains an ordered set of n elements, where n is called the length of the vector. Each
element in a vector is a scalar quantity, which may be a floating point number, an integer, a logical value or a
character.
A vector processor consists of a scalar processor and a vector unit, which could be thought of as an
independent functional unit capable of efficient vector operations.
Vector Supercomputer
Vector computers have hardware to perform the vector operations efficiently. Operands cann ot be used
directly from memory but rather are loaded into registers and are put back in registers after the operation.
Vector hardware has the special ability to overlap or pipeline operand processing. Vector functional units
pipelined, fully segmented each stage of the pipeline performs a step of the function on different operand(s)
once pipeline is full; a new result is produced each clock period (cp).
By - Aniket Sugandhi
Page no: 6 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Control
Figure 1.9: Configurations of SIMD Computers
These processors consist of a number of memory modules which can be either global or dedicated to each
processor. Thus the main memory is the aggregate of the memory modules. These Processing elements and
memory unit communicate with each other through an interconnection network. SIMD processors are
especially designed for performing vector computations. SIMD has two basic architectural organizations
a. Array processor using random access memory
b. Associative processors using content addressable memory.
All N identical processors operate under the control of a single instruction stream issued by a central control
unit. The popular examples of this type of SIMD configuration is ILLIAC IV, CM-2, MP-1. Each PEi is
essentially an arithmetic logic unit (ALU) with attached working registers and local memory PEMi for the
storage of distributed data. The CU also has its own main memory for the storage of program. The function of
CU is to decode the instructions and determine where the decoded instruction should be executed. The PE
perform same function (same instruction) synchronously in a lock step fashion under command of CU.
By - Aniket Sugandhi
Page no: 7 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Control Dependence: This refers to the situation where the order of the execution of statements cannot be
determined before run time. For example all condition statement, where the flow of statement depends on
the output. Different paths taken after a conditional branch may depend on the data hence we need to
eliminate this data dependence among the instructions. This dependence also exists between operations
performed in successive iterations of looping procedure. Control dependence often prohibits
parallelism from being exploited.
Control-independent example:
for (i=0;i<n;i++) {
a[i] = c[i];
if (a[i] < 0) a[i] = 1;
}
Control-dependent example:
for (i=1;i<n;i++) {
if (a[i-1] < 0) a[i] = 1;
}
Control dependence also avoids parallelism to being exploited. Compilers are used to eliminate this
control dependence and exploit the parallelism.
Resource dependence:
Data and control dependencies are based on the independence of the work to be done. Resource
independence is concerned with conflicts in using shared resources, such as registers, integer and floating
point ALUs, etc. ALU conflicts are called ALU dependence. Memory (storage) conflicts are called storage
dependence.
Bernstein’s Conditions - 1
Bernstein’s conditions are a set of conditions which must exist if two processes can execute in parallel.
Notation
Ii is the set of all input variables for a process Pi. Ii is also called the read set or domain of Pi. Oi is the set of
all output variables for a process Pi .Oi is also called write set
If P1 and P2 can execute in parallel (which is written as P1 || P2), then:
Bernstein’s Conditions - 2
In terms of data dependencies, Bernstein’s conditions imply that two processes can execute in parallel if
they are flow-independent, anti-independent, and output- independent. The parallelism relation || is
commutative (Pi || Pj implies Pj || Pi), but not transitive (Pi || Pj and Pj || Pk does not imply Pi || Pk).
Therefore, || is not an equivalence relation. Intersection of the input sets is allowed.
By - Aniket Sugandhi
Page no: 8 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Software Parallelism
Software parallelism is defined by the control and data dependence of programs, and is revealed in the
program’s flow graph i.e., it is defined by dependencies within the code and is a function of algorithm,
programming style, and compiler optimization.
Levels of Parallelism
Instruction Level Parallelism
This fine-grained or smallest granularity level typically involves less than 20 instructions per grain. The number
of candidates for parallel execution varies from 2 to thousands, with about five instructions or statements (on
the average) being the average level of parallelism.
Advantages:
• There are usually many candidates for parallel execution
• Compilers can usually do a reasonable job of finding this parallelism
Loop-level Parallelism
Typical loop has less than 500 instructions. If a loop operation is independent between iterations, it can
be handled by a pipeline, or by a SIMD machine. Most optimized program construct to execute on a
By - Aniket Sugandhi
Page no: 9 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
parallel or vector machine. Some loops (e.g. recursive) are difficult to handle. Loop-level parallelism is still
considered fine grain computation.
By - Aniket Sugandhi
Page no: 10 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Scheduling
A schedule is a mapping of nodes to processors and start times such that communication delay requirements
are observed, and no two nodes are executing on the same processor at the same time. Some general
scheduling goals:
• Schedule all fine-grain activities in a node to the same processor to minimize communication
delays.
• Select grain sizes for packing to achieve better schedules for a particular parallel machine.
• Node Duplication
Grain packing may potentially eliminate interprocessor communication, but it may not always produce a
shorter schedule. By duplicating nodes (that is, executing some instructions on multiple processors),
we may eliminate some interprocessor communication, and thus produce a shorter schedule.
By - Aniket Sugandhi
Page no: 11 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Demand-Driven Mechanisms
Data-driven machines select instructions for execution based on the availability of their operands; this is
essentially a bottom-up approach.
Demand-driven machines take a top-down approach, attempting to execute the instruction (a
demander) that yields the final result. This triggers the execution of instructions that yield its operands, and so
forth. The demand-driven approach matches naturally with functional programming languages (e.g. LISP and
SCHEME).
Pattern driven computers: An instruction is executed when we obtain a particular data patterns as output.
There are two types of pattern driven computers
String-reduction model: each demander gets a separate copy of the expression string to evaluate each
reduction step has an operator and embedded reference to demand the corresponding operands each
operator is suspended while arguments are evaluated
Graph-reduction model: expression graph reduced by evaluation of branches or sub-graphs, possibly in
parallel, with demanders given pointers to results of reductions. Based on sharing of pointers to arguments;
traversal and reversal of pointers continues until constant arguments are encountered.
Diameter: The maximum distance between any two processors in the network or in other words we can
say Diameter, is the maximum number of (routing) processors through which a message must pass on its way
from source to reach destination. Thus diameter measures the maximum delay for transmitting a message
from one processor to another as it determines communication time hence smaller the diameter better
will be the network topology.
By - Aniket Sugandhi
Page no: 12 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Connectivity: How many paths are possible between any two processors i.e., the multiplicity of paths between
two processors. Higher connectivity is desirable as it minimizes contention.
Arch connectivity of the network: the minimum number of arcs that must be removed for the network to
break it into two disconnected networks. The arch connectivity of various network are as follows
• 1 for linear arrays and binary trees
• 2 for rings and 2-d meshes
• 4 for 2-d torus
• d for d-dimensional hypercubes
Larger the arch connectivity lesser the conjunctions and better will be network topology. Channel width :The
channel width is the number of bits that can communicated simultaneously by a interconnection bus
connecting two processors:
Bisection Width and Bandwidth: In order divide the network into equal halves we require the remove some
communication links. The minimum numbers of such communication links that have to be removed are called
the Bisection Width. Bisection width basically provide us the information about the largest number of
messages which can be sent simultaneously (without needing to use the same wire or routing processor at
the same time and so delaying one another), no matter which processors are sending to which other
processors. Thus larger the bisection width is the better the network topology is considered. Bisection
Bandwidth is the minimum volume of communication allowed between two halves of the network with
equal numbers of processors. This is important for the networks with weighted arcs where the weights
correspond to the link width i.e., (how much data it can transfer). The Larger bisection width the better
network topology is considered.
Cost the cost of networking can be estimated on variety of criteria where we consider the number of
communication links or wires used to design the network as the basis of cost estimation, smaller the better
the cost.
Data Routing Functions: A data routing network is used for inter –PE data exchange. It can be static as in case
of hypercube routing network or dynamic such as multistage network. Various type of data routing functions
are Shifting, Rotating, Permutation (one to one), Broadcast (one to all), Multicast (many to many),
Personalized broadcast (one to many), Shuffle, Exchange Etc.
In linear and matrix structures, processors are interconnected with their neighbors in a regular structure on a
plane. A torus is a matrix structure in which elements at the matrix borders are connected in the frame of the
same lines and columns. In a complete connection structure, all elements (ex. processors) are directly
interconnected (point-to-point)
By - Aniket Sugandhi
Page no: 13 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Figure 1.14: A complete interconnection Figure 1.15: A Ring Figure 1.16: A Chordal Ring
In a tree structure, system elements are set in a hierarchical structure from the root to the leaves, see the
figure below. All elements of the tree (nodes) can be processors or only leaves are processors and the rest of
nodes are linking elements, which intermediate in transmissions. If from one node, 2 or more connections go
to different nodes towards the leaves - we say about a binary or k-nary tree. If from one node, more than one
connection goes to the neighboring node, we speak about a fat tree. A binary tree, in which in the direction of
the root, the number of connections between neighboring nodes increases twice, provides a uniform
transmission throughput between the tree levels, a feature not available in a standard tree.
By - Aniket Sugandhi
Page no: 14 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
transfers that have to be done to send data between the most distant nodes of a network. In this respect the
hypercubes have very good properties, especially for a very large number of constituent nodes. Due to this
hypercubes are popular networks in existing parallel systems.
By - Aniket Sugandhi
Page no: 15 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Figure 1.19: A two-by-two switching box and its four interconnection states
A multistage network is capable of connecting an arbitrary input terminal to an arbitrary output terminal.
G enerally it is consist of n stages where N = 2n is the number of input and output lines. And each stage use
N/2 switch boxes. The interconnection patterns from one stage to another stage are determined by network
topology. Each stage is connected to the next stage by at least N paths. The total wait time is proportional to
the number stages i.e., n and the total cost depend on the total number of switches used and that are Nlog2N.
The control structure can be individual stage control i.e., the same control signal is used to set all switch
boxes in the same stages thus we need n control signal. The second control structure is individual box
control where a separate control signal is used to set the state of each switch box. This provide flexibility at
the same time require n2/2 control signal which increases the complexity of the control circuit. In between
path is use of partial stage control.
Bus networks
A bus is the simplest type of dynamic interconnection networks. It constitutes a common data transfer path
for many devices. Depending on the type of implemented transmissions we have serial busses and parallel
busses. The devices connected to a bus can be processors, memories, I/O units, as shown in the figure below.
Figure 1.20: A diagram of a system based on a single bus Figure 1.21: A diagram of a system based on a multibus
Only one device connected to a bus can transmits data. Many devices can receive data. In the last case we
speak about a multicast transmission. If data are meant for all devices connected to a bus we speak about a
broadcast transmission. Accessing the bus must be synchronized. It is done with the use of two methods: a
token method and a bus arbiter method. With the token method, a token (a special control message or signal)
is circulating between the devices connected to a bus and it gives the right to transmit to the bus to a single
device at a time. The bus arbiter receives data transmission requests from the devices connected to a bus. It
selects one device according to a selected strategy (ex. using a system of assigned priorities) and sends an
acknowledge message (signal) to one of the requesting devices that grants it the transmitting right. After the
selected device completes the transmission, it informs the arbiter that can select another request. The
receiver (s) address is usually given in the header of the message. Special header values are used for the
broadcast and multicasts. All receivers read and decode headers. These devices that are specified in the
header, read-in the data transmitted over the bus.
By - Aniket Sugandhi
Page no: 16 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
The throughput of the network based on a bus can be increased by the use of a multibus network shown in
the figure below. In this network, processors connected to the busses can transmit data in parallel (one for
each bus) and many processors can read data from many busses at a time.
Crossbar switches
A crossbar switch is a circuit that enables many interconnections between elements of a parallel system at a
time. A crossbar switch has a number of input and output data pins and a number of control pins. In response
to control instructions set to its control input, the crossbar switch implements a stable connection of a
determined input with a determined output. The diagrams of a typical crossbar switch are shown in the figure
below.
By - Aniket Sugandhi
Page no: 17 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
It requires expensive memory control logic and a large number of cables and connections.
By - Aniket Sugandhi
Page no: 18 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in