0% found this document useful (0 votes)
2 views

Pipelining and Others

This document discusses the concept of pipelining in computer architecture, which enhances performance by allowing multiple operations to occur simultaneously, similar to an assembly line in manufacturing. It explains the organization of hardware components in a pipelined processor, detailing the fetch, decode, execute, and write stages, as well as the role of cache memory in improving access times. Additionally, it addresses potential hazards that can stall the pipeline, including data, control, and structural hazards, and emphasizes the importance of minimizing these hazards to maintain high throughput.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Pipelining and Others

This document discusses the concept of pipelining in computer architecture, which enhances performance by allowing multiple operations to occur simultaneously, similar to an assembly line in manufacturing. It explains the organization of hardware components in a pipelined processor, detailing the fetch, decode, execute, and write stages, as well as the role of cache memory in improving access times. Additionally, it addresses potential hazards that can stall the pipeline, including data, control, and structural hazards, and emphasizes the importance of minimizing these hazards to maintain high throughput.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

UNIT-5

PIPELING

5.1 BASIC CONCEPTS

The speed of execution of programs is influenced by many factors. One way to


improve performance is to use faster circuit technology to build the processor and the
main memory. Another possibility is to arrange the hardware so that more than one
operation can be performed at the same time. In this way, the number of operations
performed per second is increased even though the elapsed time needed to perform
any one operation is not changed.

We have encountered concurrent activities several times before. Chapter 1 in-


troduced the concept of multiprogramming and explained how it is possible for I/O
transfers and computational activities to proceed simultaneously. DMA devices make
this possible because they can perform I/O transfers independently once these transfers
are initiated by the processor.

Pipelining is a particularly effective way of organizing concurrent activity in a


computer system. The basic idea is very simple. It is frequently encountered in manu-
facturing plants, where pipelining is commonly known as an assembly-line operation.
Readers are undoubtedly familiar with the assembly line used in car manufacturing.
The first station in an assembly line may prepare the chassis of a car, the next station
adds the body, the next one installs the engine, and so on. While one group of workers
is installing the engine on one car, another group is fitting a car body on the chassis of
another car, and yet another group is preparing a new chassis for a third car. It may
take days to complete work on a given car, but it is possible to have a new car rolling
off the end of the assembly line every few minutes.

Consider how the idea of pipelining can be used in a computer. The processor
executes a program by fetching and executing instructions, one after the other. Let Fi
and Ei refer to the fetch and execute steps for instruction Ii . Execution of a program
consists of a sequence of fetch and execute steps, as shown in Figure 8.1a.

Now consider a computer that has two separate hardware units, one for fetching
instructions and another for executing them, as shown in Figure 5.1b. The instruction
fetched by the fetch unit is deposited in an intermediate storage buffer, B1. This buffer
is needed to enable the execution unit to execute the instruction while the fetch unit is
fetching the next instruction. The results of execution are deposited in the destination
location specified by the instruction. For the purposes of this discussion, we assume
that both the source and the destination of the data operated on by the instructions
are inside the block labeled “Execution unit.”

170
Time
I1 I2 I3

F1 E1 F2 E2 F3 E3

(a) Sequential execution

Interstage buffer
B1

Instruction
Execution
fetch unit
unit

(b) Hardware organization

Time
Clock cycle 1 2 3 4

Instruction

I1 F1 E1

I2 F2 E2

I3 F3 E3

(c) Pipelined execution

Figure 5.1 Basic idea of instruction pipelining.

The computer is controlled by a clock whose period is such that the fetch and
execute steps of any instruction can each be completed in one clock cycle. Operation of
the computer proceeds as in Figure 5.1c. In the first clock cycle, the fetch unit fetches an
instruction I1 (step F1 ) and stores it in buffer B1 at the end of the clock cycle. In the
second clock cycle, the instruction fetch unit proceeds with the fetch operation for
instruction I2 (step F2 ). Meanwhile, the execution unit performs the operation specified by
instruction I1 , which is available to it in buffer B1 (step E1 ).

171
By the end of thesecond clock cycle, the execution of instruction I1 is completed
and instruction I2 is available. Instruction I2 is stored in B1, replacing I1 , which is no
longer needed. Step E2 is performed by the execution unit during the third clock cycle,
while instruction I3 is being fetched by the fetch unit. In this manner, both the fetch
and execute units are kept busy all the time. If the pattern in Figure 8.1c can be sustained
for a long time, the completion rate of instruction execution will be twice that achievable
by the sequential operation depicted in Figure 5.1a.
In summary, the fetch and execute units in Figure 5.1b constitute a two-stage
pipeline in which each stage performs one step in processing an instruction. An inter-
stage storage buffer, B1, is needed to hold the information being passed from one stage
to the next. New information is loaded into this buffer at the end of each clock cycle.
The processing of an instruction need not be divided into only two steps. For
example, a pipelined processor may process each instruction in four steps, as follows:

F Fetch: read the instruction from the memory.


D Decode: decode the instruction and fetch the source
operand(s). E Execute: perform the operation specified by the
instruction.
W Write: store the result in the destination location.

The sequence of events for this case is shown in Figure 8.2a. Four instructions are
in progress at any given time. This means that four distinct hardware units are needed,
as shown in Figure 8.2b. These units must be capable of performing their tasks
simultane- ously and without interfering with one another. Information is passed from
one unit to the next through a storage buffer. As an instruction progresses through the
pipeline, all the information needed by the stages downstream must be passed along.
For example, during clock cycle 4, the information in the buffers is as follows:
• Buffer B1 holds instruction I3 , which was fetched in cycle 3 and is
being decoded by the instruction-decoding unit.
• Buffer B2 holds both the source operands for instruction I2 and the
specification of the operation to be performed. This is the information produced by the
decoding hardware in cycle 3. The buffer also holds the information needed for the write
step of instruction I2 (step W2 ). Even though it is not needed by stage E, this information
must be passed on to stage W in the following clock cycle to enable that stage to
perform the required Write operation.
• Buffer B3 holds the results produced by the execution unit and the
destination information for instruction I1 .

172
5.1.1 ROLE OF CACHE MEMORY

Each stage in a pipeline is expected to complete its operation in one clock cycle.
Hence, the clock period should be sufficiently long to complete the task being
performed in any stage. If different units require different amounts of time, the
clock period must allow the longest task to be completed. A unit that completes its
task early is idle for the remainder of the clock period. Hence, pipelining is most
effective in improving

Time
Clock cycle 1 2 3 4 5 6 7

Instruction

I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

(a) Instruction execution divided into four steps

Interstage buffers

D : Decode
F : Fetch instruction E: Execute W : Write
instruction and fetch operation results
operands
B1 B2 B3

(b) Hardware organization

Figure 5.2 A 4-stage pipeline.

173
performance if the tasks being performed in different stages require about the same
amount of time.
This consideration is particularly important for the instruction fetch step, which is
assigned one clock period in Figure 5.2a. The clock cycle has to be equal to or greater
than the time needed to complete a fetch operation. However, the access time of the
main memory may be as much as ten times greater than the time needed to perform
basic pipeline stage operations inside the processor, such as adding two numbers. Thus,
if each instruction fetch required access to the main memory, pipelining would be of
little value.
The use of cache memories solves the memory access problem. In particular, when a
cache is included on the same chip as the processor, access time to the cache is usually
the same as the time needed to perform other basic operations inside the processor. This

makes it possible to divide instruction fetching and processing into steps that
are more or less equal in duration. Each of these steps is performed by a different
pipeline stage, and the clock period is chosen to correspond to the longest one.

5.1.2 PIPELINE PERFORMANCE

The pipelined processor in Figure 5.2 completes the processing of one instruction
in each clock cycle, which means that the rate of instruction processing is four times that
of sequential operation. The potential increase in performance resulting from
pipelining is proportional to the number of pipeline stages. However, this increase
would be achieved only if pipelined operation as depicted in Figure 5.2a could be
sustained without interruption throughout program execution. Unfortunately, this is not
the case.

For a variety of reasons, one of the pipeline stages may not be able to complete its
processing task for a given instruction in the time allotted. For example, stage E in the
four-stage pipeline of Figure 5.2b is responsible for arithmetic and logic operations,
and one clock cycle is assigned for this task. Although this may be sufficient for
most operations, some operations, such as divide, may require more time to complete.
Figure 5.3 shows an example in which the operation specified in instruction I2 requires
three cycles to complete, from cycle 4 through cycle 6. Thus, in cycles 5 and 6, the
Write stage must be told to do nothing, because it has no data to work with. Meanwhile,
the information in buffer B2 must remain intact until the Execute stage has completed
its operation. This means that stage 2 and, in turn, stage 1 are blocked from accepting
new instructions because the information in B1 cannot be overwritten. Thus, steps D4
and F5 must be postponed as shown.

174
Time
Clock cycle 1 2 3 4 5 6 7 8 9

Instruction

I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

I5 F5 D5 E5

Figure 5.3 Effect of an execution operation taking more than one clock cycle.

Pipelined operation in Figure 5.3 is said to have been stalled for two clock cycles.
Normal pipelined operation resumes in cycle 7. Any condition that causes the pipeline to
stall is called a hazard. We have just seen an example of a data hazard. A data hazard is any
condition in which either the source or the destination operands of an instruction are not
available at the time expected in the pipeline. As a result some operation has to be delayed,
and the pipeline stalls.
The pipeline may also be stalled because of a delay in the availability of an instruct-
tion. For example, this may be a result of a miss in the cache, requiring the instruction to be
fetched from the main memory. Such hazards are often called control hazards or instruction
hazards. The effect of a cache miss on pipelined operation is illustrated in Figure 8.4.
Instruction I1 is fetched from the cache in cycle 1, and its execution proceeds normally.
However, the fetch operation for instruction I2 , which is started in cycle 2, results in a
cache miss. The instruction fetch unit must now suspend any further fetch re- quests and wait
for I2 to arrive. We assume that instruction I2 is received and loaded into buffer B1 at the
end of cycle 5. The pipeline resumes its normal operation at that point.

175
Time
Clock cycle 1 2 3 4 5 6 7 8 9

Instruction

I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

(a) Instruction execution steps in successive clock cycles

Time
Clock cycle 1 2 3 4 5 6 7 8 9

Stage
F: Fetch F1 F2 F2 F2 F2 F3

D: Decode D1 idle idle idle D2 D3

E: Execute E1 idle idle idle E2 E3

W: Write W1 idle idle idle W2 W3

(b) Function performed by each processor stage in successive clock cycles

Figure 5.4 Pipeline stall caused by a cache miss in F2.

An alternative representation of the operation of a pipeline in the case of a


cache miss is shown in Figure 5.4b. This figure gives the function performed by each
pipeline stage in each clock cycle. Note that the Decode unit is idle in cycles 3
through 5, the Execute unit is idle in cycles 4 through 6, and the Write unit is idle in
cycles 5 through 7. Such idle periods are called stalls. They are also often referred to as
bubbles in the pipeline. Once created as a result of a delay in one of the pipeline stages, a
bubble moves downstream until it reaches the last unit.

A third type of hazard that may be encountered in pipelined operation is known as


a structural hazard. This is the situation when two instructions require the use of a
given hardware resource at the same time. The most common case in which this hazard
may arise is in access to memory. One instruction may need to access memory as part

176
of the Execute or Write stage while another instruction is being fetched. If instructions
and data reside in the same cache unit, only one instruction can proceed and the other
instruction is delayed. Many processors use separate instruction and data caches to
avoid this delay.

An example of a structural hazard is shown in Figure 5.5. This figure shows how the
load instruction

Load X(R1),R2
can be accommodated in our example 4-stage pipeline. The memory address, X+[R1], is
computed in step E2 in cycle 4, then memory access takes place in cycle 5. The operand
read from memory is written into register R2 in cycle 6. This means that the execution
step of this instruction takes two clock cycles (cycles 4 and 5). It causes the pipeline to
stall for one cycle, because both instructions I2 and I3 require access to the register file
in cycle 6. Even though the instructions and their data are all available.

Figure 5.5 Effect of a Load instruction on pipeline timing.

177
The pipeline is stalled because one hardware resource, the register file,
cannot handle two operations at once. If the register file had two input ports, that
is, if it allowed two simultaneous write operations, the pipeline would not be
stalled. In general, structural hazards are avoided by providing sufficient hardware
resources on the processor chip.
It is important to understand that pipelining does not result in individual
instructions being executed faster; rather, it is the throughput that increases, where
throughput is measured by the rate at which instruction execution is completed.
Any time one of the stages in the pipeline cannot complete its operation in one clock
cycle, the pipeline stalls, and some degradation in performance occurs. Thus, the
performance level of one instruction completion in each clock cycle is actually the
upper limit for the throughput achievable in a pipelined processor organized as in
Figure 8.2b.
An important goal in designing processors is to identify all hazards that may cause
the pipeline to stall and to find ways to minimize their impact. In the following sections
we discuss various hazards, starting with data hazards, followed by control hazards. In
each case we present some of the techniques used to mitigate their negative effect
on performance.
5.2 DATA HAZARDS
A data hazard is a situation in which the pipeline is stalled because the data to
be operated on are delayed for some reason, as illustrated in Figure 8.3. We will
now examine the issue of availability of data in some detail.
Consider a program that contains two instructions, I1 followed by I2 . When
this program is executed in a pipeline, the execution of I2 can begin before the
execution of I1 is completed. This means that the results generated by I1 may not
be available for use by I2 . We must ensure that the results obtained when
instructions are executed in a pipelined processor are identical to those obtained
when the same instructions are executed sequentially. The potential for obtaining
incorrect results when operations are performed concurrently can be demonstrated
by a simple example. Assume that A = 5, and consider the following two operations:
A←3+ AB←4× A
When these operations are performed in the order given, the result is B = 32.
But if they are performed concurrently, the value of A used in computing B would
be the original value, 5, leading to an incorrect result. If these two operations are
performed by instructions in a program, then the instructions must be executed
one after the other, because the data used in the second instruction depend on the
result of the first instruction. On the other hand, the two operations
A←5×C
B ← 20 + C

178
can be performed concurrently, because these operations are independent.

Figure 5.6 Pipeline stalled by data dependency between D2 and W1 .

This example illustrates a basic constraint that must be enforced to


guarantee correct results. When two operations depend on each other, they must
be performed sequentially in the correct order. This rather obvious condition has
far-reaching con- sequences. Understanding its implications is the key to
understanding the variety of design alternatives and trade-offs encountered in
pipelined computers.
Consider the pipeline in Figure 5.2. The data dependency just described arises
when the destination of one instruction is used as a source in the next instruction.
For example, the two instructions

MulR2,R3,R4
AddR5,R4,R6

give rise to a data dependency. The result of the multiply instruction is placed
into register R4, which in turn is one of the two source operands of the Add
instruction. Assuming that the multiply operation takes one clock cycle to complete,
execution would proceed as shown in Figure 5.6.

As the Decode unit decodes the Add instruction in cycle 3, it realizes that R4 is
used as a source operand. Hence, the D step of that instruction cannot be
completed until the W step of the multiply instruction has been completed.
Completion of step D2 must be delayed to clock cycle 5, and is shown as step D2A in
the figure. Instruction I3 is fetched in cycle 3, but its decoding must be delayed
because step D3 cannot precede D2 . Hence, pipelined execution is stalled for two
cycles.

179
5.2.1 OPERAND FORWARDING

The data hazard just described arises because one instruction, instruction I2 in Figure
5.6, is waiting for data to be written in the register file. However, these data are
available at the output of the ALU once the Execute stage completes step E1 . Hence,
the delay can
be reduced, or possibly eliminated, if we arrange for the result of instruction I1 to be
forwarded directly for use in step E2 .
Figure 5.7a shows a part of the processor datapath involving the ALU and the
register file. This arrangement is similar to the three-bus structure except that
registers SRC1, SRC2, and RSLT have been added. These registers constitute the
interstage buffers needed for pipelined operation, as illustrated in Figure 8.7b. With
reference to Figure 8.2b, registers SRC1 and SRC2 are part of buffer B2 and RSLT is
part of B3. The data forwarding mechanism is provided by the blue connection lines. The
two multiplexers connected at the inputs to the ALU allow the data on the
destination bus to be selected instead of the contents of either the SRC1 or SRC2
register.

When the instructions in Figure 5.6 are executed in the datapath of Figure 5.7,
the operations performed in each clock cycle are as follows. After decoding
instruction I2 and detecting the data dependency, a decision is made to use data
forwarding. The operand not involved in the dependency, register R2, is read and
loaded in register SRC1 in clock cycle 3. In the next clock cycle, the product
produced by instruction I1 is available in register RSLT, and because of the forwarding
connection, it can be used in step E2 . Hence, execution of I2 proceeds without
interruption.

180
Figure 5.7 Operand forwarding in a pipelined processor.

5.2. HANDLING DATA HAZARDS IN SOFTWARE


In Figure 5.6, we assumed the data dependency is discovered by the hardware
while the instruction is being decoded. The control hardware delays reading register
R4 until cy- cle 5, thus introducing a 2-cycle stall unless operand forwarding is used.
An alternative approach is to leave the task of detecting data dependencies and
dealing with them to the software. In this case, the compiler can introduce the two-
cycle delay needed between instructions I1 and I2 by inserting NOP (No-operation)
instructions, as follows:

181
I1 : Mul R2,R3,R4
NOP NOP
I2 : Add R5,R4,R6

If the responsibility for detecting such dependencies is left entirely to the


software, the compiler must insert the NOP instructions to obtain a correct result. This
possibility illustrates the close link between the compiler and the hardware. A
particular feature can be either implemented in hardware or left to the compiler.
Leaving tasks such as inserting NOP instructions to the compiler leads to simpler
hardware. Being aware of the need for a delay, the compiler can attempt to reorder
instructions to perform useful tasks in the NOP slots, and thus achieve better
performance. On the other hand, the insertion of NOP instructions leads to larger code
size. Also, it is often the case that a given processor architecture has several hardware
implementations, offering different features. NOP instructions inserted to satisfy the
requirements of one implementation may not be needed and, hence, would lead to
reduced performance on a different implementation.

5.2.3 SIDE EFFECTS

The data dependencies encountered in the preceding examples are explicit and easily
detected because the register involved is named as the destination in instruction I1 and
as a source in I2 . Sometimes an instruction changes the contents of a register other
than the one named as the destination. An instruction that uses an auto increment or
auto decrement addressing mode is an example. In addition to storing new data in its
destination location, the instruction changes the contents of a source register used to
access one of its operands. All the precautions needed to handle data dependencies
involving the destination location must also be applied to the registers affected by an
auto increment or auto decrement operation. When a location other than one explicitly
named in an instruction as a destination operand is affected, the instruction is said to
have a side effect. For example, stack instructions, such as push and pop, produce
similar side effects because they implicitly use the auto increment and auto decrement
addressing modes.
Another possible side effect involves the condition code flags, which are used by
instructions such as conditional branches and add-with-carry. Suppose that registers
R1 and R2 hold a double-precision integer number that we wish to add to another
double- precision number in registers R3 and R4. This may be accomplished as
follows:
Add R1,R3
Add With Carry R2,R4

182
An implicit dependency exists between these two instructions through the carry
flag. This flag is set by the first instruction and used in the second instruction, which
performs the operation

R4 ← [R2] + [R4] +carry

Instructions that have side effects give rise to multiple data dependencies, which
lead to a substantial increase in the complexity of the hardware or software needed
to resolve them. For this reason, instructions designed for execution on pipelined
hardware should have few side effects. Ideally, only the contents of the destination
location, either a register or a memory location, should be affected by any given
instruction. Side effects, such as setting the condition code flags or updating the
contents of an address pointer, should be kept to a minimum. However, that the auto
increment and auto decrement addressing modes are potentially useful. Condition
code flags are also needed for recording such information as the generation of a
carry or the occurrence of overflow in an arithmetic operation. In Section 8.4 we
show how such functions can be provided by other means that are consistent with a
pipelined organization and with the requirements of optimizing compilers.

5.3 INSTRUCTION HAZARDS

The purpose of the instruction fetch unit is to supply the execution units with a
steady stream of instructions. Whenever this stream is interrupted, the pipeline
stalls, as Figure 5.4 illustrates for the case of a cache miss. A branch instruction
may also cause the pipeline to stall. We will now examine the effect of branch
instructions and the techniques that can be used for mitigating their impact. We start
with unconditional branches.

5.3.1 UNCONDITIONAL BRANCHES

Figure 5.8 shows a sequence of instructions being executed in a two-stage pipeline.


Instructions I1 to I3 are stored at successive memory addresses, and I2 is a
branch instruction. Let the branch target be instruction Ik . In clock cycle 3, the fetch
operation for instruction I3 is in progress at the same time that the branch
instruction is being decoded and the target address computed. In clock cycle 4, the
processor must discard I3 , which has been incorrectly fetched, and fetch
instruction Ik . In the meantime, the hardware unit responsible for the Execute (E)
step must be told to do nothing during that clock period. Thus, the pipeline is
stalled for one clock cycle.
The time lost as a result of a branch instruction is often referred to as the

183
branch penalty. In Figure 5.8, the branch penalty is one clock cycle. For a longer
pipeline, the branch penalty may be higher. For example, Figure 5.9a shows the
effect of a branch instruction on a four-stage pipeline. We have assumed that
the branch ad- dress is computed in step E2 . Instructions I3 and I4 must be
discarded, and the tar- get instruction, Ik , is fetched in clock cycle 5. Thus, the
branch penalty is two clock cycles.

Reducing the branch penalty requires the branch address to be computed earlier
in the pipeline. Typically, the instruction fetch unit has dedicated hardware to
identify a branch instruction and compute the branch target address as quickly as
possible after an instruction is fetched. With this additional hardware, both of
these tasks can be performed in step D2 , leading to the sequence of events shown
in Figure 5.9b. In this case, the branch penalty is only one clock cycle.

Figure 5.8 An idle cycle caused by a branch instruction.

184
Fig:5.9 Branch timing

Instruction Queue and Prefetching


Either a cache miss or a branch instruction stalls the pipeline for one or more clock
cycles.

Fig:5.10 Use of an instruction queue in the hardware organization of Figure 5.2b.

185
To reduce the effect of these interruptions, many processors employ
sophisticated fetch units that can fetch instructions before they are needed and put
them in a queue.
Typically, the instruction queue can store several instructions. A separate unit,
which we call the dispatch unit, takes instructions from the front of the queue and
sends them to the execution unit. This leads to the organization shown in Figure
5.10. The dispatch unit also performs the decoding function.
To be effective, the fetch unit must have sufficient decoding and processing
capa- bility to recognize and execute branch instructions. It attempts to keep the
instruction queue filled at all times to reduce the impact of occasional delays when
fetching in- structions. When the pipeline stalls because of a data hazard, for
example, the dispatch unit is not able to issue instructions from the instruction
queue. However, the fetch unit continues to fetch instructions and add them to the
queue. Conversely, if there is a delay in fetching instructions because of a branch or a
cache miss, the dispatch unit continues to issue instructions from the instruction
queue.
We have assumed that initially the queue contains one instruction. Every fetch
operation adds one instruction to the queue and every dispatch operation reduces
the queue length by one. Hence, the queue length remains the same for the first
four clock cycles. (There is both an F and a D step in each of these cycles.) Suppose
that instruction I1 introduces a 2-cycle stall. Since space is available in the queue,
the fetch unit continues to fetch instructions and the queue length rises to 3 in
clock cycle 6.
Instruction I5 is a branch instruction. Its target instruction, Ik , is fetched in cycle
7, and instruction I6 is discarded. The branch instruction would normally cause a
stall in cycle 7 as a result of discarding instruction I6 . Instead, instruction I4 is
dispatched from the queue to the decoding stage. After discarding I6 , the queue
length drops to 1 in cycle 8. The queue length will be at this value until another stall
is encountered.
5.3.2 CONDITIONAL BRANCHES AND BRANCH PREDICTION

A conditional branch instruction introduces the added hazard caused by the


dependency of the branch condition on the result of a preceding instruction. The
decision to branch cannot be made until the execution of that instruction has been
completed.
Branch instructions occur frequently. In fact, they represent about 20 percent
of the dynamic instruction count of most programs. (The dynamic count is the
number of instruction executions, taking into account the fact that some program
instructions are executed many times because of loops.) Because of the branch

186
penalty, this large percentage would reduce the gain in performance expected from
pipelining. Fortunately, branch instructions can be handled in several ways to reduce
their negative impact on the rate of execution of instructions.

Delayed Branch
In Figure 5.8, the processor fetches instruction I3 before it determines whether
the current instruction, I2 , is a branch instruction. When execution of I2 is
completed and a branch is to be made, the processor must discard I3 and fetch
the instruction at the branch target. The location following a branch instruction is
called a branch delay slot. There may be more than one branch delay slot, depending
on the time it takes to execute a branch instruction. For example, there are two
branch delay slots in Figure 5.9a and one delay slot in Figure 5.9b. The instructions
in the delay slots are always fetched and at least partially executed before the
branch decision is made and the branch target address is computed.

A technique called delayed branching can minimize the penalty incurred as a


result of conditional branch instructions. The idea is simple. The instructions in the
delay slots are always fetched. Therefore, we would like to arrange for them to be
fully executed whether or not the branch is taken. The objective is to be able to place
useful instructions in these slots. If no useful instructions can be placed in the delay
slots, these slots must be filled with NOP instructions. This situation is exactly the
same as in the case of data dependency

Consider the instruction sequence given in Figure 5.12a. Register R2 is used as


a counter to determine the number of times the contents of register R1 are
shifted left. For a processor with one delay slot, the instructions can be reordered
as shown in Figure 5.12b. The shift instruction is fetched while the branch
instruction is being executed. After evaluating the branch condition, the processor
fetches the instruction at LOOP or at NEXT, depending on whether the branch
condition is true or false, respectively.

In either case, it completes execution of the shift instruction. The sequence of


events during the last two passes in the loop is illustrated in Figure 5.13. Pipelined
operation is not interrupted at any time, and there are no idle cycles. Logically, the
program is executed as if the branch instruction were placed after the shift
instruction. That is, branching takes place one instruction later than where the
branch instruction appears in the instruction sequence in the memory, hence the
name “delayed branch.”

187
Fig:5.12 Reordering of instructions for a delayed

Fig:5.13 Execution timing showing the delay slot being filled during the last two passes

through the loop in Figure 5.12

188
The effectiveness of the delayed branch approach depends on how often it is pos-
sible to reorder instructions as in Figure 5.12. Experimental data collected from many
programs indicate that sophisticated compilation techniques can use one branch delay
slot in as many as 85 percent of the cases. For a processor with two branch delay slots,
the compiler attempts to find two instructions preceding the branch instruction that it
can move into the delay slots without introducing a logical error. The chances of finding
two such instructions are considerably less than the chances of finding one. Thus, if
increasing the number of pipeline stages involves an increase in the number of branch
delay slots, the potential gain in performance may not be fully realized.

Branch Prediction
Another technique for reducing the branch penalty associated with conditional
branches is to attempt to predict whether or not a particular branch will be taken. The
simplest form of branch prediction is to assume that the branch will not take place and to
continue to fetch instructions in sequential address order. Until the branch condition is
evaluated, instruction execution along the predicted path must be done on a speculative
basis. Speculative execution means that instructions are executed before the processor
is certain that they are in the correct execution sequence. Hence, care must be taken that
no processor registers or memory locations are updated until it is confirmed that these
instructions should indeed be executed. If the branch decision indicates otherwise, the
instructions and all their associated data in the execution units must be purged, and the
correct instructions fetched and executed.

An incorrectly predicted branch is illustrated in Figure 5.14 for a four-stage pipeline.

189
The figure shows a Compare instruction followed by a Branch>0 instruction. Branch
prediction takes place in cycle 3, while instruction I3 is being fetched. The fetch unit
predicts that the branch will not be taken, and it continues to fetch instruction I4 as I3
enters the Decode stage. The results of the compare operation are available at the end
of cycle 3. Assuming that they are forwarded immediately to the instruction fetch unit,
the branch condition is evaluated in cycle 4. At this point, the instruction fetch unit real-
izes that the prediction was incorrect, and the two instructions in the execution pipe are
purged. A new instruction, Ik , is fetched from the branch target address in clock cycle 5.

If branch outcomes were random, then half the branches would be taken. Then
the simple approach of assuming that branches will not be taken would save the time
lost to conditional branches 50 percent of the time. However, better performance can
be achieved if we arrange for some branch instructions to be predicted as taken and
others as not taken, depending on the expected program behavior. For example, a branch
instruction at the end of a loop causes a branch to the start of the loop for every pass
through the loop except the last one. Hence, it is advantageous to assume that this
branch will be taken and to have the instruction fetch unit start to fetch instructions at
the branch target address. On the other hand, for a branch instruction at the beginning
of a program loop, it is advantageous to assume that the branch will not be taken.
A decision on which way to predict the result of the branch may be made in
hardware by observing whether the target address of the branch is lower than or higher
than the address of the branch instruction. A more flexible approach is to have the
compiler decide whether a given branch instruction should be predicted taken or not
taken. The branch instructions of some processors, such as SPARC, include a branch
prediction bit, which is set to 0 or 1 by the compiler to indicate the desired behavior.
The instruction fetch unit checks this bit to predict whether the branch will be taken or
not taken.

With either of these schemes, the branch prediction decision is always the same
every time a given instruction is executed. Any approach that has this characteristic is
called static branch prediction. Another approach in which the prediction decision may
change depending on execution history is called dynamic branch prediction.

5.4 INFLUENCES ON INSTRUCTION SETS

We have seen that some instructions are much better suited to pipelined execution than
others. For example, instruction side effects can lead to undesirable data dependencies.
In this section, we examine the relationship between pipelined execution and machine
instruction features. We discuss two key aspects of machine instructions — addressing
modes and condition code flags.

190
5.4.1 ADDRESSING MODES

Addressing modes should provide the means for accessing a variety of data structures
simply and efficiently. Useful addressing modes include index, indirect, autoincrement,
and autodecrement. Many processors provide various combinations of these modes to
increase the flexibility of their instruction sets. Complex addressing modes, such as those
involving double indexing, are often encountered.
In choosing the addressing modes to be implemented in a pipelined processor, we must
consider the effect of each addressing mode on instruction flow in the pipeline. Two
important considerations in this regard are the side effects of modes such as
autoincrement and autodecrement and the extent to which complex addressing modes
cause the pipeline to stall. Another important factor is whether a given mode is likely to
be used by compilers.
To compare various approaches, we assume a simple model for accessing operands in the
memory. The load instruction Load X(R1),R2 takes five cycles to complete execution, as
indicated in Figure 5.5. However, the instruction

Load (R1),R2
can be organized to fit a four-stage pipeline because no address computation is required.
Access to memory can take place in stage E. A more complex addressing mode may
require several accesses to the memory to reach the named operand. For example, The
instruction
Load (X(R1)),R2
may be executed as shown in Figure 5.16a, assuming that the index offset, X, is given in
the instruction word. After computing the address in cycle 3, the processor needs to
access memory twice — first to read location X+[R1] in clock cycle 4 and then to read
location [X+[R1]] in cycle 5. If R2 is a source operand in the next instruction, that
instruction would be stalled for three cycles, which can be reduced to two cycles with
operand forwarding, as shown.

191
Figure 5.16 Equivalent operations using complex and simple addressing modes.
To implement the same Load operation using only simple addressing modes
requires several instructions. For example, on a computer that allows three
operand addresses, we can use

Add #X,R1,R2
Load (R2),R2
Load (R2),R2

The Add instruction performs the operation R2 ← X+ [R1]. The two Load instructions
fetch the address and then the operand from the memory. This sequence of
instructions takes exactly the same number of clock cycles as the original, single
Load instruction, as shown in Figure 5.16b.

This example indicates that, in a pipelined processor, complex addressing


modes that involve several accesses to the memory do not necessarily lead to faster
execution. The main advantage of such modes is that they reduce the number of
instructions needed to perform a given task and thereby reduce the program space
needed in the main memory. Their main disadvantage is that their long execution
times cause the pipeline to stall, thus reducing its effectiveness. They require more

192
complex hardware to decode and execute them. Also, they are not
convenient for compilers to work with.
The instruction sets of modern processors are designed to take maximum
advantage of pipelined hardware. Because complex addressing modes are not suitable
for pipelined execution, they should be avoided. The addressing modes used in
modern processors often have the following features:

 Access to an operand does not require more than one access to the
memory.
 Only load and store instructions access memory
operands.
 The addressing modes used do not have side
effects.

Three basic addressing modes that have these features are register, register
indirect, and index. The first two require no address computation. In the index
mode, the address can be computed in one cycle, whether the index value is given
in the instruction or in a register. Memory is accessed in the following cycle. None of
these modes has any side effects, with one possible exception. Some architectures,
such as ARM, allow the address computed in the index mode to be written back
into the index register. This is a side effect that would not be allowed under the
guidelines above. Note also that relative addressing can be used; this is a special
case of indexed addressing in which the program counter is used as the index
register.

5.4.2 CONDITION CODES

In many processors, the condition code flags are stored in the processor status
register. They are either set or cleared by many instructions, so that they can be
tested by subsequent conditional branch instructions to change the flow of
program execution. An optimizing compiler for a pipelined processor attempts to
reorder instructions to avoid stalling the pipeline when branches or data
dependencies between successive instructions occur. In doing so, the compiler
must ensure that reordering does not cause a change in the outcome of a
computation. The dependency introduced by the condition-code flags reduces the
flexibility available for the compiler to reorder instructions.
Consider the sequence of instructions in Figure 5.17a, and assume that the ex-
ecution of the Compare and Branch=0 instructions proceeds as in Figure 8.14. The
branch decision takes place in step E2 rather than D2 because it must await the
result of the Compare instruction. The execution time of the Branch instruction can
be reduced by interchanging the Add and Compare instructions, as shown in Figure
5.17b.

193
Figure 5.17 Instruction reordering.

This will delay the branch instruction by one cycle relative to the Compare instruction. As a
result, at the time the Branch instruction is being decoded the result of the Com- pare
instruction will be available and a correct branch decision will be made. There would be no
need for branch prediction. However, interchanging the Add and Com- pare instructions can
be done only if the Add instruction does not affect the condition codes.

These observations lead to two important conclusions about the way condition codes should be
handled. First, to provide flexibility in reordering instructions, the condition-code flags should be
affected by as few instructions as possible. Second, the compiler should be able to specify in
which instructions of a program the condi- tion codes are affected and in which they are not. An
instruction set designed with pipelining in mind usually provides the desired flexibility. Figure
8.17b shows the instructions reordered assuming that the condition code flags are affected only
when this is explicitly stated as part of the instruction OP code. The SPARC and ARM
architectures provide this flexibility.

194
5.5 Large Computer Systems

5.5.1 Parallel Processing

Parallel processing is a term used to denote a large class of techniques that are used to
provide simultaneous data-processing tasks for the purpose of increasing the computational
speed of a computer system.
The purpose of parallel processing is to speed up the computer processing capability and
increase its throughput, that is, the amount of processing that can be accomplished during a
given interval of time. The amount of hardware increases with parallel processing, and with it,
the cost of the system increases.
Parallel processing can be viewed from various levels of complexity
At the lowest level, we distinguish between parallel and serial operations by the type of registers
used. e.g. shift registers and registers with parallel load
There are a variety of ways that parallel processing can be classified.
Internal organization of the processors
Interconnection structure between processors
The flow of information through the system

M. J. Flynn considers the organization of a computer system by the number of instructions
and data items that are manipulated simultaneously.

Single instruction stream, single data stream (SISD)


Single instruction stream, multiple data stream (SIMD)
Multiple instruction stream, single data stream (MISD)
Multiple instruction stream, multiple data stream (MIMD)

SISD
Represents the organization of a single computer containing a control unit, a processor unit, and a
memory unit. Instructions are executed sequentially and the system may or may not have internal
parallel processing capabilities. parallel processing may be achieved by means of multiple
functional units or by pipeline processing.
SIMD
Represents an organization that includes many processing units under the supervision of common
control unit. All processors receive the same instruction from the control unit but operate on
different items of data. The shared memory unit must contain multiple modules so that it can
communicate with all the processors simultaneously.
MISD & MIMD
MISD structure is only of theoretical interest since no practical system has been constructed using
this organization.
MIMD organization refers to a computer system capable of processing several programs at the same
time. e.g. multiprocessor and multicomputer system

195
Flynn’s classification depends on the distinction between the performance of the control unit and
the data-processing unit. It emphasizes the behavioral characteristics of the computer system
rather than its operational and structural interconnections. One type of parallel processing that
does not fit Flynn’s classification is pipelining.
5.5.2 Array Processing
An array processor is a processor that performs computations on large arrays of data. The term is
used to refer to two different types of processors.Attached array processor: Is an auxiliary
processor. It is intended to improve the performance of the host computer in specific numerical
computation tasks. SIMD array processor: Has a single-instruction multiple-data organization.It
manipulates vector instructions by means of multiple functional units responding to a common
instruction. Attached Array ProcessorIts purpose is to enhance the performance of the computer
by providing vector processing for complex scientific applications.
Parallel processing with multiple functional units
Fig. shows the interconnection of an attached array processor to a host computer. For example,
when attached to a VAX 11 computer, the FSP-164/MAX from Floating-Point Systems increases
the computing power of the VAX to 100megaflops. The objective of the attached array processor
is to provide vector manipulation capabilities to a conventional computer at a fraction of the cost
of supercomputer.

Fig : Attached array processor with host computer


SIMD Array Processor:An SIMD array processor is a computer with multiple processing units
operating in parallel. A general block diagram of an array processor is shown in Fig below. It
contains a set of identical processing elements (PEs), each having a local memory M. Each PE
includes an ALU, a floating-point arithmetic unit, and working registers. Vector instructions are
broadcast to all PEs simultaneously. Masking schemes are used to control the status of each PE
during the execution of vector instructions. Each PE has a flag that is set when the PE is active
and reset when the PE is inactive.

Fig: SIMD array processor organization

196
Characteristics of multiprocessors

A multiprocessor system is an interconnection of two or more CPUs with memory and input-out
equipment. The term “processor” in multiprocessor can mean either a central processing unit
(CPU) or an input-output processor (IOP). Multiprocessors are classified as multiple instruction
stream, multiple data stream (MIMD) systems The similarity and distinction between
multiprocessor and multicomputer are Similarity Both support concurrent operations Distinction

The network consists of several autonomous computers that may or may not communicate with
eachother. A multiprocessor system is controlled by one operating system that provides interaction
between processors and all the components of the system cooperate in the solution of a problem.
Multiprocessing improves the reliability of the system.
The benefit derived from a multiprocessor organization is an improved system performance.
Multiple indepedent jobs can be made to operate in parallel.

A single job can be partitioned into multiple parallel tasks.

Multiprocessing can improve performance by decomposing a program into parallel executable tasks.
The user can explicitly declare that certain tasks of the program be executed in parallel. This must
be done prior to loading the program by specifying the parallel executable segments.The other is to
provide a compiler with multiprocessor software that can automatically detect parallelism in a
user’s program. Multiprocessor are classified by the way their memory is organized.

A multiprocessor system with common shared memory is classified as a shared-memory or tightly


coupled multiprocessor. Tolerate a higher degree of interaction between tasks. Each processor
element with its own private local memory is classified as a distributed-memory or loosely coupled
system. Are most efficient when the interaction between tasks

5.5.3 THE STRUCTURE OF GENERAL-PURPOSE MULTIPROCESSORS

5.5.3.1 UMA (Uniform Memory Access) Multiprocessor


An interconnection-network permits n processors to access k memories Thus, any of the
processors can access any of the memories. The interconnection-network may introduce network-
delay between Processor & Memory.
A system which has the same network-latency for all accesses from the processors to the
memory-modules is called a UMA Multiprocessor.
Although the latency is uniform, it may be large for a network that connects many processors &
many memory-modules.
For better performance, it is desirable to place a memory-module close to each processor.
Disadvantage:

Interconnection-networks with very short delays are costly and complex to implement.

197
.5.3.2 NUMA (Non-Uniform Memory Access) Multiprocessors

Memory-modules are attached directly to the processors (Figure 12.3).


The network-latency is avoided when a processor makes a request to access its local memory.
However, a request to access a remote-memory-module must pass through the network.
Because of the difference in latencies for accessing local and remote portions of the shared
memory, systems of this type are called NUMA multiprocessors.
Advantage:
A high computation rate is achieved in all processors
Disadvantage:
The remote accesses take considerably longer than accesses to the local memory.

5.5.3.3 Distributed Memory Systems


All memory-modules serve as private memories for processors that are directly connectedto
them.A processor cannot access a remote-memory without the cooperation of the remote-processor.
This cooperation takes place in the form of messages exchanged by the processors. Such systems are
often called Distributed-Memory Systems (Figure 12.4).

198
5.5.4 Interconnection Structures :
The components that form a multiprocessor system are CPUs, IOPs connected to input-
output devices, and a memory unit. The interconnection between the components can
have different physical configurations, depending on the number of transfer paths that are
available Between the processors and memory in a shared memory system o Among the
processing elements in a loosely coupled system There are several physical forms available
for establishing an interconnection network.
Time-shared common bus

Multiport memory

Crossbar switch

Multistage switching network

Hypercube system

Time Shared Common Bus:


A common-bus multiprocessor system consists of a number of processors connected
through a common path to a memory unit.
Disadvantage.: Only one processor can communicate with the memory or another
processor at any given time. As a consequence, the total overall transfer rate within the
system is limited by the speed of the single path A more economical implementation of a
dual bus structure is depicted in Fig. below. Part of the local memory may be designed as a
cache memory attached to the CPU.

199
Fig: Time shared common bus organization

Fig: System bus structure for multiprocessors

Multiport Memory:
A multiport memory system employs separate buses between each memory module and
each CPU. The module must have internal control logic to determine which port will have
access to memory at any given time. Memory access conflicts are resolved by assigning
fixed priorities to each memory port.
Advanatge.:
The high transfer rate can be achieved because of the multiple paths.
Disadvantage.: It requires expensive memory control logic and a large number of cables
and connections

200
Fig: Multiport memory organization
Crossbar Switch:
Consists of a number of crosspoints that are placed at intersections between processor
buses and memory module paths. The small square in each crosspoint is a switch that
determines the path from a processor to a memory module.
Advanatge.:
Supports simultaneous transfers from all510 memory modules
Disadvantage.: The hardware required to implement the switch can become quite large
and complex. Below fig. shows the functional design of a crossbar switch connected to one
memory module.

Fig: Crossbar switch

201
Multistage Switching Network:The basic component of a multistage network is a two-
input, two-output interchange switch as shown in Fig. below.

Using the 2x2 switch as a building block, it is510possible to build a multistage network to
control the communication between a number of sources and destinations. To see how this
is done, consider
he binary tree shown in Fig. below. Certain request patterns cannot be satisfied
simultaneously. i.e., if P1 à 000~011, then P2 à 100~111 One such topology is the omega
switching network shown in Fig. below

202
Fig: 8 x 8 Omega Switching Network
Some request patterns cannot be connected simultaneously. i.e., any two sources cannot be
connected simultaneously to destination 000 and 001 . In a tightly coupled multiprocessor
system, the source is a processor and the destination is a memory module. In a loosely
coupled multiprocessor system, both the source and destination are processing elements.
Hypercube System:
The hypercube or binary n-cube multiprocessor structure is a loosely coupled 510
system composed of N=2n processors interconnected in an n-dimensional binary cube. Each
processor forms a node of the cube, in effect it contains not only a CPU but also local
memory and I/O interface. Each processor address differs from that of each of its n
neighbors by exactly one bit position. Fig. below shows the hypercube structure for n=1, 2,
and 3. Routing messages through an n-cube structure may take from one to n links from a
source node to a destination node. A routing procedure can be developed by computing the
exclusive-OR of the source node address with the destination node address. The message is
then sent along any one of the axes that the resulting binary value will have 1 bits
corresponding to the axes on which the two nodes differ. A representative of the hypercube
architecture is the Intel iPSC computer complex. It consists of 128(n=7) microcomputers,
each node consists of a CPU, a floating-point processor, local memory, and serial
communication interface units.

Fig: Hypercube structures for n=1,2,3

203

You might also like