0% found this document useful (0 votes)
16 views

UNIT 3 Second Half Notes

Pipelining and Hazards

Uploaded by

sathyaraj92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

UNIT 3 Second Half Notes

Pipelining and Hazards

Uploaded by

sathyaraj92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

UNIT 3

PROCESSOR AND CONTROL UNIT


3.4 PIPELINING
Pipelining is an implementation technique in which multiple instructions are
overlapped in execution.
MIPS instructions classically take five steps:
1. Fetch instruction from memory.
2. Read registers while decoding the instruction. The regular format of MIPS
instructions allows reading and decoding to occur simultaneously.
3. Execute the operation or calculate an address.
4. Access an operand in data memory.
5. Write the result into a register.
Let’s create a pipeline using the following eight instructions: load word (lw), store
word (sw), add (add), subtract (sub), AND (and), OR (or), set less than (slt), and
branch on equal (beq).
Compare the average time between instructions of a single-cycle
implementation, in which all instructions take one clock cycle, to a pipelined
implementation. The operation times for the major functional units in this example
are 200 ps for memory access, 200 ps for ALU operation, and 100 ps for register
file read or write. In the single-cycle model, every instruction takes exactly one
clock cycle, so the clock cycle must be stretched to accommodate the slowest
instruction.
Figure 4.26 shows the time required for each of the eight instructions. The single-
cycle design must allow for the slowest instruction—in Figure 4.26 it is
lw—so the time required for every instruction is 800 ps. Similarly Figure 4.27
compares nonpipelined and pipelined execution of three load word instructions.
Thus, the time between the first and fourth instructions in the nonpipelined design
is 3 × 800 ns or 2400 ps.
All the pipeline stages take a single clock cycle, so the clock cycle must be
long enough to accommodate the slowest operation. Just as the single-cycle design
must take the worst-case clock cycle of 800 ps, even though some instructions can
be as fast as 500 ps, the pipelined execution clock cycle must have the worst-case
clock cycle of 200 ps, even though some stages take only 100 ps. Pipelining still
offers a fourfold performance improvement: the time between the first and fourth
instructions is 3 × 200 ps or 600 ps.

We can turn the pipelining speed-up discussion above into a formula. If the
stages are perfectly balanced, then the time between instructions on the pipelined
processor—assuming ideal conditions—is equal to
Pipelining improves performance by increasing instruction
throughput, as opposed to decreasing the execution time of an individual
instruction, but instruction throughput is the important metric because real
programs execute billions of instructions.
3.4.1 Designing Instruction Sets for Pipelining
First, all MIPS instructions are the same length. This restriction makes it
much easier to fetch instructions in the first pipeline stage and to decode them in
the second stage.
Second, MIPS has only a few instruction formats, with the source register
fields being located in the same place in each instruction. This symmetry means
that the second stage can begin reading the register file at the same time that the
hardware is determining what type of instruction was fetched. If MIPS instruction
formats were not symmetric, we would need to split stage 2, resulting in six
pipeline stages.
Third, memory operands only appear in loads or stores in MIPS. This
restriction means we can use the execute stage to calculate the memory address and
then access memory in the following stage.
Fourth, operands must be aligned in memory in MIPS. So, we need not
worry about a single data transfer instruction requiring two data memory accesses;
the requested data can be transferred between processor and memory in a single
pipeline stage.

3.5 Pipelined Datapath


The division of an instruction into five stages means a five-stage pipeline,
which in turn means that up to five instructions will be in execution during any
single clock cycle. Thus, we must separate the datapath into five pieces, with each
piece named corresponding to a stage of instruction execution:

1. IF: Instruction fetch

2. ID: Instruction decode and register file read


3. EX: Execution or address calculation

4. MEM: Data memory access

5. WB: Write back

In Figure 4.33, these five components correspond roughly to the way the datapath
is drawn; instructions and data move generally from left to right through the five
stages as they complete execution. There are, however, two exceptions to this left -
to-right flow of instructions:

 The write-back stage, which places the result back into the register file in the
middle of the datapath
 The selection of the next value of the PC, choosing between the incremented
PC and the branch address from the MEM stage

Data flowing from right to left does not affect the current instruction; these reverse
data movements influence only later instructions in the pipeline. Note that the first
right-to-left flow of data can lead to data hazards and the second leads to control
hazards.
The following diagram shows the pipelined datapath with the pipeline registers
highlighted. All instructions advance during each clock cycle from one pipeline
register to the next. The registers are named for the two stages separated by that
register.

For example, the pipeline register between the IF and ID stages is called
IF/ID. Notice that there is no pipeline register at the end of the write-back stage.
All instructions must update some state in the processor, the register file,
memory, or the PC.
For example, a load instruction will place its result in 1 of the 32 registers,
and any later instruction that needs that data will simply read the appropriate
register.

The pipeline registers separate each pipeline stage. They are labeled by the
stages that they separate; For example, the first is labeled IF/ID because it
separates the instruction fetch and instructions decode stages.

The registers must be wide enough to store all the data corresponding to the
lines that go through them. For example, the IF/ID register must be 64 bits wide,
because it must hold both the 32-bit instruction fetched from memory and the
incremented 32-bit PC address.
We highlight the right half of registers or memory when they are being read
and highlight the left half when they are being written.

Example: Load Instruction (lw) - lw $s1, 100($s0)

1. Instruction fetch: The Figure shows the instruction being read from
memory using the address in the PC and then being placed in the IF/ID
pipeline register. The PC address is incremented by 4 and then written back
into the PC to be ready for the next clock cycle. This incremented address is
also saved in the IF/ID pipeline register in case it is needed later for an
instruction, such as beq.
2. Instruction decode and register file read: The Figure shows the instruction
portion of the IF/ID pipeline register supplying the 16-bit immediate field,
which is sign-extended to 32 bits, and the register numbers to read the two
registers. All three values are stored in the ID/EX pipeline register, along
with the incremented PC address. We again transfer everything that might be
needed by any instruction during a later clock cycle.
3. Execute or address calculation: Figure shows that the load instruction
reads the contents of register 1 and the sign-extended immediate from the
ID/EX pipeline register and adds them using the ALU. That sum is placed in
the EX/MEM pipeline register.
4. Memory access: The Figure shows the load instruction reading the data
memory using the address from the EX/MEM pipeline register and loading
the data into the MEM/WB pipeline register.
5. Write-back: The Figure shows the final step: reading the data from the
MEM/WB pipeline register and writing it into the register file in the middle
of the figure.

3.6 Pipelined Control


We start with a simple design that add control to the pipelined datapath. The
first step is to label the control lines on the existing datapath. In single-cycle
implementation, we assume that the PC is written on each clock cycle, so there
is no separate write signal for the PC. By the same argument, there are no
separate write signals for the pipeline registers (IF/ ID, ID/EX, EX/MEM, and
MEM/WB), since the pipeline registers are also written during each clock cycle.
To specify control for the pipeline, we need only set the control values during
each pipeline stage. Because each control line is associated with a component
active in only a single pipeline stage, we can divide the control lines into five
groups according to the pipeline stage.
1. Instruction fetch: The control signals to read instruction memory and to
write the PC are always asserted, so there is nothing special to control in
this pipeline stage.
2. Instruction decode/register file read: As in the previous stage, the same
thing happens at every clock cycle, so there are no optional control lines
to set.
3. Execution/address calculation: The signals to be set are RegDst, ALUOp,
and ALUSrc. The signals select the Result register, the ALU operation,
and either Read data 2 or a sign-extended immediate for the ALU.
4. Memory access: The control lines set in this stage are Branch, MemRead,
and MemWrite. The branch equal, load, and store instructions set these
signals, respectively. PCSrc selects the next sequential address unless
control asserts Branch and the ALU result was 0.
5. Write-back: The two control lines are MemtoReg, which decides between
sending the ALU result or the memory value to the register file, and
RegWrite, which writes the chosen value.
Implementing control means setting the nine control lines to these values in
each stage for each instruction. The simplest way to do this is to extend the
pipeline registers to include control information. Since the control lines start
with the EX stage, we can create the control information during instruction
decode. Figure 4.50 above shows that these control signals are then used in
the appropriate pipeline stage as the instruction moves down the pipeline.
3.7 PIPELINE HAZARDS
There are situations in pipelining when the next instruction cannot execute in the
following clock cycle. These events are called as pipeline hazards, and there are
three different types.
 Structural Hazard occurs when a planned instruction cannot execute in the
proper clock cycle because the hardware does not support the combination
of instructions that are set to execute.
 Data Hazard is also called as pipeline data hazard occurs when a planned
instruction cannot execute in the proper clock cycle because data that is
needed to execute the instruction is not yet available.
 Control Hazard also called as branch hazard occurs when the proper
instruction cannot execute in the proper pipeline clock cycle because the
instruction that was fetched is not the one that is needed; that is, the flow of
instruction addresses is not what the pipeline expected.

3.8 HANDLING DATA HAZARDS


Data hazard can be handled by using two methods.

1. Forwarding
2. By using stall
3.8.1 FORWARDING

Let’s look at a sequence with many dependences:

sub $2, $1,$3 # Register $2 written by sub

and $12,$2,$5 # 1st operand($2) depends on sub

or $13,$6,$2 # 2nd operand($2) depends on sub

add $14,$2,$2 # 1st($2) & 2nd($2) depend on sub

sw $15,100($2) # Base ($2) depends on sub

The last four instructions are all dependent on the result in register $2 of the first
instruction. If register $2 had the value 10 before the subtract instruction and −20
afterwards, the programmer intends that −20 will be used in the following
instructions that refer to register $2.

Figure 4.52 illustrates the execution of these instructions using a multiple-


clock-cycle pipeline representation. Figure 4.52 shows the value of register $2,
which changes during the middle of clock cycle 5, when the sub instruction writes
its result.
Figure 4.52 shows that the values read for register $2 would not be the result
of the sub instruction unless the read occurred during clock cycle 5 or later. Thus,
the instructions that would get the correct value of −20 are add and sw; the AND
and OR instructions would get the incorrect value 10.

The desired result is available at the end of the EX stage or clock cycle 3.
When is the data actually needed by the AND and OR instructions? At the
beginning of the EX stage, or clock cycles 4 and 5, respectively. Thus, we can
execute this segment without stalls if we simply forward the data as soon as it is
available to any units that need it before it is available to read from the register file.

The primary solution is based on the observation that we don’t need to wait
for the instruction to complete before trying to resolve the data hazard.
Forwarding Also called bypassing. A method of resolving a data hazard by
retrieving the missing data element from internal buffers rather than waiting for it
to arrive from programmer visible registers or memory.

Figure 4.53 shows the dependences between the pipeline registers and the
inputs to the ALU for the same code sequence. The change is that the dependence
begins from a pipeline register, rather than waiting for the WB stage to write the
register file. Thus, the required data exists in time for later instructions, with the
pipeline registers holding the data to be forwarded.
If we can take the inputs to the ALU from any pipeline register rather than
just ID/EX, then we can forward the proper data. By adding multiplexors to the
input of the ALU, and with the proper controls.

3.8.2 PIPELINE STALL

Pipeline stall also called bubble. A stall initiated in order to resolve a hazard.

Load-use data hazard: A specific form of data hazard in which the data being
loaded by a load instruction has not yet become available when it is needed by
another instruction.

nop: An instruction that does no operation to change state.

One case where forwarding cannot save the day is when an instruction tries
to read a register following a load instruction that writes the same register. Figure
4.58 illustrates the problem. The data is still being read from memory in clock
cycle 4 while the ALU is performing the operation for the following instruction.
Something must stall the pipeline for the combination of load followed by an
instruction that reads its result.

Hence, in addition to a forwarding unit, we need a hazard detection unit. It

operates during the ID stage so that it can insert the stall between the load and its
use.
The following diagram shows the AND instruction is turned into a nop and all
instructions beginning with the AND instructions are delayed one cycle. In this
example, the hazard forces the AND and OR instructions to repeat in clock cycle 4
what they did in clock cycle 3: AND reads registers and decodes, and OR is
refetched from instruction memory.

A bubble is inserted beginning in clock cycle 4, by changing the AND


instruction to a nop. Note that the AND instruction is really fetched and decoded in
clock cycles 2 and 3, but its EX stage is delayed until clock cycle 5. Likewise the
OR instruction is fetched in clock cycle 3, but its ID stage is delayed until clock
cycle 5. After insertion of the bubble, all the dependences go forward in time and
no further hazards occur.

3.9 HANDLING CONTROL HAZARDS


Control Hazard also called as branch hazard occurs when the proper
instruction cannot execute in the proper pipeline clock cycle because the
instruction that was fetched is not the one that is needed; that is, the flow of
instruction addresses is not what the pipeline expected.
An instruction must be fetched at every clock cycle to sustain the pipeline, yet in
our design the decision about whether to branch doesn’t occur until the MEM
pipeline stage.
We look at three schemes for resolving control hazards:

 Assume Branch Not Taken


 Reducing the Delay of Branches
 Dynamic Branch Prediction
 Branch delay slot.

3.9.1 ASSUME BRANCH NOT TAKEN

In control hazards, assuming a branch is not taken means that the next
instruction is fetched and execution continues down the sequential instruction
stream. If the branch is taken, the instructions in the pipeline are discarded.
Discarding instructions, then, means we must be able to flush instructions in the
IF, ID, and EX stages of the pipeline. flush means to discard instructions in a
pipeline, usually due to an unexpected event.

3.9.2 REDUCING THE DELAY OF BRANCHES

Thus far, we have assumed the next PC for a branch is selected in the MEM
stage, but if we move the branch execution earlier in the pipeline, then fewer
instructions need be flushed.
Moving the branch decision up requires two actions to occur earlier:
computing the branch target address and evaluating the branch decision. The easy
part of this change is to move up the branch address calculation. We already have
the PC value in the IF/ID pipeline register, so we just move the branch adder from
the EX stage to the ID stage.
Adding hardware to compute the branch target address and evaluate the
branch decision in the ID stage reduces the number of stall (flush) cycles to one.
3.9.3 DYNAMIC BRANCH PREDICTION

This strategy uses recent branch history during program execution to predict
whether or not the branch will be taken next time when it occurs. It uses recent
branch information to predict the next branch. This technique is called dynamic
branch prediction- Prediction of branches at runtime using runtime information.
A branch prediction buffer or branch history table is a small memory
indexed by the lower portion of the address of the branch instruction. The memory
contains a bit that says whether the branch was recently taken or not.
1-Bit Branch Prediction
 Simple Prediction: Uses a single bit to predict the outcome of a branch.
 Initialization: Bit is initially set to a default value (often "not taken").
 Prediction:
o If the bit is 1, the branch is predicted to be taken.
o If the bit is 0, the branch is predicted to be not taken.
 Update:
o If the prediction is correct, the bit remains unchanged.
o If the prediction is incorrect, the bit is changed.

The simple 1-bit prediction scheme has a performance shortcoming: even if


a branch is almost always taken, we can predict incorrectly twice, rather than once,
when it is not taken.
To remedy this weakness, 2-bit prediction schemes are often used. In a 2-bit
scheme, a prediction must be wrong twice before it is changed. The following
diagram shows the finite-state machine for a 2-bit prediction scheme.
State Transitions:
 00: Strongly not taken
 01: Weakly not taken
 10: Weakly taken
 11: Strongly taken
Prediction: Based on the current state of the counter.
Update:
 If the prediction is correct, the counter is incremented or decremented.
 If the prediction is incorrect, the counter is decremented or incremented.
3.9.4 BRANCH DELAY SLOT

Compilers and assemblers try to place an instruction that always executes


after the branch in the branch delay slot. The job of the software is to make the
successor instructions valid and useful.
Branch delay slot: The slot directly after a delayed branch instruction,
which in the MIPS architecture is filled by an instruction that does not affect the
branch. Figure 4.64 shows the three ways in which the branch delay slot can be
scheduled.
3.10 EXCEPTIONS
Exceptions and interrupts are events other than branches or jumps that change the
normal flow of instruction execution.

Exception: Exception also called interrupt. An unscheduled event that disrupts


program execution and they are used to detect overflow. The two types of
exceptions that our current implementation can generate are execution of an
undefined instruction and an arithmetic overflow.
Interrupt: It is an exception that comes from outside of the processor.
We use the term interrupt only when the event is externally caused. Here are five
examples showing whether the situation is internally generated by the processor or
externally generated:

Handling Exception:
The two types of exceptions can occur in the basic MIPS architecture
implementation.
1. Execution of an undefined instruction
2. An arithmetic overflow.
When an exception occurs the processor saves the address of the ending instruction
in the exception program counter (EPC) and then transfer control to the operating
system at some specified address. The operating system then takes the appropriate
action, which may involve providing some service to the user program, taking
some predefined action in response to an overflow, or stopping the execution of the
program and reporting an error.
After performing whatever action is required because of the exception, the
operating system can terminate the program or may continue its execution, using
the EPC to determine where to restart the execution of the program.
Two main methods used to communicate the reason for an exception:
 The first method used in the MIPS architecture is to include a status
register (called the Cause register), which holds a field that indicates
the reason for the exception.
 A second method is to use vectored interrupts. In a vectored interrupt,
the address to which control is transferred is determined by the cause
of the exception.
We will need to add two additional registers to our current MIPS implementation:
 EPC: A 32-bit register used to hold the address of the affected instruction.
 Cause: A register used to record the cause of the exception. In the MIPS
architecture, this register is 32 bits, although some bits are currently unused.

You might also like