Chapter 4
Pipelining
Basic Computer Architecture
What is computer architecture
Instruction set architecture
What to do
Computer organization
How to do (Data path and Control)
Chapter 4 — The Processor — 2
Instruction Set Architecture
Instruction set architecture for MIPS
Arithmetic logical instruction
add R3,R2,R1
Data transfer instructions
lw R2, offset(R1)
sw R2, offset(R1)
Branch instructions
beq rs, rt, offset
Chapter 4 — The Processor — 3
Computer Organization
Sequential (single cycle) Execution
One instruction is fetched from instruction memory
All the steps in instruction execution are completed
Then next instruction is fetched
In other words, new instruction can not be fetched
from the memory unless the previous instruction
has completed its execution
Instruction 1
Instruction 2
Instruction 3
Chapter 4 — The Processor — 4
single cycle
Why a single cycle implementation is not
used today
Inefficient: single cycle for each instruction with
Which
inst.? same length
Longest path determines the clock cycle
CPI is 1 and clock cycle is too long, overall
performance is poor
We need another implementation
technique that is more efficient and having
higher throughput. (Pipelining)
Chapter 4 — The Processor — 5
Pipelining
Pipelining
Next instruction is fetched from the memory
before the previous instruction has completed
its execution
In other words, overlapping of instruction
execution
Why pipelining is used ?
To improve performance (higher throughput)
Chapter 4 — The Processor — 6
Pipelining
Each stage = 30 min
Total Execution time for 4 loads = 8 h
Original
Each stage = 30 min
Total Execution time for 4 loads = 3.5 h
Improved
Four loads:
Speedup
= 8/3.5 = 2.3
Chapter 4 — The Processor — 7
Pipelining vs Performance
What is performance ?
Latency or response time
How long it takes to do a single task
Throughput
Total work done (all tasks) per unit time
Pipelining increases latency or throughput ?
Only throughput
Chapter 4 — The Processor — 8
MIPS Pipeline
Five stages, one step per stage
1. IF (Fetch): Instruction fetch from memory IF
2. ID (Decode): Instruction decode & register read ID
3. EX (Execute): Execute operation or calculate address
EX
4. MEM (Memory): Access data memory MEM
5. WB (Writeback): Write result back to register
WB
Focus on 8 inst.:
lw, sw, add, sub, AND, OR, slt, beq
Chapter 4 — The Processor — 9
Graphical Representation of MIPS 5-stage Pipeline
read write
(right shaded) ALU (left shaded)
White background
because add does not access memory
Chapter 4 — The Processor — 10
Pipeline Performance
Assume that:
Time for register read and write is 100ps
Time for any other stage is 200 ps
Inst. Inst. fetch Reg. read ALU op Mem. access Reg. write Tot. time
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format (add, 200ps 100 ps 200ps 100 ps 600ps
sub, AND, OR,
slt)
Branch (beq) 200ps 100 ps 200ps 500ps
Chapter 4 — The Processor — 11
Pipeline Performance
Single-cycle (Clock cycle time,Tc= 800ps)
Time bet. 1st and 4th inst.=3x800=2400 ps
Pipelined (Tc= 200ps)
Time bet. 1st and 4th inst.=3x200=600 ps
The clock cycle design must
allow for the slowest inst.
See previous table
Chapter 4 — The Processor — 12
Pipeline Performance
𝐶𝐶 = 5
𝐶𝐶 = 9
𝐶𝐶 =104
Chapter 4 — The Processor — 13
Example 1
Consider a nonpipelined machine with 8
execution stages of lengths 20 ns.
The time between two instructions
20+20+20+20+20+20+20+20 = 160 ns
Suppose we introduce pipelining on this
machine.
The time between two instructions = 20 ns
The speedup obtained from pipelining
Speedup = 160 / 20 = 8
Chapter 4 — The Processor — 14
Example 2
Consider a nonpipelined machine with 10
execution stages of lengths 10, 20, 20, 30, 10,
10, 50, 45, 20, 10.
The time between two instructions on this machine
10+20+20+30+10+10+50+45+20+10 = 225 ns
Suppose we introduce pipelining on this
machine.
The time between two instructions = 50 ns. The clock cycle design must
allow for the slowest inst.
Speedup = 225 / 50 = 4.5
Chapter 4 — The Processor — 15
Pipeline Speedup
Chapter 4 — The Processor — 16
Example 3
In a non-pipelined machine time between instructions is
200 ns. If we use pipelining with four stages such that all
the stages are balanced. What is the time between
instructions after pipelining ?
Time between instructions after pipelining = 200/4 = 50 ns
Notice: speedup = 200/50=4= number of pipelined stages
Chapter 4 — The Processor — 17
Pipelining and ISA Design
MIPS ISA designed for pipelining
All instructions are 32-bits
Easier to fetch and decode in one cycle
c.f. x86: 1- to 17-byte instructions
Few and regular instruction formats
Can decode and read registers in one step
Load/store addressing
Can calculate address in 3rd stage, access memory
in 4th stage
Alignment of memory operands
Memory access takes only one cycle
Chapter 4 — The Processor — 18
Pipelining Hazards
There are situations in pipelining when the next
instruction can not execute in the following clock cycle.
These events are called hazards
In other word, any condition that causes a pipeline to
stall is called a hazard.
There are three types of hazards:
Structural hazards:
A required resource is busy
Data hazards:
Need to wait for previous instruction to complete its data read/write
Control hazards:
Deciding on control action depends on previous instruction
Chapter 4 — The Processor — 19
Structure Hazards
Due to the conflict for use of a resource
A required resource is busy
ex. Using a washer-dryer combination
Assume a MIPS pipeline with a single
memory
Load/store requires data access
Instruction fetch would have to stall (wait) for that
cycle
Would cause a pipeline “bubble”
Hence, pipelined datapaths require separate
instruction/data memories
Chapter 4 — The Processor — 20
Structure Hazards
Chapter 4 — The Processor — 21
Structure Hazards
Chapter 4 — The Processor — 22
Data Hazards
Data hazards:
Arise from the dependence of one instruction on an
earlier one that is still in the pipeline
Need to stall (wait) for previous instruction to complete
its data read/write add $s0, $t0, $t1
sub $t2, $s0, $t3
The value of $s0
is written back here
So, Wait Until it becomes available
The value of $s0 is needed here but it is
not available in this stage to be read
Chapter 4 — The Processor — 23
Data Hazards: Example
An instruction depends on completion of data access by
a previous instruction
add $s0, $t0, $t1
sub $t2, $s0, $t3
To resolve this hazard:
(1) wait until the hazard is resolved (but impacts CPI)
Write result is in the fifth stage
Chapter 4 — The Processor — 24
Data Hazards:
(2)Forwarding (Bypassing)
Use result when it is computed
Don’t wait for it to be stored in a register
Requires extra connections in the datapath (Hardware)
Valid only if the destination stage is later in time than
the source stage
Can’t prevent all pipeline stalls
Chapter 4 — The Processor — 25
Load-Use Data Hazard
Can’t always avoid stalls by forwarding
If value not computed when needed
Can’t forward backward in time!
Chapter 4 — The Processor — 26
Load-Use Data Hazard
So, we would have to stall one stage for a
load-use data hazard
Chapter 4 — The Processor — 27
(3)Code Scheduling to Avoid Stalls
Reorder code to avoid use of load result in
the next instruction (Software)
C code for A = B + E; C = B + F;
lw $t1, 0($t0)
lw $t2, 4($t0)
stall add $t3, $t1, $t2 Forwarding is adopted here
sw $t3, 12($t0)
lw $t4, 8($t0)
stall add $t5, $t1, $t4
sw $t5, 16($t0)
13 cycles
Chapter 4 — The Processor — 28
lw $t1, 0($t0)
(3)Code Scheduling to Avoid Stalls lw $t2, 4($t0)
add $t3, $t1, $t2
lw $t1, 0($t0) sw $t3, 12($t0)
lw $t4, 8($t0)
lw $t2, 4($t0)
add $t5, $t1, $t4
stall sw $t5, 16($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
lw $t4, 8($t0)
stall
add $t5, $t1, $t4
sw $t5, 16($t0)
13 cycles
Chapter 4 — The Processor — 29
(3)Code Scheduling to Avoid Stalls
Reorder code to avoid use of load result in
the next instruction (Software)
C code for A = B + E; C = B + F;
Forwarding is adopted here
lw $t1, 0($t0) lw $t1, 0($t0)
lw $t2, 4($t0) lw $t2, 4($t0)
stall add $t3, $t1, $t2 lw $t4, 8($t0)
sw $t3, 12($t0) add $t3, $t1, $t2
lw $t4, 8($t0) sw $t3, 12($t0)
stall add $t5, $t1, $t4 add $t5, $t1, $t4
sw $t5, 16($t0) sw $t5, 16($t0)
13 cycles 11 cycles
Chapter 4 — The Processor — 30
lw $t1, 0($t0)
(3)Code Scheduling to Avoid Stalls
lw $t2, 4($t0)
lw $t4, 8($t0)
lw $t1, 0($t0) add $t3, $t1, $t2
sw $t3, 12($t0)
lw $t2, 4($t0)
add $t5, $t1, $t4
lw $t4, 8($t0) sw $t5, 16($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
11 cycles
Chapter 4 — The Processor — 31
Control Hazards
Also called branch hazards because
control hazards are due to branch
instructions
Branch determines flow of control
Fetching next instruction depends on branch
outcome
Pipeline can’t always fetch correct instruction
Control hazard will occur when the proper
instruction was not fetched
What is the solution ?
Chapter 4 — The Processor — 32
Control Hazards: (1) Stall on Branch
Wait until branch outcome determined before fetching
next instruction
It means we have to wait until stage 4.
Advantage: simple both to software and hardware
Stall 3 cycles
Chapter 4 — The Processor — 33
Control Hazards: (2) putting extra hardware
Lets assume that we put in enough extra hardware during the second stage
of pipeline (ID stage). so that we can:
test registers (Comparator)
calculate the branch address (adder)
update the PC
Even with this extra hardware, we have to wait until stage 2.
Stall 1 cycle
Chapter 4 — The Processor — 34
Control Hazards: (3) Branch Prediction
Longer pipelines can’t determine branch
outcome early
Stall penalty becomes unacceptable
Solution: Predict outcome of branch
Only stall if prediction is wrong
In MIPS pipeline
Can predict branches not taken
Fetch instruction after branch, with no delay
Chapter 4 — The Processor — 35
MIPS with Predict Not Taken
Prediction correct
i.e. branch not taken
Prediction incorrect
i.e. branch taken
Chapter 4 — The Processor — 36
Check Yourself (HW)
[Link]
[Link]
Chapter 4 — The Processor — 37
Pipeline Summary
Pipelining improves performance by
increasing instruction throughput
Executes multiple instructions in parallel
Each instruction has the same latency
Pipelining hazards
Structure, data, control
Instruction set design affects complexity of
pipeline implementation
Chapter 4 — The Processor — 38