16.482 / 16.
561
Computer Architecture
and Design
Instructor: Dr. Michael Geiger
Fall 2013
Lecture 5:
Pipelining
Lecture outline
Announcements/reminders
HW 4 to be posted; due 10/16
Lecture next week on Wednesday, 10/16
Review: Processor datapath and control
Today’s lecture: Pipelining
02/23/18 Computer Architecture Lecture 5 2
Review: Simple MIPS Chooses PC+4
datapath or branch target
Chooses ALU
output or
memory output
Chooses register
or sign-extended
immediate
02/23/18 Computer Architecture Lecture 5 3
Datapath for R-type
instructions
EXAMPLE:
add $4, $10, $30
($4 = $10 + $30)
02/23/18 Computer Architecture Lecture 5 4
Datapath for I-type ALU
instructions
EXAMPLE:
addi $4, $10, 15
($4 = $10 + 15)
02/23/18 Computer Architecture Lecture 5 5
Datapath for beq (not taken)
EXAMPLE:
beq $1,$2,label
(branch to label if
$1 == $2)
02/23/18 Computer Architecture Lecture 5 6
Datapath for beq (taken)
EXAMPLE:
beq $1,$2,label
(branch to label if
$1 == $2)
02/23/18 Computer Architecture Lecture 5 7
Datapath for lw instruction
EXAMPLE:
lw $2, 10($3)
($2 = mem[$3 + 10])
02/23/18 Computer Architecture Lecture 5 8
Datapath for sw instruction
EXAMPLE:
sw $2, 10($3)
(mem[$3 + 10] = $2)
02/23/18 Computer Architecture Lecture 5 9
Motivating pipelining
We’ve seen basic single-cycle datapath
Offers 1 CPI ...
... but cycle time determined by longest instruction
Load essentially uses all stages
We’d like both low CPI and a short cycle
Solution: pipelining
Simultaneously execute multiple instructions
Use multi-cycle “assembly line” approach
02/23/18 Computer Architecture Lecture 5 10
Pipelining is like …
… doing laundry (no, really)
Say 4 people (Ann, Brian, Cathy, Don) want
to use a laundry service that has four
components:
Washer, which takes 30 minutes
Dryer, which takes 30 minutes
“Folder,” which takes 30 minutes
“Storer,” which takes 30 minutes
02/23/18 Computer Architecture Lecture 5 11
Sequential laundry service
Each person starts when previous one finishes
4 loads take 8 hours
02/23/18 Computer Architecture Lecture 5 12
Pipelined laundry service
As soon as a particular component is free, next
person can use it
4 loads take 3 ½ hours
02/23/18 Computer Architecture Lecture 5 13
Pipelining questions
Does pipelining improve latency or throughput?
Throughput—time for each instruction same, but more
instructions per unit time
What’s the maximum potential speedup of
pipelining?
The number of stages, N—before, each instruction took N
cycles, now we can (theoretically) finish 1 instruction per
cycle
If one stage can run faster, how does that affect the
speedup?
No effect—cycle time depends on longest stage, because
you may be using hardware from all stages at once
02/23/18 Computer Architecture Lecture 5 14
Principles of pipelining
Every instruction takes same number of steps
Pipeline stages
1 stage per cycle (like multi-cycle datapath)
MIPS (like most simple processors) has 5 stages
IF: Instruction fetch
ID: Instruction decode and register read
EX: Execution / address calculation
MEM: Memory access
WB: Write back result to register
02/23/18 Computer Architecture Lecture 5 15
Pipeline Performance
Assume time for stages is
100ps for register read or write
200ps for other stages
Compare pipelined datapath with single-cycle
datapath
Instr Instr fetch Register ALU op Memory Register Total time
read access write
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps
Computer Architecture
Lecture 5
02/23/18 16
Pipeline Performance
Single-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
Computer Architecture
Lecture 5
02/23/18 17
Pipeline diagram
Cycle
1 2 3 4 5 6 7 8
lw IF ID EX MEM WB
add IF ID EX MEM WB
beq IF ID EX MEM WB
sw IF ID EX MEM WB
Pipeline diagram shows execution of multiple instructions
Instructions listed vertically
Cycles shown horizontally
Each instruction divided into stages
Can see what instructions are in a particular stage at any cycle
02/23/18 Computer Architecture Lecture 5 18
Performance example
Say we have the following code:
loop: add $t1, $t2, $t3
lw $t4, 0($t1)
beq $t4, $t3, end
sw $t3, 4($t1)
add $t2, $t2, 8
j loop
end: ...
Assume each pipeline stage takes 4 ns
How long would one loop iteration take in an ideal pipelined
processor (i.e., no delays between instructions)?
02/23/18 Computer Architecture Lecture 5 19
Solution
Cycle
1 2 3 4 5 6 7 8 9 10
add IF ID EX MEM WB
lw IF ID EX MEM WB
beq IF ID EX MEM WB
sw IF ID EX MEM WB
add IF ID EX MEM WB
j IF ID EX MEM WB
Can draw pipelining diagram to show # cycles
In ideal pipelining, with M instructions & N pipeline stages,
total time = N + (M-1)
Here, M = 6, N = 5 5 + (6-1) = 10 cycles
Total time = (10 cycles) * (4 ns/cycle) = 40 ns
02/23/18 Computer Architecture Lecture 5 20
Pipelined datapath principles
MEM
Right-to-left WB
flow leads to
hazards
02/23/18 Computer Architecture Lecture 5 21
Pipeline registers
Need registers between stages for info from previous cycles
Register must be able to hold all needed info for given stage
For example, IF/ID must be 64 bits—32 bits for instruction, 32 bits for PC+4
May need to propagate info through multiple stages for later use
For example, destination reg. number determined in ID, but not used until WB
02/23/18 Computer Architecture Lecture 5 22
Pipeline hazards
A hazard is a situation that prevents an
instruction from executing during its
designated clock cycle
3 types:
Structure hazards: two instructions attempt to
simultaneously use the same hardware
Data hazards: instruction attempts to use data
before it’s ready
Control hazards: attempt to make a decision
before condition is evaluated
02/23/18 Computer Architecture Lecture 5 23
Structure hazards
Examples in MIPS pipeline
May need to calculate addresses and perform operations
need multiple adders + ALU
May need to access memory for both instructions and data
need instruction & data memories (caches)
May need to read and write register file
write in first half of cycle, read in second
Cycle
1 2 3 4 5 6 7 8
lw IF ID EX MEM WB
add IF ID EX MEM WB
beq IF ID EX MEM WB
sw IF ID EX MEM WB
02/23/18 Computer Architecture Lecture 5 24
Data Hazard Example
Consider this sequence:
sub $2, $1,$3
and $12,$2,$5
or $13,$6,$2
add $14,$2,$2
sw $15,100($2)
Can’t use value of $2 until it’s actually
computed and stored
No hazard for sw
Register hardware takes care of add
What about and, or?
02/23/18 Computer Architecture Lecture 5 25
Software solution: no-ops
No-ops: instructions that do nothing
Effectively “stalls” pipeline until data is ready
Compiler can recognize hazards ahead of
time and insert nop instructions
Cycle Result written to reg file
1 2 3 4 5 6 7 8
sub IF ID EX MEM WB
nop IF ID EX MEM WB
nop IF ID EX MEM WB
and IF ID EX MEM WB
02/23/18 Computer Architecture Lecture 5 26
No-op example
Given the following code, where are no-ops
needed?
add $t2, $t3, $t4
sub $t5, $t1, $t2
or $t6, $t2, $t7
slt $t8, $t9, $t5
02/23/18 Computer Architecture Lecture 5 27
Solution
Given the following code, where are no-ops
needed?
add $t2, $t3, $t4 $t2 used by sub, or
nop
nop
sub $t5, $t1, $t2 $t5 used by slt
or $t6, $t2, $t7
nop could also be before or
slt $t8, $t9, $t5
02/23/18 Computer Architecture Lecture 5 28
Avoiding stalls
Inserting no-ops rarely best solution
Complicates compiler
Reduces performance
Can we solve problem in hardware? (Hint: when do we know value of $2?)
02/23/18 Computer Architecture Lecture 5 29
Dependencies & Forwarding
Value computed at end of EX stage
Use pipeline registers to forward
Add additional paths to ALU inputs from EX/MEM, MEM/WB
02/23/18 Computer Architecture Lecture 5 30
Load-Use Data Hazard
Need to stall
for one cycle
Chapter 4 — The
Processor — 31
How to Stall the Pipeline
Force control values in ID/EX register
to 0
EX, MEM and WB do nop (no-operation)
Prevent update of PC and IF/ID register
Using instruction is decoded again
Following instruction is fetched again
1-cycle stall allows MEM to read data for lw
Can subsequently forward to EX stage
Chapter 4 — The
Processor — 32
Stall/Bubble in the Pipeline
Stall inserted
here
Chapter 4 — The
Processor — 33
Stall/Bubble in the Pipeline
Or, more
Chapter 4 — The accurately…
Processor — 34
Datapath with Hazard
Detection
Chapter 4 — The
Processor — 35
Code Scheduling to Avoid
Stalls
Reorder code to avoid use of load result in
the next instruction
C code for A = B + E; C = B + F;
lw $t1, 0($t0) lw $t1, 0($t0)
lw $t2, 4($t0) lw $t2, 4($t0)
stall add $t3, $t1, $t2 lw $t4, 8($t0)
sw $t3, 12($t0) add $t3, $t1, $t2
lw $t4, 8($t0) sw $t3, 12($t0)
stall add $t5, $t1, $t4 add $t5, $t1, $t4
sw $t5, 16($t0) sw $t5, 16($t0)
13 cycles 11 cycles
02/23/18 Computer Architecture Lecture 5 36
Control Hazards
Branch determines flow of control
Fetching next instruction depends on branch
outcome
Pipeline can’t always fetch correct instruction
Still working on ID stage of branch
In MIPS pipeline
Need to compare registers and compute target
early in the pipeline
Add hardware to do it in ID stage
02/23/18 Computer Architecture Lecture 5 37
Stall on Branch
Wait until branch outcome determined before
fetching next instruction
02/23/18 Computer Architecture Lecture 5 38
Branch Prediction
Longer pipelines can’t readily determine
branch outcome early
Stall penalty becomes unacceptable
Predict outcome of branch
Only stall if prediction is wrong
In MIPS pipeline
Can predict branches not taken
Fetch instruction after branch, with no delay
Computer Architecture
Lecture 5
02/23/18 39
MIPS with Predict Not Taken
Prediction
correct
Prediction
incorrect
02/23/18 Computer Architecture Lecture 5 40
More-Realistic Branch
Prediction
Static branch prediction
Based on typical branch behavior
Example: loop and if-statement branches
Predict backward branches taken
Predict forward branches not taken
Dynamic branch prediction
Hardware measures actual branch behavior
e.g., record recent history of each branch
Assume future behavior will continue the trend
When wrong, stall while re-fetching, and update history
02/23/18 Computer Architecture Lecture 5 41
Final notes
Next time:
Instruction scheduling issues
Dynamic branch prediction
Dynamic scheduling
Multiple issue
Midterm exam preview
Announcements/reminders
HW 4 to be posted; due 10/16
Lecture next week on Wednesday, 10/16
02/23/18 Computer Architecture Lecture 5 42