0% found this document useful (0 votes)
16 views

Lec03-Pipelining 2021

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Lec03-Pipelining 2021

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

dce 2021

dce 2021

ADVANCED SYSTEM
ARCHITECTURES
Pipelining
BK
TP.HCM Trần Ngọc Thịnh
http://www.cse.hcmut.edu.vn/~tnthinh

©2021, dce
2

1 2

dce dce
2021
What is pipelining? 2021

Instruction Pipelining (1/2)


• Implementation technique in which multiple • Instruction pipelining is CPU implementation technique where
multiple operations on a number of instructions are
instructions are overlapped in execution overlapped.
• Real-life pipelining examples? • An instruction execution pipeline involves a number of steps,
where each step completes a part of an instruction. Each
– Laundry step is called a pipeline stage or a pipeline segment.
– Factory production lines • The stages or steps are connected in a linear fashion: one
stage to the next to form the pipeline -- instructions enter at
– Traffic?? one end and progress through the stages and exit at the other
end.
• The time to move an instruction one step down the pipeline is
is equal to the machine cycle and is determined by the stage
with the longest processing delay.

3 4

3 4
dce dce
2021

Instruction Pipelining (2/2) 2021


Pipelining Example: Laundry
• Pipelining increases the CPU instruction throughput: • Laundry Example
The number of instructions completed per cycle. • Ann, Brian, Cathy, Dave
– Under ideal conditions (no stall cycles), instruction each have one load of ABCD
throughput is one instruction per machine cycle, or ideal clothes to wash, dry, and
CPI = 1 fold
• Pipelining does not reduce the execution time of an
• Washer takes 30 minutes
individual instruction: The time needed to complete
all processing steps of an instruction (also called
instruction completion latency). • Dryer takes 40 minutes
– Minimum instruction latency = n cycles, where n is the
number of pipeline stages • “Folder” takes 20 minutes
5 6

5 6

dce dce
2021
Sequential Laundry 2021
Pipelined Laundry Start work ASAP
6 PM 7 8 9 10 11 Midnight 6 PM 7 8 9 10 11 Midnight

Time Time

30 40 20 30 40 20 30 40 20 30 40 20 30 40 40 40 40 20
T T
a
s
A a A
s
k k

O
B B
O
r r
d C d C
e e
r
D r
D
Sequential laundry takes 6 hours for 4 loads Pipelined laundry takes 3.5 hours for 4 loads
If they learned pipelining, how long would laundry take? Speedup = 6/3.5 = 1.7
7 8

7 8
dce dce
2021
Pipelining Lessons 2021
Pipelining Example: Laundry
6 PM 7 8 9 Pipelining doesn’t help • Pipelined Laundry Observations:
latency of single task, it helps
Time throughput of entire workload – At some point, all stages of washing will be
T Pipeline rate limited by operating concurrently
30 40 40 40 40 20
a slowest pipeline stage – Pipelining doesn’t reduce number of stages
s Multiple tasks operating
k A simultaneously
• doesn’t help latency of single task
Potential speedup = Number • helps throughput of entire workload
O
r B pipe stages
Unbalanced lengths of pipe
d – As long as we have separate resources, we can
stages reduces speedup
e C pipeline the tasks
r Time to “fill” pipeline and time
to “drain” it reduces speedup – Multiple tasks operating simultaneously use
D
different resources

9 10

9 10

dce dce
2021
Pipelining Example: Laundry 2021
CPU Pipelining
• Pipelined Laundry Observations: • 5 stages of a MIPS instruction
– Speedup due to pipelining depends on the number – Fetch instruction from instruction memory
of stages in the pipeline – Read registers while decoding instruction
– Execute operation or calculate address, depending on
– Pipeline rate limited by slowest pipeline stage the instruction type
• If dryer needs 45 min , time for all stages has to be 45 – Access an operand from data memory
min to accommodate it
– Write result into a register
• Unbalanced lengths of pipe stages reduces speedup
• We can reduce the cycles to fit the stages.

– Time to “fill” pipeline and time to “drain” it reduces


speedup Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

– If one load depends on another, we will have to Load Ifetch Reg/Dec Exec Mem Wr
wait (Delay/Stall for Dependencies)
11 12

11 12
dce dce
2021
CPU Pipelining 2021
CPU Pipelining
• Example: Resources for Load Instruction • Note that accessing source & destination registers is performed in two
different parts of the cycle
– Fetch instruction from instruction memory (Ifetch) • We need to decide upon which part of the cycle should reading and
writing to the register file take place.
– Instruction memory (IM)
Reading Time (clock cycles) Writing
– Read registers while decoding instruction
(Reg/Dec) I
n
– Register file & decoder (Reg)

ALU
Inst 0 Im Reg Dm Reg
s
– Execute operation or calculate address, depending t

ALU
r. Inst 1 Im Reg Dm Reg
on the instruction type (Exec)

ALU
O Im Reg Dm Reg
– ALU r Inst 2

– Access an operand from data memory (Mem) d

ALU
e Inst 3 Im Reg Dm Reg
– Data memory (DM) r

ALU
Im Reg Dm Reg
– Write result into a register (Wr) Inst 4

– Register file (Reg) Fill time Sink time


13 14

13 14

dce dce
2021
CPU Pipelining: Example 2021
CPU Pipelining: Example
• Single-cycle, pipelined execution
• Single-Cycle, non-pipelined execution
– Improve performance by increasing instruction throughput
•Total time for 3 instructions: 24 ns – Total time for 3 instructions = 14 ns
– Each instruction adds 2 ns to total execution time
– Stage time limited by slowest resource (2 ns)
P ro g ra m
e x e c u tio n – Assumptions:
2 4 6 8 10 12 14 16 18
o rd e r Time • Write to register occurs in 1st half of clock
(in in str u c tio ns ) • Read from register occurs in 2nd half of clock
Instruction Reg ALU Data Reg
lw $ 1 , 1 0 0 ( $ 0 ) fetch access P ro g ra m
e x e c u t io n 2 4 6 8 10 12 14
8 ns Instruction Reg ALU Data Reg Time
fetch access o rd e r
lw $ 2 , 2 0 0 ( $ 0 ) ( in in s t ru c tio n s )
Instruction Instruc tion Da ta
8 ns fetch lw $1, 100($0) R eg ALU Reg
fetch acc ess
...
lw $ 3 , 3 0 0 ( $ 0 )
In struction D a ta
8 ns lw $2, 200($0) 2 ns
fetc h
Re g ALU
a ccess
Reg

Instruc tion D a ta
lw $3, 300($0) 2 ns
fetch
R eg ALU
access
Reg

2 ns 2 ns 2 ns 2 ns 2 ns

15 16

15 16
dce dce
2021
CPU Pipelining: Example 2021
MIP dataflow
• Assumptions:
– Only consider the following instructions:
lw, sw, add, sub, and, or, slt, beq IF/ID ID/EX EX/MEM MEM/WB
4
ADD M
– Operation times for instruction classes are: u
• Memory access 2 ns x
PC Branch
Comp.
• ALU operation 2 ns IR6...10 taken

• Register file read or write 1 ns M


Inst. IR11..15
u
– Use a single- cycle (not multi-cycle) model Memory MEM/ x
Register ALU
M
– Clock cycle must accommodate the slowest instruction (2 ns) WB.IR File
u
Data
– Both pipelined & non-pipelined approaches use the same HW components Data must be
M
Mem. x
stored from one
stage to the next u
in pipeline x
InstrClass IstrFetch RegRead ALUOp DataAccess RegWrite TotTime
registers/latches.
lw 2 ns 1 ns 2 ns 2 ns 1 ns 8 ns hold temporary Sign
sw 2 ns 1 ns 2 ns 2 ns 7 ns values between Extend
clocks and needed 16 32
add, sub, and, or, slt 2 ns 1 ns 2 ns 1 ns 6 ns
info. for
beq 2 ns 1 ns 2 ns 5 ns execution.

17 18

17 18

dce dce
2021
CPU Pipelining Example: (1/2) 2021
CPU Pipelining Example: (2/2)
• Theoretically: • If we have 3 consecutive instructions
– Non-pipelined needs 8 x 3 = 24 ns
– Speedup should be equal to number of stages ( n
– Pipelined needs 14 ns
tasks, k stages, p latency) => Speedup = 24 / 14 = 1.7

– Speedup = n*p ≈ k (for large n) • If we have 1003 consecutive instructions


– Add more time for 1000 instruction (i.e. 1003 instruction)
p/k*(n-1) + p
on the previous example
• Practically: • Non-pipelined total time= 1000 x 8 + 24 = 8024 ns
• Pipelined total time = 1000 x 2 + 14 = 2014 ns
– Stages are imperfectly balanced => Speedup ~ 3.98~ (8 ns / 2 ns]
– Pipelining needs overhead ~ near perfect speedup
=> Performance increases for larger number of instructions
– Speedup less than number of stages (throughput)
19 20

19 20
dce dce
2021
Pipelining MIPS Instruction Set 2021
Pipelining MIPS Instruction Set
• MIPS was designed with pipelining in mind
2. MIPS has limited instruction format
=> Pipelining is easy in MIPS:
– All instruction are the same length
– Source register in the same place for each
– Limited instruction format instruction (symmetric)
– Memory operands appear only in lw & sw instructions – 2nd stage can begin reading at the same time as
– Operands must be aligned in memory decoding
– If instruction format wasn’t symmetric, stage 2
1. All MIPS instruction are the same length should be split into 2 distinct stages
– Fetch instruction in 1st pipeline stage
=> Total stages = 6 (instead of 5)
– Decode instructions in 2nd stage
– If instruction length varies (e.g. 80x86), pipelining will be
more challenging
21 22

21 22

dce dce
2021
Pipelining MIPS Instruction Set 2021
Pipelining MIPS Instruction Set
3. Memory operands appear only in lw & sw 4. Operands must be aligned in memory
instructions – Transfer of more than one data operand can be
– We can use the execute stage to calculate done in a single stage with no conflicts
memory address
– Access memory in the next stage – Need not worry about single data transfer
– If we needed to operate on operands in memory instruction requiring 2 data memory accesses
(e.g. 80x86), stages 3 & 4 would expand to – Requested data can be transferred between the
• Address calculation CPU & memory in a single pipeline stage
• Memory access
• Execute

23 24

23 24
dce dce MIPS In-Order Single-Issue Integer Pipeline Ideal
2021
Instruction Pipelining Review 2021

Operation
– MIPS In-Order Single-Issue Integer Pipeline Fill Cycles = number of stages -1
(No stall cycles)

– Performance of Pipelines with Stalls Clock Number Time in clock cycles 


Instruction Number 1 2 3 4 5 6 7 8 9
– Pipeline Hazards
• Structural hazards Instruction I IF ID EX MEM WB
Instruction I+1 IF ID EX MEM WB
• Data hazards Instruction I+2 IF ID EX MEM WB
 Minimizing Data hazard Stalls by Forwarding Instruction I+3 IF ID EX MEM WB
Instruction I +4 4 cycles = n -1 IF ID EX MEM WB
 Data Hazard Classification
Time to fill the pipeline
 Data Hazards Present in Current MIPS Pipeline
MIPS Pipeline Stages:
• Control hazards IF = Instruction Fetch Last instruction,
First instruction, I
 Reducing Branch Stall Cycles ID = Instruction Decode Completed I+4 completed

 Static Compiler Branch Prediction EX = Execution


MEM = Memory Access n= 5 pipeline stages Ideal CPI =1
 Delayed Branch Slot
In-order = instructions executed in original program order
WB = Write Back
» Canceling Delayed Branch Slot Ideal pipeline operation without any stall cycles
25 26

25 26

dce dce
2021
5 Steps of MIPS Datapath 2021

Visualizing Pipelining
Instruction Instr. Decode Execute Memory Write Figure A.2, Page A-8
Fetch Reg. Fetch Addr. Calc Access Back Time (clock cycles)
Next PC
MUX

Next SEQ PC Next SEQ PC Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Adder

I
4

ALU
RS1
Zero? n Ifetch Reg DMem Reg
Write
s destination
MUX MUX

MEM/WB
Address

Memory

IF ID EX MEM WB
EX/MEM

RS2
Reg File

t register
ID/EX
IF/ID

ALU

r. in first half

ALU
Ifetch Reg DMem Reg
Memory

of WB cycle
Data

MUX

O
r

ALU
IR <= mem[PC]; Ifetch Reg DMem Reg
WB Data

PC <= PC + 4
Sign
Extend
d
Imm
e
A <= Reg[IRrs];
r

ALU
RD RD RD Ifetch Reg DMem Reg
B <= Reg[IRrt]
rslt <= A opIRop B Read operand registers
in second half of ID cycle
WB <= rslt
• Data stationary control
Reg[IRrd] <= WB Operation of ideal integer in-order 5-stage pipeline
– local decode for each instruction phase
/ pipeline stage

27 28
dce dce
2021
Pipelining Performance Example 2021
Pipeline Hazards
• Example: For an unpipelined CPU:
• Hazards are situations in pipelining which prevent the next
– Clock cycle = 1ns, 4 cycles for ALU operations and branches and 5
cycles for memory operations with instruction frequencies of 40%, instruction in the instruction stream from executing during the
20% and 40%, respectively. designated clock cycle possibly resulting in one or more stall
– If pipelining adds 0.2 ns to the machine clock cycle then the speedup
in instruction execution from pipelining is: (or wait) cycles.
Non-pipelined Average instruction execution time = Clock cycle x • Hazards reduce the ideal speedup (increase CPI > 1) gained
Average CPI from pipelining and are classified into three classes:
= 1 ns x ((40% + 20%) x 4 + 40%x 5) = 1 ns x 4.4 = 4.4 ns
– Structural hazards: Arise from hardware resource conflicts when the
In the pipelined implementation five stages are used with an available hardware cannot support all possible combinations of
average instruction execution time of: 1 ns + 0.2 ns = 1.2 ns instructions.
Speedup from pipelining = Instruction time unpipelined – Data hazards: Arise when an instruction depends on the result of a
Instruction time pipelined previous instruction in a way that is exposed by the overlapping of
= 4.4 ns / 1.2 ns = 3.7 times faster instructions in the pipeline
– Control hazards: Arise from the pipelining of conditional branches and
other instructions that change the PC
29 30

29 30

dce dce
2021
How do we deal with hazards? 2021
Stalls and performance
• Often, pipeline must be stalled • Stalls impede progress of a pipeline and result in deviation
• Stalling pipeline usually lets some instruction(s) in from 1 instruction executing/clock cycle
pipeline proceed, another/others wait for data, • Pipelining can be viewed to:
resource, etc. – Decrease CPI or clock cycle time for instruction
– Let’s see what affect stalls have on CPI…
• A note on terminology:
– If we say an instruction was “issued later than instruction x”, • CPI pipelined = Ideal CPI + Pipeline stall cycles per instruction
we mean that it was issued after instruction x and is not as = 1 + Pipeline stall cycles per instruction
far along in the pipeline
• Ignoring overhead and assuming stages are balanced:

– If we say an instruction was “issued earlier than instruction CPI unpipeline d


x”, we mean that it was issued before instruction x and is Speedup 
1  pipeline stall cycles per instructio n
further along in the pipeline

31 32

31 32
dce dce
2021
Even more pipeline performance issues! 2021
Structural Hazards
• This results in: • In pipelined machines overlapped instruction execution
Clock cycle unpipeline d requires pipelining of functional units and duplication of
Clock cycle pipelined 
Pipeline depth
resources to allow all possible combinations of instructions in
Pipeline depth 
Clock cycle unpipeline d the pipeline.
• Which leads to: Clock cycle pipelined

1 Clock cycle unpipeline d • If a resource conflict arises due to a hardware resource being
Speedup from pipelining  
1  Pipeline stall cycles per instructio n Clock cycle pipelined required by more than one instruction in a single cycle, and
1 one or more such instructions cannot be accommodated,
  Pipeline depth
1  Pipeline stall cycles per instructio n then a structural hazard has occurred, for example:
– when a pipelined machine has a shared single-memory pipeline stage
• If no stalls, speedup equal to # of pipeline stages in for data and instructions.
ideal case  stall the pipeline for one cycle for memory data access

33 34

33 34

dce dce
2021
An example of a structural hazard 2021
How is it resolved?

ALU
Load Mem Reg DM Reg
ALU

Load Mem Reg DM Reg

ALU
Instruction 1 Mem Reg DM Reg
ALU

Instruction 1 Mem Reg DM Reg

ALU
Instruction 2 Mem Reg DM Reg
ALU

Instruction 2 Mem Reg DM Reg


Stall Bubble Bubble Bubble Bubble Bubble
ALU

Instruction 3 Mem Reg DM Reg

ALU
Instruction 3 Mem Reg DM Reg
ALU

Instruction 4 Mem Reg DM Reg

Time
Pipeline generally stalled by
Time
What’s the problem here? 35
inserting a “bubble” or NOP 36

35 36
dce dce
2021
Or alternatively… 2021
A Structural Hazard Example
• Given that data references are 40% for a specific instruction
mix or program, and that the ideal pipelined CPI ignoring
Clock Number
hazards is equal to 1.
Inst. # 1 2 3 4 5 6 7 8 9 10
LOAD IF ID EX MEM WB • A machine with a data memory access structural hazards
Inst. i+1 IF ID EX MEM WB
requires a single stall cycle for data references and has a
Inst. i+2 IF ID EX MEM WB
clock rate 1.05 times higher than the ideal machine. Ignoring
Inst. i+3 stall IF ID EX MEM WB
Inst. i+4 IF ID EX MEM WB
other performance losses for this machine:
Inst. i+5 IF ID EX MEM
Average instruction time = CPI X Clock cycle time
Inst. i+6 IF ID EX
Average instruction time = (1 + 0.4 x 1) x Clock cycle ideal
LOAD instruction “steals” an instruction fetch cycle
1.05
which will cause the pipeline to stall.
= 1.3 X Clock cycle time ideal
Thus, no instruction completes on clock cycle 8 Therefore the machine without the hazard is better.

37 38

37 38

dce dce
2021
Remember the common case! 2021
Data Hazards
• All things being equal, a machine without structural • Data hazards occur when the pipeline changes the order of
hazards will always have a lower CPI. read/write accesses to instruction operands in such a way
that the resulting access order differs from the original
sequential instruction operand access order of the
• But, in some cases it may be better to allow them unpipelined machine resulting in incorrect execution.
than to eliminate them. • Data hazards may require one or more instructions to be
stalled to ensure correct execution.
• Example:
• These are situations a computer architect might have ADD R1, R2, R3
to consider: SUB R4, R1, R5
AND R6, R1, R7
– Is pipelining functional units or duplicating them costly in OR R8,R1,R9
terms of HW? XOR R10, R1, R11
– Does structural hazard occur often? – All the instructions after ADD use the result of the ADD instruction
– What’s the common case??? – SUB, AND instructions need to be stalled for correct execution.

39 40

39 40
dce dce
2021
Data Hazard on R1 2021
Minimizing Data hazard Stalls by Forwarding
Time (clock cycles) • Forwarding is a hardware-based technique (also called register
bypassing or short-circuiting) used to eliminate or minimize data
IF ID/RF EX MEM WB hazard stalls.
I
• Using forwarding hardware, the result of an instruction is copied

ALU
add r1,r2,r3 Ifetch Reg DMem Reg
directly from where it is produced (ALU, memory read port etc.), to
n
s where subsequent instructions need it (ALU input register, memory
write port etc.)

ALU
t sub r4,r1,r3 Ifetch Reg DMem Reg

r. • For example, in the MIPS integer pipeline with forwarding:


– The ALU result from the EX/MEM register may be forwarded or fed back to the

ALU
O and r6,r1,r7 Ifetch Reg DMem Reg
ALU input latches as needed instead of the register operand value read in the ID
r stage.
d – Similarly, the Data Memory Unit result from the MEM/WB register may be fed back
to the ALU input latches as needed .

ALU
Reg
or r8,r1,r9
Ifetch Reg DMem
e – If the forwarding hardware detects that a previous ALU operation is to write the
r register corresponding to a source for the current ALU operation, control logic
selects the forwarded result as the ALU input rather than the value read from the

ALU
xor r10,r1,r11 Ifetch Reg DMem Reg
register file.

42

41 42

dce dce
2021
HW Change for Forwarding 2021

Forwarding to Avoid Data Hazard


Time (clock cycles)
NextPC I
n add r1,r2,r3

ALU
Ifetch Reg DMem Reg

s
mux
Registers

t
MEM/WR
EX/MEM

r.
ALU
ID/EX

ALU
Reg
sub r4,r1,r3 Ifetch Reg DMem

Data O
mux

Memory r

ALU
and r6,r1,r7 Ifetch Reg DMem Reg
mux

Immediate d
e
r

ALU
Ifetch Reg DMem Reg
or r8,r1,r9

ALU
xor r10,r1,r11 Ifetch Reg DMem Reg

What circuit detects and resolves this hazard?

43 44
dce dce
2021

Forwarding 2021
Forwarding to Avoid LW-SW Data Hazard
Fix data hazards by
forwarding results as Time (clock cycles)

ALU
add $1, IM Reg DM Reg
soon as they are
I
n available to where they
I add r1,r2,r3 Ifetch

ALU
Reg DMem Reg
are needed
n

ALU
s sub $4,$1,$5 IM Reg DM Reg

t s
lw r4, 0(r1)

ALU
t Ifetch Reg DMem Reg

r. r.

ALU
IM Reg DM Reg
and $6,$1,$7

ALU
Ifetch Reg DMem Reg
O O sw r4,12(r1)
r r

ALU
IM Reg DM Reg
or $8,$1,$9 d
d

ALU
Ifetch Reg DMem Reg

e or r8,r6,r9
e r

ALU
IM Reg DM Reg
r xor $4,$1,$5

ALU
Ifetch Reg DMem Reg
xor r10,r9,r11

45 46

dce dce
2021
Data Hazard Classification 2021
Data Hazard Classification
Given two instructions I, J, with I occurring before J in an I (Write) I (Read)
instruction stream:
I Shared
• RAW (read after write): A true data dependence Shared
I ..
Operand
J tried to read a source before I writes to it, so J .. Operand
..
incorrectly gets the old value.
.. J J (Read) J (Write)
• WAW (write after write): A name dependence
Program
J tries to write an operand before it is written by I Read after Write (RAW)
J Order
Write after Read (WAR)
The writes end up being performed in the wrong order.
Program I (Write) I (Read)
• WAR (write after read): A name dependence Order
J tries to write to a destination before it is read by I, Shared
Shared
so I incorrectly gets the new value. Operand Operand
• RAR (read after read): Not a hazard.
J (Write) J (Read)
Write after Write (WAW) Read after Read (RAR) not a hazard
47 48

47 48
dce dce
2021
Read after write (RAW) hazards 2021
Write after write (WAW) hazards
• With RAW hazard, instruction j tries to read a source operand • With WAW hazard, instruction j tries to write an operand
before instruction i writes it. before instruction i writes it.
• Thus, j would incorrectly receive an old or incorrect value
• The writes are performed in wrong order leaving the value
• Graphically/Example: written by earlier instruction

• Graphically/Example: i: SUB R1, R4, R3


… j i … i: ADD R1, R2, R3 j: ADD R1, R2, R3
Instruction j is a Instruction i is a j: SUB R4, R1, R6 … j i …
read instruction write instruction
issued after i issued before j Instruction j is a Instruction i is a
write instruction write instruction
• Can use stalling or forwarding to resolve this hazard issued after i issued before j

49 50

49 50

dce dce
2021
Write after read (WAR) hazards 2021
Data Hazards Requiring Stall Cycles
• With WAR hazard, instruction j tries to write an operand • In some code sequence cases, potential data hazards cannot be handled
before instruction i reads it. by bypassing. For example:
LW R1, 0 (R2)
SUB R4, R1, R5
• Instruction i would incorrectly receive newer value of its
AND R6, R1, R7
operand;
OR R8, R1, R9
– Instead of getting old value, it could receive some newer, undesired
value: • The LW instruction has the data in clock cycle 4 (MEM cycle).
• The SUB instruction needs the data of R1 in the beginning of that cycle.
• Graphically/Example: • Hazard prevented by hardware pipeline interlock causing a stall cycle.
i: SUB R4, R1, R3
… j i … j: ADD R1, R2, R3
Instruction j is a Instruction i is a
write instruction read instruction
issued after i issued before j
51 52

51 52
dce dce
Data Hazard Even with Forwarding Data Hazard Even with Forwarding
2021 2021

Time (clock cycles) Time (clock cycles)

I lw r1, 0(r2) Ifetch I lw r1, 0(r2) Ifetch

ALU
ALU
Reg DMem Reg Reg DMem Reg

n n
s s
t t

ALU
Bubble

ALU
sub r4,r1,r6 Ifetch Reg DMem Reg
sub r4,r1,r6 Ifetch Reg DMem Reg

r. r.

O O Bubble

ALU
ALU
Ifetch Reg DMem Reg
and r6,r1,r7 Ifetch Reg DMem Reg
and r6,r1,r7
r r
d d
Bubble

ALU
Ifetch
e e
Reg DMem
or r8,r1,r9

ALU
Ifetch Reg DMem Reg
or r8,r1,r9
r r

53 54

dce dce
2021
Hardware Pipeline Interlocks 2021
Data hazards and the compiler
• A hardware pipeline interlock detects a data hazard and stalls • Compiler should be able to help eliminate
the pipeline until the hazard is cleared.
some stalls caused by data hazards
• The CPI for the stalled instruction increases by the length of
the stall.
• For the Previous example, (no stall cycle): • i.e. compiler could not generate a LOAD
LW R1, 0(R1)
SUB R4,R1,R5
IF ID
IF
EX
ID
MEM
EX
WB
MEM WB
instruction that is immediately followed by
AND R6,R1,R7
OR R8, R1, R9
IF ID
IF
EX
ID
MEM
EX
WB
MEM WB
instruction that uses result of LOAD’s
With Stall Cycle: destination register.
Stall + Forward

LW R1, 0(R1) IF ID EX MEM WB


SUB R4,R1,R5 IF ID STALL EX MEM WB
AND R6,R1,R7
OR R8, R1, R9
IF STALL
STALL
ID
IF
EX
ID
MEM
EX
WB
MEM WB
• Technique is called “pipeline/instruction
scheduling”
55 56

55 56
dce dce Static Compiler Instruction Scheduling (Re-Ordering)
2021
Some example situations 2021

for Data Hazard Stall Reduction


• Many types of stalls resulting from data hazards are very frequent. For
Situation Example Action example:
No Dependence LW R1, 45(R2) No hazard possible because no
ADD R5, R6, R7 dependence exists on R1 in the A = B+ C
SUB R8, R6, R7 immediately following three instructions.
produces a stall when loading the second data value (B).
OR R9, R6, R7

Dependence LW R1, 45(R2) Comparators detect the use of R1 in the • Rather than allow the pipeline to stall, the compiler could sometimes
ADD R5, R1, R7 ADD and stall the ADD (and SUB and OR)
requiring stall before the ADD begins EX
schedule the pipeline to avoid stalls.
SUB R8, R6, R7
OR R9, R6, R7 • Compiler pipeline or instruction scheduling involves rearranging the code
Dependence LW R1, 45(R2) Comparators detect the use of R1 in SUB sequence (instruction reordering) to eliminate or reduce the number of
ADD R5, R6, R7 and forward the result of LOAD to the ALU
overcome by stall cycles.
SUB R8, R1, R7 in time for SUB to begin with EX
forwarding OR R9, R6, R7 Static = At compilation time by the compiler
Dependence with LW R1, 45(R2) No action is required because the read of Dynamic = At run time by hardware in the CPU
ADD R5, R6, R7 R1 by OR occurs in the second half of the
accesses in order SUB R8, R6, R7 ID phase, while the write of the loaded
data occurred in the first half.
OR R9, R1, R7
57 58

57 58

dce dce
2021
Static Compiler Instruction Scheduling Example 2021
Performance of Pipelines with Stalls
• Hazard conditions in pipelines may make it necessary to stall the pipeline
• For the code sequence: by a number of cycles degrading performance from the ideal pipelined
a, b, c, d ,e, and f
a=b+c are in memory
CPU CPI of 1.
CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction
d=e-f
= 1 + Pipeline stall clock cycles per instruction
• Assuming loads have a latency of one clock cycle, the following
code or pipeline compiler schedule eliminates stalls: • If pipelining overhead is ignored and we assume that the stages are
perfectly balanced then speedup from pipelining is given by:
Speedup = CPI unpipelined / CPI pipelined
Original code with stalls: Scheduled code with no stalls:
LW Rb,b = CPI unpipelined / (1 + Pipeline stall cycles per instruction)
LW Rb,b
LW Rc,c
Stall LW Rc,c
ADD Ra,Rb,Rc • When all instructions in the multicycle CPU take the same number of
LW Re,e
SW Ra,a cycles equal to the number of pipeline stages then:
LW Re,e ADD Ra,Rb,Rc
LW Rf,f LW Rf,f Speedup = Pipeline depth / (1 + Pipeline stall cycles per instruction)
SW Ra,a
Stall SUB Rd,Re,Rf
SW Rd,d No stalls for scheduled code
SUB Rd,Re,Rf
2 stalls for original code
SW Rd,d
59 60

59 60
dce dce Control Hazard on Branches
2021

Control Hazards 2021

Three Stage Stall


• When a conditional branch is executed it may change the PC and, without
any special measures, leads to stalling the pipeline for a number of cycles
10: beq r1,r3,36

ALU
Ifetch Reg DMem Reg
until the branch condition is known (branch is resolved).
– Otherwise the PC may not be correct when needed in IF
• In current MIPS pipeline, the conditional branch is resolved in stage 4 (MEM

ALU
Reg

stage) resulting in three stall cycles as shown below:


14: and r2,r3,r5 Ifetch Reg DMem

Branch instruction IF ID EX MEM WB

ALU
18: or r6,r1,r7 Ifetch Reg DMem Reg

Branch successor stall stall stall IF ID EX MEM WB


Branch successor + 1 IF ID EX MEM WB
Branch successor + 2 3 stall cycles IF ID EX MEM

ALU
Ifetch Reg DMem Reg

Branch successor + 3 IF ID EX
22: add r8,r1,r9
Branch successor + 4 IF ID
Branch successor + 5 IF

ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg

Assuming we stall or flush the pipeline on a branch instruction:


Three clock cycles are wasted for every branch for current MIPS pipeline
Branch Penalty = stage number where branch is resolved - 1
here Branch Penalty = 4 - 1 = 3 Cycles
61

61 62

dce dce
2021
Reducing Branch Stall Cycles 2021

Pipelined MIPS Datapath


Instruction Instr. Decode Execute Memory Write
Pipeline hardware measures to reduce branch stall cycles: Fetch Reg. Fetch Addr. Calc Access Back
Next PC Next
1- Find out whether a branch is taken earlier in the pipeline. Modified MIPS

MUX
SEQ PC
2- Compute the taken PC earlier in the pipeline. Pipeline:

Adder
Adder
Zero?
Conditional

In MIPS: 4 RS1
Branche
Completed in

MEM/WB
Address

Memory

EX/MEM
RS2

Reg File
– In MIPS branch instructions BEQZ, BNE, test a register for equality to

ID/EX

ALU
ID Stage

IF/ID
zero.

Memory
MUX

Data
– This can be completed in the ID cycle by moving the zero test into that

MUX
cycle.
Branch resolved in

WB Data
– Both PCs (taken and not taken) must be computed early. Sign
stage 2 (ID) Imm
Extend

– Requires an additional adder because the current ALU is not useable until Branch Penalty = 2
EX cycle. RD RD RD
-1=1
– This results in just a single cycle stall on branches.

• Interplay of instruction set design and cycle time.

63

63 64
dce dce
2021
Branch Stall Impact 2021
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
• If CPI = 1, 30% branch,
#2: Predict Branch Not Taken
Stall 3 cycles => new CPI = 1.9!
– Execute successor instructions in sequence
• Two part solution: – “Squash” instructions in pipeline if branch actually taken
– Determine branch taken or not sooner, AND – Advantage of late pipeline state update Most
Common

– Compute taken branch address earlier – 47% MIPS branches not taken on average
– PC+4 already calculated, so use it to get next instruction
• MIPS branch tests if register = 0 or  0
#3: Predict Branch Taken
• MIPS Solution: – 53% MIPS branches taken on average
– Move Zero test to ID/RF stage – But haven’t calculated branch target address in MIPS
• MIPS still incurs 1 cycle branch penalty
– Adder to calculate new PC in ID/RF stage
• Other machines: branch target known before outcome
– 1 clock cycle penalty for branch versus 3 – What happens when hit not-taken branch?

65 66

dce dce
Predict Branch Not-Taken Scheme
2021 2021

Four Branch Hazard Alternatives


Not Taken Branch (no stall)
(most common scheme) #4: Delayed Branch
– Define branch to take place AFTER a following
instruction

Taken Branch (stall)


branch instruction
sequential successor1
Stall sequential successor2
........ Branch delay of length n

sequential successorn
Assuming the MIPS pipeline with reduced branch penalty = 1
branch target if taken

Stall when the branch is taken – 1 slot delay allows proper decision and branch target
address in 5 stage pipeline
Pipeline stall cycles from branches = frequency of taken branches X branch penalty
– MIPS uses this
67

67 68
dce dce
2021
Delayed Branch Example 2021
Scheduling Branch Delay Slots
A. From before branch B. From branch target C. From fall through
Not Taken Branch (no stall)
add $1,$2,$3 sub $4,$5,$6 add $1,$2,$3
if $2=0 then if $1=0 then
delay slot delay slot
add $1,$2,$3
if $1=0 then or $7,$8,$9
delay slot sub $4,$5,$6
Taken Branch (no stall)
becomes becomes becomes
sub $4,$5,$6 add $1,$2,$3
if $2=0 then if $1=0 then
add $1,$2,$3 or $7,$8,$9
add $1,$2,$3
if $1=0 then
sub $4,$5,$6 sub $4,$5,$6

Single Branch Delay Slot Used • A is the best choice, fills delay slot & reduces instruction count (IC)
Assuming branch penalty = 1 cycle • In B, the sub instruction may need to be copied, increasing IC
• In B and C, must be okay to execute sub when branch fails
69

69 70

dce dce
2021
Delayed Branch-delay Slot Scheduling Strategies 2021
Delayed Branch
The branch-delay slot instruction can be chosen from • Compiler effectiveness for single branch delay slot:
three cases: – Fills about 60% of branch delay slots
A An independent instruction from before the branch: Common
– About 80% of instructions executed in branch delay slots
Always improves performance when used. The branch useful in computation
must not depend on the rescheduled instruction.
– About 50% (60% x 80%) of slots usefully filled
B An instruction from the target of the branch:
Improves performance if the branch is taken and may require instruction • Delayed Branch downside: As processor go to
duplication. This instruction must be safe to execute if the branch is not taken. deeper pipelines and multiple issue, the branch
C An instruction from the fall through instruction stream: delay grows and need more than one delay slot
Improves performance when the branch is not taken. The instruction must be safe – Delayed branching has lost popularity compared to more
to execute when the branch is taken.
expensive but more flexible dynamic approaches
The performance and usability of cases B, C is improved by using
a canceling or nullifying branch. – Growth in available transistors has made dynamic
approaches relatively cheaper

71

71 72
dce dce
2021
Evaluating Branch Alternatives 2021
Pipeline Performance Example
• Assume the following MIPS instruction mix:
Pipeline speedup = Pipeline depth
1 +Branch frequencyBranch penalty Type Frequency
Arith/Logic 40%
Assume: 4% unconditional branch, Load 30% of which 25% are followed immediately by
6% conditional branch- untaken, an instruction using the loaded value
10% conditional branch-taken Store 10%
branch 20% of which 45% are taken
Scheduling Branch CPI speedup v. speedup v.
scheme penalty unpipelined stall • What is the resulting CPI for the pipelined MIPS with forwarding and
Stall pipeline 3 1.60 3.1 1.0 branch address calculation in ID stage when using a branch not-taken
scheme?
Predict not taken 1x0.04+3x0.10 1.34 3.7 1.19 Branch Penalty = 1 cycle
Predict taken 1x0.14+2x0.06 1.26 4.0 1.29 • CPI = Ideal CPI + Pipeline stall clock cycles per instruction
Delayed branch 0.5 1.10 4.5 1.45 = 1 + stalls by loads + stalls by branches
= 1 + .3 x .25 x 1 + .2 x .45 x 1
= 1 + .075 + .09
= 1.165
74

73 74

dce dce
2021
Pipelining Summary 2021
Pipelining Summary
• Pipelining overlaps the execution of multiple instructions. • Speed Up VS. Pipeline Depth; if ideal CPI is 1, then:
• With an idea pipeline, the CPI is one, and the speedup is Pipeline Depth Clock Cycle Unpipelined
Speedup = X
equal to the number of stages in the pipeline. 1 + Pipeline stall CPI Clock Cycle Pipelined
• However, several factors prevent us from achieving the ideal • Hazards limit performance
speedup, including – Structural: need more HW resources
– Not being able to divide the pipeline evenly – Data: need forwarding, compiler scheduling
– The time needed to empty and flush the pipeline – Control: early evaluation & PC, delayed branch, prediction
– Overhead needed for pipelining • Increasing length of pipe increases impact of hazards;
– Structural, data, and control hazards pipelining helps instruction bandwidth, not latency
• Just overlap tasks, and easy if tasks are independent • Compilers reduce cost of data and control hazards
– Load delay slots
– Branch delay slots
– Branch prediction

75 76

75 76
dce dce
2021
Example 2021
Example
• (1) lw $1, 40($6)
• (1) lw $1, 40($6) • (2) add $6, $2, $2
• (2) add $6, $2, $2 • (3) sw $6, 50($1)
• (4) add $4, $5, $6
• (3) sw $6, 50($1) • (5) lw $6, 10($4)
• (4) add $4, $5, $6
• B) Identify which dependencies will cause data hazards in the pipeline
• (5) lw $6, 10($4) implementation (without forwarding hardware).

• A) Identify all the data dependencies in the code given above.


– $1 in Instr.(3) has data dependency with instr.(1)
– $6 in Instr.(3) has data dependency with instr.(2)
– $6 in Instr.(4) has data dependency with instr.(2)
– $4 in Instr.(5) has data dependency with instr.(4) – Instr.(1) loads value to $1 at cycle 5, but instr.(3) reads $1 at cycle 4.
– Instr.(2) writes value to $6 at cycle 6, but instr.(3) reads $6 at cycle 4.
– Instr.(2) writes value to $6 at cycle 6, but instr.(4) reads $6 at cycle 5
– Instr.(4) writes value to $4 at cycle 8, but instr.(5) reads $4 at cycle 6.
77 78

77 78

dce dce
2021
Example 2021
Example
• (1) lw $1, 40($6) • E) Now assume that the architecture has special hardware to implement
forwarding so that no-ops are added only in the cases where forwarding does not
• (2) add $6, $2, $2
resolve the hazard, how many cycles will it take to execute the code?
• (3) sw $6, 50($1)
• (4) add $4, $5, $6
• (5) lw $6, 10($4)
• C) Show the code after adding no-ops (stall) to avoid hazards (needed for
correctness in the absence of forwarding hardware).

• The instr.(1) forwards the load result ($1) from MEM/WB register to ID/EXE as an ALU input
at cycle 5.
• The instr.(2) forwards the add result ($6) from EXE/MEM register to ID/EXE as a store input
at cycle 5.
• The instr.(2) forwards the add result ($6) from MEM/WB register to ID/EXE as a ALU input at
cycle 6.
• The instr.(4) forwards the add result ($4) from EXE/MEM register to ID/EXE as a ALU input
• D) How many cycles will it take to execute this code (with no-ops added)? at cycle 7.
• With forwarding, it takes 9 cycles to execute the code.
– According the form in (c), it takes 13 cycles to execute this code.
79 80

79 80

You might also like