0% found this document useful (0 votes)
124 views43 pages

Pipelining and Pipelining Hazards

This document summarizes a lecture on pipelining and pipelining hazards in computer architecture. It discusses how pipelining improves throughput by allowing multiple instructions to be processed simultaneously across different stages. However, pipelining can introduce hazards such as structural hazards when resources are busy, data hazards due to dependencies between instructions, and control hazards with branches. These hazards are addressed through techniques like stalling the pipeline, forwarding results, code reordering, and dynamic branch prediction.

Uploaded by

Shahid Hussain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views43 pages

Pipelining and Pipelining Hazards

This document summarizes a lecture on pipelining and pipelining hazards in computer architecture. It discusses how pipelining improves throughput by allowing multiple instructions to be processed simultaneously across different stages. However, pipelining can introduce hazards such as structural hazards when resources are busy, data hazards due to dependencies between instructions, and control hazards with branches. These hazards are addressed through techniques like stalling the pipeline, forwarding results, code reordering, and dynamic branch prediction.

Uploaded by

Shahid Hussain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

CE-820 Spring 2023

Advanced Computer Architecture

Lecture # 07
Pipelining and Pipelining Hazards

Muhammad Imran
[email protected]
Acknowledgement
2

▪ Content from following has been used in these lectures


▪ Computer Organization and Design (RISC-V Edition), Patterson and
Hennessy
▪ Computer Architecture: A Quantitative Approach, 6th Edition,
Hennessy and Patterson
Contents
3

▪ Pipelining to improve throughput


▪ Pipelining Hazards
Pipelining to Improve Throughput
A Laundry Analogy
5

▪ Pipelining helps execute multiple tasks in parallel


▪ Improves throughput …
Improvement by Pipelining
6

▪ Pipelining only improves throughput


▪ Individual tasks still take same amount of time
▪ When number of tasks is too large and pipeline stages are
perfectly balanced!
▪ Improvement in performance ≈ Number of pipeline stages
▪ Example
▪ Suppose each stage in laundry takes 10 minutes and there are 4
stages
▪ Without pipelining
▪ 4 loads take 40×4=160 minutes!
▪ With pipelining
▪ 4 loads take 70 minutes!
▪ With large number of loads, improvement would approach 4 times!
Pipelining in RISC-V
7

▪ Instruction execution stages


1. Fetch instruction from memory.
2. Read registers and decode the instruction.
3. Execute the operation or calculate an address.
4. Access an operand in data memory (if necessary).
5. Write the result into a register (if necessary).
▪ Therefore, we can implement a five-stage pipeline for RISC-
V!

▪ RISC-V Pipeline Stages are not perfectly balanced!


Pipelining in RISC-V
8

▪ Example of RISC-V instruction execution time!

▪ Assuming Multiplexors, Control Unit, PC Access and Sign-extension


unit has no delay!
▪ Load takes the longest!
▪ For single-cycle, the clock period must be that of load time!
▪ Among individual stages the longest time is 200 ps!
▪ The clock period in pipelined architecture must be at least 200 ps (+
tclk2q + ts), although some stages take only 100 ps!!
Pipelining in RISC-V
9

▪ Time to execute three instructions


▪ Without pipelining: 2400 ps
▪ After pipelining: 1400 ps
▪ Time between first and 3rd instruction = 2×200 ps = 400 ps
Pipelining in RISC-V
10

▪ How about adding 1,000,000 more instructions (to current 3)?


▪ Each instruction adds 200 ps, total execution time using pipelining would be (1400 ps +
200×1,000,000 ps) = 200,001,400 ps
▪ For non-pipelined design execution time = (2400 ps + 1,000,000×800 ps) = 800,002,400 ps
▪ Improved is about 4 times = amount of reduction in clock period (given imbalanced stages)!
Pipelining in RISC-V
11

▪ Implementing RISC-V Pipelining


▪ RISC-V instructions are easier to pipeline, because of
▪ Fixed instruction size
▪ Easy to fetch and decode!
▪ Fixed location of source/destination operands
▪ Memory operands appear only in loads/stores
▪ Execution stage can be used for address calculation!
▪ Access memory in last stage!
▪ Allowing memory operand in other instructions would increase number
of pipeline stages / imbalance!
▪ RISC-V has fewer instruction formats!

▪ x86 architecture has variable instruction sizes


▪ Hard to pipeline
▪ x86 instructions are translated into RISC like operations to
implement pipelining
Pipelining is a programmer invisible
technique for performance
improvement …
Pipelining Hazards
What are hazards?
14

▪ Situations when next instruction in the pipeline cannot


correctly execute in the following cycle
▪ Three types of hazards
▪ Structural Hazards
▪ Data Hazards
▪ Control Hazards
Structural Hazards
15

▪ When hardware cannot execute a combination of


instructions in same cycle
▪ The required hardware resource is busy!
▪ Example
▪ If we had one memory for instructions and data in RISC-V
▪ Memory access from one instruction and instruction fetch of another
couldn’t execute in one cycle

▪ RISC-V instruction set was designed to be pipelined


▪ Easier to avoid structural hazards in pipelined implementation
Data Hazards
16

▪ When instruction cannot execute because the data required


by it is not yet available
▪ Dependence of one instruction on an earlier instruction in the
pipeline
▪ Example
▪ add x1, x2, x3
▪ sub x4, x1, x5
▪ x1 will be written back in first instruction later before it is required
by 2nd instruction!

Without addressing data hazard outdated value of x1 will be read!


Addressing Data Hazards
17

▪ Simple Solution
▪ Stall the pipeline!
▪ Wait until the data has been written
Pipeline Stall or Bubble –
A pipeline stall (wait)
initiated to resolve a
hazard

▪ Need to wait 3 cycles, so that slows down execution!


▪ Compiler can implement stalls to resolve data hazards!
▪ Can we do better?
Addressing Data Hazards
18

▪ Forwarding or Bypassing
▪ Forward the data to the next instruction when it is available
▪ Do not wait for the write back stage to complete!

▪ Forwarding only works destination stage is later in time compared


to source stage!
Addressing Data Hazards
19

▪ Forwarding cannot prevent all pipeline stalls!


▪ Example
▪ ld x1, 0(x2)
▪ sub x4, x1, x5
▪ x1 is not available for forwarding when it is needed by 2nd instruction!

▪ Known as load-use data hazard!


Addressing Data Hazards
20

▪ Code reordering to prevent stalls


▪ Example
▪ Consider following C Code
▪ a = b + e;
▪ c = b + f;
▪ Compiled to following RISC-V Code
▪ ld x1, 0(x31) # Load b
▪ ld x2, 8(x31) # Load e
▪ add x3, x1, x2 #b+e
▪ sd x3, 24(x31) # Store a
▪ ld x4, 16(x31) # Load f
▪ add x5, x1, x4 #b+f
▪ sd x5, 32(x31) # Store c
▪ Which instructions have data hazards?
▪ How many hazards are there with or without forwardig?
Addressing Data Hazards
21

▪ Code reordering to prevent stalls


▪ Example
▪ Data hazards when we can use forwarding
ld x1, 0(x31) ld x1, 0(x31)
ld x2, 8(x31) ld x2, 8(x31)
add x3, x1, x2 ld x4, 16(x31)
sd x3, 24(x31) add x3, x1, x2
ld x4, 16(x31) sd x3, 24(x31)
add x5, x1, x4 add x5, x1, x4
sd x5, 32(x31) sd x5, 32(x31)
▪ How can we reorder code to avoid stalls?
▪ Simply moving ld instruction above in the sequence can avoid both
hazards!
▪ How much this improves the performance?
▪ The new code sequence executes two cycles faster (assuming that we are
using forwarding!)
Addressing Data Hazards
22

▪ Forwarding implementation in RISC-V


▪ RISC-V instructions write at most one result!
▪ Result is written in last stage!

▪ Forwarding is harder if
▪ More than one results are to be forwarded per instruction!
Control Hazards
23

▪ When conditional branch instructions need to decide which


instruction should be next
▪ Next instruction cannot be fetched until branch decision is
finalized!
▪ Also known as branch hazards
▪ Two important considerations
▪ When is the branch target known (computed)?
▪ When is the branch decision known (test evaluated)?
▪ These two factors decide penalties (additional cycles) for the
control hazards!
Addressing Control Hazards
24

▪ Branch decision and target address can be finalized in


second stage by adding extra hardware
▪ Still the pipeline may need to be stalled!
▪ Example
Addressing Control Hazards
25

▪ What is the drawback of finalizing branch decision in ID


(second) stage?
▪ Can introduce new data hazards!
▪ If branch is dependent on an earlier instruction!
▪ Forwarding to second stage will solve less of those hazards because
destination stage is too early!
Addressing Control Hazards
26

▪ Prediction
▪ A better solution to control hazards
▪ Predict the outcome of branch and fetch the next instruction!
▪ If prediction is wrong, fetch the right instruction again!

▪ Prediction can be static or dynamic!


Addressing Control Hazards
27

▪ Static prediction (compile time solution)


▪ Predict all branches as taken or not taken
▪ Last example using prediction
▪ When prediction is correct!

▪ When prediction is wrong!


Addressing Control Hazards
28

▪ Static prediction (compile time solution)


▪ Predict some branches as taken and some as not taken
▪ For example, a branch instruction at the end of a loop is usually
taken
▪ Predict all branches to earlier addresses to be taken!
Addressing Control Hazards
29

▪ Dynamic prediction (runtime solution)


▪ Predict based on knowledge of the behavior of branch instruction!
▪ Keep history of different branches as taken and not taken
▪ Predict based on prevalent behavior of different instructions!
▪ Because of lot of history, such prediction have accuracy of above 90%!

▪ Can you think of any other solution to control hazards?


Addressing Control Hazards
30

▪ Delayed prediction (compile time solution)


▪ Delay the branch decision!
▪ Execute the instruction which is not impacted by branch!
▪ Efficient for one-cycle branch delays!
▪ Example
▪ add x1, x2, x4
▪ beq x5, x6, somewhere
▪ Reorder the instructions as
▪ beq x5, x6, somewhere
▪ add x1, x2, x4
▪ Handled by assembler!
▪ Invisible to assembly language programmer!
Instruction Sets can make pipelining
easier or harder …
Knowledge Check!
32

▪ Tell whether following code sequences must stall, can avoid


stall with only forwarding or can execute without stall or
forwarding …
▪ Example 1
▪ ld x10, 0(x10)
▪ add x11, x10, x10
▪ Answer:
▪ Cannot fully avoid stall!
▪ Can reduce one cycle by forwarding!
Knowledge Check!
33

▪ Tell whether following code sequences must stall, can avoid


stall with only forwarding or can execute without stall or
forwarding …
▪ Example 2
▪ add x11, x10, x10
▪ addi x12, x10, 5
▪ addi x14, x11, 5
▪ Answer:
▪ Third instruction needs x11 before first writes it back!
▪ read-after-write data hazard!
▪ Stall can be avoided by forwarding!
Knowledge Check!
34

▪ Tell whether following code sequences must stall, can avoid


stall with only forwarding or can execute without stall or
forwarding …
▪ Example 3
▪ addi x11, x10, 1
▪ addi x12, x10, 2
▪ addi x13, x10, 3
▪ addi x14, x10, 4
▪ addi x15, x10, 5
▪ Answer:
▪ No stalls even without forwarding!
Pipelining performance when
considering stalls due to hazards …
Performance without Stalls
36

▪ In general,
Avg. Time per instruction unpipelined
Speedup =
Avg. Time per instruction pipelined

▪ Balanced stages and no stalls!

Avg. Time per instruction unpipelined


Speedup =
Number of pipeline stages
Performance with Stalls
37

▪ In general,
Avg. Time per instruction unpipelined
Speedup =
Avg. Time per instruction pipelined

CPI unpipelined × Clock cycle unpipelined


Speedup =
CPI pipelined × Clock cycle pipelined

▪ Stalls decrease the CPI from ideal pipeline CPI!


CPI pipelined = Ideal CPI + Pipeline stall cycles per instruction

▪ What is ideal pipeline CPI?


CPI pipelined = 1 + Pipeline stall cycles per instruction
Performance with Stalls
38

CPI unpipelined × Clock cycle unpipelined


Speedup =
CPI pipelined × Clock cycle pipelined

▪ Ignoring clock skew in pipelined implementation, cycle


time is same in pipelined and multi-cycle implementation!
CPI unpipelined
Speedup =
1 + Pipeline stall cycles per instruction

▪ If all instructions take same cycles, CPI unpipelined =


pipeline depth!
Pipeline depth
Speedup =
1 + Pipeline stall cycles per instruction

▪ Without stalls, speedup = pipeline depth!


Performance with Stalls
39

Pipeline depth
Speedup =
1 + Pipeline stall cycles per instruction

▪ Stall cycles depend on frequency of instructions causing


stalls × penalty for each such instruction
▪ For instance, for branches

Pipeline stall cycles for Branch


= × Branch Penalty
branches Frequency
▪ With this equation, we can compare different prediction schemes!
Example: Prediction Performance
40

▪ MIPS R4000 pipeline


▪ Three pipeline stages before branch target address is known!
▪ Four pipeline stages until branch comparison is done!
▪ Assume no stalls on registers in conditional comparison!
▪ Find effective addition to CPI due to branches assuming
▪ Unconditional branches are 4%
▪ Conditional branches, untaken are 6%
▪ Conditional branches, taken are 10%

▪ Let’s compare three branch prediction schemes


▪ Flush pipeline
▪ Predicted taken
▪ Predicted untaken
Example: Prediction Performance
41

▪ Target Address → 3rd Stage, Condition Test → 4th Stage


▪ Branch Penalties
Branch Scheme Penalty Unconditional Penalty Untaken Penalty Taken

Flush pipeline 2 3 3

Predicted taken 2 3 2

Predicted untaken 2 0 3

Pipeline stall cycles for branches = Branch Frequency × Branch Penalty


Example: Prediction Performance
42

▪ If base CPI = 1 and branches are the only source of stalls


▪ Stalling the pipeline is 1.56 times slower than the ideal!
▪ Predicted untaken is 1.38 times slower than the ideal!
▪ Predicted untaken is 1.13 (=1.56/1.38) times better than stalling the pipeline!
Relevant Reading
43

▪ Computer Organization and Design (RISC-V Edition),


Patterson and Hennessy
▪ Chapter 4
▪ Section 4.5!

▪ Computer Architecture: A Quantitative Approach, 6th


Edition, Hennessy and Patterson
▪ Appendix C
▪ Sections C.1 and C.2

You might also like