Instruction Level Parallelism: 1. Scoreboard and Tomasulo Algorithms
Instruction Level Parallelism: 1. Scoreboard and Tomasulo Algorithms
Definition of ILP
Why?
Works when cant know real dependence at compile time Compiler Simpler Code for one machine runs well on another
Key Idea:
Allow instructions behind stall to proceed Enables out-of-order execution and completion (commit). First implemented in CDC 6600 (1963).
Example:
DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 ADDD surely stalls for F0 (waiting that DIVD commits). SUBD would stall without dynamic scheduling.
Scoreboard Scheme
Issue (decode and check structural h.). Read Operands (wait until no data hazards).
Scoreboard Implications
Out-of-order completion -> WAR and WAW hazards. Solutions for WAR:
Queue both the operations and copies of its operands. Read registers only during Read Operands stage.
Scoreboard Implications
For WAW, the machine stalls until the other instruction completes Multiple execution units Scoreboard keeps track of dependencies and state of operations.
If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or a WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared.
A source operand is available if: - no earlier issued active instruction will write it or - A functional unit is writing its value in a register When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. RAW hazards are resolved dynamically in this step, and instructions may be sent into execution out of order.
FUs are characterized by: - latency (the effective time used to complete one operation). - Initiation interval (the number of cycles that must elapse between issuing two operations to the same functional unit).
Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction.
WAR Example
DIVD ADDD SUBD F0,F2,F4 F10,F0,F8 F8,F8,F14
In this case, the scoreboard would stall the SUBD in the WB stage,waiting that ADDD reads F0 and F8.
Scoreboard structure
1.
2.
Indicates the state of the functional unit (FU): Busy Indicates whether the unit is busy or not Op - The operation to perform in the unit (+,-, etc.) Fi - Destination register Fj, Fk Source register numbers Qj, Qk Functional units producing source registers Rj, Rk Flags indicating when Fj, Fk are ready
3.
Indicates which functional unit will write each register. Blank if no pending instructions will write that register.
Scoreboard Example
Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Issue Read Execution Write operands complete Result
Busy No No No No No
Op
dest Fi
S1 Fj
S2 Fk
Fk? Rk
Clock
FU
F0
F2
F4
F6
F8
F10
F12
...
F30
Busy Yes No No No No
Op Load
dest Fi F6
S1 Fj
S2 Fk R2
Fk? Rk Yes
Clock
1 FU
F0
F2
F4
F6 F8 F10
Integer
F12
...
F30
Busy Yes No No No No
Op Load
dest Fi F6
S1 Fj
S2 Fk R2
Fk? Rk Yes
Clock
2 FU
F0
F2
F4
F6 F8 F10
Integer
F12
...
F30
Busy Yes No No No No
Op Load
dest Fi F6
S1 Fj
S2 Fk R2
Fk? Rk Yes
Clock
3 FU
F0
F2
F4
F6 F8 F10
Integer
F12
...
F30
Issue stalls
Busy Yes No No No No
Op Load
dest Fi F6
S1 Fj
S2 Fk R2
Fk? Rk Yes
Clock
4 FU
F0
F2
F4
F6 F8 F10
Integer
F12
...
F30
Issue stalls
Busy Yes No No No No
Op Load
dest Fi F2
S1 Fj
S2 Fk R3
Fk? Rk Yes
Clock
5 FU
F0
F2
Integer
F4
F6 F8 F10
F12
...
F30
Op Load Mult
dest Fi F2 F0
S1 Fj F2
S2 Fk R3 F4
Clock
6 FU
F0
F2
F4
F6 F8 F10
F12
...
F30
Mult1 Integer
dest Fi F2 F0 F8
S1 Fj F2 F6
S2 Fk R3 F4 F2
Clock
7 FU
F0
F2
F4
F6 F8 F10
Add
F12
...
F30
Mult1 Integer
Clock
8 FU
F0
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
Mult1 Integer
Mult1
Clock
8 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
Mult1
Clock
9 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
Mult1
Clock
11 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
Op Mult
Div
F10
F0
F6
Mult1
No
Yes
Clock
12 FU
F0
Mult1
F2
F4
F6 F8 F10
Divide
F12
...
F30
Mult1
Clock
13 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
Mult1
Clock
14 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
Mult1
Clock
15 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
Mult1
Clock
16 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
Mult1
Clock
17 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
Mult1
Clock
18 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
Mult1
Clock
19 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
Fk? Rk
Yes Yes
Yes Yes
Clock
20 FU
F0
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
Fk? Rk
Yes Yes
Yes Yes
Clock
21 FU
F0
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
Fk? Rk
Yes
Yes
Clock
22 FU
F0
F2
F4
F6 F8 F10
Divide
F12
...
F30
Now DIVD can read its operands, ADDD can write the result
Fk? Rk
Yes
Yes
Clock
61 FU
F0
F2
F4
F6 F8 F10
Divide
F12
...
F30
DIVD finishes,
Fk? Rk
Clock
62 FU
F0
F2
F4
F6 F8 F10
F12
...
F30
Achieves a speedup of 2.5 w.r.t. no dynamic scheduling By reorganizing instructions the compiler achieves only 1.7 But
No cache No forwarding hardware Limited to instructions in a basic block Small number of functional units (structural hazards) Wait fo WAR hazards Prevent WAW hazards
Tomasulo Algorithm
Invented at IBM 3 years after CDC 6600 for the IBM 360/91 Same Goal: performance w/o special compilers Lead to:
The control logic and the buffers are distributed with Fus Operand buffers are called reservation stations. Each instruction is an entry of a reservation station. Its operands are replaced by values or pointers (Register Renaming)
Avoid WAR and WAW hazards Reservation stations are more than registers (so can do better optimizations than a compiler).
Results are dispatched to other Fus through a Common Data Bus Load/Stores treated as FUs
Tag identifying the RS OP=the operation to perform on the component. Vj, Vk=Value of the source operands Qj,Qk=Pointers to RS that produce Vj,Vk Busy=Indicates RS Busy
Other components
RF and the Store buffer have a Value (V) and a Pointer (Q) field. Load buffers have an address field, and a busy field. Store Buffers have also an address field.
ISSUE.
Get an instruction I from the queue. If it is an FP op. Check if an RS is empty (i.e., check for structural hazards). Rename registers; WAR resolution: If I writes Rx, read by an instruction K already issued, K knows already the value of Rx or knows what instruction will write it. So the RF can be linked to I. WAW resolution: Since we use in-order issue, the RF can be linked to I.
Execution
When both operands are ready then execute. If not ready, watch the common data bus fo results
Write result
Write on Common Data Bus to all waiting units; mark reservation stations available.
A common data bus is a data+source bus. In the IBM 360/91 Data=64 bits, Source=4 bits FU must perform associative lookup in the RS.
Pipelined FUs Issue window size=14 No issue on structural hazards WAR, WAW avoided with renaming Broadcast results from FU Control distributed on RS
Multiple but not pipelined Fus Issue window size=5 No issue on structural hazards Stall the completion for WAW and WAR hazards Results written back on registers. Control centralized through the Scoreboard.
Branch Prediction
Current DLX wastes one cycle but other architectures compute branches several cycles after the IF stage. We need to predict ASAP branch result (ID stage). Performance of Branch Prediction depends on:
Accuracy measured in terms of percentage of misprediction Cost of Misprediction measured in terms of the time wasted to execute unuseful instructions.
Table of 1 bit values Indexed by the lower bits of the PC address Says whether or not branch taken last time
2.
When we arrive to the end of the loop and we must exit. Here the BHT predicts to stay in the loop. When we re-enter the loop, we reach the end and we must stay in the loop. Here the BHT predicts to exit
It is a 2 bit scheme in which we change prediction only if we get misprediction twice. For each index of the table, the 2 bits report the state of a state machine (next slide). When we arrive at the end of the loop, we dont change prediction.
It has been measured that a 4096 entry table, programs have a misprediction percentage from 1% to 18%:
1% 18% 9% 12%
Correlating Branches
Basic hypotesis: recent branches are correlated, i.e., behavior of recently executed branches affects the prediction of current branch:
L2:
R3,R1,2 R3,L1 r1,r0,r0; bb1 r3,r1,2 r3,L2 r2,r0,r0; bb2 r3,r1,r2 r3,L3 bb3
L3:
Branch L2 is correlated previous branches. If both are not taken then L2 is taken.
Idea:
record m most recently executed branches as taken or not taken. Use that pattern to select the proper branch history table.
The branch is predicted on the basis of the previously executed one by selecting the appropriate 1 bit BHT.
1 0 .... 1 1 ....
Branch to be predicted
(m,n) predictors
In general, (m,n) predictor means record last m branches to select between 2^m, n-bit history tables.
Each cell of the predictor represents the state of a 2 bit branch predictor.
Frequency of Mispredictions
12% 10% 8%
4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT
6% 6%
11%
6% 5% 4%
6% 4% 2% 0% 1% 0% 1%
5%
doducd
nasa7
gcc
espresso
spice
tomcatv
eqntott
fpppp
matrix300
li
Associative lookup
Predicted PC
No, instruction is not predicted To be a branch, proceed normally Yes, instruction is a a branch, PC should be used as next PC
Issue an instruction dependent on branch before the branch result is known. Commit is always made in order. Commit of a speculative instruction is made only when the branch outcome is known. The same holds for exceptions (synchronous or asynchronous) deviations of control flow
Tomasulos Boosting needs a buffer for uncommited results (reorder buffer). Each entry is:
Instruction Destination Value
ROB has a slot for each issued instruction. When an instruction writes into a register, it writes only in its assigned slot in the ROB. The reorder buffer can be a operand source (like the RS or load buffers) or destination (like RF and store buffers)
RS now only queue instructions to FUs (to reduce structural hazards) Pointers, now, are directed toward ROB slots.
2.
3. 4.
Issue: get an instruction from the queue. RS && ROB must have a slot free. Dispatch the operation indicating in which slot it must write Execution: When both operands ready, execute. If not watch in the CDB. Write Result:Write on CDB and on ROB Commit: the commited instruction at head of the ROB updates destination register and is removed. Mispredicted branches flush the ROB (graduation).