Dynamic Scheduling Using Tomasulo's Algorithm: Lotzi Bölöni
Dynamic Scheduling Using Tomasulo's Algorithm: Lotzi Bölöni
Lotzi Blni
EEL 5708
Acknowledgements
All the lecture slides were adopted from the slides of David Patterson (1998, 2001) and David E. Culler (2001), Copyright 1998-2002, University of California Berkeley
EEL 5708
Dynamic Scheduling
A major limitation of the simple pipelining techniques is in-order execution If an instruction is stalled in the pipeline all the instructions behind it must wait
Even if there would be enough hardware resources to execute them
EEL 5708
EEL 5708
Tomasulos Algorithm
Designed for the IBM 360/91, by Robert Tomasulo Goal: high performance without special compilers IBM 360 had only 4 FP registers
Solution: register renaming
Why Study? leads to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604,
EEL 5708
Tomasulos Algorithm
Control & buffers distributed with Function Units (FU) FU buffers called reservation stations; have pending operands Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ;
avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers cant
Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue
EEL 5708
Tomasulo organization
EEL 5708
BusyIndicates reservation station or FU is busy Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.
EEL 5708
Normal data bus: data + destination (go to bus) Common data bus: data + source (come from bus) 64 bits of data + 4 bits of Functional Unit source address
Write if matches expected Functional Unit (produces result) Does the broadcast
EEL 5708
S1 Vj
S2 Vk
RS for j Qj
RS for k Qk
Clock
0 FU
F0
F2
F4
F6
F8
F30
EEL 5708
S1 Vj
S2 Vk
RS for j Qj
RS for k Qk
Clock
1 FU
F0 F2
F4
F6
Load1
F8
F30
EEL 5708
S1 Vj
S2 Vk
RS for j Qj
RS for k Qk
Clock
2 FU
F0
F2
Load2
F4
F6
Load1
F8
F30
EEL 5708
S2 Vk
RS for j Qj
RS for k Qk
R(F4)
Load2
Clock
3
F0
F2
F4
F6
Load1
F8
F30
FU Mult1 Load2
Note: registers names are removed (renamed) in Reservation Stations EEL 5708 Load1 completing; what is waiting for Load1?
RS for j Qj
RS for k Qk Load2
Load2
Clock
4
F0 F2
FU Mult1 Load2
F4
F6
F8
F30
M(34+R2) Add1
EEL 5708
RS for k Qk
Clock
5
F0 F2
FU Mult1 M(45+R3)
F4
F6
F8
F30
RS for k Qk
Clock
6
F0 F2
FU Mult1 M(45+R3)
F4
F6
Add2
F8
F30
Add1 Mult2
EEL 5708
RS for k Qk
Clock
7
F0 F2
FU Mult1 M(45+R3)
F4
F6
Add2
F8
F30
Add1 Mult2
EEL 5708
RS for k Qk
Clock
8
F0 F2
FU Mult1 M(45+R3)
F4
F6
Add2
F8
F30
M()-M()Mult2
EEL 5708
RS for k Qk
Clock
9
F0 F2
FU Mult1 M(45+R3)
F4
F6
Add2
F8
F30
M()M() Mult2
EEL 5708
RS for k Qk
Clock
10
F0 F2
FU Mult1 M(45+R3)
F4
F6
Add2
F8
F30
M()M() Mult2
EEL 5708
RS for k Qk
Clock
11
F0 F2
FU Mult1 M(45+R3)
F4
F6
F8
F30
EEL 5708
RS for k Qk
Clock
12
F0 F2
FU Mult1 M(45+R3)
F4
F6
F8
F30
RS for k Qk
Clock
13
F0 F2
FU Mult1 M(45+R3)
F4
F6
F8
F30
(MM)+M()M()M() Mult2
EEL 5708
RS for k Qk
Clock
14
F0 F2
FU Mult1 M(45+R3)
F4
F6
F8
F30
(MM)+M()M()M() Mult2
EEL 5708
RS for k Qk
Clock
15
F0 F2
FU Mult1 M(45+R3)
F4
F6
F8
F30
(MM)+M()M()M() Mult2
EEL 5708
M*F4
M(34+R2)
Clock
16
F0
F2
F4
F6
F8
F30
FU M*F4 M(45+R3)
EEL 5708
M*F4
M(34+R2)
Clock
55
F0 F2
FU M*F4 M(45+R3)
F4
F6
F8
F30
(MM)+M()M()M() Mult2
EEL 5708
M*F4
M(34+R2)
Clock
56
F0 F2
FU M*F4 M(45+R3)
F4
F6
F8
F30
(MM)+M()M()M() Mult2
EEL 5708
RS for j Qj
RS for k Qk
Clock
57
F0
F2
F4
F6
F8
F30
FU M*F4 M(45+R3)
EEL 5708
Tomasulo Drawbacks
Complexity
delays of 360/91, MIPS 10000, IBM 620?
Many associative stores (CDB) at high speed Performance limited by Common Data Bus
Multiple CDBs => more FU logic for parallel assoc stores
EEL 5708
Assume Multiply takes 4 clocks Assume first load takes 8 clocks (cache miss?), second load takes 4 clocks (hit) To be clear, will show clocks for SUBI, BNEZ Reality, integer instructions ahead
EEL 5708
S1 Vj
S2 Vk
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
F6
F8
EEL 5708
S1 Vj
S2 Vk
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
F6
F8
EEL 5708
S1 Vj
S2 Vk
R(F2)
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
Mult1
F6
F8
EEL 5708
S1 Vj
S2 Vk
R(F2)
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
Mult1
F6
F8
EEL 5708
S1 Vj
S2 Vk
R(F2)
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
Mult1
F6
F8
EEL 5708
S1 Vj
S2 Vk
R(F2)
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
Mult1
F6
F8
EEL 5708
Clock
6
R1 72
F0
Qi Load2
F2
F4
Mult1
F6
F8
EEL 5708
Clock
7
R1 72
F0
Qi Load2
F2 F4
Mult2
F6
F8
Clock
8
R1 72
F0
Qi Load2
F2
F4
Mult2
F6
F8
EEL 5708
R(F2) R(F2)
Load1 Load2
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
Mult2
F6
F8
EEL 5708
Load2
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
Mult2
F6
F8
EEL 5708
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
Mult2
F6
F8
EEL 5708
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
Mult2
F6
F8
EEL 5708
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
Mult2
F6
F8
EEL 5708
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
Mult2
F6
F8
EEL 5708
M(72) R(F2)
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
Mult2
F6
F8
EEL 5708
R(F2)
Load3
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
Mult1
F6
F8
EEL 5708
R(F2)
Load3
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
Mult1
F6
F8
EEL 5708
R(F2)
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
Mult1
F6
F8
EEL 5708
R(F2)
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
Mult1
F6
F8
EEL 5708
R(F2)
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
Mult1
F6
F8
EEL 5708
R(F2)
0 R1 F0 F2 0 R1 R1 #8 Loop
F2
F4
Mult1
F6
F8
EEL 5708
Tomasulo Summary
Reservations stations: renaming to larger set of registers + buffering source operands
Prevents registers as bottleneck Allows loop unrolling in HW
Not limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions
Dynamic scheduling Register renaming Load/store disambiguation
360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264
EEL 5708