0% found this document useful (0 votes)

41 views

Instruction Level Parallelism: 1. Scoreboard and Tomasulo Algorithms

The document discusses instruction level parallelism and hardware techniques for exploiting it, specifically scoreboarding. Scoreboarding allows out-of-order execution by tracking dependencies between instructions and ensuring operands are available before instructions execute. It consists of four stages: issue, read operands, execute, and write result. An example is provided showing the status of instructions and functional units over multiple cycles to demonstrate how scoreboarding enables overlapping execution while avoiding hazards.

Uploaded by

SAM

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views

Instruction Level Parallelism: 1. Scoreboard and Tomasulo Algorithms

Uploaded by

SAM

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 72

Instruction Level Parallelism

1. Scoreboard and Tomasulo algorithms

Vittorio Zaccaria Alari @ ST 2001

Definition of ILP

ILP=Potential overlap of execution among instructions. Overlapping possible if:

No Structural Hazards No RAW, WAR of WAW Stalls No Control Stalls

Hardware Schemes to exploit ILP

Vittorio Zaccaria Alari @ ST 2001

Why?

Works when cant know real dependence at compile time Compiler Simpler Code for one machine runs well on another

Vittorio Zaccaria Alari @ ST 2001

Key Idea:

Allow instructions behind stall to proceed Enables out-of-order execution and completion (commit). First implemented in CDC 6600 (1963).

Vittorio Zaccaria Alari @ ST 2001

Example:
DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 ADDD surely stalls for F0 (waiting that DIVD commits). SUBD would stall without dynamic scheduling.

Vittorio Zaccaria Alari @ ST 2001

Scoreboard Scheme

Similar to the DLX scheme. ID stage splitted in two parts:

Issue (decode and check structural h.). Read Operands (wait until no data hazards).

Scoreboard allow instructions without dependencies to execute.

Vittorio Zaccaria Alari @ ST 2001

Scoreboard Implications

Out-of-order completion -> WAR and WAW hazards. Solutions for WAR:

Queue both the operations and copies of its operands. Read registers only during Read Operands stage.

Vittorio Zaccaria Alari @ ST 2001

Scoreboard Implications

For WAW, the machine stalls until the other instruction completes Multiple execution units Scoreboard keeps track of dependencies and state of operations.

Four Stages of Scoreboard Control

Vittorio Zaccaria Alari @ ST 2001

Issue Decode instructions & check for structural hazards.

If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or a WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared.

Four Stages of Scoreboard Control

2. Read Operands Wait until no data hazards, then read operands

Vittorio Zaccaria Alari @ ST 2001

A source operand is available if: - no earlier issued active instruction will write it or - A functional unit is writing its value in a register When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. RAW hazards are resolved dynamically in this step, and instructions may be sent into execution out of order.

Four Stages of Scoreboard Control

3.Execution Operate on operands
The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution.

Vittorio Zaccaria Alari @ ST 2001

FUs are characterized by: - latency (the effective time used to complete one operation). - Initiation interval (the number of cycles that must elapse between issuing two operations to the same functional unit).

Four Stages of Pipeline Control

4. Write result Finish execution

Vittorio Zaccaria Alari @ ST 2001

Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction.

Vittorio Zaccaria Alari @ ST 2001

WAR Example
DIVD ADDD SUBD F0,F2,F4 F10,F0,F8 F8,F8,F14

In this case, the scoreboard would stall the SUBD in the WB stage,waiting that ADDD reads F0 and F8.

Vittorio Zaccaria Alari @ ST 2001

Scoreboard structure
1.
2.

Instruction status Functional Unit status

Indicates the state of the functional unit (FU): Busy Indicates whether the unit is busy or not Op - The operation to perform in the unit (+,-, etc.) Fi - Destination register Fj, Fk Source register numbers Qj, Qk Functional units producing source registers Rj, Rk Flags indicating when Fj, Fk are ready

Register result status.

Indicates which functional unit will write each register. Blank if no pending instructions will write that register.

Vittorio Zaccaria Alari @ ST 2001

Scoreboard Example
Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Issue Read Execution Write operands complete Result

Busy No No No No No

dest Fi

S1 Fj

S2 Fk

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Clock
FU

F10

F12

...

F30

Scoreboard Example Cycle 1

Vittorio Zaccaria Alari @ ST 2001

Busy Yes No No No No

Op Load

dest Fi F6

S1 Fj

S2 Fk R2

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk Yes

Clock
1 FU

F6 F8 F10
Integer

F12

...

F30

Scoreboard Example Cycle 2

Vittorio Zaccaria Alari @ ST 2001

Busy Yes No No No No

Op Load

dest Fi F6

S1 Fj

S2 Fk R2

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk Yes

Clock
2 FU

F6 F8 F10
Integer

F12

...

F30

Integer Pipeline Full Cannot exec 2nd Load Issue stalls

Scoreboard Example Cycle 3

Vittorio Zaccaria Alari @ ST 2001

Busy Yes No No No No

Op Load

dest Fi F6

S1 Fj

S2 Fk R2

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk Yes

Clock
3 FU

F6 F8 F10
Integer

F12

...

F30

Issue stalls

Scoreboard Example Cycle 4

Vittorio Zaccaria Alari @ ST 2001

Busy Yes No No No No

Op Load

dest Fi F6

S1 Fj

S2 Fk R2

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk Yes

Clock
4 FU

F6 F8 F10
Integer

F12

...

F30

Issue stalls

Scoreboard Example Cycle 5

Vittorio Zaccaria Alari @ ST 2001

Busy Yes No No No No

Op Load

dest Fi F2

S1 Fj

S2 Fk R3

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk Yes

Clock
5 FU

F2
Integer

F6 F8 F10

F12

...

F30

In this cycle the 2nd load is issued.

Scoreboard Example Cycle 6

Vittorio Zaccaria Alari @ ST 2001

Busy Yes Yes No No No

Op Load Mult

dest Fi F2 F0

S1 Fj F2

S2 Fk R3 F4

FU for j FU for k Fj? Qj Qk Rj Integer No

Fk? Rk Yes Yes

Clock
6 FU

F6 F8 F10

F12

...

F30

Mult1 Integer

Mult is issued but has to wait for F2

Scoreboard Example Cycle 7

Vittorio Zaccaria Alari @ ST 2001

Busy Yes Yes No Yes No

Op Load Mult Sub

dest Fi F2 F0 F8

S1 Fj F2 F6

S2 Fk R3 F4 F2

FU for j FU for k Fj? Qj Qk Rj Integer Integer No Yes

Fk? Rk Yes Yes No

Clock
7 FU

F6 F8 F10
Add

F12

...

F30

Mult1 Integer

Now, Subd can be issued, but has to wait for operands.

Scoreboard Example Cycle 8a

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 6 7 8 dest Fi F2 F0 F8 F10 S1 Fj F2 F6 F0 S2 Fk R3 F4 F2 F6 FU for j FU for k Fj? Qj Qk Rj Integer Integer Mult1 No Yes No Fk? Rk Yes Yes No Yes

Vittorio Zaccaria Alari @ ST 2001

Busy Yes Yes No Yes Yes

Op Load Mult Sub Div

Clock
8 FU

F6 F8 F10
Add Divide

F12

...

F30

Mult1 Integer

DIVD is issued but there is another RAW hazard

Scoreboard Example Cycle 8b

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 7 8 dest Fi F0 F8 F10 S1 Fj F2 F6 F0 S2 Fk F4 F2 F6 FU for j FU for k Fj? Qj Qk Rj Yes Yes No Fk? Rk Yes Yes Yes

Vittorio Zaccaria Alari @ ST 2001

Busy No Yes No Yes Yes

Op Mult Sub Div

Mult1

Clock
8 FU

F0
Mult1

F6 F8 F10
Add Divide

F12

...

F30

Load completes, and operands for Mult and subd are

Scoreboard Example Cycle 9

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 10 Mult1 Mult2 2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 8 dest Fi F0 F8 F10 S1 Fj F2 F6 F0 S2 Fk F4 F2 F6 FU for j FU for k Fj? Qj Qk Rj Yes Yes No Fk? Rk Yes Yes Yes

Vittorio Zaccaria Alari @ ST 2001

Busy No Yes No Yes Yes

Op Mult Sub Div

Mult1

Clock
9 FU

F0
Mult1

F6 F8 F10
Add Divide

F12

...

F30

MULT and SUB are sent in execution in parallel

Scoreboard Example Cycle 11

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 8 Mult1 Mult2 0 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 8 dest Fi F0 F8 F10 S1 Fj F2 F6 F0 S2 Fk F4 F2 F6 FU for j FU for k Fj? Qj Qk Rj Yes Yes No Fk? Rk Yes Yes Yes

Vittorio Zaccaria Alari @ ST 2001

Busy No Yes No Yes Yes

Op Mult Sub Div

Mult1

Clock
11 FU

F0
Mult1

F6 F8 F10
Add Divide

F12

...

F30

The SUBD finishes

Scoreboard Example Cycle 12

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 7 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 dest Fi F0 S1 Fj F2 S2 Fk F4 FU for j FU for k Fj? Qj Qk Rj Yes Fk? Rk Yes

Vittorio Zaccaria Alari @ ST 2001

Busy No Yes No No Yes

Op Mult

Div

F10

Mult1

Yes

Clock
12 FU

F0
Mult1

F6 F8 F10
Divide

F12

...

F30

Read operands for DIVD?

Scoreboard Example Cycle 13

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 6 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
13 FU

F0
Mult1

F6 F8 F10
Add Divide

F12

...

F30

SUBD writes results and ADDD can be issued

Scoreboard Example Cycle 14

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 5 Mult1 Mult2 2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
14 FU

F0
Mult1

F6 F8 F10
Add Divide

F12

...

F30

Scoreboard Example Cycle 15

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 4 Mult1 Mult2 1 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
15 FU

F0
Mult1

F6 F8 F10
Add Divide

F12

...

F30

Scoreboard Example Cycle 16

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 3 Mult1 Mult2 0 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
16 FU

F0
Mult1

F6 F8 F10
Add Divide

F12

...

F30

Scoreboard Example Cycle 17

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 2 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
17 FU

F0
Mult1

F6 F8 F10
Add Divide

F12

...

F30

Write result of ADDD? NO, there is a WAR hazard

Scoreboard Example Cycle 18

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 1 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
18 FU

F0
Mult1

F6 F8 F10
Add Divide

F12

...

F30

Scoreboard Example Cycle 19

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 0 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
19 FU

F0
Mult1

F6 F8 F10
Add Divide

F12

...

F30

Scoreboard Example Cycle 20

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No No No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Yes Yes

Clock
20 FU

F6 F8 F10
Add Divide

F12

...

F30

Scoreboard Example Cycle 21

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No No No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Yes Yes

Clock
21 FU

F6 F8 F10
Add Divide

F12

...

F30

Scoreboard Example Cycle 22

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add 40 Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 13 14 16 22 dest S1 S2 Busy Op Fi Fj Fk No No No No Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Yes

Clock
22 FU

F6 F8 F10
Divide

F12

...

F30

Now DIVD can read its operands, ADDD can write the result

Scoreboard Example Cycle 61

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add 0 Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 61 13 14 16 22 dest S1 S2 Busy Op Fi Fj Fk No No No No Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Yes

Clock
61 FU

F6 F8 F10
Divide

F12

...

F30

DIVD finishes,

Scoreboard Example Cycle 62

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add 0 Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 61 62 13 14 16 22 dest S1 S2 Busy Op Fi Fj Fk No No No No No

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Clock
62 FU

F6 F8 F10

F12

...

F30

Vittorio Zaccaria Alari @ ST 2001

CDC 6600 Scoreboard

Achieves a speedup of 2.5 w.r.t. no dynamic scheduling By reorganizing instructions the compiler achieves only 1.7 But

No cache No forwarding hardware Limited to instructions in a basic block Small number of functional units (structural hazards) Wait fo WAR hazards Prevent WAW hazards

Vittorio Zaccaria Alari @ ST 2001

Tomasulo Algorithm

Invented at IBM 3 years after CDC 6600 for the IBM 360/91 Same Goal: performance w/o special compilers Lead to:

Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604

Vittorio Zaccaria Alari @ ST 2001

Tomasulo Algorithm Basics

The control logic and the buffers are distributed with Fus Operand buffers are called reservation stations. Each instruction is an entry of a reservation station. Its operands are replaced by values or pointers (Register Renaming)

Vittorio Zaccaria Alari @ ST 2001

Tomasulo Algorithm Basics

Register Renaming allows to:

Avoid WAR and WAW hazards Reservation stations are more than registers (so can do better optimizations than a compiler).

Results are dispatched to other Fus through a Common Data Bus Load/Stores treated as FUs

Tomasulo Algorithm for an FPU

Vittorio Zaccaria Alari @ ST 2001

Reservation Station Components

Vittorio Zaccaria Alari @ ST 2001

Tag identifying the RS OP=the operation to perform on the component. Vj, Vk=Value of the source operands Qj,Qk=Pointers to RS that produce Vj,Vk Busy=Indicates RS Busy

Vittorio Zaccaria Alari @ ST 2001

Other components

RF and the Store buffer have a Value (V) and a Pointer (Q) field. Load buffers have an address field, and a busy field. Store Buffers have also an address field.

The three stages of the Tomasulo Algorithm.

Vittorio Zaccaria Alari @ ST 2001

ISSUE.

Get an instruction I from the queue. If it is an FP op. Check if an RS is empty (i.e., check for structural hazards). Rename registers; WAR resolution: If I writes Rx, read by an instruction K already issued, K knows already the value of Rx or knows what instruction will write it. So the RF can be linked to I. WAW resolution: Since we use in-order issue, the RF can be linked to I.

The Three Stages of The Tomasulo Algorithms

Vittorio Zaccaria Alari @ ST 2001

Execution

When both operands are ready then execute. If not ready, watch the common data bus fo results

Write result

Write on Common Data Bus to all waiting units; mark reservation stations available.

Vittorio Zaccaria Alari @ ST 2001

The Common Data Bus

A common data bus is a data+source bus. In the IBM 360/91 Data=64 bits, Source=4 bits FU must perform associative lookup in the RS.

Vittorio Zaccaria Alari @ ST 2001

Tomasulo (IBM) versus Scoreboard (CDC)

Pipelined FUs Issue window size=14 No issue on structural hazards WAR, WAW avoided with renaming Broadcast results from FU Control distributed on RS

Multiple but not pipelined Fus Issue window size=5 No issue on structural hazards Stall the completion for WAW and WAR hazards Results written back on registers. Control centralized through the Scoreboard.

Vittorio Zaccaria Alari @ ST 2001

Branch Prediction

Current DLX wastes one cycle but other architectures compute branches several cycles after the IF stage. We need to predict ASAP branch result (ID stage). Performance of Branch Prediction depends on:

Accuracy measured in terms of percentage of misprediction Cost of Misprediction measured in terms of the time wasted to execute unuseful instructions.

Vittorio Zaccaria Alari @ ST 2001

Branch History Table

Table of 1 bit values Indexed by the lower bits of the PC address Says whether or not branch taken last time

Vittorio Zaccaria Alari @ ST 2001

Branch History Table

Problem: in a loop, 1 bit BHT will cause two mispredictions:

When we arrive to the end of the loop and we must exit. Here the BHT predicts to stay in the loop. When we re-enter the loop, we reach the end and we must stay in the loop. Here the BHT predicts to exit

Vittorio Zaccaria Alari @ ST 2001

Dynamic Branch Prediction

It is a 2 bit scheme in which we change prediction only if we get misprediction twice. For each index of the table, the 2 bits report the state of a state machine (next slide). When we arrive at the end of the loop, we dont change prediction.

We can describe the algorithm with a FSM

Vittorio Zaccaria Alari @ ST 2001

Branch History Table Accuracy

Vittorio Zaccaria Alari @ ST 2001

We have a misprediction when

We make a wrong guess for that branch but also Because the same index can be referenced by two different branches, sometimes we get the history of the wrong branch

Branch History Table Accuracy

Vittorio Zaccaria Alari @ ST 2001

It has been measured that a 4096 entry table, programs have a misprediction percentage from 1% to 18%:

Nasa7, tomcatv Eqntott Spice Gcc

1% 18% 9% 12%

4096 about as good as infinite table (for the Alpha 21164)

Vittorio Zaccaria Alari @ ST 2001

Correlating Branches

Basic hypotesis: recent branches are correlated, i.e., behavior of recently executed branches affects the prediction of current branch:

Correlating Branches Example

If(a==2) bb1; L1: If(b==2) bb2; L2: If(a!=b) bb3; L1: subi bnez add subi bnez add sub beqz ...;

Vittorio Zaccaria Alari @ ST 2001

L2:

R3,R1,2 R3,L1 r1,r0,r0; bb1 r3,r1,2 r3,L2 r2,r0,r0; bb2 r3,r1,r2 r3,L3 bb3

L3:

Branch L2 is correlated previous branches. If both are not taken then L2 is taken.

Vittorio Zaccaria Alari @ ST 2001

Idea:

record m most recently executed branches as taken or not taken. Use that pattern to select the proper branch history table.

Example of a simple correlating predictor

Vittorio Zaccaria Alari @ ST 2001

The branch is predicted on the basis of the previously executed one by selecting the appropriate 1 bit BHT.
1 0 .... 1 1 ....

Branch Prediction Table if last branch taken

Branch Prediction Table if last branch not taken

Branch to be predicted

Last branch result

effective branch result

Vittorio Zaccaria Alari @ ST 2001

(m,n) predictors

In general, (m,n) predictor means record last m branches to select between 2^m, n-bit history tables.

Example of a (2,2) correlating branch predictor

Vittorio Zaccaria Alari @ ST 2001

Each cell of the predictor represents the state of a 2 bit branch predictor.

Vittorio Zaccaria Alari @ ST 2001

Accuracy of different Schemes

18% 16% 14%

Frequency of Mispredictions

12% 10% 8%

4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT
6% 6%

11%

6% 5% 4%

6% 4% 2% 0% 1% 0% 1%

doducd

nasa7

gcc

espresso

spice

tomcatv

eqntott

fpppp

4,096 entries: 2-bits per entry

matrix300

Unlimited entries: 2-bits/entry

1,024 entries (2,2)

Address must also be predicted

Vittorio Zaccaria Alari @ ST 2001

Access in the IF stage the Branch Target Buffer: Tipical Entry:

Exact Address of a branch Predicted PC (only if not sequential)

Branch Target Buffer structure

Pc of fetched instruction

Vittorio Zaccaria Alari @ ST 2001

Associative lookup

Predicted PC

No, instruction is not predicted To be a branch, proceed normally Yes, instruction is a a branch, PC should be used as next PC

Branch Target Buffer

Vittorio Zaccaria Alari @ ST 2001

Hardware Speculation (Boosting)

Vittorio Zaccaria Alari @ ST 2001

Issue an instruction dependent on branch before the branch result is known. Commit is always made in order. Commit of a speculative instruction is made only when the branch outcome is known. The same holds for exceptions (synchronous or asynchronous) deviations of control flow

Speculative Tomasulos Algorthm

Vittorio Zaccaria Alari @ ST 2001

Tomasulos Boosting needs a buffer for uncommited results (reorder buffer). Each entry is:
Instruction Destination Value

ROB has a slot for each issued instruction. When an instruction writes into a register, it writes only in its assigned slot in the ROB. The reorder buffer can be a operand source (like the RS or load buffers) or destination (like RF and store buffers)

Vittorio Zaccaria Alari @ ST 2001

Tomasulos ROB (cont.)

RS now only queue instructions to FUs (to reduce structural hazards) Pointers, now, are directed toward ROB slots.

Four steps of speculative Tomasulos Algorithm

Vittorio Zaccaria Alari @ ST 2001

2.
3. 4.

Issue: get an instruction from the queue. RS && ROB must have a slot free. Dispatch the operation indicating in which slot it must write Execution: When both operands ready, execute. If not watch in the CDB. Write Result:Write on CDB and on ROB Commit: the commited instruction at head of the ROB updates destination register and is removed. Mispredicted branches flush the ROB (graduation).

Speculative Tomasulos algorithm

Vittorio Zaccaria Alari @ ST 2001

Geometry Chapter 12 Packet
0% (1)
Geometry Chapter 12 Packet
6 pages
CCIE Lab Guide
No ratings yet
CCIE Lab Guide
331 pages
Chapter 12
No ratings yet
Chapter 12
51 pages
Student Motivation Scale
No ratings yet
Student Motivation Scale
26 pages
ILP ScoreBoard
No ratings yet
ILP ScoreBoard
45 pages
Score Boarding
No ratings yet
Score Boarding
38 pages
Score Board Contd. and Tomasulo's Algorithm: Instructor: Laxmi Bhuyan
No ratings yet
Score Board Contd. and Tomasulo's Algorithm: Instructor: Laxmi Bhuyan
62 pages
Dynamic Scheduling Using Tomasulo's Algorithm: Lotzi Bölöni
No ratings yet
Dynamic Scheduling Using Tomasulo's Algorithm: Lotzi Bölöni
54 pages
Dynamic Scheduling
No ratings yet
Dynamic Scheduling
70 pages
1. CA Lec06 Chpater 3 Dynamic Scheduling
No ratings yet
1. CA Lec06 Chpater 3 Dynamic Scheduling
113 pages
Computer Architecture 計算機結構: Scoreboard
No ratings yet
Computer Architecture 計算機結構: Scoreboard
36 pages
2. Dynamic Approach Tomosulo Algorithm
No ratings yet
2. Dynamic Approach Tomosulo Algorithm
59 pages
2. Dynamic Approach Tomosulo Algorithm
No ratings yet
2. Dynamic Approach Tomosulo Algorithm
57 pages
Tomasulo Algorithm and Dynamic Branch Prediction
No ratings yet
Tomasulo Algorithm and Dynamic Branch Prediction
57 pages
Dynamic Scheduling Using Tomasulo's Algorithm: Lotzi Bölöni
No ratings yet
Dynamic Scheduling Using Tomasulo's Algorithm: Lotzi Bölöni
54 pages
Score Board Contd. and Tomasulo's Algorithm: Instructor: Laxmi Bhuyan
No ratings yet
Score Board Contd. and Tomasulo's Algorithm: Instructor: Laxmi Bhuyan
62 pages
Dynamic scheduling - Scoreboard technique
No ratings yet
Dynamic scheduling - Scoreboard technique
39 pages
Tomasulo
No ratings yet
Tomasulo
54 pages
Lecture 6: Dynamic Scheduling With Scoreboarding and Tomasulo Algorithm (Section 2.4)
No ratings yet
Lecture 6: Dynamic Scheduling With Scoreboarding and Tomasulo Algorithm (Section 2.4)
31 pages
Dynamic Scheduling Using Tomasulo's Approach: - A Big Picture
No ratings yet
Dynamic Scheduling Using Tomasulo's Approach: - A Big Picture
29 pages
Dynamic scheduling - Tomasulo Algorithm
No ratings yet
Dynamic scheduling - Tomasulo Algorithm
48 pages
Lecture 6
No ratings yet
Lecture 6
29 pages
Lect12 TomasuloExample PDF
No ratings yet
Lect12 TomasuloExample PDF
41 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
52 pages
Tomasulo Example
No ratings yet
Tomasulo Example
22 pages
lect06
No ratings yet
lect06
89 pages
Dynamic Pipeline Scheduling: Ggitm-Cse 1
No ratings yet
Dynamic Pipeline Scheduling: Ggitm-Cse 1
52 pages
Computer Architecture
100% (2)
Computer Architecture
46 pages
L11 DS PDF
No ratings yet
L11 DS PDF
41 pages
Latencies For Different Execution Units: FP Add/Sub: 2 CC FP MUL: 10 CC FP DIV: 25 CC INT ALU op/LD/SD: 1 CC 1. Without Data Forwarding
No ratings yet
Latencies For Different Execution Units: FP Add/Sub: 2 CC FP MUL: 10 CC FP DIV: 25 CC INT ALU op/LD/SD: 1 CC 1. Without Data Forwarding
2 pages
Tomasulo 2
No ratings yet
Tomasulo 2
8 pages
F - Systems Tasks
No ratings yet
F - Systems Tasks
30 pages
Instruction-Level Parallelism Dynamic Scheduling
No ratings yet
Instruction-Level Parallelism Dynamic Scheduling
25 pages
Chapter 03
No ratings yet
Chapter 03
19 pages
Ee660 2017 Spring Materials Week 04 Slides
No ratings yet
Ee660 2017 Spring Materials Week 04 Slides
40 pages
section7
No ratings yet
section7
23 pages
Tomasulo With Re-Order Buffer-V3
No ratings yet
Tomasulo With Re-Order Buffer-V3
10 pages
Tomasulo Algorithm
No ratings yet
Tomasulo Algorithm
38 pages
QuickScan80 Version1 7
No ratings yet
QuickScan80 Version1 7
16 pages
Pic Microcontroller: Technological University of The Philippines
100% (1)
Pic Microcontroller: Technological University of The Philippines
15 pages
Exploiting Instruction-Level Parallelism With Software Approaches
No ratings yet
Exploiting Instruction-Level Parallelism With Software Approaches
108 pages
Advanced Computer Architecture-06CS81 Hardware Based Speculation (Unit 3)
No ratings yet
Advanced Computer Architecture-06CS81 Hardware Based Speculation (Unit 3)
5 pages
ILP2 (Unit4)
No ratings yet
ILP2 (Unit4)
27 pages
FL654 Manual
No ratings yet
FL654 Manual
21 pages
Ic 693 Cpu 313
No ratings yet
Ic 693 Cpu 313
5 pages
PLC Exercise
No ratings yet
PLC Exercise
19 pages
Ccie Routing Switching
No ratings yet
Ccie Routing Switching
338 pages
Cs433 Fa20 Hw3 Solution
No ratings yet
Cs433 Fa20 Hw3 Solution
15 pages
FUNCTION ACCEPTANCE TEST(TEMPLATE)
No ratings yet
FUNCTION ACCEPTANCE TEST(TEMPLATE)
3 pages
cs433 Fa19 hw4 Solution
No ratings yet
cs433 Fa19 hw4 Solution
12 pages
Lect14-HWSpecExample Comparison PDF
No ratings yet
Lect14-HWSpecExample Comparison PDF
39 pages
Instrucciones Del Pic16f84
No ratings yet
Instrucciones Del Pic16f84
15 pages
Calculated Encryption
From Everand
Calculated Encryption
John C Livingstone
No ratings yet
Miscellaneous Electrical Equipment & Components World Summary: Market Values & Financials by Country
From Everand
Miscellaneous Electrical Equipment & Components World Summary: Market Values & Financials by Country
Editorial DataGroup
No ratings yet
Microcontroller Exploits
From Everand
Microcontroller Exploits
Travis Goodspeed
No ratings yet
FCP - FortiAnalyzer 7.4 Administrator Exam Preparation
From Everand
FCP - FortiAnalyzer 7.4 Administrator Exam Preparation
Georgio Daccache
No ratings yet
A Wide Band Adaptive All Digital Phase Locked Loop With Self Jitt
No ratings yet
A Wide Band Adaptive All Digital Phase Locked Loop With Self Jitt
179 pages
An Ultra Low Power and Low Complexity All Digital PLL With A High Resolution Digitally Controlled Oscillator
No ratings yet
An Ultra Low Power and Low Complexity All Digital PLL With A High Resolution Digitally Controlled Oscillator
7 pages
Characteristic Investigation Impulse Radiation of Two UWB Antennas
No ratings yet
Characteristic Investigation Impulse Radiation of Two UWB Antennas
4 pages
Opamp Loop Simulation
No ratings yet
Opamp Loop Simulation
4 pages
The Effect of Sustained Natural Apophyseal Glide SNAG Combined With Neurodynamics in The Management of A Patient With Cervical Radiculopathy A Case
No ratings yet
The Effect of Sustained Natural Apophyseal Glide SNAG Combined With Neurodynamics in The Management of A Patient With Cervical Radiculopathy A Case
7 pages
Rehabilitation Exercises To Induce Balanced Scapular Muscle Activity in An Anti-Gravity Posture
No ratings yet
Rehabilitation Exercises To Induce Balanced Scapular Muscle Activity in An Anti-Gravity Posture
4 pages
2 Front To Back MMIC Design Flow With ADS
No ratings yet
2 Front To Back MMIC Design Flow With ADS
65 pages
Digital Logic Circuit Design
No ratings yet
Digital Logic Circuit Design
8 pages
Low Phase Noise, Very Wide Band Sige Fully Integrated Vco
No ratings yet
Low Phase Noise, Very Wide Band Sige Fully Integrated Vco
4 pages
Modeling RF Systems
No ratings yet
Modeling RF Systems
41 pages
Basic Thermodynamics Essay 1
No ratings yet
Basic Thermodynamics Essay 1
3 pages
BP 2059
No ratings yet
BP 2059
8 pages
Tos 1ST Summative Q1
No ratings yet
Tos 1ST Summative Q1
1 page
Home Work - 3: EEE 207 / ECE 207: Electronic Devices and Circuits II
No ratings yet
Home Work - 3: EEE 207 / ECE 207: Electronic Devices and Circuits II
3 pages
Ex 2 Enzymes
No ratings yet
Ex 2 Enzymes
6 pages
MMB411 Tutorial - Gears01 Fundamentals PDF
No ratings yet
MMB411 Tutorial - Gears01 Fundamentals PDF
29 pages
Buy ebook Knowledge Graph and Semantic Computing Language Knowledge and Intelligence Second China Conference CCKS 2017 Chengdu China August 26 29 2017 Revised Selected Papers 1st Edition Juanzi Li cheap price
100% (5)
Buy ebook Knowledge Graph and Semantic Computing Language Knowledge and Intelligence Second China Conference CCKS 2017 Chengdu China August 26 29 2017 Revised Selected Papers 1st Edition Juanzi Li cheap price
62 pages
Airfoil Families For HAWTs
100% (3)
Airfoil Families For HAWTs
12 pages
Pump Maintenance Timeline
No ratings yet
Pump Maintenance Timeline
1 page
Language Chemistry
No ratings yet
Language Chemistry
34 pages
Nuclear Fission and Fusion Lesson 16
No ratings yet
Nuclear Fission and Fusion Lesson 16
6 pages
Experiment 04 Identification of Carboxylic Acid
No ratings yet
Experiment 04 Identification of Carboxylic Acid
8 pages
173 Funtions of Excel
No ratings yet
173 Funtions of Excel
182 pages
Gear Tooth Profile (1)
100% (1)
Gear Tooth Profile (1)
7 pages
Thesis Manuscript - Airindy Felisita - 11
No ratings yet
Thesis Manuscript - Airindy Felisita - 11
232 pages
ER20 Data Sheet en
No ratings yet
ER20 Data Sheet en
2 pages
Backend Development Test 1
No ratings yet
Backend Development Test 1
49 pages
Investigation of Material Removal Rate in Turning Operation
No ratings yet
Investigation of Material Removal Rate in Turning Operation
6 pages
Assessing Mechanical Properties and Microstructure of Fire-Damaged Engineered Cementitious Composites
No ratings yet
Assessing Mechanical Properties and Microstructure of Fire-Damaged Engineered Cementitious Composites
8 pages
Frontmatter
No ratings yet
Frontmatter
20 pages
VCFT Unit-1 PPTS
No ratings yet
VCFT Unit-1 PPTS
72 pages
Academic Task 3 Simulation Questions
No ratings yet
Academic Task 3 Simulation Questions
9 pages
Terminacion Abb
No ratings yet
Terminacion Abb
28 pages
Model Engineering College, Thrikkakara Cs 431 Compiler Design Lab
No ratings yet
Model Engineering College, Thrikkakara Cs 431 Compiler Design Lab
2 pages
Using Bleaching Powder
No ratings yet
Using Bleaching Powder
11 pages
03-exec
No ratings yet
03-exec
11 pages
Krautkrämer DM5E Operating Manual - EN
No ratings yet
Krautkrämer DM5E Operating Manual - EN
102 pages
FX Broker Buster Manual v1.0
No ratings yet
FX Broker Buster Manual v1.0
9 pages