0% found this document useful (0 votes)
41 views

Instruction Level Parallelism: 1. Scoreboard and Tomasulo Algorithms

The document discusses instruction level parallelism and hardware techniques for exploiting it, specifically scoreboarding. Scoreboarding allows out-of-order execution by tracking dependencies between instructions and ensuring operands are available before instructions execute. It consists of four stages: issue, read operands, execute, and write result. An example is provided showing the status of instructions and functional units over multiple cycles to demonstrate how scoreboarding enables overlapping execution while avoiding hazards.

Uploaded by

SAM
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Instruction Level Parallelism: 1. Scoreboard and Tomasulo Algorithms

The document discusses instruction level parallelism and hardware techniques for exploiting it, specifically scoreboarding. Scoreboarding allows out-of-order execution by tracking dependencies between instructions and ensuring operands are available before instructions execute. It consists of four stages: issue, read operands, execute, and write result. An example is provided showing the status of instructions and functional units over multiple cycles to demonstrate how scoreboarding enables overlapping execution while avoiding hazards.

Uploaded by

SAM
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 72

Instruction Level Parallelism

1. Scoreboard and Tomasulo algorithms

Vittorio Zaccaria Alari @ ST 2001

Definition of ILP

ILP=Potential overlap of execution among instructions. Overlapping possible if:


No Structural Hazards No RAW, WAR of WAW Stalls No Control Stalls

Hardware Schemes to exploit ILP

Vittorio Zaccaria Alari @ ST 2001

Why?

Works when cant know real dependence at compile time Compiler Simpler Code for one machine runs well on another

Vittorio Zaccaria Alari @ ST 2001

Key Idea:

Allow instructions behind stall to proceed Enables out-of-order execution and completion (commit). First implemented in CDC 6600 (1963).

Vittorio Zaccaria Alari @ ST 2001

Example:
DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 ADDD surely stalls for F0 (waiting that DIVD commits). SUBD would stall without dynamic scheduling.

Vittorio Zaccaria Alari @ ST 2001

Scoreboard Scheme

Similar to the DLX scheme. ID stage splitted in two parts:

Issue (decode and check structural h.). Read Operands (wait until no data hazards).

Scoreboard allow instructions without dependencies to execute.

Vittorio Zaccaria Alari @ ST 2001

Scoreboard Implications

Out-of-order completion -> WAR and WAW hazards. Solutions for WAR:

Queue both the operations and copies of its operands. Read registers only during Read Operands stage.

Vittorio Zaccaria Alari @ ST 2001

Scoreboard Implications

For WAW, the machine stalls until the other instruction completes Multiple execution units Scoreboard keeps track of dependencies and state of operations.

Four Stages of Scoreboard Control


1.

Vittorio Zaccaria Alari @ ST 2001

Issue Decode instructions & check for structural hazards.

If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or a WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared.

Four Stages of Scoreboard Control


2. Read Operands Wait until no data hazards, then read operands

Vittorio Zaccaria Alari @ ST 2001

A source operand is available if: - no earlier issued active instruction will write it or - A functional unit is writing its value in a register When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. RAW hazards are resolved dynamically in this step, and instructions may be sent into execution out of order.

Four Stages of Scoreboard Control


3.Execution Operate on operands
The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution.

Vittorio Zaccaria Alari @ ST 2001

FUs are characterized by: - latency (the effective time used to complete one operation). - Initiation interval (the number of cycles that must elapse between issuing two operations to the same functional unit).

Four Stages of Pipeline Control


4. Write result Finish execution

Vittorio Zaccaria Alari @ ST 2001

Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction.

Vittorio Zaccaria Alari @ ST 2001

WAR Example
DIVD ADDD SUBD F0,F2,F4 F10,F0,F8 F8,F8,F14

In this case, the scoreboard would stall the SUBD in the WB stage,waiting that ADDD reads F0 and F8.

Vittorio Zaccaria Alari @ ST 2001

Scoreboard structure
1.
2.

Instruction status Functional Unit status

Indicates the state of the functional unit (FU): Busy Indicates whether the unit is busy or not Op - The operation to perform in the unit (+,-, etc.) Fi - Destination register Fj, Fk Source register numbers Qj, Qk Functional units producing source registers Rj, Rk Flags indicating when Fj, Fk are ready

3.

Register result status.

Indicates which functional unit will write each register. Blank if no pending instructions will write that register.

Vittorio Zaccaria Alari @ ST 2001

Scoreboard Example
Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Issue Read Execution Write operands complete Result

Busy No No No No No

Op

dest Fi

S1 Fj

S2 Fk

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Clock
FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Scoreboard Example Cycle 1


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Issue 1 Read Execution Write operands completeResult

Vittorio Zaccaria Alari @ ST 2001

Busy Yes No No No No

Op Load

dest Fi F6

S1 Fj

S2 Fk R2

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk Yes

Clock
1 FU

F0

F2

F4

F6 F8 F10
Integer

F12

...

F30

Scoreboard Example Cycle 2


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2

Vittorio Zaccaria Alari @ ST 2001

Busy Yes No No No No

Op Load

dest Fi F6

S1 Fj

S2 Fk R2

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk Yes

Clock
2 FU

F0

F2

F4

F6 F8 F10
Integer

F12

...

F30

Integer Pipeline Full Cannot exec 2nd Load Issue stalls

Scoreboard Example Cycle 3


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3

Vittorio Zaccaria Alari @ ST 2001

Busy Yes No No No No

Op Load

dest Fi F6

S1 Fj

S2 Fk R2

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk Yes

Clock
3 FU

F0

F2

F4

F6 F8 F10
Integer

F12

...

F30

Issue stalls

Scoreboard Example Cycle 4


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4

Vittorio Zaccaria Alari @ ST 2001

Busy Yes No No No No

Op Load

dest Fi F6

S1 Fj

S2 Fk R2

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk Yes

Clock
4 FU

F0

F2

F4

F6 F8 F10
Integer

F12

...

F30

Issue stalls

Scoreboard Example Cycle 5


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5

Vittorio Zaccaria Alari @ ST 2001

Busy Yes No No No No

Op Load

dest Fi F2

S1 Fj

S2 Fk R3

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk Yes

Clock
5 FU

F0

F2
Integer

F4

F6 F8 F10

F12

...

F30

In this cycle the 2nd load is issued.

Scoreboard Example Cycle 6


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 6

Vittorio Zaccaria Alari @ ST 2001

Busy Yes Yes No No No

Op Load Mult

dest Fi F2 F0

S1 Fj F2

S2 Fk R3 F4

FU for j FU for k Fj? Qj Qk Rj Integer No

Fk? Rk Yes Yes

Clock
6 FU

F0

F2

F4

F6 F8 F10

F12

...

F30

Mult1 Integer

Mult is issued but has to wait for F2

Scoreboard Example Cycle 7


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 6 7

Vittorio Zaccaria Alari @ ST 2001

Busy Yes Yes No Yes No

Op Load Mult Sub

dest Fi F2 F0 F8

S1 Fj F2 F6

S2 Fk R3 F4 F2

FU for j FU for k Fj? Qj Qk Rj Integer Integer No Yes

Fk? Rk Yes Yes No

Clock
7 FU

F0

F2

F4

F6 F8 F10
Add

F12

...

F30

Mult1 Integer

Now, Subd can be issued, but has to wait for operands.

Scoreboard Example Cycle 8a


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 6 7 8 dest Fi F2 F0 F8 F10 S1 Fj F2 F6 F0 S2 Fk R3 F4 F2 F6 FU for j FU for k Fj? Qj Qk Rj Integer Integer Mult1 No Yes No Fk? Rk Yes Yes No Yes

Vittorio Zaccaria Alari @ ST 2001

Busy Yes Yes No Yes Yes

Op Load Mult Sub Div

Clock
8 FU

F0

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

Mult1 Integer

DIVD is issued but there is another RAW hazard

Scoreboard Example Cycle 8b


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 7 8 dest Fi F0 F8 F10 S1 Fj F2 F6 F0 S2 Fk F4 F2 F6 FU for j FU for k Fj? Qj Qk Rj Yes Yes No Fk? Rk Yes Yes Yes

Vittorio Zaccaria Alari @ ST 2001

Busy No Yes No Yes Yes

Op Mult Sub Div

Mult1

Clock
8 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

Load completes, and operands for Mult and subd are

Scoreboard Example Cycle 9


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 10 Mult1 Mult2 2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 8 dest Fi F0 F8 F10 S1 Fj F2 F6 F0 S2 Fk F4 F2 F6 FU for j FU for k Fj? Qj Qk Rj Yes Yes No Fk? Rk Yes Yes Yes

Vittorio Zaccaria Alari @ ST 2001

Busy No Yes No Yes Yes

Op Mult Sub Div

Mult1

Clock
9 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

MULT and SUB are sent in execution in parallel

Scoreboard Example Cycle 11


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 8 Mult1 Mult2 0 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 8 dest Fi F0 F8 F10 S1 Fj F2 F6 F0 S2 Fk F4 F2 F6 FU for j FU for k Fj? Qj Qk Rj Yes Yes No Fk? Rk Yes Yes Yes

Vittorio Zaccaria Alari @ ST 2001

Busy No Yes No Yes Yes

Op Mult Sub Div

Mult1

Clock
11 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

The SUBD finishes

Scoreboard Example Cycle 12


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 7 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 dest Fi F0 S1 Fj F2 S2 Fk F4 FU for j FU for k Fj? Qj Qk Rj Yes Fk? Rk Yes

Vittorio Zaccaria Alari @ ST 2001

Busy No Yes No No Yes

Op Mult

Div

F10

F0

F6

Mult1

No

Yes

Clock
12 FU

F0
Mult1

F2

F4

F6 F8 F10
Divide

F12

...

F30

Read operands for DIVD?

Scoreboard Example Cycle 13


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 6 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
13 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

SUBD writes results and ADDD can be issued

Scoreboard Example Cycle 14


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 5 Mult1 Mult2 2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
14 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

Scoreboard Example Cycle 15


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 4 Mult1 Mult2 1 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
15 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

Scoreboard Example Cycle 16


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 3 Mult1 Mult2 0 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
16 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

Scoreboard Example Cycle 17


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 2 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
17 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

Write result of ADDD? NO, there is a WAR hazard

Scoreboard Example Cycle 18


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 1 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
18 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

Scoreboard Example Cycle 19


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 0 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
19 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

Scoreboard Example Cycle 20


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No No No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Yes Yes

Yes Yes

Clock
20 FU

F0

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

Scoreboard Example Cycle 21


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No No No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Yes Yes

Yes Yes

Clock
21 FU

F0

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

Scoreboard Example Cycle 22


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add 40 Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 13 14 16 22 dest S1 S2 Busy Op Fi Fj Fk No No No No Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Yes

Yes

Clock
22 FU

F0

F2

F4

F6 F8 F10
Divide

F12

...

F30

Now DIVD can read its operands, ADDD can write the result

Scoreboard Example Cycle 61


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add 0 Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 61 13 14 16 22 dest S1 S2 Busy Op Fi Fj Fk No No No No Yes Div F10 F0 F6

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Yes

Yes

Clock
61 FU

F0

F2

F4

F6 F8 F10
Divide

F12

...

F30

DIVD finishes,

Scoreboard Example Cycle 62


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add 0 Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 61 62 13 14 16 22 dest S1 S2 Busy Op Fi Fj Fk No No No No No

Vittorio Zaccaria Alari @ ST 2001

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Clock
62 FU

F0

F2

F4

F6 F8 F10

F12

...

F30

Vittorio Zaccaria Alari @ ST 2001

CDC 6600 Scoreboard


Achieves a speedup of 2.5 w.r.t. no dynamic scheduling By reorganizing instructions the compiler achieves only 1.7 But

No cache No forwarding hardware Limited to instructions in a basic block Small number of functional units (structural hazards) Wait fo WAR hazards Prevent WAW hazards

Vittorio Zaccaria Alari @ ST 2001

Tomasulo Algorithm

Invented at IBM 3 years after CDC 6600 for the IBM 360/91 Same Goal: performance w/o special compilers Lead to:

Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604

Vittorio Zaccaria Alari @ ST 2001

Tomasulo Algorithm Basics

The control logic and the buffers are distributed with Fus Operand buffers are called reservation stations. Each instruction is an entry of a reservation station. Its operands are replaced by values or pointers (Register Renaming)

Vittorio Zaccaria Alari @ ST 2001

Tomasulo Algorithm Basics

Register Renaming allows to:


Avoid WAR and WAW hazards Reservation stations are more than registers (so can do better optimizations than a compiler).

Results are dispatched to other Fus through a Common Data Bus Load/Stores treated as FUs

Tomasulo Algorithm for an FPU

Vittorio Zaccaria Alari @ ST 2001

Reservation Station Components

Vittorio Zaccaria Alari @ ST 2001

Tag identifying the RS OP=the operation to perform on the component. Vj, Vk=Value of the source operands Qj,Qk=Pointers to RS that produce Vj,Vk Busy=Indicates RS Busy

Vittorio Zaccaria Alari @ ST 2001

Other components

RF and the Store buffer have a Value (V) and a Pointer (Q) field. Load buffers have an address field, and a busy field. Store Buffers have also an address field.

The three stages of the Tomasulo Algorithm.

Vittorio Zaccaria Alari @ ST 2001

ISSUE.

Get an instruction I from the queue. If it is an FP op. Check if an RS is empty (i.e., check for structural hazards). Rename registers; WAR resolution: If I writes Rx, read by an instruction K already issued, K knows already the value of Rx or knows what instruction will write it. So the RF can be linked to I. WAW resolution: Since we use in-order issue, the RF can be linked to I.

The Three Stages of The Tomasulo Algorithms

Vittorio Zaccaria Alari @ ST 2001

Execution

When both operands are ready then execute. If not ready, watch the common data bus fo results

Write result

Write on Common Data Bus to all waiting units; mark reservation stations available.

Vittorio Zaccaria Alari @ ST 2001

The Common Data Bus

A common data bus is a data+source bus. In the IBM 360/91 Data=64 bits, Source=4 bits FU must perform associative lookup in the RS.

Vittorio Zaccaria Alari @ ST 2001

Tomasulo (IBM) versus Scoreboard (CDC)


Pipelined FUs Issue window size=14 No issue on structural hazards WAR, WAW avoided with renaming Broadcast results from FU Control distributed on RS

Multiple but not pipelined Fus Issue window size=5 No issue on structural hazards Stall the completion for WAW and WAR hazards Results written back on registers. Control centralized through the Scoreboard.

Vittorio Zaccaria Alari @ ST 2001

Branch Prediction

Current DLX wastes one cycle but other architectures compute branches several cycles after the IF stage. We need to predict ASAP branch result (ID stage). Performance of Branch Prediction depends on:

Accuracy measured in terms of percentage of misprediction Cost of Misprediction measured in terms of the time wasted to execute unuseful instructions.

Vittorio Zaccaria Alari @ ST 2001

Branch History Table

Table of 1 bit values Indexed by the lower bits of the PC address Says whether or not branch taken last time

Vittorio Zaccaria Alari @ ST 2001

Branch History Table

Problem: in a loop, 1 bit BHT will cause two mispredictions:


1.

2.

When we arrive to the end of the loop and we must exit. Here the BHT predicts to stay in the loop. When we re-enter the loop, we reach the end and we must stay in the loop. Here the BHT predicts to exit

Vittorio Zaccaria Alari @ ST 2001

Dynamic Branch Prediction

It is a 2 bit scheme in which we change prediction only if we get misprediction twice. For each index of the table, the 2 bits report the state of a state machine (next slide). When we arrive at the end of the loop, we dont change prediction.

We can describe the algorithm with a FSM

Vittorio Zaccaria Alari @ ST 2001

Branch History Table Accuracy

Vittorio Zaccaria Alari @ ST 2001

We have a misprediction when


We make a wrong guess for that branch but also Because the same index can be referenced by two different branches, sometimes we get the history of the wrong branch

Branch History Table Accuracy

Vittorio Zaccaria Alari @ ST 2001

It has been measured that a 4096 entry table, programs have a misprediction percentage from 1% to 18%:

Nasa7, tomcatv Eqntott Spice Gcc

1% 18% 9% 12%

4096 about as good as infinite table (for the Alpha 21164)

Vittorio Zaccaria Alari @ ST 2001

Correlating Branches

Basic hypotesis: recent branches are correlated, i.e., behavior of recently executed branches affects the prediction of current branch:

Correlating Branches Example


If(a==2) bb1; L1: If(b==2) bb2; L2: If(a!=b) bb3; L1: subi bnez add subi bnez add sub beqz ...;

Vittorio Zaccaria Alari @ ST 2001

L2:

R3,R1,2 R3,L1 r1,r0,r0; bb1 r3,r1,2 r3,L2 r2,r0,r0; bb2 r3,r1,r2 r3,L3 bb3

L3:

Branch L2 is correlated previous branches. If both are not taken then L2 is taken.

Vittorio Zaccaria Alari @ ST 2001

Idea:

record m most recently executed branches as taken or not taken. Use that pattern to select the proper branch history table.

Example of a simple correlating predictor

Vittorio Zaccaria Alari @ ST 2001

The branch is predicted on the basis of the previously executed one by selecting the appropriate 1 bit BHT.
1 0 .... 1 1 ....

Branch Prediction Table if last branch taken

Branch Prediction Table if last branch not taken

Branch to be predicted

Last branch result

effective branch result

Vittorio Zaccaria Alari @ ST 2001

(m,n) predictors

In general, (m,n) predictor means record last m branches to select between 2^m, n-bit history tables.

Example of a (2,2) correlating branch predictor

Vittorio Zaccaria Alari @ ST 2001

Each cell of the predictor represents the state of a 2 bit branch predictor.

Vittorio Zaccaria Alari @ ST 2001

Accuracy of different Schemes


18% 16% 14%

Frequency of Mispredictions

12% 10% 8%

4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT
6% 6%

11%

6% 5% 4%

6% 4% 2% 0% 1% 0% 1%

5%

doducd

nasa7

gcc

espresso

spice

tomcatv

eqntott

fpppp

4,096 entries: 2-bits per entry

matrix300

Unlimited entries: 2-bits/entry

1,024 entries (2,2)

li

Address must also be predicted

Vittorio Zaccaria Alari @ ST 2001

Access in the IF stage the Branch Target Buffer: Tipical Entry:


Exact Address of a branch Predicted PC (only if not sequential)

Branch Target Buffer structure


Pc of fetched instruction

Vittorio Zaccaria Alari @ ST 2001

Associative lookup

Predicted PC

No, instruction is not predicted To be a branch, proceed normally Yes, instruction is a a branch, PC should be used as next PC

Branch Target Buffer

Vittorio Zaccaria Alari @ ST 2001

Hardware Speculation (Boosting)


Vittorio Zaccaria Alari @ ST 2001

Issue an instruction dependent on branch before the branch result is known. Commit is always made in order. Commit of a speculative instruction is made only when the branch outcome is known. The same holds for exceptions (synchronous or asynchronous) deviations of control flow

Speculative Tomasulos Algorthm


Vittorio Zaccaria Alari @ ST 2001

Tomasulos Boosting needs a buffer for uncommited results (reorder buffer). Each entry is:
Instruction Destination Value

ROB has a slot for each issued instruction. When an instruction writes into a register, it writes only in its assigned slot in the ROB. The reorder buffer can be a operand source (like the RS or load buffers) or destination (like RF and store buffers)

Vittorio Zaccaria Alari @ ST 2001

Tomasulos ROB (cont.)

RS now only queue instructions to FUs (to reduce structural hazards) Pointers, now, are directed toward ROB slots.

Four steps of speculative Tomasulos Algorithm


1.

Vittorio Zaccaria Alari @ ST 2001

2.
3. 4.

Issue: get an instruction from the queue. RS && ROB must have a slot free. Dispatch the operation indicating in which slot it must write Execution: When both operands ready, execute. If not watch in the CDB. Write Result:Write on CDB and on ROB Commit: the commited instruction at head of the ROB updates destination register and is removed. Mispredicted branches flush the ROB (graduation).

Speculative Tomasulos algorithm

Vittorio Zaccaria Alari @ ST 2001

You might also like