0% found this document useful (0 votes)
40 views

03ILP Speculation and Advanced Topics

Hardware-based speculation uses branch prediction, dynamic scheduling, and speculation to improve instruction-level parallelism (ILP). It introduces completion and commit stages after execution to speculatively execute instructions before a branch result is known. A reorder buffer tracks speculative results until the branch outcome commits them. The reorder buffer replaces the store buffer and allows out-of-order execution with in-order commit. Multiple-issue processors use static or dynamic scheduling to issue multiple instructions per cycle to further improve ILP. VLIW processors rely on the compiler for scheduling while superscalar processors perform dynamic scheduling.

Uploaded by

Uzair Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

03ILP Speculation and Advanced Topics

Hardware-based speculation uses branch prediction, dynamic scheduling, and speculation to improve instruction-level parallelism (ILP). It introduces completion and commit stages after execution to speculatively execute instructions before a branch result is known. A reorder buffer tracks speculative results until the branch outcome commits them. The reorder buffer replaces the store buffer and allows out-of-order execution with in-order commit. Multiple-issue processors use static or dynamic scheduling to issue multiple instructions per cycle to further improve ILP. VLIW processors rely on the compiler for scheduling while superscalar processors perform dynamic scheduling.

Uploaded by

Uzair Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

1

ILP: Hardware Based


Speculation and Multi Issue
PREPARED BY SHAFIA HUSSAIN
2
Hardware-Based Speculation
Dynamic Branch Prediction + Dynamic Scheduling + Speculation

 Control dependencies pose major hurdle to ILP.


 Branch prediction reduces the stalls but in an out of order processor,
instructions wait for it’s result to execute.
 Speculation goes a step further and executes the instructions suggested by the
branch prediction
 This is accomplished by introducing two separate stages after instruction
execution; Completion, and Commit.
 Instruction result is available to the dependent instruction after the Completion
stage.
 Result is written back to registers in Commit after the result of branch is known.
 If the branch prediction was wrong none of the completed instructions from the
branch prediction are committed
3
Reorder Buffer

 A circular buffer named Reorder Buffer (ROB) is added


after functional units.
 The completed results reside in it until the instructions are
no more speculative i.e. branch result is known.
 It has two pointers which point to it’s head and tail. At start
both point to same location.
 With each new instruction tail pointer increments
 Head pointer increments with each commit.
4
Reorder Buffer

 The ROB entry number is recorded in the RSs for


operands that are produced by instruction
 Each entry in the ROB contains four fields: the instruction
type, the destination field, the value field, and the ready
field which indicates that instruction has completed
execution.
 Reorder buffer also replaces store buffer. Store writes the
result in commit stage which is the second step of store
execution
5
6
Steps of Execution
Issue
 Instruction only issued when there is empty slot in both the
reservation station and ROB
 Operands are provided if they are available in registers or ROB
 The no of ROB entry is also sent to RS which is used as a tag to get
result when it is broadcasted on CDB
 In order issue, so
 No RAW hazard for register instructions
 No WAR hazard for register and memory instructions
7
Steps of Execution
Execute
 If operands are not available, monitor the CDB
 Load has two execution steps which complete in this phase
 For Store only effective address computation is completed. Data is written to
memory in commit stage. So store only need base address at this stage.
 RAW hazards through memory are maintained by two restrictions:
 Maintaining the program order for the computation of an effective address of a load
with respect to all earlier stores.
 Not allowing a load to initiate the second step of its execution if any active ROB
entry is occupied by a store having same address
8
Steps of Execution
Write Result
 Result is written to CDB with it’s ROB tag, from where it
reaches ROB and any RS waiting for result
 RS is marked available.
 For store instructions, if the value to be stored is
available, it is written into the Value field of the ROB
entry otherwise, the CDB must be monitored until that
value is broadcast
9
Steps of Execution
Commit
 Three scenarios when an instruction reaches the head of ROB
 Branch
 If it is true then it’s execution completes here
 Otherwise all the ROB is flushed and execution is restarted from the correct
instruction
 Store writes to the memory location if data is available
 Other normal instructions write their results to registers.
 When an instruction commits, its entry in ROB is vacant again for new instructions.
 Since the commit is in order
 so there are no WAW hazards for registers as well memory
 No imprecise exception
 Exceptions are also stored and raised only when the instruction reaches head of ROB
10
Exploiting ILP Using Multiple Issue

 The goal of the multiple-issue processors is to decrease CPI to be less than 1 or increase
IPC
 Three major flavors:
 Statically scheduled superscalar processors
 Instruction issue is dynamic due to Hazard detection unit but executed in order, so
compiler must schedule in such a way that stalls are minimum
 VLIW (very long instruction word) processors
 Similar to static superscalar but doesn’t contains any hazard detection hardware. It
is job of compiler to avoid hazard and stalls
 Fixed size packet of instructions id issued
 Dynamically scheduled superscalar processors
 With Dynamic issue and scheduling. May or may not have speculation
11
Comparison of different Multi-issue processors
12
Multi-issue and Static Scheduling
VLIW

 Static superscalar advantages diminish as issue width grows


because the instruction execute in order and any hazard
would stall the processor
 VLIW processor issue a fixed length packet containing
multiple instruction.
 Each functional unit has a dedicated slot in the packet e.g
integer operations in slot1 and slot2, while load/store in slot5
13
VLIW
 To max utilize the benefit of multiple issue max slots should be
utilized (ideally all the slots).
 But it must be made sure to avoid hazards such as RAW etc. as it is
the job of programmer or compiler
 ILP can be achieved by exploiting parallelism within the basic block
as well as loop unrolling.
 If exploiting the parallelism requires scheduling code across
branches, a substantially more complex global scheduling algorithm
(e.g trace scheduling) must be used. (not covered in course)
Example
14

C Code

RISC-V Assembly
15
Mem1 Mem2 FP Op1 FP op2 Integer/branch
         
fld f0,0(x1) fld f1,-8(x1)      

fld f2,-16(x1) fld f3,-24(x1)      

fld f4,-32(x1) fld f5,-40(x1) f.add f6,f0,f19 f.add f7,f1,f19  

    f.add f8,f2,f19 f.add f9,f3,f19  

    f.add f10,f4,f19 f.add f11,f5,f19  

fsd f6,0(x1) fsd f7,-8(x1)      


fsd f9,-
fsd f8,-16(x1) 24(x1)     addi x1,x1,-48

fsd f10,16(x1) fsd f11,8(x1)     bne x1,x2,loop


16
VLIW Problems
 Technical
 Code Size
 Loop Unrolling
 Unused function units had nops (Wasted bits)
 Clever Encoding: e.g one large immediate field available for all
functional units
 Lockstep Operation
 A stall in one functional unit will stall the whole processor and could
be unpredictable e.g cache miss
 Logistical
 Binary Code compatibility
17
Superscalar

 Multiple instructions are dynamically issued


 Structural hazard is handled by processor
 Statically scheduled perform in-order execution
 If a dependent instruction is stopped, all following it are also
stopped
 Dynamically scheduled can use Tomasulo like scheme and perform
out of order execution
 Ofcourse multiple functional units are required
18
Superscalar: Issue stage
 Issue logic is the most complex part and is the bottleneck
 Instruction inside issue packet may be dependent on each other.
 So dependence must be determined within a cycle and then all reservation station
must be reserved updated in parallel.
 every possible combination of dependent instructions in the same clock cycle must
be considered and combinations increase as issue packet increase e.g in 4-issue 4 th
can be dependent on 1st , or 3rd can be dependent on 2nd and so on,..
 Issuing multi instructions can be accomplished in two ways and modern processor may
use both
 Divide the issue stage in 2 stages (superpipelining) which run at twice the external
clock (2-issue). But tables still need to updates for the following instruction. Cannot
be easily extended to issue 4.
 Widen the units so that they can issue multi instructions. Instruction dependency is
analyzed in parallel and reservation stations are updated
19
Superscalar: Considerations per
cycle
Issue
 Preallocate n reorder buffer entries while also making sure that reservation stations
are available for any combination of instruction
 For successful preallocation reservation stations demand can be reduced by
limiting no. of instruction for any particular functional unit instruction (e.g. only 2
floating point multiplications
 Ifstill there is shortage of reservation stations then bundle is broken and remaining
are issued in next packet
 Analyze all dependencies among instructions
 Ifthe instructions dependent on instructions inside bundle then use the assigned
reorder buffer entries to update the table
 Ifthe instruction are dependent on earlier instruction then use existing reservation
table entries
 Execution and Commit: Multiple instructions can commit per cycle
20
Example
21
22
Advanced Techniques for Instruction 23
Delivery
and Speculation

 Inmulti-issue special arrangements are needed to


maintain issue rate
 It requires
 Wider Path
 Removing branch stalls
 Advance Speculation techniques
24
Increasing Instruction Fetch Bandwidth:
Using Branch Target Buffer

 Branch prediction reduces stall but doesn’t eliminate it


 In standard 5-stage pipeline we still have 1 stall with branch
prediction if branch is taken, because the branch target is figured out
in the decode stage. So there is still one bubble if prediction is true
and branch is taken
 A branch target buffer is like a cache and it stores the target address of
taken branches
 In fetch stage the address of instruction is matched with entries in
BTB (before decoding the instruction)
25
26
27
Example
28

So on average 0.38 cycles will be wasted due to branch. 1 cycle for


branch execution is excluded
Branch Folding

• One variation on the branch-target buffer is to store one or more target instructions
instead of, or in addition to, the predicted target address.
• It actual target instructions allows us to perform an optimization called branch
folding(consider out of order execution).
• Branch folding can be used to obtain 0-cycle unconditional branches and sometimes
0-cycle conditional branches. (i.e. Because IF stage is skipped)
Increasing Instruction Fetch Bandwidth 29
Specialized Branch Predictors for
Indirect Jumps

 Indirect
branches such as in Indirect Jumps and
Procedure returns pose extra challenge for speculation
 These jumps use jump register instruction in assembly.
 Switch case is converted to a table (jump table) of address in memory. The index to table is
made using input and table starting addresses. The value is loaded in register and jr reg
jumps to target
 Similarly jump back from procedure also uses jr reg.

 Inboth cases above instruction is same but address


varies which seriously affect prediction
Increasing Instruction Fetch Bandwidth 30
Specialized Branch Predictors for
Indirect Jumps

 The procedure return problem can be improved by adding a


buffer (cache) which acts as a stack for indirect jumps.
 It contains the address of recent procedure return addresses.
 On each call address is pushed in stack and on return it is
popped.
 So for indirect jumps use a separate return address predictor
instead of the predictor for direct jumps
Increasing Instruction Fetch Bandwidth: Integrated Instruction 31
Fetch Units

 Divide the processor in two ends Frond end and back end
 Front end is Instruction fetch unit. So it is not considered as just one stage of pipeline.
 The interface between front end and back end is buffer which contains instructions to be
issued
 It is now an autonomous unit that has following features
 Integrated branch prediction: It is part of IF unit and is constantly predicting
branches .
 Instruction prefetch:, The unit autonomously manages the prefetching of instructions
integrating it with branch prediction to deliver multiple instructions per clock.
 Instruction memory access and buffering: To maintain bandwidth in multissue, it
may require to jumping to different blocks in cache. The delay is hid using prefetching
and also by including a buffer inside fetch unit
Speculation: Implementation Issues and 32
Extensions
Renaming using Merged Register File

 Renaming using reorder buffer has certain disadvantages (and also


advantages, so still high end processors use it)
 Register values are written twice, once in reorder buffer and then in
register file when instruction commits. So more power consumption
 Source operand may come from two locations; register file and
reorder buffer. It increases the interconnect for data to be sent to
reservation stations.
 Merged register file has two sets of registers.
 Architectural Registers: Visible to programmer (Compiler)
 Physical Registers: Not visible. For renaming. Hold speculative
results like reorder buffer and in other variant also hold committed
33
Renaming using Merged Register File

 There are two variants are (not much difference)


 There are Architectural registers which also physically exists and another
set of Extended Physical registers to hold speculative results and for
renaming. At Commits their values are moved to Architectural registers
 Architectural registers are logical registers. Physical registers contain
both speculative and committed values.
 In both cases Physical registers are much larger in no as compared to
Architectural registers.
 Reorder buffer function to keep track of order for commit is still present.
34
Renaming using Merged Register File

 A renaming table maps Architectural registers to Physical registers and also


keeps track of free registers
 All the destination registers of instruction being issued are assigned to the free
registers and This way the WAR and WAW dependencies are eliminated
 Consequently the following instructions which have a RAW dependency also
update their operands likewise.
 So an architectural register can be mapped to multiple physical registers.
 When an architectural register is commited, the previous physical register is
freed. For example if R1 is mapped to physical registers R5 and R6. So when
R6 is committed, then we are sure that R5 has no use and it can be added to
free list
Physical Registers Architectu
ral
35
Example Registers
r1 r1
r2 r2
r3 r3

RAW between i2 and i1 for r1 r4 r4


WAR between i3 and i2 for r5 r5 r5
WAW between i4 and i1 for r1 r6 r6
r7 free
i1:Add r1,r2,r4
r8 free
i2:Sub r3,r1,r5
r9 free
i3:Mul r5,r1,r6
i4:Add r1,r6,r5 r10 free
Suppose initially table is
r11 free
r12 free
 The code after mapping is
i1:Add r7,r2,r4 36
i2:Sub r8,r7,r5
i3:Mul r9,r7,r6 Physical Registers Architectural
i4:Add r10,r6,r9 Registers
r1 r1
• As we can see that all WAR and WAW dependencies r2 r2
are removed and RAW dependencies are maintained
r3 r3
• Since r1 was already mapped to r1 before first
instruction, it means it is already used as an operand r4 r4
or dest. of an instruction. Thus using as new dest. r5 r5
without renaming will cause WAR or WAW.
r6 r6
• As can be seen, an arch. Reg can be mapped to
multiple physical reg which may result in shortage if r7 r1
unused or not freed. r8 r3
• How to know when a Physical register can be freed?
r9 r5
• The safest and easiest (but not the optimum) way is
to wait until the latest register for the same Arch.
register commits. Then previous can be freed. r10 r1
• For example in above example, r7 (earlier mapping r11 free
of r1) can be freed when r10 (later mapping of r1)
r12 free
commits because no dependent instruction for r7 can
exist after it
37
Speculation: Implementation Issues and Extensions
The Challenge of More Issues per Clock

 Once branch prediction is accurate thus


speculation is optimum, issue rate can be
considered to increase.
 Duplicating function unit is straightforward
 Renaming example below again describes
the complexity of issue is the bottleneck
and (thus commit which is dual)
38
The Challenge of More Issues per Clock

 The difficulty is that all task of finding


dependency and rewriting must must be
done in a cycle
39
Speculation: Implementation Issues and Extensions
How Much to Speculate

 Although Speculation has many advantages but it has certain


disadvantages which limit how much we speculate.
 It takes time and energy to recover from incorrect speculation
 Inorder to maintain high execution rate for speculation, the
processor must have additional resources which take area and
power.
 Optimum branch prediction (more caches/buffers i.e predictors)
 Multiple functional unit
 Ifspeculation causes an exception to occur (e.g. cache miss), it will
significantly reduce performance
 Most processor allow only low cost exceptions (1st level cache
miss) to handled in speculation.
40
Speculation and the Challenge of
Energy Efficiency
 Instructions
that are speculated and whose results are not
needed generate excess work for the processor, wasting energy.
 Undoingthe speculation and restoring the state of the processor
consumes additional energy.
 ifspeculation lowers the execution time by more than it
increases the average power consumption, then the total energy
consumed may be less.
 SPEC2000 benchmarks results show that misspeculation is
high for integer programs and low for floating point programs
so a clever scheme is needed
41
42
Speculating Through Multiple Branches

 We have only seen speculation through single branch


 Database and integer computation programs have clustering
of branches, so they may benefit from speculating multiple
branches and in some case multiple branches per clock (as
of 2017 no processor performs multiple branches per clock)
 Multiple branches per cycle means multiple prediction per
cycle, so multiported branch prediction buffers
 Getting instructions from non-contagious locations
43
Address Aliasing Prediction

 Predicts whether two stores (WAW hazard); or a load and a store refer
to the same memory address (RAW and WAR hazards).
 If they are no the same then we can interchange.
 We don’t need to predict exact addresses. Only predict whether the both
match.
 Processor must also have mechanism to recover after misprediction as
we do in the case of mispredicted branches.
 It is a simple form of Value Prediction (which after almost 15 years of
research isn’t available in any processor).
 In value prediction we predict the value produced by an instruction and
thus eliminate data dependence restrictions.
44
Multithreading: Exploiting Thread-Level Parallelism
to Improve Uniprocessor Throughput

 Multithreading uses thread level parallelism but it improves pipeline utilization. That
is why it is discussed here.
 With speculation and prefetch there is still chance that all stalls cannot be hidden e.g.
a cache miss
 Multithreading is a technique in which multiple threads share a processor without the
need of context switch.
 Threads of a process have separate PC and state (registers) but they share the same
data space (i.e code and data segment).
 So in processor which supports multithreading, there are separate PCs and registers
for each thread but all share the functional units.
 Thus while in-order execution may stall the processor, it is avoided by executing
instructions from other threads meanwhile.
 For dynamic scheduling it is needed to have per thread renaming table alongwith
separate registers and PC.
 Of course to get benefit from multithreading a program must have multiple threads
45
Multithreading Types

 Three approaches
 Fine-grained(static scheduling):
 Switching between threads on each cycle (interleaving in round robin fashion)
 A thread is skipped if it has stall whether due to data dependence or any cache miss
 no out of order execution ie. Static scheduling. Thus no dynamic scheduling hardware (e.g reservation station).
Thus multiple 5 stage pipelines in parallel.
 Coarse-grained(static scheduling):
 Switches threads only on costly stalls, such as level two or three cache misses.
 Simultaneous Multithreading:
 When Fine-grained multithreading is used on superscalar with dynamic scheduling.
 It issue from one thread. SMT executes instructions from multiple threads, leaving it up to the hardware to associate
instruction slots and renamed registers with their proper threads.
 In fine-grained multithreading only those instructions are issued whose operands are available whereas with
dynamic scheduling those instructions are also issued whose operands are being calculated
46
47
Multithreading Types

 Fine-grained:
 Increase Multi-thread throughput but increases the latency of a single-thread
 Coarse-grained:
 Less likely to slow down the execution of any one thread.
 It has throughput losses
 Thread is not switched for small stalls (e.g L1 cahe miss)
 Thread is only switched on a stall therefore there is a stall (bubble) in the
pipeline before every switch
 No major current processor uses this technique.
48
Simultaneous Multithreading

 Effectiveness of SMT was explored in 2000-2001, assuming Dynamic superscalar will get
much wider in next few years with
 Supporting six to eight issues per clock with
 Many simultaneous loads and stores
 large primary caches,
 Four to eight contexts with simultaneous issue and commit
 In practice, the existing implementations of SMT offer
 only two to four contexts with fetching
 Issue from only one,
 Up to four issues per clock.
 The result is that the gain from SMT is also more modest.

You might also like