03ILP Speculation and Advanced Topics
03ILP Speculation and Advanced Topics
The goal of the multiple-issue processors is to decrease CPI to be less than 1 or increase
IPC
Three major flavors:
Statically scheduled superscalar processors
Instruction issue is dynamic due to Hazard detection unit but executed in order, so
compiler must schedule in such a way that stalls are minimum
VLIW (very long instruction word) processors
Similar to static superscalar but doesn’t contains any hazard detection hardware. It
is job of compiler to avoid hazard and stalls
Fixed size packet of instructions id issued
Dynamically scheduled superscalar processors
With Dynamic issue and scheduling. May or may not have speculation
11
Comparison of different Multi-issue processors
12
Multi-issue and Static Scheduling
VLIW
C Code
RISC-V Assembly
15
Mem1 Mem2 FP Op1 FP op2 Integer/branch
fld f0,0(x1) fld f1,-8(x1)
• One variation on the branch-target buffer is to store one or more target instructions
instead of, or in addition to, the predicted target address.
• It actual target instructions allows us to perform an optimization called branch
folding(consider out of order execution).
• Branch folding can be used to obtain 0-cycle unconditional branches and sometimes
0-cycle conditional branches. (i.e. Because IF stage is skipped)
Increasing Instruction Fetch Bandwidth 29
Specialized Branch Predictors for
Indirect Jumps
Indirect
branches such as in Indirect Jumps and
Procedure returns pose extra challenge for speculation
These jumps use jump register instruction in assembly.
Switch case is converted to a table (jump table) of address in memory. The index to table is
made using input and table starting addresses. The value is loaded in register and jr reg
jumps to target
Similarly jump back from procedure also uses jr reg.
Divide the processor in two ends Frond end and back end
Front end is Instruction fetch unit. So it is not considered as just one stage of pipeline.
The interface between front end and back end is buffer which contains instructions to be
issued
It is now an autonomous unit that has following features
Integrated branch prediction: It is part of IF unit and is constantly predicting
branches .
Instruction prefetch:, The unit autonomously manages the prefetching of instructions
integrating it with branch prediction to deliver multiple instructions per clock.
Instruction memory access and buffering: To maintain bandwidth in multissue, it
may require to jumping to different blocks in cache. The delay is hid using prefetching
and also by including a buffer inside fetch unit
Speculation: Implementation Issues and 32
Extensions
Renaming using Merged Register File
Predicts whether two stores (WAW hazard); or a load and a store refer
to the same memory address (RAW and WAR hazards).
If they are no the same then we can interchange.
We don’t need to predict exact addresses. Only predict whether the both
match.
Processor must also have mechanism to recover after misprediction as
we do in the case of mispredicted branches.
It is a simple form of Value Prediction (which after almost 15 years of
research isn’t available in any processor).
In value prediction we predict the value produced by an instruction and
thus eliminate data dependence restrictions.
44
Multithreading: Exploiting Thread-Level Parallelism
to Improve Uniprocessor Throughput
Multithreading uses thread level parallelism but it improves pipeline utilization. That
is why it is discussed here.
With speculation and prefetch there is still chance that all stalls cannot be hidden e.g.
a cache miss
Multithreading is a technique in which multiple threads share a processor without the
need of context switch.
Threads of a process have separate PC and state (registers) but they share the same
data space (i.e code and data segment).
So in processor which supports multithreading, there are separate PCs and registers
for each thread but all share the functional units.
Thus while in-order execution may stall the processor, it is avoided by executing
instructions from other threads meanwhile.
For dynamic scheduling it is needed to have per thread renaming table alongwith
separate registers and PC.
Of course to get benefit from multithreading a program must have multiple threads
45
Multithreading Types
Three approaches
Fine-grained(static scheduling):
Switching between threads on each cycle (interleaving in round robin fashion)
A thread is skipped if it has stall whether due to data dependence or any cache miss
no out of order execution ie. Static scheduling. Thus no dynamic scheduling hardware (e.g reservation station).
Thus multiple 5 stage pipelines in parallel.
Coarse-grained(static scheduling):
Switches threads only on costly stalls, such as level two or three cache misses.
Simultaneous Multithreading:
When Fine-grained multithreading is used on superscalar with dynamic scheduling.
It issue from one thread. SMT executes instructions from multiple threads, leaving it up to the hardware to associate
instruction slots and renamed registers with their proper threads.
In fine-grained multithreading only those instructions are issued whose operands are available whereas with
dynamic scheduling those instructions are also issued whose operands are being calculated
46
47
Multithreading Types
Fine-grained:
Increase Multi-thread throughput but increases the latency of a single-thread
Coarse-grained:
Less likely to slow down the execution of any one thread.
It has throughput losses
Thread is not switched for small stalls (e.g L1 cahe miss)
Thread is only switched on a stall therefore there is a stall (bubble) in the
pipeline before every switch
No major current processor uses this technique.
48
Simultaneous Multithreading
Effectiveness of SMT was explored in 2000-2001, assuming Dynamic superscalar will get
much wider in next few years with
Supporting six to eight issues per clock with
Many simultaneous loads and stores
large primary caches,
Four to eight contexts with simultaneous issue and commit
In practice, the existing implementations of SMT offer
only two to four contexts with fetching
Issue from only one,
Up to four issues per clock.
The result is that the gain from SMT is also more modest.