0% found this document useful (0 votes)

40 views

03ILP Speculation and Advanced Topics

Hardware-based speculation uses branch prediction, dynamic scheduling, and speculation to improve instruction-level parallelism (ILP). It introduces completion and commit stages after execution to speculatively execute instructions before a branch result is known. A reorder buffer tracks speculative results until the branch outcome commits them. The reorder buffer replaces the store buffer and allows out-of-order execution with in-order commit. Multiple-issue processors use static or dynamic scheduling to issue multiple instructions per cycle to further improve ILP. VLIW processors rely on the compiler for scheduling while superscalar processors perform dynamic scheduling.

Uploaded by

Uzair Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views

03ILP Speculation and Advanced Topics

Uploaded by

Uzair Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 48

1

ILP: Hardware Based

Speculation and Multi Issue
PREPARED BY SHAFIA HUSSAIN
2
Hardware-Based Speculation
Dynamic Branch Prediction + Dynamic Scheduling + Speculation

 Control dependencies pose major hurdle to ILP.

 Branch prediction reduces the stalls but in an out of order processor,
instructions wait for it’s result to execute.
 Speculation goes a step further and executes the instructions suggested by the
branch prediction
 This is accomplished by introducing two separate stages after instruction
execution; Completion, and Commit.
 Instruction result is available to the dependent instruction after the Completion
stage.
 Result is written back to registers in Commit after the result of branch is known.
 If the branch prediction was wrong none of the completed instructions from the
branch prediction are committed
3
Reorder Buffer

 A circular buffer named Reorder Buffer (ROB) is added

after functional units.
 The completed results reside in it until the instructions are
no more speculative i.e. branch result is known.
 It has two pointers which point to it’s head and tail. At start
both point to same location.
 With each new instruction tail pointer increments
 Head pointer increments with each commit.
4
Reorder Buffer

 The ROB entry number is recorded in the RSs for

operands that are produced by instruction
 Each entry in the ROB contains four fields: the instruction
type, the destination field, the value field, and the ready
field which indicates that instruction has completed
execution.
 Reorder buffer also replaces store buffer. Store writes the
result in commit stage which is the second step of store
execution
5
6
Steps of Execution
Issue
 Instruction only issued when there is empty slot in both the
reservation station and ROB
 Operands are provided if they are available in registers or ROB
 The no of ROB entry is also sent to RS which is used as a tag to get
result when it is broadcasted on CDB
 In order issue, so
 No RAW hazard for register instructions
 No WAR hazard for register and memory instructions
7
Steps of Execution
Execute
 If operands are not available, monitor the CDB
 Load has two execution steps which complete in this phase
 For Store only effective address computation is completed. Data is written to
memory in commit stage. So store only need base address at this stage.
 RAW hazards through memory are maintained by two restrictions:
 Maintaining the program order for the computation of an effective address of a load
with respect to all earlier stores.
 Not allowing a load to initiate the second step of its execution if any active ROB
entry is occupied by a store having same address
8
Steps of Execution
Write Result
 Result is written to CDB with it’s ROB tag, from where it
reaches ROB and any RS waiting for result
 RS is marked available.
 For store instructions, if the value to be stored is
available, it is written into the Value field of the ROB
entry otherwise, the CDB must be monitored until that
value is broadcast
9
Steps of Execution
Commit
 Three scenarios when an instruction reaches the head of ROB
 Branch
 If it is true then it’s execution completes here
 Otherwise all the ROB is flushed and execution is restarted from the correct
instruction
 Store writes to the memory location if data is available
 Other normal instructions write their results to registers.
 When an instruction commits, its entry in ROB is vacant again for new instructions.
 Since the commit is in order
 so there are no WAW hazards for registers as well memory
 No imprecise exception
 Exceptions are also stored and raised only when the instruction reaches head of ROB
10
Exploiting ILP Using Multiple Issue

 The goal of the multiple-issue processors is to decrease CPI to be less than 1 or increase
IPC
 Three major flavors:
 Statically scheduled superscalar processors
 Instruction issue is dynamic due to Hazard detection unit but executed in order, so
compiler must schedule in such a way that stalls are minimum
 VLIW (very long instruction word) processors
 Similar to static superscalar but doesn’t contains any hazard detection hardware. It
is job of compiler to avoid hazard and stalls
 Fixed size packet of instructions id issued
 Dynamically scheduled superscalar processors
 With Dynamic issue and scheduling. May or may not have speculation
11
Comparison of different Multi-issue processors
12
Multi-issue and Static Scheduling
VLIW

 Static superscalar advantages diminish as issue width grows

because the instruction execute in order and any hazard
would stall the processor
 VLIW processor issue a fixed length packet containing
multiple instruction.
 Each functional unit has a dedicated slot in the packet e.g
integer operations in slot1 and slot2, while load/store in slot5
13
VLIW
 To max utilize the benefit of multiple issue max slots should be
utilized (ideally all the slots).
 But it must be made sure to avoid hazards such as RAW etc. as it is
the job of programmer or compiler
 ILP can be achieved by exploiting parallelism within the basic block
as well as loop unrolling.
 If exploiting the parallelism requires scheduling code across
branches, a substantially more complex global scheduling algorithm
(e.g trace scheduling) must be used. (not covered in course)
Example
14

C Code

RISC-V Assembly
15
Mem1 Mem2 FP Op1 FP op2 Integer/branch

fld f0,0(x1) fld f1,-8(x1)

fld f2,-16(x1) fld f3,-24(x1)

fld f4,-32(x1) fld f5,-40(x1) f.add f6,f0,f19 f.add f7,f1,f19

f.add f8,f2,f19 f.add f9,f3,f19

f.add f10,f4,f19 f.add f11,f5,f19

fsd f6,0(x1) fsd f7,-8(x1)

fsd f9,-
fsd f8,-16(x1) 24(x1) addi x1,x1,-48

fsd f10,16(x1) fsd f11,8(x1) bne x1,x2,loop

16
VLIW Problems
 Technical
 Code Size
 Loop Unrolling
 Unused function units had nops (Wasted bits)
 Clever Encoding: e.g one large immediate field available for all
functional units
 Lockstep Operation
 A stall in one functional unit will stall the whole processor and could
be unpredictable e.g cache miss
 Logistical
 Binary Code compatibility
17
Superscalar

 Multiple instructions are dynamically issued

 Structural hazard is handled by processor
 Statically scheduled perform in-order execution
 If a dependent instruction is stopped, all following it are also
stopped
 Dynamically scheduled can use Tomasulo like scheme and perform
out of order execution
 Ofcourse multiple functional units are required
18
Superscalar: Issue stage
 Issue logic is the most complex part and is the bottleneck
 Instruction inside issue packet may be dependent on each other.
 So dependence must be determined within a cycle and then all reservation station
must be reserved updated in parallel.
 every possible combination of dependent instructions in the same clock cycle must
be considered and combinations increase as issue packet increase e.g in 4-issue 4 th
can be dependent on 1st , or 3rd can be dependent on 2nd and so on,..
 Issuing multi instructions can be accomplished in two ways and modern processor may
use both
 Divide the issue stage in 2 stages (superpipelining) which run at twice the external
clock (2-issue). But tables still need to updates for the following instruction. Cannot
be easily extended to issue 4.
 Widen the units so that they can issue multi instructions. Instruction dependency is
analyzed in parallel and reservation stations are updated
19
Superscalar: Considerations per
cycle
Issue
 Preallocate n reorder buffer entries while also making sure that reservation stations
are available for any combination of instruction
 For successful preallocation reservation stations demand can be reduced by
limiting no. of instruction for any particular functional unit instruction (e.g. only 2
floating point multiplications
 Ifstill there is shortage of reservation stations then bundle is broken and remaining
are issued in next packet
 Analyze all dependencies among instructions
 Ifthe instructions dependent on instructions inside bundle then use the assigned
reorder buffer entries to update the table
 Ifthe instruction are dependent on earlier instruction then use existing reservation
table entries
 Execution and Commit: Multiple instructions can commit per cycle
20
Example
21
22
Advanced Techniques for Instruction 23
Delivery
and Speculation

 Inmulti-issue special arrangements are needed to

maintain issue rate
 It requires
 Wider Path
 Removing branch stalls
 Advance Speculation techniques
24
Increasing Instruction Fetch Bandwidth:
Using Branch Target Buffer

 Branch prediction reduces stall but doesn’t eliminate it

 In standard 5-stage pipeline we still have 1 stall with branch
prediction if branch is taken, because the branch target is figured out
in the decode stage. So there is still one bubble if prediction is true
and branch is taken
 A branch target buffer is like a cache and it stores the target address of
taken branches
 In fetch stage the address of instruction is matched with entries in
BTB (before decoding the instruction)
25
26
27
Example
28

So on average 0.38 cycles will be wasted due to branch. 1 cycle for

branch execution is excluded
Branch Folding

• One variation on the branch-target buffer is to store one or more target instructions
instead of, or in addition to, the predicted target address.
• It actual target instructions allows us to perform an optimization called branch
folding(consider out of order execution).
• Branch folding can be used to obtain 0-cycle unconditional branches and sometimes
0-cycle conditional branches. (i.e. Because IF stage is skipped)
Increasing Instruction Fetch Bandwidth 29
Specialized Branch Predictors for
Indirect Jumps

 Indirect
branches such as in Indirect Jumps and
Procedure returns pose extra challenge for speculation
 These jumps use jump register instruction in assembly.
 Switch case is converted to a table (jump table) of address in memory. The index to table is
made using input and table starting addresses. The value is loaded in register and jr reg
jumps to target
 Similarly jump back from procedure also uses jr reg.

 Inboth cases above instruction is same but address

varies which seriously affect prediction
Increasing Instruction Fetch Bandwidth 30
Specialized Branch Predictors for
Indirect Jumps

 The procedure return problem can be improved by adding a

buffer (cache) which acts as a stack for indirect jumps.
 It contains the address of recent procedure return addresses.
 On each call address is pushed in stack and on return it is
popped.
 So for indirect jumps use a separate return address predictor
instead of the predictor for direct jumps
Increasing Instruction Fetch Bandwidth: Integrated Instruction 31
Fetch Units

 Divide the processor in two ends Frond end and back end
 Front end is Instruction fetch unit. So it is not considered as just one stage of pipeline.
 The interface between front end and back end is buffer which contains instructions to be
issued
 It is now an autonomous unit that has following features
 Integrated branch prediction: It is part of IF unit and is constantly predicting
branches .
 Instruction prefetch:, The unit autonomously manages the prefetching of instructions
integrating it with branch prediction to deliver multiple instructions per clock.
 Instruction memory access and buffering: To maintain bandwidth in multissue, it
may require to jumping to different blocks in cache. The delay is hid using prefetching
and also by including a buffer inside fetch unit
Speculation: Implementation Issues and 32
Extensions
Renaming using Merged Register File

 Renaming using reorder buffer has certain disadvantages (and also

advantages, so still high end processors use it)
 Register values are written twice, once in reorder buffer and then in
register file when instruction commits. So more power consumption
 Source operand may come from two locations; register file and
reorder buffer. It increases the interconnect for data to be sent to
reservation stations.
 Merged register file has two sets of registers.
 Architectural Registers: Visible to programmer (Compiler)
 Physical Registers: Not visible. For renaming. Hold speculative
results like reorder buffer and in other variant also hold committed
33
Renaming using Merged Register File

 There are two variants are (not much difference)

 There are Architectural registers which also physically exists and another
set of Extended Physical registers to hold speculative results and for
renaming. At Commits their values are moved to Architectural registers
 Architectural registers are logical registers. Physical registers contain
both speculative and committed values.
 In both cases Physical registers are much larger in no as compared to
Architectural registers.
 Reorder buffer function to keep track of order for commit is still present.
34
Renaming using Merged Register File

 A renaming table maps Architectural registers to Physical registers and also

keeps track of free registers
 All the destination registers of instruction being issued are assigned to the free
registers and This way the WAR and WAW dependencies are eliminated
 Consequently the following instructions which have a RAW dependency also
update their operands likewise.
 So an architectural register can be mapped to multiple physical registers.
 When an architectural register is commited, the previous physical register is
freed. For example if R1 is mapped to physical registers R5 and R6. So when
R6 is committed, then we are sure that R5 has no use and it can be added to
free list
Physical Registers Architectu
ral
35
Example Registers
r1 r1
r2 r2
r3 r3

RAW between i2 and i1 for r1 r4 r4

WAR between i3 and i2 for r5 r5 r5
WAW between i4 and i1 for r1 r6 r6
r7 free
i1:Add r1,r2,r4
r8 free
i2:Sub r3,r1,r5
r9 free
i3:Mul r5,r1,r6
i4:Add r1,r6,r5 r10 free
Suppose initially table is
r11 free
r12 free
 The code after mapping is
i1:Add r7,r2,r4 36
i2:Sub r8,r7,r5
i3:Mul r9,r7,r6 Physical Registers Architectural
i4:Add r10,r6,r9 Registers
r1 r1
• As we can see that all WAR and WAW dependencies r2 r2
are removed and RAW dependencies are maintained
r3 r3
• Since r1 was already mapped to r1 before first
instruction, it means it is already used as an operand r4 r4
or dest. of an instruction. Thus using as new dest. r5 r5
without renaming will cause WAR or WAW.
r6 r6
• As can be seen, an arch. Reg can be mapped to
multiple physical reg which may result in shortage if r7 r1
unused or not freed. r8 r3
• How to know when a Physical register can be freed?
r9 r5
• The safest and easiest (but not the optimum) way is
to wait until the latest register for the same Arch.
register commits. Then previous can be freed. r10 r1
• For example in above example, r7 (earlier mapping r11 free
of r1) can be freed when r10 (later mapping of r1)
r12 free
commits because no dependent instruction for r7 can
exist after it
37
Speculation: Implementation Issues and Extensions
The Challenge of More Issues per Clock

 Once branch prediction is accurate thus

speculation is optimum, issue rate can be
considered to increase.
 Duplicating function unit is straightforward
 Renaming example below again describes
the complexity of issue is the bottleneck
and (thus commit which is dual)
38
The Challenge of More Issues per Clock

 The difficulty is that all task of finding

dependency and rewriting must must be
done in a cycle
39
Speculation: Implementation Issues and Extensions
How Much to Speculate

 Although Speculation has many advantages but it has certain

disadvantages which limit how much we speculate.
 It takes time and energy to recover from incorrect speculation
 Inorder to maintain high execution rate for speculation, the
processor must have additional resources which take area and
power.
 Optimum branch prediction (more caches/buffers i.e predictors)
 Multiple functional unit
 Ifspeculation causes an exception to occur (e.g. cache miss), it will
significantly reduce performance
 Most processor allow only low cost exceptions (1st level cache
miss) to handled in speculation.
40
Speculation and the Challenge of
Energy Efficiency
 Instructions
that are speculated and whose results are not
needed generate excess work for the processor, wasting energy.
 Undoingthe speculation and restoring the state of the processor
consumes additional energy.
 ifspeculation lowers the execution time by more than it
increases the average power consumption, then the total energy
consumed may be less.
 SPEC2000 benchmarks results show that misspeculation is
high for integer programs and low for floating point programs
so a clever scheme is needed
41
42
Speculating Through Multiple Branches

 We have only seen speculation through single branch

 Database and integer computation programs have clustering
of branches, so they may benefit from speculating multiple
branches and in some case multiple branches per clock (as
of 2017 no processor performs multiple branches per clock)
 Multiple branches per cycle means multiple prediction per
cycle, so multiported branch prediction buffers
 Getting instructions from non-contagious locations
43
Address Aliasing Prediction

 Predicts whether two stores (WAW hazard); or a load and a store refer
to the same memory address (RAW and WAR hazards).
 If they are no the same then we can interchange.
 We don’t need to predict exact addresses. Only predict whether the both
match.
 Processor must also have mechanism to recover after misprediction as
we do in the case of mispredicted branches.
 It is a simple form of Value Prediction (which after almost 15 years of
research isn’t available in any processor).
 In value prediction we predict the value produced by an instruction and
thus eliminate data dependence restrictions.
44
Multithreading: Exploiting Thread-Level Parallelism
to Improve Uniprocessor Throughput

 Multithreading uses thread level parallelism but it improves pipeline utilization. That
is why it is discussed here.
 With speculation and prefetch there is still chance that all stalls cannot be hidden e.g.
a cache miss
 Multithreading is a technique in which multiple threads share a processor without the
need of context switch.
 Threads of a process have separate PC and state (registers) but they share the same
data space (i.e code and data segment).
 So in processor which supports multithreading, there are separate PCs and registers
for each thread but all share the functional units.
 Thus while in-order execution may stall the processor, it is avoided by executing
instructions from other threads meanwhile.
 For dynamic scheduling it is needed to have per thread renaming table alongwith
separate registers and PC.
 Of course to get benefit from multithreading a program must have multiple threads
45
Multithreading Types

 Three approaches
 Fine-grained(static scheduling):
 Switching between threads on each cycle (interleaving in round robin fashion)
 A thread is skipped if it has stall whether due to data dependence or any cache miss
 no out of order execution ie. Static scheduling. Thus no dynamic scheduling hardware (e.g reservation station).
Thus multiple 5 stage pipelines in parallel.
 Coarse-grained(static scheduling):
 Switches threads only on costly stalls, such as level two or three cache misses.
 Simultaneous Multithreading:
 When Fine-grained multithreading is used on superscalar with dynamic scheduling.
 It issue from one thread. SMT executes instructions from multiple threads, leaving it up to the hardware to associate
instruction slots and renamed registers with their proper threads.
 In fine-grained multithreading only those instructions are issued whose operands are available whereas with
dynamic scheduling those instructions are also issued whose operands are being calculated
46
47
Multithreading Types

 Fine-grained:
 Increase Multi-thread throughput but increases the latency of a single-thread
 Coarse-grained:
 Less likely to slow down the execution of any one thread.
 It has throughput losses
 Thread is not switched for small stalls (e.g L1 cahe miss)
 Thread is only switched on a stall therefore there is a stall (bubble) in the
pipeline before every switch
 No major current processor uses this technique.
48
Simultaneous Multithreading

 Effectiveness of SMT was explored in 2000-2001, assuming Dynamic superscalar will get
much wider in next few years with
 Supporting six to eight issues per clock with
 Many simultaneous loads and stores
 large primary caches,
 Four to eight contexts with simultaneous issue and commit
 In practice, the existing implementations of SMT offer
 only two to four contexts with fetching
 Issue from only one,
 Up to four issues per clock.
 The result is that the gain from SMT is also more modest.

08 Speculation
No ratings yet
08 Speculation
21 pages
Cs2354 Advanced Computer Architecture 2 Marks
No ratings yet
Cs2354 Advanced Computer Architecture 2 Marks
10 pages
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
No ratings yet
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
21 pages
CAQA5e ch3
No ratings yet
CAQA5e ch3
45 pages
Pipeline History
No ratings yet
Pipeline History
30 pages
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
No ratings yet
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
67 pages
IT 209 Lecture 7 2021
No ratings yet
IT 209 Lecture 7 2021
47 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
49 pages
Superscalar Processor Simulator Report PDF Version
No ratings yet
Superscalar Processor Simulator Report PDF Version
16 pages
Lecture 13: Trace Scheduling, Conditional Execution, Speculation, Limits of ILP
No ratings yet
Lecture 13: Trace Scheduling, Conditional Execution, Speculation, Limits of ILP
21 pages
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
No ratings yet
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
7 pages
Lec9 Multiple Issue Processors
No ratings yet
Lec9 Multiple Issue Processors
33 pages
CH10-Processor Structure and Function
No ratings yet
CH10-Processor Structure and Function
14 pages
L 0 ILP Optional Extra Topic
No ratings yet
L 0 ILP Optional Extra Topic
44 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
Midterm Recap: Performance Evaluation
No ratings yet
Midterm Recap: Performance Evaluation
5 pages
Presentation Cea Chapter16 2 Demo
No ratings yet
Presentation Cea Chapter16 2 Demo
30 pages
Me FIRST
No ratings yet
Me FIRST
4 pages
DSP q1
No ratings yet
DSP q1
7 pages
10.Week
No ratings yet
10.Week
35 pages
Module 5_Processor Structure and Function
No ratings yet
Module 5_Processor Structure and Function
74 pages
Correlating (Global) Branch Predictors Correlating Branch Predictors
No ratings yet
Correlating (Global) Branch Predictors Correlating Branch Predictors
3 pages
L27,28 Superscaler
No ratings yet
L27,28 Superscaler
28 pages
Superscalar
No ratings yet
Superscalar
38 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
36 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
22 pages
Input Unit: Memory: in Processing Element (PE) or CPU: Output
No ratings yet
Input Unit: Memory: in Processing Element (PE) or CPU: Output
24 pages
Hafta 14
No ratings yet
Hafta 14
23 pages
Lec02 Superscalar SW VLIW 22 23
No ratings yet
Lec02 Superscalar SW VLIW 22 23
34 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
Advanced Topics in Computer Architecture ECE 7373
No ratings yet
Advanced Topics in Computer Architecture ECE 7373
40 pages
CH14 COA9e Processor Structure and Function
No ratings yet
CH14 COA9e Processor Structure and Function
40 pages
Aca Important Questions 2 Marks 16marks
60% (5)
Aca Important Questions 2 Marks 16marks
18 pages
Onur 447 Spring15 Lecture11 Precise Exceptions Afterlecture
No ratings yet
Onur 447 Spring15 Lecture11 Precise Exceptions Afterlecture
49 pages
Instruction Pipelining
No ratings yet
Instruction Pipelining
32 pages
CPU Structure & Functions
No ratings yet
CPU Structure & Functions
44 pages
Lecture-14-03.02.2025
No ratings yet
Lecture-14-03.02.2025
53 pages
P14-15 Superscalar
No ratings yet
P14-15 Superscalar
28 pages
EE457Unit9a_OoO
No ratings yet
EE457Unit9a_OoO
77 pages
L1.3b_OOOpipelines
No ratings yet
L1.3b_OOOpipelines
72 pages
Chapter 2 Lecture 4 and 5
No ratings yet
Chapter 2 Lecture 4 and 5
56 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
55 pages
Computer Architecture Revision For Final Exam
No ratings yet
Computer Architecture Revision For Final Exam
60 pages
Star Lion College of Engineering & Technology: Cs2354 Aca-2 Marks & 16 Marks
No ratings yet
Star Lion College of Engineering & Technology: Cs2354 Aca-2 Marks & 16 Marks
14 pages
12 - Processor Structure and Function
No ratings yet
12 - Processor Structure and Function
73 pages
RN ACA-5 Unit-II
No ratings yet
RN ACA-5 Unit-II
42 pages
Computer Architecture Chapter 4: The Processor Part 3: Dr. Phạm Quốc Cường
No ratings yet
Computer Architecture Chapter 4: The Processor Part 3: Dr. Phạm Quốc Cường
23 pages
Superpipelining
No ratings yet
Superpipelining
7 pages
4-Advanced pipelining_241114_060906
No ratings yet
4-Advanced pipelining_241114_060906
80 pages
CA - Slides
No ratings yet
CA - Slides
28 pages
Onur Digitaldesign - Comparch 2021 Lecture13 Pipelining Afterlecture
No ratings yet
Onur Digitaldesign - Comparch 2021 Lecture13 Pipelining Afterlecture
138 pages
CH18 COA11e
No ratings yet
CH18 COA11e
40 pages
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
74 pages
CH - 14 - Instruction Level Parallelism and Superscalar Processors
No ratings yet
CH - 14 - Instruction Level Parallelism and Superscalar Processors
42 pages
Computer Organization and Architecture What Does Superscalar Mean?
No ratings yet
Computer Organization and Architecture What Does Superscalar Mean?
14 pages
Pic® Micro Principles Teachers Pack V11
From Everand
Pic® Micro Principles Teachers Pack V11
Clive W. Humphris
No ratings yet
Pic® Micro Principles V11
From Everand
Pic® Micro Principles V11
Clive W. Humphris
No ratings yet
Pic® Micro Principles on Your Mobile
From Everand
Pic® Micro Principles on Your Mobile
Clive W. Humphris
No ratings yet
Learn the Pic® Micro on Your Smartphone
From Everand
Learn the Pic® Micro on Your Smartphone
Clive W. Humphris
No ratings yet
Model Solution Quiz 1.
No ratings yet
Model Solution Quiz 1.
2 pages
Model Solution Quiz 2
No ratings yet
Model Solution Quiz 2
2 pages
UzairSohailResume
No ratings yet
UzairSohailResume
2 pages
Hci-Past Paper
No ratings yet
Hci-Past Paper
14 pages
Terminal Sub Part 5
No ratings yet
Terminal Sub Part 5
2 pages
GFK2123AA_CPU374-KZ
No ratings yet
GFK2123AA_CPU374-KZ
13 pages
Final ICT Skill Notes For AI IX and X
No ratings yet
Final ICT Skill Notes For AI IX and X
7 pages
VLSI Physical Deisgn Automation and CAD Tools: Theory and Practice
100% (2)
VLSI Physical Deisgn Automation and CAD Tools: Theory and Practice
503 pages
Computer Operations and Packages
No ratings yet
Computer Operations and Packages
90 pages
Pipeline Architecture: C. V. Ramamoorthy
No ratings yet
Pipeline Architecture: C. V. Ramamoorthy
42 pages
LTE-compliant Multi-Radio Access Technology (RAT) Baseband LSI
No ratings yet
LTE-compliant Multi-Radio Access Technology (RAT) Baseband LSI
5 pages
Hardwired & Micro-Programmed C.U.
No ratings yet
Hardwired & Micro-Programmed C.U.
3 pages
Processor Organization: Module-3 Part-2
No ratings yet
Processor Organization: Module-3 Part-2
88 pages
Ayush Practical 2
No ratings yet
Ayush Practical 2
2 pages
Chapter 3
No ratings yet
Chapter 3
16 pages
Non-Teaching Staff Unit
100% (2)
Non-Teaching Staff Unit
39 pages
Mck Semiconductors 2024 Webpdf
No ratings yet
Mck Semiconductors 2024 Webpdf
80 pages
8051 Microcontroller OSCILLATOR AND CLOCK
50% (2)
8051 Microcontroller OSCILLATOR AND CLOCK
1 page
Ict Lecture 1
No ratings yet
Ict Lecture 1
30 pages
Introduc) On: Lecture 12: Measuring Cpu Performance Lecture 1: Evolution of Computer System
No ratings yet
Introduc) On: Lecture 12: Measuring Cpu Performance Lecture 1: Evolution of Computer System
15 pages
IM1011 Sem231 Topic 02-1 Hardware
No ratings yet
IM1011 Sem231 Topic 02-1 Hardware
95 pages
Risc and Cisc
No ratings yet
Risc and Cisc
20 pages
Components of Computer: Hardware Software Hardware
No ratings yet
Components of Computer: Hardware Software Hardware
9 pages
Memory Mapping in 64 Bit Mode and Registers: 64 Bit Intel Assembly Language
No ratings yet
Memory Mapping in 64 Bit Mode and Registers: 64 Bit Intel Assembly Language
28 pages
Architecture of Pentium Microprocessor
67% (3)
Architecture of Pentium Microprocessor
3 pages
Brochure AC 800PEC Controller
No ratings yet
Brochure AC 800PEC Controller
12 pages
Download Complete Digital design and computer architecture. ARM Edition David Harris PDF for All Chapters
100% (5)
Download Complete Digital design and computer architecture. ARM Edition David Harris PDF for All Chapters
55 pages
IC693CPU331: This Datasheet For The
No ratings yet
IC693CPU331: This Datasheet For The
2 pages
Chapter2 Instructions Architecture Set
No ratings yet
Chapter2 Instructions Architecture Set
105 pages
01 Electrical Voice Mcqs
No ratings yet
01 Electrical Voice Mcqs
5 pages
Power Guru - Implementing Smart Power Management On The Android Platform
No ratings yet
Power Guru - Implementing Smart Power Management On The Android Platform
6 pages
Evolution of Computer
No ratings yet
Evolution of Computer
10 pages
All-In-One PLC: KV Nano Application Guide Vol. 5
No ratings yet
All-In-One PLC: KV Nano Application Guide Vol. 5
16 pages
NXP LPC 23Xx/24Xx: Naeem Latif
No ratings yet
NXP LPC 23Xx/24Xx: Naeem Latif
7 pages
Module 1
No ratings yet
Module 1
67 pages

03ILP Speculation and Advanced Topics

Uploaded by

03ILP Speculation and Advanced Topics

Uploaded by

1

ILP: Hardware Based

 Control dependencies pose major hurdle to ILP.

 A circular buffer named Reorder Buffer (ROB) is added

 The ROB entry number is recorded in the RSs for

 Static superscalar advantages diminish as issue width grows

fld f2,-16(x1) fld f3,-24(x1)

fld f4,-32(x1) fld f5,-40(x1) f.add f6,f0,f19 f.add f7,f1,f19

f.add f8,f2,f19 f.add f9,f3,f19

f.add f10,f4,f19 f.add f11,f5,f19

fsd f6,0(x1) fsd f7,-8(x1)

fsd f10,16(x1) fsd f11,8(x1) bne x1,x2,loop

 Multiple instructions are dynamically issued

 Inmulti-issue special arrangements are needed to

 Branch prediction reduces stall but doesn’t eliminate it

So on average 0.38 cycles will be wasted due to branch. 1 cycle for

 Inboth cases above instruction is same but address

 The procedure return problem can be improved by adding a

 Renaming using reorder buffer has certain disadvantages (and also

 There are two variants are (not much difference)

 A renaming table maps Architectural registers to Physical registers and also

RAW between i2 and i1 for r1 r4 r4

 Once branch prediction is accurate thus

 The difficulty is that all task of finding

 Although Speculation has many advantages but it has certain

 We have only seen speculation through single branch

You might also like