0% found this document useful (0 votes)

39 views

Exploiting Instruction-Level Parallelism With Software Approaches

The document discusses compiler techniques for exploiting instruction-level parallelism, including pipeline scheduling and loop unrolling. It provides examples of optimizing a loop that adds a scalar to an array by scheduling instructions, unrolling the loop, and scheduling the unrolled loop. Unrolling the loop and further scheduling the unrolled loop reduces the clock cycles per iteration from 10 to 2.4 when run on a dual-issue pipeline. The document also covers detecting parallelism in loops, transforming loops to expose parallelism, and analyzing recurrences.

Uploaded by

Gladiss Merlin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

Exploiting Instruction-Level Parallelism With Software Approaches

Uploaded by

Gladiss Merlin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 108

Chapter 4

Exploiting Instruction-Level Parallelism with Software Approaches

Overview
Basic Compiler Techniques
Pipeline scheduling loop unrolling

Static Branch Prediction Static Multiple Issue: VLIW Advanced Compiler Support for Exposing ILP
Detecting loop-level parallelism Software pipelining symbolic loop unrolling Global code scheduling

Hardware support for exposing more parallelism

Conditional or predicted instructions Compiler speculation with hardware support

Hardware vs Software speculation mechanisms Intel IA-64 ISA

Review of Multi-issue Taxonomy

Common Name Superscaler (static) Superscaler (dynamic) Sperscaler (speculative) Issue Hazard structure detection dynamic dynamic dynamic hardware hardware hardware Scheduling static dynamic dynamic with speculation Distinguishing characteristic in-order execution some out-of-order execution out-of-order execution with speculation Examples
Sun UltraSPARC II/III IBM Power2

Pentium III/4 MIPS R10K, Alpha 21264, HP PA 8500, IBM RS64III

VLIW/LIW

static

software

static

no hazards between issue packets

Trimedia, i860

EPIC

mostly static

mostly software

mostly static explicit dependencies marked by compiler

Itanium (IA-64 is one implementation)

Quote about IA-64 Architecture

One of the surprises about IA-64 is that we hear no claims of high frequency, despite claims that an EPIC processor is less complex than a superscaler processor. Its hard to know why this is so, but one can speculate that the overall complexity involved in focusing on CPI, as IA-64 does, makes it hard to get high megahertz. - M. Hopkins, 2000

Basic Pipeline Scheduling

To keep pipeline full
Find sequences of unrelated instructions to overlap Separate dependent instructions by at least the latency of source instruction

Compiler success depends on:

Amount of ILP available Latencies of functional units

Assumptions for Examples

Standard 5-stage integer pipeline plus floating point pipeline
Branches have delay of 1 cycle Integer load latency of 1 cycle, ALU latency of 0 Functional units fully pipelined or replicated so that there are no structural hazards

Latencies between dependent FP instructions:

Instruction producing result FP ALU operation FP ALU operation Load double Instruction using result Another FP ALU operation Store double FP ALU operation Latency in clock cycles 3 2 1

Load double

Store double

Loop Example
Add a scalar to an array. for (i=1000; i>0; i=i-1) x[i] = x[i] + s; Iterations of the loop are parallel with no dependencies between iterations.

Straightforward Conversion
R1 holds the address of the highest array element F2 holds the scalar R2 is pre-computed so that 8(R2) is the last element

loop: L.D F0, 0(R1) ADD.D F4,F0,F2 S.D F4, 0(R1) DADDUI R1,R1,#-8 BNE R1,R2, loop

;F0 = array element ;add scalar in F2 ;store result ;decrement pointer (DW) ;branch if R1 != R2

Program in MIPS Pipeline

loop: L.D F0, 0(R1) 1-cycle Stall latency ADD.D F4,F0,F2 Stall 2-cycle latency Stall S.D F4, 0(R1) DADDUI R1,R1,#-8 1-cycle latency Stall branch BNE R1,R2, loop delay slot Stall
Clock cycle issued 1 2 3 4 5 6 7 8 9 10

Scheduled Program in MIPS Pipeline

loop: L.D Stall ADD.D Stall Stall S.D DADDUI Stall BNE Stall F0, 0(R1) F4,F0,F2 Clock cycle issued 1 2 3 4 5 6 7 8 9 10 Clock cycle issued 1 2 3 4 5 6

OLD

F4, 0(R1) R1,R1,#-8

R1,R2, loop

loop:

2-cycle latency

L.D DADDUI ADD.D Stall BNE S.D

F0, 0(R1) R1,R1,#-8 F4,F0,F2 R1,R2, loop F4, 8(R1)

NEW

Compiler Tasks
loop: L.D DADDUI ADD.D Stall BNE S.D F0, 0(R1) R1,R1,#-8 F4,F0,F2 R1,R2, loop F4, 8(R1) Clock cycle issued 1 2 3 4 5 6

OK to reorder DADDUI and ADD.D OK to reorder S.D and BNE OK to reorder DADDUI and S.D, but requires
0(R1) 8(R1)
This one is difficult since a reverse dependency can be seen:

S.D F4,0(R1) DADDUI R1,R1,# -8

Loop Overhead
loop: L.D DADDUI ADD.D Stall BNE S.D F0, 0(R1) R1,R1,#-8 F4,F0,F2
R1,R2, loop F4, 8(R1) Clock cycle issued 1 2 3 4 5 6

6 is the minimum due to dependencies and pipeline latencies Actual work of the loop is just 3 instructions:
L.D, ADD.D, S.D

Other instructions are loop overhead:

DADDUI, BNE

Loop Unrolling
Eliminate some of the overhead by unrolling the loop (fully or partially). Need to adjust the loop termination code Allows more parallel instructions in a row Allows more flexibility in reordering Usually requires register renaming

Unrolled Version
loop: L.D ADD.D S.D L.D ADD.D S.D F0, 0(R1) F4,F0,F2 F4, 0(R1) F6, -8(R1) F8,F6,F2 F8, -8(R1) Assume that the # iterations is a multiple of 4 Decrement R1 by 32 for this 4 iterations More registers required to avoid unnecessary dependencies

L.D ADD.D S.D

L.D ADD.D S.D DADDI BNE

F10, -16(R1) F12,F10,F2 F12, -16(R1)

F14, -24(R1) F16,F14,F2 F16, -24(R1) R1, R1, #-32 R1, R2, loop

Eliminate 3 DADDI and 3 BNE instructions

Without scheduling, each operation will cause a stall when pipelined:
L.D stall ADD.D stall stall S.D F0, 0(R1) F4,F0,F2

F4, 0(R1)

Scheduled Unrolled Version

loop: L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D DADDI BNE F0, 0(R1) F4,F0,F2 F4, 0(R1) F6, -8(R1 F8,F6,F2 F8, -8(R1) F10, -16(R1) F12,F10,F2 F12, -16(R1) F14, -24(R1) F16,F14,F2 F16, -24(R1) R1, R1, #-32 R1, R2, loop L.D L.D L.D L.D ADD.D ADD.D Schedule ADD.D ADD.D S.D S.D Move DADDI dependent instructions S.D apart BNE Fill in delay slot S.D loop: F0, 0(R1) Move dependent F6, -8(R1) instructions F10, -16(R1) apart F14, -24(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 F16,F14,F2 F4, 0(R1) F8, -8(R1) R1, R1, # -32 F12, 16(R1) R1, R2, loop F16, 8(R1) ; 8-32 = -24

Requires 28 clock cycles for 4 iterations 7 cycles/iteration

Requires 14 clock cycles for 4 iterations 3.5 cycles/iteration

Loop Unrolling in General

For a loop with n iterations, unrolled k times: (note n might not be a multiple of k)
n mod k iterations n iterations 1 copy 1 copy

Unroll

n/k iterations

k copies (unrolled)

Summary of Example
Version Unscheduled Scheduled

Clock cycles per iteration 10

Code size
5 5

Unrolled
Unrolled and Scheduled

7
3.5

14
14

Compiler Tasks for Unrolled Scheduled Version

OK to move S.D after DADDUI and BNE if S.D offset is adjusted Determine that unrolling is useful because loop iterations are independent Use different registers to avoid name hazards Eliminate extra test and branch instructions and adjust iteration code OK to move L.D and S.D instructions in unrolled code (requires analyzing memory addresses) Keep all the real dependencies, but reorder to avoid stalls

Limits to Loop Unrolling

Eventually the gains of removing loop overhead diminishes
Remaining loop overhead amortization

Code size limitations

Embedded applications Increase in cache misses

Compiler limitations
Shortfall in registers Increase in number of live values past # registers

Unrolled Scheduled Loop in Pipeline

Example: Statically Scheduled Superscaler MIPS Processor
2 instructions per clock cycle (dual-issue) if

1 is a load/store/branch/integer ALU (includes FP load/store) 1 is a FP operation (add, mult/div)

Fetch Decode

Integer (and FP LD/ST)

Mem

FP ALU

Write

Pipeline Schedule for 5 Iteration Unrolled Version

Integer Instruction
loop: L.D L.D F0, 0(R1) F6, -8(R1)

FP Instruction

Clock Cycle 1 2

L.D
L.D L.D S.D S.D S.D DADDI S.D BNE S.D

F10, -16(R1)
F14, -24(R1) F18, -32(R1) F4, 0(R1) F8, -8(R1) F12, -16(R1) R1, R1, # -32 F16, 16(R1) R1, R2, loop F20, 8(R1)

ADD.D
ADD.D ADD.D ADD.D ADD.D

F4,F0,F2
F8,F6,F2 F12,F10,F2 F16,F14,F2 F20,F18,F2

3
4 5 6 7 8 9 10 11 12

12 cycles for 5 iterations 2.4 cycles/iteration

Summary of Example
Version Unscheduled Scheduled Unrolled (4) Unrolled (4) and Scheduled Unrolled (5) and Schedule in MultiIssue Pipe

Clock cycles per iteration 10

6 7 3.5 2.4

Code size
5 5 14 14 17

Overview
Basic Compiler Techniques
Pipeline scheduling loop unrolling

Static Branch Prediction Static Multiple Issue: VLIW Advanced Compiler Support for Exposing ILP
Detecting loop-level parallelism Software pipelining symbolic loop unrolling Global code scheduling

Hardware support for exposing more parallelism

Conditional or predicted instructions Compiler speculation with hardware support

Hardware vs Software speculation mechanisms Intel IA-64 ISA

Detecting Parallelism
Loop-level parallelism
Analyzed at the source level requires recognition of array references, loops, indices.

Loop-carried dependence a dependence of one loop iteration on a previous iteration.

for (k=1; k<=100; k=k+1) { A[k+1] = A[k] + B[k];

Examples of Loop Parallelism

A loop is parallel if it can be written without a cycle in the dependencies.
for (k=1; k<=100; k=k+1) {
a loop carried dependency in a single statement is a cycle

for (k=1; k<=100; k=k+1) {

A[k+1] = A[k] + C[k];

dependency within iteration does not make a cycle

A[k] = A[k] + B[k];

B[k+1] = B[k] + A[k+1]; }

B[k+1] = C[k] + D[1]; } This loop can be modified to make it parallel

loop carried dependency, but no cycle (B[k+1] does not have B[k] as source)

Transformation
Two statements can be interchanged. First iteration of first statement can be computed outside loop so that A[k+1] is computed within loop. Last iteration of second statement must also be computed outside loop. A[1] = A[1] + B[1]; for (k=1; k<=100; k=k+1) { A[k] = A[k] + B[k]; B[k+1] = C[k] + D[k];
Expose the parallelism

for (k=1; k<=99; k=k+1) { B[k+1] = C[k] +

dependency D[k]; within iteration

A[k+1] = A[k+1] + B[k+1];

} B[101] = C[100] + D[100];

Recurrences
for (i=2; i<=100; i=i-1) { Y[i] = Y[i-n] + Y[i]; }
Y[i] depends on itself, but uses the value of an earlier iteration. n = dependence distance. Most often, n=1. The larger that n is, the more parallelism is available.

Some architectures have special support for recurrences.

Finding Dependencies
Important for
Efficient scheduling Determining which loops to unroll Eliminating name dependencies

Makes finding dependencies difficult:

Arrays and pointers in C or C++ Pass by reference parameter passing in FORTRAN

Dependencies in Arrays
Array indices, i, are affine if:
ai+ b (for a one-dimensional array) Index of multiple-dimension arrays is affine if indices in each dimension are affine

Common example of non-affine index:

x[y[i]] (indirect array addressing)

For two affine indices: ai+b and ck+d there is a dependence if:
GCD(c,a) must divide (d-b) evenly

Example
For (k=1; k<=100; k=k+1) { X[2k+3] = X[2k] * 5.0; }
GCD test: a=2, b=3 c=2, d=0 Dependence if GCD(c,a) divides (d-b) evenly GCD(2,2) = 2 d-b = -3 k 1 2 3 4 5 6 7 . . . . 100 2k+3 5 7 9 11 13 15 17 . . . . 203 2k 2 4 6 8 10 12 14 . . . . 200

GCD Test Limitations

GCD test must take the limits of the indices into account.
If GCD shows NO dependency, then there is no dependency If GCD shows SOME dependency, it might not occur (because it might be outside the bounds of the indices).

In general, determining if dependency exists is NP-complete. There are exact tests for restricted situations

Dependency Classification
Different dependencies are handled differently
Anti-dependence and output dependence
rename

Real dependencies
try to reorder to separate by length of latency

Example
Find dependencies
True dependencies Output dependencies Anti-dependencies
Output dependence

for (i=1; i<=100; i=i+1) { Y[i] = X[i]/c;

True dependencies Not loop-carried Anti-dependence

X[i] = X[i] + c;

Z[i] = Y[i] + c;
Y[i] = c -Y[i]; }
Anti-dependence

Example
Eliminate output dependence (also eliminates second anti-dependence)
Rename Y T

Output dependence

for (i=1; i<=100; i=i+1) { T[i] = X[i]/c;

True dependencies Not loop-carried Anti-dependence

X[i] = X[i] + c;

Z[i] = T[i] + c;
Y[i] = c -T[i]; }
Anti-dependence

Example
Eliminate anti-dependence
Rename X S Final result is parallel loop that can be unrolled

for (i=1; i<=100; i=i+1) { T[i] = X[i]/c;

True dependencies Not loop-carried Anti-dependence

S[i] = X[i] + c;

Z[i] = T[i] + c;
Y[i] = c -T[i]; }

Software Pipelining
Interleaves instructions from different iterations of a loop without unrolling
each iteration is made from instructions from different iterations of the loop software counterpart to Tomasulos algorithm start-up and finish-up code required

Software Pipelining

Software Pipelining
Loop: LD ADD.D S.D DADDUI BNE F0, 0(R1) F4, F0, F2 F4, 0(R1) R1, R1, # -8 R1,R2, Loop 3

symbolic unrolling

new loop
Loop: S.D ADD.D L.D DADDUI BNE F0, 16(R1) ; store to M[i] F4, F0, F2 ; add to M[i-1] F4, 0(R1) ; load M[i-2] R1, R1, # -8 R1,R2, Loop

LD ADD.D S.D LD ADD.D S.D LD ADD.D S.D

F0, 0(R1) F4, F0, F2 F4, 0(R1) F0, 0(R1) F4, F0, F2 F4, 0(R1) F0, 0(R1) F4, F0, F2 F4, 0(R1)

3 2

Software Pipelining

new loop
Loop: S.D ADD.D L.D DADDUI BNE F0, 16(R1) ; store to M[i] F4, F0, F2 ; add to M[i-1] F4, 0(R1) ; load M[i-2] R1, R1, # -8 R1,R2, Loop

Result: 1 cycle per instruction 1 loop iteration per 5 cycles less code space than unrolling

separate to eliminate RAW stall and to fill delay slot

rescheduled loop
Loop: S.D DADDUI ADD.D BNE L.D F0, 16(R1) ; store to M[i] R1, R1, # -8 F4, F0, F2 ; add to M[i-1] R1, R2, Loop F4, 8(R1) ; load M[i-2]

Start-up and Wind-down Code

Iterations in software pipeline loop

Software Pipelining Benefits

Here the overhead includes instructions not optimally overlapped

Here the overhead includes branch/counter instructions that are not easy to overlap

Global Code Scheduling

Loop unrolling and software pipelining
Improve ILP when loop bodies are straightling code (no branches) Control flow (branches) within loops makes both more complex.
will require moving instructions across branches

Global Code Scheduling moving instructions across branches

Global Code Scheduling

Goal: compact code fragment with internal control structure into shortest possible sequence
must preserve data and control dependencies
data dependencies force partial order of instructions control dependencies dictate instructions across which code cannot be moved

Finding shortest possible sequence requires identification of critical path. Critical path is the longest sequence of dependent instructions

Global Code Motion

Moving code across branches affects frequency of execution of such code Need to determine relative frequency of different paths

Global Code Motion

Global code motion:
can we move B[i] or C[i] assignments? is it beneficial?
execution frequencies cost of moving (any empty slots?) effect on critical path which is better move, B[i] or C[i] cost of compensation code

Simplifications of Global Code Scheduling

Trace Scheduling
find the most frequent path (trace selection) unroll loop to create trace and then schedule it efficiently (trace compaction)

Trace Scheduling Example

Trace exits and re-entrances are very complex and require much bookkeeping

Trace Scheduling
Advantages
eliminates some hard decisions in global code scheduling good for code such as scientific programs with intensive loops and predictable behavior

Disadvantages
significant overhead in compensation code when trace must be exited

Superblocks
Similar to trace scheduling, but have only ONE entrance point When trace is exited, a duplicate loop is used for remainder of code (tail duplication)

Chapter 4 Overview
Basic Compiler Techniques
Pipeline scheduling loop unrolling

Static Branch Prediction Static Multiple Issue: VLIW Advanced Compiler Support for Exposing ILP
Detecting loop-level parallelism Software pipelining symbolic loop unrolling Global code scheduling

Hardware support for exposing more parallelism

Conditional or predicted instructions Compiler speculation with hardware support

Hardware vs Software speculation mechanisms Intel IA-64 ISA

Review
Loop unrolling Software pipelining Trace scheduling Global code scheduling Problems
Increase parallelism when branch behavior is known

Moving code across branches, very difficult problem

Unpredictable branches Dependencies between memory references

Hardware Options
Instruction set change:
conditional instructions
example: conditional move CMOVZ R1, R2, R3
R1 R2 if R3=0

predicated instructions
example: predicated load LWC R1, 9(R2), R3
R1 M[R2+9] if R3 0

Conditional Moves
Can be used to eliminate some branches
if (A==0) {S=T;}
Let A,S,T be assigned to R1, R2, R3 Code without conditional move: BNEZ R1, L ADDU R2, R3, R0 L: Using conditional move: CMOVZ R2, R3, R1 Control dependence is converted to data dependence

Conditional Moves
Useful for conversions such as absolute value A = abs (B)
if (B < 0), { A = -B;} else {A = B};

Can implement with two conditional moves

Conditional Moves
Useful for short sequences Not efficient for branches that guard large blocks of code. Simplest form of predicated instruction

Predication
Execution of an instruction is controlled by a predicate.
When predicate is false, instruction becomes a nop Full predication is when all instructions can be predicated. Full predication allows conversions of large blocks of code that are branch dependent

Predication and Multi-Issue

Predicated instructions can be used to improve scheduling
Can be used to fill delay slots Some branches can be eliminated Eliminates some control dependencies Reduces overhead of global code scheduling

Limits of Predicated Instructions

Annulled instructions take processor resources Predicated instructions are not efficient for multiple branches. Implementing conditional/predicated instructions has some hardware cost

Examples
Support conditional moves:
MIPS Alpha PowerPC SPARC Intelx86

Support full predication:

IA-64

Compiler Speculation
To speculate ambitiously, must have
1. The ability to find instructions that can be speculatively moved and not affect program data flow. 2. The ability to ignore exceptions in speculated instructions, until it is certain they should occur. 3. The ability to speculatively interchange loads and stores which may have address conflicts. The last two require hardware support.

Preserving Exception Behavior

The results of a mis-predicted instruction will not be used in final computation, and should not cause an exception. Four approaches:
1. HW and OS cooperatively ignore exceptions for speculated instructions 2. Speculative instructions that never raise exceptions are used, checks are implemented to determine when exceptions should occur 3. Poison status bits are attached to result registers written by speculated instructions when they cause exceptions. Cause a fault when normal instruction uses results. 4. Hardware buffers speculative instruction results until instruction is no longer speculative

Exception Categories
Those that are handled and then normally resumed (page fault, I/O request, etc) (Resuming exceptions)

Those that indicate program error (overflow, memory protection fault, etc) (Terminating exceptions)

Approach 1.
HW and OS cooperatively ignore exceptions for speculated instructions handle all exceptions but ignore terminating exceptions
Resuming exceptions
Handling exception for speculative instructions will cause performance penalty, but not incorrect program behavior.

Terminating exceptions
In this approach HW and OS return an undefined value for any exception that would normally cause termination. Because termination SHOULD result when these exceptions do occur, execution of INCORRECT program is in error. Used for Fast Mode in some processors.

Approach 2.
Same as approach 1, but add instructions to check for terminating exceptions.
Example: LD sLD BNEZ SPECCK J DADDI SD R1,0(R3) R14, 0(R2) R1, L1 0(R2) L2 R14, R1, #4 R14, 0(R3) ; load A ; speculative, no termination ; test A ; check for speculation exception ; skip else ; else clause ; store A

L1: L2:

Both correct and incorrect programs execute without error Requires additional checking instructions

Approach 3.
Track exceptions as they occur Postpone terminating exceptions Requires poison bit for each register and speculation bit for instruction. Terminating exception causes poison bit to be set for result register. Speculative instructions using poison result pass poison bit to their result Non-speculative instructions using poison result cause termination Stores cannot be speculative since memory locations cannot have poison bits.

Approach 3.
Example
Example: LD sLD BEQZ DADDI SD R1,0(R3) R14, 0(R2) R1, L1 R14, R1, #4 R14, 0(R3) ; load A ; speculative, set poison bit for R14 ; on exception ; test A ; ; store A if poison bit is set, fault

L1:

Both correct and incorrect programs execute without error

Approach 4.

Hardware buffers speculative instruction results until instruction is no longer speculative

Similar to reorder buffer with dynamic scheduling Compiler
Marks instructions as speculative Indicates number of branches instruction spans Indicates branch assumptions (taken/not taken) Original location of speculative instruction is marked with sentinel Responsible for register renaming Instructions placed in reorder buffer when issued and forced to commit in order (without dynamic scheduling) Values committed when instruction is no longer speculative (sentinel location is reached or branch they depend on is resolved) Exceptions handled when values are committed.

Hardware

Hardware Support for Memory Reference Speculation

Moving loads above stores is important for reducing critical paths of code segments Compiler can not always be certain that reordering load/store is correct (memory address dependencies) Add special instruction to check for address conflicts.
Left at original location of load instruction Hardware saves address of speculative load If subsequent store (before special instruction) uses address, then speculation fails, else it succeeds
When failure occurs, must re-execute load at check point If other speculative instructions were execute, must re-execute those instructions too. Expensive to have speculation fail

Hardware vs Software Speculation

Disambiguation of memory references
Software hard to do at compile time if program uses pointers Hardware dynamic disambiguation is possible for supporting reordering of loads and stores in Tomasulos approach Support for speculative memory references can help compiler, but overhead of recovery is high.

Hardware vs Software Speculation

When control flow is unpredictable, hardware speculation works better than compiler speculation Integer programs tend to have unpredictable control flow
Static (compiler) predictor misses 16% (SPECint) Hardware predictor misses under 10%

Statically scheduled processors normally include dynamic branch predictors

Hardware vs Software Speculation

Hardware-based speculation maintains completely precise exception model Software-based approaches have added special hardware support to allow this as well.

Hardware vs Software Speculation

Hardware-based speculation does not require compensation or bookkeeping code Ambitious software-based approaches require this hardware support

Hardware vs Software Speculation

Compiler-based approaches can see a longer code sequence Better code scheduling can result

Hardware vs Software Speculation

Hardware-based speculation requires complex additional hardware resources Compiler-based speculation requires complex software For hardware that supports compiler speculation, tradeoffs between hardware costs and amount and usefulness of simplifications must be made

Intel IA-64 Architecture Itanium Implementation

IA-64
Instruction set architecture Instruction format Examples of explicit parallelism support Predication and speculation support

Itanium Implementation
Functional units and instruction issue Performance

The IA-64 Instruction Set Architecture

Register model
64 Predicate Registers (1-bit ) 128 General Purpose Registers (64-bit) Actually have 65 bits 128 Floating Point Registers (82-bit)

8 Branch Registers (64-bit)

Hold branch address for indirect branches

Other Registers for systems control, memory mapping, performance counters, communication with OS

Integer Registers
R0-R31 always accessible R32-R127 implemented as a register stack
each procedure is allocated to a set of registers

128 General Purpose Registers (64-bit)

R0-R31 CFM R32-R127

CFM Current Frame pointer used to point to a set of registers for a procedure

Instruction Format
VLIW approach
implicit parallelism among operations in an instruction fixed formatting of the operation fields

More flexible than most VLIW architectures

depends on compiler to detect ILP compiler schedules into parallel slots

Instruction Groups
A sequence of consecutive instructions with no register dependencies There may be some memory dependencies Boundaries between groups are indicated with a stop.

Instruction Bundles
128 bits of encoded instructions Each bundle:
5-bit template field
specifies what types of execution units each instruction requires

3 instructions, each 41 bits

Execution Slots
Execution unit slot Instruction Instruction Type Description I-unit A I M-unit A M F B L+X Integer ALU Non-integer ALU Integer ALU Memory Access Floating Point Branches Extended Example instructions
add,sub,and,or... integer and multimedia shifts, bit tests, ... add,sub,and,or... load/stores, int/FP
Floating point instructions Conditional branches Extended immediates, stops, nops

F-unit B-unit L+X

Templates
Template 0 1 2 3 4 5 Slot 0 M M M M M M Slot 1 I I I I L L Slot 2 I I I I X X Template 14 15 16 17 18 19 Slot 0 M M M M M M Slot 1 M M I I B B Slot 2 B B B B B B

8
9 10

M
M M

I
I I

22
23 24 25 28 29

B
B M M M M

B
B M M F F

B
B B B B B

11 12
13

M M
M

M F
F

I F
F

Exercise
Loop example using MIPS form of instructions:
loop: L.D ADD.D S.D DADDUI BNE F0, 0(R1) F4,F0,F2 F4, 0(R1) R1,R1,#-8 R1,R2, loop ;F0 = array element ;add scalar in F2 ;store result ;decrement pointer (DW) ;branch if R1 != R2

Lets see if we can unroll this loop and map it to IA-64 bundles.

loop:

L.D F0, 0(R1) ADD.D F4,F0,F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8,F6,F2 S.D F8, -8(R1) L.D F10, -16(R1) ADD.D F12,F10,F2 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16,F14,F2 S.D F16, -24(R1) L.D F18, -32(R1) ADD.D F20,F18,F2 S.D F20, -32(R1) L.D F22, -40(R1) ADD.D F24,F22,F2 S.D F24, -40(R1) L.D F26, -48(R1) ADD.D F28,F26,F2 S.D F28, -48(R1) DADDI R1, R1, #-56 BNE R1, R2, loop

Unrolled Loop (7 times)

loop:

L.D L.D L.D L.D L.D L.D L.D ADD.D ADD.D ADD.D ADD.D ADD.D ADD.D ADD.D S.D S.D S.D S.D S.D DADDI S.D BNE S.D

F0, 0(R1) F6, -8(R1) F10, -16(R1) F14, -24(R1) F18, -32(R1) F22, -40(R1) F26, -48(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 F16,F14,F2 F20,F18,F2 F24,F22,F2 F28,F26,F2 F4, 0(R1) F8, -8(R1) F12, -16(R1) F16, -24(R1) F20, -32(R1) R1, R1, # -56 F24, 16(R1) R1, R2, loop F28, 8(R1)

Unrolled Loop (Scheduled)

Type M
Instruction producing result

Latencies
Instruction using result FP ALU operation Store double FP ALU operation Store double Latency in clock cycles 3 2 1 0

Type F

FP ALU operation FP ALU operation Load double

Type M Type I
; 16-56=-40 ; 8-56 = -48

Load double

loop:

L.D L.D L.D L.D L.D L.D L.D ADD.D ADD.D ADD.D ADD.D ADD.D ADD.D ADD.D S.D S.D S.D S.D S.D DADDI S.D BNE S.D

Find Possible Stops in Code

Type M

Type F

Type M Type I
; 16-56=-40 ; 8-56 = -48

Scheduled Code
Template Slot 0 Slot 1
L.D F6, -8(R1) L.D F14, -24(R1) ADD.D F4, F0, F2

Slot 2

Cycle
1 3

9: M M I L.D F0, 0(R1) 14: M M F L.D F10, -16(R1)

Scheduled Code Faster

Template Slot 0 Slot 1
L.D F6, -8(R1) L.D F14, -24(R1) LD F22, -40(R1) ADD.D F4, F0, F2

Slot 2

Cycle
1 2 3

8: M M I L.D F0, 0(R1) 9: M M F L.D F10, -16(R1) 14: M M F L.D F18, -32(R1)

IA-64 Instruction Format

41-bits

4-bits (major opcode)

31-bits

6-bits (predicate register)

Together with the 5-bit bundle template, determine the major operation

64 Predicate Registers (1-bit )

Type-A Instructions
Instruction Type Number of formats Example instructions Add,sub, and,or Shift left and add ALU immediates Extra opcode bits 9 7 9 GPRs/ FPRs 3 3 2 Immediate bits 0 0 8 2-bit shift count Other/ comment

Add immediate
Add immediate Compare

3
0 4

2
2 2

14
22 0 2 predicate register destinations 2 predicate register destinations

Compare immediate

Type-I Instructions
Instruction Type Number of formats Example instructions Shift R/L variable I 29 Test bit Move to BR Extra opcode bits 9 GPRs/ FPRs 3 Immediate bits 0 Other/ comment Used by multimedia inst. 2 predicate register dest. Branch register specifier

6 6

3 1

6-bit field specifier 9-bit branch predict

Type-M Instructions
Instruction Type Number Example of instructions formats Integer/FP load and store, line prefetch Integer/FP load and store, and line prefetch and post-increment by immediate Integer/FP load prefetch and register postincrement Extra opcode bits 10 GPRs/ FPRs 2 Immediate bits 0 Other/ comment Speculative/ nonspeculative Speculative/ nonspeculative

Speculative/ nonspeculative

Integer/FP speculation check

21 in two fields

Type-B Instructions
Instruction Type Number Example of instructions formats PC-relative branch, counted branch PC-relative call Extra opcode bits 7 GPRs/ FPRs 0 Immediate bits 21 Other/ comment

1 branch register

Type-F Instructions
Instruction Type Number Example of instructions formats FP arithmetic F 15 FP compare Extra opcode bits 2 2 GPRs/ FPRs 4 2 2 6-bit predicate regs Immediate bits Other/ comment

Type-L + X Instructions
Instruction Type Number Example of instructions formats 4 Move immediate long Extra opcode bits 2 GPRs/ FPRs 1 Immediate bits 64 Other/ comment Takes 2 slots

L+X

Summary does not include all of the multi-media instructions.

Predication
Almost all instructions can be predicated
Specify a predicate register in last 6 bits of instruction Predicate registers set with compare or test instructions
Compare 10 possible comparison tests Two predicate registers as destinations Written with result and compliment OR, with logical function that combines tests and compliment Multiple tests can be done with one instruction

Speculation Support
Control speculation support:
Deferred exceptions for speculated instructions
Equivalent of poison bits

Memory reference speculation

Support for speculation of load instructions

Deferred Exceptions
Support to indicate an exception on a speculative instruction
GPRs have NaT (Not a Thing) bits (make registers 65-bits long) FPRs use NaTVal (Not a Thing Value)
Significand of 0 and exponent out of range

NaTs and NaTVals are propagated by speculative instructions that dont reference memory FP instructions use status registers to record exceptions for this purpose.

Resolution of Deferred Exceptions

If non-speculative instruction receives a NaT or NaTVal as source operand, generate terminating exception If a chk.s instruction detects the presence of NaT or NaTVal, branch to routine designed to recover from speculative operation.

Memory Reference Support

Advanced loads
Speculative load that is moved above a store Creates an entry in ALAT table
Register destination of the load Address of accessed memory location

When store is executed, address is compared to active ALAT table entries

If a match occurs, ALAT table entry marked invalid

When instruction USING speculative load value is executed, ALAT table is checked
Ld.c check that is used if only load is speculative
Only causes a reload of the value

Chk.a check that is used if other speculative code used the load value.
Specifies address of a fix-up routine that re-executes code sequence.

Itanium Processor
First implementation of IA-64 (2001) 800 MHz clock Multi-issue up to 6-issues per clock cycle Up to 3 branches and 2 memory references 3-level cache memory hierarchy
L1 is split data/instruction caches L2 is unified cache on-chip L3 is unified cache off-chip (but on container)

Functional Units
I-unit I-unit M-unit M-unit Instruction Integer load B-unit F-unit B-unit F-unit B-unit Latency 1 All functional units are pipelined Bypassing paths are implemented (forwarding) Bypass between units has a 1 cycle delay

Floating-point load
Correctly predicted taken branch Mispredicted branch Integer ALU operation FP arithmetic

9
0-3 9 0 4

Itanium Multi-Issue
Instruction Issue window of 2 bundles at a time Up to 6 instructions issues at once

template

inst 1

inst 2

inst 3

template

inst 1

inst 2

inst 3

NOPs and predicated instructions with false predicates are not issued If one or more instructions cannot be issued due to unavailable function unit, bundle can be split

Itanium Pipeline
10 Stages
Front-End IPG Fetch Rotate Instruction Delivery EXP REN Operand Delivery WLD REG EXE Execution DET WRB

Prefetches up to 32 bytes per clock Can hold up to 8 bundles Branch prediction: multilevel adaptive predictor

Distributes up to 6 instructions to 9 functional units. Implements register renaming

Access register file register bypassing register scoreboard checks predicate dependencies

Executes instructions through ALU and load-store units Detects exceptions and posts NaTs Write back

Features in Common with Dynamically Scheduled Pipelines

Branch prediction Register renaming Scoreboarding (like Tomasulos algorithm) Deep pipeline many stages before EX Stages after execution to handle exceptions

Itanium Performance
Integer benchmarks
Itanium shows best performance only for mcf benchmark Geometric Means:
PItanium = 60% PPentium
4

PItanium = 68% PAlpha

21264

Geometric Means adjusted for clock speed:

PItanium = 85% PAlpha
21264

Itanium Performance
FP benchmarks
Itanium has best performance for 8/16 benchmarks Geometric Means:
PItanium = 108% PPentium PItanium = 120% PAlpha
4

21264

art benchmark has large effect

Conclusions
Multi-issue processors only achieve high performance with much investment in silicon area and hardware complexity No clear winner in hardware or software approaches to ILP in general
Software helps for conditional instructions and speculative load support Hardware helps for scoreboard type scheduling, dynamic branch prediction, local checking for speculated load correctness

Codes U1 Text
0% (1)
Codes U1 Text
5 pages
Case Study
No ratings yet
Case Study
16 pages
Unit II
No ratings yet
Unit II
84 pages
Adv Topic Compiler Supported ILPSlides
No ratings yet
Adv Topic Compiler Supported ILPSlides
18 pages
CAunitiii
No ratings yet
CAunitiii
36 pages
AdvTopicCompilerSupportedILP
No ratings yet
AdvTopicCompilerSupportedILP
17 pages
4.1 Basic Compiler Techniques For Exposing ILP Instruction-Level Parallelism
No ratings yet
4.1 Basic Compiler Techniques For Exposing ILP Instruction-Level Parallelism
11 pages
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
No ratings yet
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
26 pages
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
0% (1)
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
40 pages
Lec18-Static BRANCH PREDICTION VLIW
No ratings yet
Lec18-Static BRANCH PREDICTION VLIW
40 pages
9 Loop Unrolling
No ratings yet
9 Loop Unrolling
21 pages
Lec02 Superscalar SW VLIW 22 23
No ratings yet
Lec02 Superscalar SW VLIW 22 23
34 pages
Compiler Techniques For Exposing ILP
No ratings yet
Compiler Techniques For Exposing ILP
18 pages
Compiler Techniques For Exposing ILP
No ratings yet
Compiler Techniques For Exposing ILP
4 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
Lecture 5
No ratings yet
Lecture 5
76 pages
13) Ilp1 PDF
No ratings yet
13) Ilp1 PDF
85 pages
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
No ratings yet
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
24 pages
Data Dependences and Hazards
No ratings yet
Data Dependences and Hazards
24 pages
43-Instruction Scheduling and Software Pipelining-19!11!2024
No ratings yet
43-Instruction Scheduling and Software Pipelining-19!11!2024
25 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
Topic2c Ss Dynamicscheduling
No ratings yet
Topic2c Ss Dynamicscheduling
94 pages
Lec 11
No ratings yet
Lec 11
19 pages
HW 2 Is Out! Due 9/25!
No ratings yet
HW 2 Is Out! Due 9/25!
21 pages
Solution 2
No ratings yet
Solution 2
3 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
VLIW Architecture
No ratings yet
VLIW Architecture
53 pages
Vliw/Epic:: Statically Scheduled ILP
No ratings yet
Vliw/Epic:: Statically Scheduled ILP
34 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
13 pages
HW3 Sol PDF
No ratings yet
HW3 Sol PDF
5 pages
MN Loop Unrolling
No ratings yet
MN Loop Unrolling
5 pages
M116C 1 M116C 1 Lec10-Pipeline-II
No ratings yet
M116C 1 M116C 1 Lec10-Pipeline-II
18 pages
Computer_Architecture_ILP_-_techniques_for_increasing
No ratings yet
Computer_Architecture_ILP_-_techniques_for_increasing
11 pages
Intro To Static Pipelining: CS252 Graduate Computer Architecture
No ratings yet
Intro To Static Pipelining: CS252 Graduate Computer Architecture
52 pages
ACA Unit 3
No ratings yet
ACA Unit 3
17 pages
Very Large Instruction Word (Vliw) Processors: What Is Good and What Is Bad With Superscalars ?
No ratings yet
Very Large Instruction Word (Vliw) Processors: What Is Good and What Is Bad With Superscalars ?
11 pages
EE457Unit9a_OoO
No ratings yet
EE457Unit9a_OoO
77 pages
Lec 15
No ratings yet
Lec 15
15 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
108 pages
Lec9 Multiple Issue Processors
No ratings yet
Lec9 Multiple Issue Processors
33 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
108 pages
5.Advanced-1
No ratings yet
5.Advanced-1
60 pages
Software Pipelining Patterson 1996
No ratings yet
Software Pipelining Patterson 1996
60 pages
Cosc530 Ch3all6up
No ratings yet
Cosc530 Ch3all6up
8 pages
Instruction-Level Parallelism (ILP), Since The
100% (1)
Instruction-Level Parallelism (ILP), Since The
57 pages
Superscalar Architecture
No ratings yet
Superscalar Architecture
156 pages
Introduction To Advanced Pipelining
No ratings yet
Introduction To Advanced Pipelining
64 pages
Exploiting ILP With Software Approach
No ratings yet
Exploiting ILP With Software Approach
104 pages
Lecture 9: Case Study - MIPS R4000 and Introduction To Advanced Pipelining
No ratings yet
Lecture 9: Case Study - MIPS R4000 and Introduction To Advanced Pipelining
23 pages
2.advanced Compiler Support For ILP
100% (1)
2.advanced Compiler Support For ILP
16 pages
Lecture12 Vliw
No ratings yet
Lecture12 Vliw
19 pages
Lec-10 Software Pipelining
No ratings yet
Lec-10 Software Pipelining
24 pages
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
No ratings yet
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
21 pages
Pipe 4
No ratings yet
Pipe 4
50 pages
chapter4_2
No ratings yet
chapter4_2
34 pages
CAQA5e ch3
No ratings yet
CAQA5e ch3
45 pages
Digital Filtering in Hardware: Adnan Aziz
No ratings yet
Digital Filtering in Hardware: Adnan Aziz
102 pages
Computer Architecture Revision For Final Exam
No ratings yet
Computer Architecture Revision For Final Exam
60 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
49 pages
Advanced OSPF & BGP
From Everand
Advanced OSPF & BGP
Ashlan Chidester
No ratings yet
Exploring BeagleBone: Tools and Techniques for Building with Embedded Linux
From Everand
Exploring BeagleBone: Tools and Techniques for Building with Embedded Linux
Derek Molloy
4/5 (2)
Reference Guide To Useful Electronic Circuits And Circuit Design Techniques - Part 2
From Everand
Reference Guide To Useful Electronic Circuits And Circuit Design Techniques - Part 2
Kerwin Mathew
No ratings yet
Research Article: Design Tradeoff of Hot Carrier Immunity and Robustness in LDMOS With Grounded Gate Shield
No ratings yet
Research Article: Design Tradeoff of Hot Carrier Immunity and Robustness in LDMOS With Grounded Gate Shield
9 pages
Midterm Exam in Travel Management
No ratings yet
Midterm Exam in Travel Management
19 pages
LG Virtual Keyboard User Guide
No ratings yet
LG Virtual Keyboard User Guide
2 pages
Pengecualian Kredit Utk IM110 & IM120
No ratings yet
Pengecualian Kredit Utk IM110 & IM120
2 pages
Acad Nag
No ratings yet
Acad Nag
108 pages
17_SPPA T2000 Maintenance Manual
No ratings yet
17_SPPA T2000 Maintenance Manual
655 pages
DCDC Exam Content Outline PDF
No ratings yet
DCDC Exam Content Outline PDF
4 pages
Data Mining Techniques and Tools For Syn PDF
No ratings yet
Data Mining Techniques and Tools For Syn PDF
45 pages
HP Probook 455 G3 Quanta X73A DAX73AMB6E1 Schematics
No ratings yet
HP Probook 455 G3 Quanta X73A DAX73AMB6E1 Schematics
62 pages
LECTURE 4 Aplikasi Komputer
No ratings yet
LECTURE 4 Aplikasi Komputer
9 pages
R 390 An 95
No ratings yet
R 390 An 95
39 pages
Ficha Tecnica-Reflectores Philips
No ratings yet
Ficha Tecnica-Reflectores Philips
4 pages
Certficate 63 (4) (c) of BSA
No ratings yet
Certficate 63 (4) (c) of BSA
2 pages
High Power Synchronous Buck Converter Delivers Up To 50A: Design Note 156 Dale Eagar
No ratings yet
High Power Synchronous Buck Converter Delivers Up To 50A: Design Note 156 Dale Eagar
2 pages
Lua Introduction For New Programmers: Download Slides at
No ratings yet
Lua Introduction For New Programmers: Download Slides at
47 pages
Analog Circuit Analysis
100% (1)
Analog Circuit Analysis
51 pages
Grade 11 NOtes 2
No ratings yet
Grade 11 NOtes 2
3 pages
Template IJEBAR
No ratings yet
Template IJEBAR
2 pages
Ahmed's Resume
No ratings yet
Ahmed's Resume
1 page
CG Manual
No ratings yet
CG Manual
33 pages
Mock Test 03: Thời gian làm bài 60 phút, không kể thời gian phát đề - - - - - - - - - - - - - - - - - - - - - - - -
No ratings yet
Mock Test 03: Thời gian làm bài 60 phút, không kể thời gian phát đề - - - - - - - - - - - - - - - - - - - - - - - -
5 pages
Physics Investigatory Project Class XII
38% (16)
Physics Investigatory Project Class XII
18 pages
LMAM1 Diagnostic Guide Version 4 (3) (1)
No ratings yet
LMAM1 Diagnostic Guide Version 4 (3) (1)
24 pages
Trimpack: GPS Receiver
No ratings yet
Trimpack: GPS Receiver
61 pages
Value Stream Mapping
No ratings yet
Value Stream Mapping
35 pages
Bim & Gis
No ratings yet
Bim & Gis
24 pages
Mini Project: A Simple Xerox Machine Controller: 23 JAN 2021 SECR1013 Digital Logic
No ratings yet
Mini Project: A Simple Xerox Machine Controller: 23 JAN 2021 SECR1013 Digital Logic
19 pages
Food Ordering System: Department of Computer Science Object Oriented Lab Project Proposal Form
No ratings yet
Food Ordering System: Department of Computer Science Object Oriented Lab Project Proposal Form
3 pages