Exploiting Instruction-Level Parallelism With Software Approaches
Exploiting Instruction-Level Parallelism With Software Approaches
Overview
Basic Compiler Techniques
Pipeline scheduling loop unrolling
Static Branch Prediction Static Multiple Issue: VLIW Advanced Compiler Support for Exposing ILP
Detecting loop-level parallelism Software pipelining symbolic loop unrolling Global code scheduling
VLIW/LIW
static
software
static
Trimedia, i860
EPIC
mostly static
mostly software
Load double
Store double
Loop Example
Add a scalar to an array. for (i=1000; i>0; i=i-1) x[i] = x[i] + s; Iterations of the loop are parallel with no dependencies between iterations.
Straightforward Conversion
R1 holds the address of the highest array element F2 holds the scalar R2 is pre-computed so that 8(R2) is the last element
loop: L.D F0, 0(R1) ADD.D F4,F0,F2 S.D F4, 0(R1) DADDUI R1,R1,#-8 BNE R1,R2, loop
;F0 = array element ;add scalar in F2 ;store result ;decrement pointer (DW) ;branch if R1 != R2
OLD
R1,R2, loop
loop:
2-cycle latency
NEW
Compiler Tasks
loop: L.D DADDUI ADD.D Stall BNE S.D F0, 0(R1) R1,R1,#-8 F4,F0,F2 R1,R2, loop F4, 8(R1) Clock cycle issued 1 2 3 4 5 6
OK to reorder DADDUI and ADD.D OK to reorder S.D and BNE OK to reorder DADDUI and S.D, but requires
0(R1) 8(R1)
This one is difficult since a reverse dependency can be seen:
Loop Overhead
loop: L.D DADDUI ADD.D Stall BNE S.D F0, 0(R1) R1,R1,#-8 F4,F0,F2
R1,R2, loop F4, 8(R1) Clock cycle issued 1 2 3 4 5 6
6 is the minimum due to dependencies and pipeline latencies Actual work of the loop is just 3 instructions:
L.D, ADD.D, S.D
Loop Unrolling
Eliminate some of the overhead by unrolling the loop (fully or partially). Need to adjust the loop termination code Allows more parallel instructions in a row Allows more flexibility in reordering Usually requires register renaming
Unrolled Version
loop: L.D ADD.D S.D L.D ADD.D S.D F0, 0(R1) F4,F0,F2 F4, 0(R1) F6, -8(R1) F8,F6,F2 F8, -8(R1) Assume that the # iterations is a multiple of 4 Decrement R1 by 32 for this 4 iterations More registers required to avoid unnecessary dependencies
F4, 0(R1)
Unroll
n/k iterations
k copies (unrolled)
Summary of Example
Version Unscheduled Scheduled
Code size
5 5
Unrolled
Unrolled and Scheduled
7
3.5
14
14
Compiler limitations
Shortfall in registers Increase in number of live values past # registers
FP ALU
Write
FP Instruction
Clock Cycle 1 2
L.D
L.D L.D S.D S.D S.D DADDI S.D BNE S.D
F10, -16(R1)
F14, -24(R1) F18, -32(R1) F4, 0(R1) F8, -8(R1) F12, -16(R1) R1, R1, # -32 F16, 16(R1) R1, R2, loop F20, 8(R1)
ADD.D
ADD.D ADD.D ADD.D ADD.D
F4,F0,F2
F8,F6,F2 F12,F10,F2 F16,F14,F2 F20,F18,F2
3
4 5 6 7 8 9 10 11 12
Summary of Example
Version Unscheduled Scheduled Unrolled (4) Unrolled (4) and Scheduled Unrolled (5) and Schedule in MultiIssue Pipe
Code size
5 5 14 14 17
Overview
Basic Compiler Techniques
Pipeline scheduling loop unrolling
Static Branch Prediction Static Multiple Issue: VLIW Advanced Compiler Support for Exposing ILP
Detecting loop-level parallelism Software pipelining symbolic loop unrolling Global code scheduling
Detecting Parallelism
Loop-level parallelism
Analyzed at the source level requires recognition of array references, loops, indices.
loop carried dependency, but no cycle (B[k+1] does not have B[k] as source)
Transformation
Two statements can be interchanged. First iteration of first statement can be computed outside loop so that A[k+1] is computed within loop. Last iteration of second statement must also be computed outside loop. A[1] = A[1] + B[1]; for (k=1; k<=100; k=k+1) { A[k] = A[k] + B[k]; B[k+1] = C[k] + D[k];
Expose the parallelism
Recurrences
for (i=2; i<=100; i=i-1) { Y[i] = Y[i-n] + Y[i]; }
Y[i] depends on itself, but uses the value of an earlier iteration. n = dependence distance. Most often, n=1. The larger that n is, the more parallelism is available.
Finding Dependencies
Important for
Efficient scheduling Determining which loops to unroll Eliminating name dependencies
Dependencies in Arrays
Array indices, i, are affine if:
ai+ b (for a one-dimensional array) Index of multiple-dimension arrays is affine if indices in each dimension are affine
For two affine indices: ai+b and ck+d there is a dependence if:
GCD(c,a) must divide (d-b) evenly
Example
For (k=1; k<=100; k=k+1) { X[2k+3] = X[2k] * 5.0; }
GCD test: a=2, b=3 c=2, d=0 Dependence if GCD(c,a) divides (d-b) evenly GCD(2,2) = 2 d-b = -3 k 1 2 3 4 5 6 7 . . . . 100 2k+3 5 7 9 11 13 15 17 . . . . 203 2k 2 4 6 8 10 12 14 . . . . 200
In general, determining if dependency exists is NP-complete. There are exact tests for restricted situations
Dependency Classification
Different dependencies are handled differently
Anti-dependence and output dependence
rename
Real dependencies
try to reorder to separate by length of latency
Example
Find dependencies
True dependencies Output dependencies Anti-dependencies
Output dependence
X[i] = X[i] + c;
Z[i] = Y[i] + c;
Y[i] = c -Y[i]; }
Anti-dependence
Example
Eliminate output dependence (also eliminates second anti-dependence)
Rename Y T
Output dependence
X[i] = X[i] + c;
Z[i] = T[i] + c;
Y[i] = c -T[i]; }
Anti-dependence
Example
Eliminate anti-dependence
Rename X S Final result is parallel loop that can be unrolled
S[i] = X[i] + c;
Z[i] = T[i] + c;
Y[i] = c -T[i]; }
Software Pipelining
Interleaves instructions from different iterations of a loop without unrolling
each iteration is made from instructions from different iterations of the loop software counterpart to Tomasulos algorithm start-up and finish-up code required
Software Pipelining
Software Pipelining
Loop: LD ADD.D S.D DADDUI BNE F0, 0(R1) F4, F0, F2 F4, 0(R1) R1, R1, # -8 R1,R2, Loop 3
symbolic unrolling
new loop
Loop: S.D ADD.D L.D DADDUI BNE F0, 16(R1) ; store to M[i] F4, F0, F2 ; add to M[i-1] F4, 0(R1) ; load M[i-2] R1, R1, # -8 R1,R2, Loop
F0, 0(R1) F4, F0, F2 F4, 0(R1) F0, 0(R1) F4, F0, F2 F4, 0(R1) F0, 0(R1) F4, F0, F2 F4, 0(R1)
3 2
Software Pipelining
new loop
Loop: S.D ADD.D L.D DADDUI BNE F0, 16(R1) ; store to M[i] F4, F0, F2 ; add to M[i-1] F4, 0(R1) ; load M[i-2] R1, R1, # -8 R1,R2, Loop
Result: 1 cycle per instruction 1 loop iteration per 5 cycles less code space than unrolling
rescheduled loop
Loop: S.D DADDUI ADD.D BNE L.D F0, 16(R1) ; store to M[i] R1, R1, # -8 F4, F0, F2 ; add to M[i-1] R1, R2, Loop F4, 8(R1) ; load M[i-2]
Here the overhead includes branch/counter instructions that are not easy to overlap
Finding shortest possible sequence requires identification of critical path. Critical path is the longest sequence of dependent instructions
Trace exits and re-entrances are very complex and require much bookkeeping
Trace Scheduling
Advantages
eliminates some hard decisions in global code scheduling good for code such as scientific programs with intensive loops and predictable behavior
Disadvantages
significant overhead in compensation code when trace must be exited
Superblocks
Similar to trace scheduling, but have only ONE entrance point When trace is exited, a duplicate loop is used for remainder of code (tail duplication)
Chapter 4 Overview
Basic Compiler Techniques
Pipeline scheduling loop unrolling
Static Branch Prediction Static Multiple Issue: VLIW Advanced Compiler Support for Exposing ILP
Detecting loop-level parallelism Software pipelining symbolic loop unrolling Global code scheduling
Review
Loop unrolling Software pipelining Trace scheduling Global code scheduling Problems
Increase parallelism when branch behavior is known
Hardware Options
Instruction set change:
conditional instructions
example: conditional move CMOVZ R1, R2, R3
R1 R2 if R3=0
predicated instructions
example: predicated load LWC R1, 9(R2), R3
R1 M[R2+9] if R3 0
Conditional Moves
Can be used to eliminate some branches
if (A==0) {S=T;}
Let A,S,T be assigned to R1, R2, R3 Code without conditional move: BNEZ R1, L ADDU R2, R3, R0 L: Using conditional move: CMOVZ R2, R3, R1 Control dependence is converted to data dependence
Conditional Moves
Useful for conversions such as absolute value A = abs (B)
if (B < 0), { A = -B;} else {A = B};
Conditional Moves
Useful for short sequences Not efficient for branches that guard large blocks of code. Simplest form of predicated instruction
Predication
Execution of an instruction is controlled by a predicate.
When predicate is false, instruction becomes a nop Full predication is when all instructions can be predicated. Full predication allows conversions of large blocks of code that are branch dependent
Examples
Support conditional moves:
MIPS Alpha PowerPC SPARC Intelx86
Compiler Speculation
To speculate ambitiously, must have
1. The ability to find instructions that can be speculatively moved and not affect program data flow. 2. The ability to ignore exceptions in speculated instructions, until it is certain they should occur. 3. The ability to speculatively interchange loads and stores which may have address conflicts. The last two require hardware support.
Exception Categories
Those that are handled and then normally resumed (page fault, I/O request, etc) (Resuming exceptions)
Those that indicate program error (overflow, memory protection fault, etc) (Terminating exceptions)
Approach 1.
HW and OS cooperatively ignore exceptions for speculated instructions handle all exceptions but ignore terminating exceptions
Resuming exceptions
Handling exception for speculative instructions will cause performance penalty, but not incorrect program behavior.
Terminating exceptions
In this approach HW and OS return an undefined value for any exception that would normally cause termination. Because termination SHOULD result when these exceptions do occur, execution of INCORRECT program is in error. Used for Fast Mode in some processors.
Approach 2.
Same as approach 1, but add instructions to check for terminating exceptions.
Example: LD sLD BNEZ SPECCK J DADDI SD R1,0(R3) R14, 0(R2) R1, L1 0(R2) L2 R14, R1, #4 R14, 0(R3) ; load A ; speculative, no termination ; test A ; check for speculation exception ; skip else ; else clause ; store A
L1: L2:
Both correct and incorrect programs execute without error Requires additional checking instructions
Approach 3.
Track exceptions as they occur Postpone terminating exceptions Requires poison bit for each register and speculation bit for instruction. Terminating exception causes poison bit to be set for result register. Speculative instructions using poison result pass poison bit to their result Non-speculative instructions using poison result cause termination Stores cannot be speculative since memory locations cannot have poison bits.
Approach 3.
Example
Example: LD sLD BEQZ DADDI SD R1,0(R3) R14, 0(R2) R1, L1 R14, R1, #4 R14, 0(R3) ; load A ; speculative, set poison bit for R14 ; on exception ; test A ; ; store A if poison bit is set, fault
L1:
Approach 4.
Hardware
Itanium Implementation
Functional units and instruction issue Performance
Other Registers for systems control, memory mapping, performance counters, communication with OS
Integer Registers
R0-R31 always accessible R32-R127 implemented as a register stack
each procedure is allocated to a set of registers
CFM Current Frame pointer used to point to a set of registers for a procedure
Instruction Format
VLIW approach
implicit parallelism among operations in an instruction fixed formatting of the operation fields
Instruction Groups
A sequence of consecutive instructions with no register dependencies There may be some memory dependencies Boundaries between groups are indicated with a stop.
Instruction Bundles
128 bits of encoded instructions Each bundle:
5-bit template field
specifies what types of execution units each instruction requires
Execution Slots
Execution unit slot Instruction Instruction Type Description I-unit A I M-unit A M F B L+X Integer ALU Non-integer ALU Integer ALU Memory Access Floating Point Branches Extended Example instructions
add,sub,and,or... integer and multimedia shifts, bit tests, ... add,sub,and,or... load/stores, int/FP
Floating point instructions Conditional branches Extended immediates, stops, nops
Templates
Template 0 1 2 3 4 5 Slot 0 M M M M M M Slot 1 I I I I L L Slot 2 I I I I X X Template 14 15 16 17 18 19 Slot 0 M M M M M M Slot 1 M M I I B B Slot 2 B B B B B B
8
9 10
M
M M
M
M M
I
I I
22
23 24 25 28 29
B
B M M M M
B
B M M F F
B
B B B B B
11 12
13
M M
M
M F
F
I F
F
Exercise
Loop example using MIPS form of instructions:
loop: L.D ADD.D S.D DADDUI BNE F0, 0(R1) F4,F0,F2 F4, 0(R1) R1,R1,#-8 R1,R2, loop ;F0 = array element ;add scalar in F2 ;store result ;decrement pointer (DW) ;branch if R1 != R2
Lets see if we can unroll this loop and map it to IA-64 bundles.
loop:
L.D F0, 0(R1) ADD.D F4,F0,F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8,F6,F2 S.D F8, -8(R1) L.D F10, -16(R1) ADD.D F12,F10,F2 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16,F14,F2 S.D F16, -24(R1) L.D F18, -32(R1) ADD.D F20,F18,F2 S.D F20, -32(R1) L.D F22, -40(R1) ADD.D F24,F22,F2 S.D F24, -40(R1) L.D F26, -48(R1) ADD.D F28,F26,F2 S.D F28, -48(R1) DADDI R1, R1, #-56 BNE R1, R2, loop
loop:
L.D L.D L.D L.D L.D L.D L.D ADD.D ADD.D ADD.D ADD.D ADD.D ADD.D ADD.D S.D S.D S.D S.D S.D DADDI S.D BNE S.D
F0, 0(R1) F6, -8(R1) F10, -16(R1) F14, -24(R1) F18, -32(R1) F22, -40(R1) F26, -48(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 F16,F14,F2 F20,F18,F2 F24,F22,F2 F28,F26,F2 F4, 0(R1) F8, -8(R1) F12, -16(R1) F16, -24(R1) F20, -32(R1) R1, R1, # -56 F24, 16(R1) R1, R2, loop F28, 8(R1)
Latencies
Instruction using result FP ALU operation Store double FP ALU operation Store double Latency in clock cycles 3 2 1 0
Type F
Type M Type I
; 16-56=-40 ; 8-56 = -48
Load double
loop:
L.D L.D L.D L.D L.D L.D L.D ADD.D ADD.D ADD.D ADD.D ADD.D ADD.D ADD.D S.D S.D S.D S.D S.D DADDI S.D BNE S.D
F0, 0(R1) F6, -8(R1) F10, -16(R1) F14, -24(R1) F18, -32(R1) F22, -40(R1) F26, -48(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 F16,F14,F2 F20,F18,F2 F24,F22,F2 F28,F26,F2 F4, 0(R1) F8, -8(R1) F12, -16(R1) F16, -24(R1) F20, -32(R1) R1, R1, # -56 F24, 16(R1) R1, R2, loop F28, 8(R1)
Type F
Type M Type I
; 16-56=-40 ; 8-56 = -48
Scheduled Code
Template Slot 0 Slot 1
L.D F6, -8(R1) L.D F14, -24(R1) ADD.D F4, F0, F2
Slot 2
Cycle
1 3
Slot 2
Cycle
1 2 3
8: M M I L.D F0, 0(R1) 9: M M F L.D F10, -16(R1) 14: M M F L.D F18, -32(R1)
31-bits
Together with the 5-bit bundle template, determine the major operation
Type-A Instructions
Instruction Type Number of formats Example instructions Add,sub, and,or Shift left and add ALU immediates Extra opcode bits 9 7 9 GPRs/ FPRs 3 3 2 Immediate bits 0 0 8 2-bit shift count Other/ comment
Add immediate
Add immediate Compare
3
0 4
2
2 2
14
22 0 2 predicate register destinations 2 predicate register destinations
Compare immediate
Type-I Instructions
Instruction Type Number of formats Example instructions Shift R/L variable I 29 Test bit Move to BR Extra opcode bits 9 GPRs/ FPRs 3 Immediate bits 0 Other/ comment Used by multimedia inst. 2 predicate register dest. Branch register specifier
6 6
3 1
Type-M Instructions
Instruction Type Number Example of instructions formats Integer/FP load and store, line prefetch Integer/FP load and store, and line prefetch and post-increment by immediate Integer/FP load prefetch and register postincrement Extra opcode bits 10 GPRs/ FPRs 2 Immediate bits 0 Other/ comment Speculative/ nonspeculative Speculative/ nonspeculative
46
10
Speculative/ nonspeculative
21 in two fields
Type-B Instructions
Instruction Type Number Example of instructions formats PC-relative branch, counted branch PC-relative call Extra opcode bits 7 GPRs/ FPRs 0 Immediate bits 21 Other/ comment
21
1 branch register
Type-F Instructions
Instruction Type Number Example of instructions formats FP arithmetic F 15 FP compare Extra opcode bits 2 2 GPRs/ FPRs 4 2 2 6-bit predicate regs Immediate bits Other/ comment
Type-L + X Instructions
Instruction Type Number Example of instructions formats 4 Move immediate long Extra opcode bits 2 GPRs/ FPRs 1 Immediate bits 64 Other/ comment Takes 2 slots
L+X
Predication
Almost all instructions can be predicated
Specify a predicate register in last 6 bits of instruction Predicate registers set with compare or test instructions
Compare 10 possible comparison tests Two predicate registers as destinations Written with result and compliment OR, with logical function that combines tests and compliment Multiple tests can be done with one instruction
Speculation Support
Control speculation support:
Deferred exceptions for speculated instructions
Equivalent of poison bits
Deferred Exceptions
Support to indicate an exception on a speculative instruction
GPRs have NaT (Not a Thing) bits (make registers 65-bits long) FPRs use NaTVal (Not a Thing Value)
Significand of 0 and exponent out of range
NaTs and NaTVals are propagated by speculative instructions that dont reference memory FP instructions use status registers to record exceptions for this purpose.
When instruction USING speculative load value is executed, ALAT table is checked
Ld.c check that is used if only load is speculative
Only causes a reload of the value
Chk.a check that is used if other speculative code used the load value.
Specifies address of a fix-up routine that re-executes code sequence.
Itanium Processor
First implementation of IA-64 (2001) 800 MHz clock Multi-issue up to 6-issues per clock cycle Up to 3 branches and 2 memory references 3-level cache memory hierarchy
L1 is split data/instruction caches L2 is unified cache on-chip L3 is unified cache off-chip (but on container)
Functional Units
I-unit I-unit M-unit M-unit Instruction Integer load B-unit F-unit B-unit F-unit B-unit Latency 1 All functional units are pipelined Bypassing paths are implemented (forwarding) Bypass between units has a 1 cycle delay
Floating-point load
Correctly predicted taken branch Mispredicted branch Integer ALU operation FP arithmetic
9
0-3 9 0 4
Itanium Multi-Issue
Instruction Issue window of 2 bundles at a time Up to 6 instructions issues at once
template
inst 1
inst 2
inst 3
template
inst 1
inst 2
inst 3
NOPs and predicated instructions with false predicates are not issued If one or more instructions cannot be issued due to unavailable function unit, bundle can be split
Itanium Pipeline
10 Stages
Front-End IPG Fetch Rotate Instruction Delivery EXP REN Operand Delivery WLD REG EXE Execution DET WRB
Prefetches up to 32 bytes per clock Can hold up to 8 bundles Branch prediction: multilevel adaptive predictor
Access register file register bypassing register scoreboard checks predicate dependencies
Executes instructions through ALU and load-store units Detects exceptions and posts NaTs Write back
Itanium Performance
Integer benchmarks
Itanium shows best performance only for mcf benchmark Geometric Means:
PItanium = 60% PPentium
4
21264
Itanium Performance
FP benchmarks
Itanium has best performance for 8/16 benchmarks Geometric Means:
PItanium = 108% PPentium PItanium = 120% PAlpha
4
21264
Conclusions
Multi-issue processors only achieve high performance with much investment in silicon area and hardware complexity No clear winner in hardware or software approaches to ILP in general
Software helps for conditional instructions and speculative load support Hardware helps for scoreboard type scheduling, dynamic branch prediction, local checking for speculated load correctness