0% found this document useful (0 votes)
22 views

Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"

The document discusses optimizations that can be performed on code to improve performance, including both machine-independent and machine-dependent optimizations. It provides examples of optimizations like code motion, reduction in strength, sharing common subexpressions, and moving operations out of loops. It also discusses limitations of optimizing compilers and analyzing performance at different levels like clock cycles and cycles per element.

Uploaded by

abesalok123
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"

The document discusses optimizations that can be performed on code to improve performance, including both machine-independent and machine-dependent optimizations. It provides examples of optimizations like code motion, reduction in strength, sharing common subexpressions, and moving operations out of loops. It also discusses limitations of optimizing compilers and analyzing performance at different levels like clock cycles and cycles per element.

Uploaded by

abesalok123
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 57

15-213

“The course that gives CMU its Zip!”

Code Optimization
Sept. 25, 2003

Topics
 Machine-Independent Optimizations
 Machine Dependent Optimizations
 Code Profiling

class10.ppt
Harsh Reality
There’s more to performance than asymptotic complexity

Constant factors matter too!


 Easily see 10:1 performance range depending on how code is
written
 Must optimize at multiple levels:
 algorithm, data representations, procedures, and loops

Must understand system to optimize performance


 How programs are compiled and executed
 How to measure program performance and identify bottlenecks
 How to improve performance without destroying code
modularity and generality

–2– 15-213, F’03


Limitations of Optimizing Compilers
Operate under fundamental constraint
 Must not cause any change in program behavior under any
possible condition
 Often prevents it from making optimizations when would only affect
behavior under pathological conditions.
Behavior that may be obvious to the programmer can be
obfuscated by languages and coding styles
 e.g., Data ranges may be more limited than variable types suggest
Most analysis is performed only within procedures
 Whole-program analysis is too expensive in most cases
Most analysis is based only on static information
 Compiler has difficulty anticipating run-time inputs

When in doubt, the compiler must be conservative


–3– 15-213, F’03
Machine-Independent Optimizations
Optimizations that you or compiler should do
regardless of processor / compiler

Code Motion
 Reduce frequency with which computation performed
 If it will always produce same result
 Especially moving code out of loop

for (i = 0; i < n; i++) {


for (i = 0; i < n; i++) int ni = n*i;
for (j = 0; j < n; j++) for (j = 0; j < n; j++)
a[n*i + j] = b[j]; a[ni + j] = b[j];
}

–4– 15-213, F’03


Compiler-Generated Code Motion
 Most compilers do a good job with array code + simple loop
structures

Code Generated by GCC for (i = 0; i < n; i++) {


int ni = n*i;
for (i = 0; i < n; i++) int *p = a+ni;
for (j = 0; j < n; j++) for (j = 0; j < n; j++)
a[n*i + j] = b[j]; *p++ = b[j];
}

imull %ebx,%eax # i*n


movl 8(%ebp),%edi # a
leal (%edi,%eax,4),%edx # p = a+i*n (scaled by 4)
# Inner Loop
.L40:
movl 12(%ebp),%edi # b
movl (%edi,%ecx,4),%eax # b+j (scaled by 4)
movl %eax,(%edx) # *p = b[j]
addl $4,%edx # p++ (scaled by 4)
incl %ecx # j++
jl .L40 # loop if j<n

–5– 15-213, F’03


Reduction in Strength
 Replace costly operation with simpler one
 Shift, add instead of multiply or divide
16*x --> x << 4
 Utility machine dependent
 Depends on cost of multiply or divide instruction
 On Pentium II or III, integer multiply only requires 4 CPU cycles
 Recognize sequence of products

int ni = 0;
for (i = 0; i < n; i++) for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) for (j = 0; j < n; j++)
a[n*i + j] = b[j]; a[ni + j] = b[j];
ni += n;
}

–6– 15-213, F’03


Share Common Subexpressions
 Reuse portions of expressions
 Compilers often not very sophisticated in exploiting
arithmetic properties

/* Sum neighbors of i,j */ int inj = i*n + j;


up = val[(i-1)*n + j]; up = val[inj - n];
down = val[(i+1)*n + j]; down = val[inj + n];
left = val[i*n + j-1]; left = val[inj - 1];
right = val[i*n + j+1]; right = val[inj + 1];
sum = up + down + left + right; sum = up + down + left + right;

3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n

leal -1(%edx),%ecx # i-1


imull %ebx,%ecx # (i-1)*n
leal 1(%edx),%eax # i+1
imull %ebx,%eax # (i+1)*n
imull %ebx,%edx # i*n

–7– 15-213, F’03


Time Scales
Absolute Time
 Typically use nanoseconds
 10–9 seconds
 Time scale of computer instructions

Clock Cycles
 Most computers controlled by high frequency clock signal
 Typical Range
 100 MHz
» 108 cycles per second
» Clock period = 10ns
 2 GHz
» 2 X 109 cycles per second
» Clock period = 0.5ns
 Fish machines: 550 MHz (1.8 ns clock period)
–8– 15-213, F’03
Cycles Per Element
 Convenient way to express performance of program that
operators on vectors or lists
 Length = n
 T = CPE*n + Overhead
1000

900

800
vsum1
700
Slope = 4.0
600
Cycles

500

400
vsum2
Slope = 3.5
300

200

100

0
0 50 100 150 200
Elements
–9– 15-213, F’03
Vector Abstract Data Type (ADT)

length 0 1 2 length–1
data   

Procedures
vec_ptr new_vec(int len)
 Create vector of specified length
int get_vec_element(vec_ptr v, int index, int *dest)
 Retrieve vector element, store at *dest
 Return 0 if out of bounds, 1 if successful
int *get_vec_start(vec_ptr v)
 Return pointer to start of vector data
 Similar to array implementations in Pascal, ML, Java
 E.g., always do bounds checking

– 10 – 15-213, F’03
Optimization Example
void combine1(vec_ptr v, int *dest)
{
int i;
*dest = 0;
for (i = 0; i < vec_length(v); i++) {
int val;
get_vec_element(v, i, &val);
*dest += val;
}
}
Procedure
 Compute sum of all elements of integer vector
 Store result at destination location
 Vector data structure and operations defined via abstract data type

Pentium II/III Performance: Clock Cycles / Element


 42.06 (Compiled -g) 31.25 (Compiled -O2)

– 11 – 15-213, F’03
Understanding Loop
void combine1-goto(vec_ptr v, int *dest)
{
int i = 0;
int val;
*dest = 0;
if (i >= vec_length(v))
goto done; 1 iteration
loop:
get_vec_element(v, i, &val);
*dest += val;
i++;
if (i < vec_length(v))
goto loop
done:
}
Inefficiency
 Procedure vec_length called every iteration
 Even though result always the same
– 12 – 15-213, F’03
Move vec_length Call Out of Loop
void combine2(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
*dest = 0;
for (i = 0; i < length; i++) {
int val;
get_vec_element(v, i, &val);
*dest += val;
}
}
Optimization
 Move call to vec_length out of inner loop
 Value does not change from one iteration to next
 Code motion
 CPE: 20.66 (Compiled -O2)
 vec_length requires only constant time, but significant overhead

– 13 – 15-213, F’03
Optimization Blocker: Procedure Calls
Why couldn’t compiler move vec_len out of inner loop?
 Procedure may have side effects
 Alters global state each time called
 Function may not return same value for given arguments
 Depends on other parts of global state
 Procedure lower could interact with strlen

Why doesn’t compiler look at code for vec_len?


 Interprocedural optimization is not used extensively due to cost

Warning:
 Compiler treats procedure call as a black box
 Weak optimizations in and around them

– 14 – 15-213, F’03
Reduction in Strength
void combine3(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
*dest = 0;
for (i = 0; i < length; i++) {
*dest += data[i];
}

Optimization
 Avoid procedure call to retrieve each vector element
 Get pointer to start of array before loop
 Within loop just do pointer reference
 Not as clean in terms of data abstraction
 CPE: 6.00 (Compiled -O2)
 Procedure calls are expensive!
 Bounds checking is expensive
– 15 – 15-213, F’03
Eliminate Unneeded Memory Refs
void combine4(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
int sum = 0;
for (i = 0; i < length; i++)
sum += data[i];
*dest = sum;
}

Optimization
 Don’t need to store in destination until end
 Local variable sum held in register
 Avoids 1 memory read, 1 memory write per cycle
 CPE: 2.00 (Compiled -O2)
 Memory references are expensive!
– 16 – 15-213, F’03
Detecting Unneeded Memory Refs.
Combine3 Combine4
.L18: .L24:
movl (%ecx,%edx,4),%eax addl (%eax,%edx,4),%ecx
addl %eax,(%edi)
incl %edx incl %edx
cmpl %esi,%edx cmpl %esi,%edx
jl .L18 jl .L24

Performance
 Combine3
 5 instructions in 6 clock cycles
 addl must read and write memory
 Combine4
 4 instructions in 2 clock cycles

– 17 – 15-213, F’03
Optimization Blocker: Memory Aliasing
Aliasing
 Two different memory references specify single location

Example
 v: [3, 2, 17]
 combine3(v, get_vec_start(v)+2) --> ?
 combine4(v, get_vec_start(v)+2) --> ?

Observations
 Easy to have happen in C
 Since allowed to do address arithmetic
 Direct access to storage structures
 Get in habit of introducing local variables
 Accumulating within loops
 Your way of telling compiler not to check for aliasing
– 18 – 15-213, F’03
General Forms of Combining
void abstract_combine4(vec_ptr v, data_t *dest)
{
int i;
int length = vec_length(v);
data_t *data = get_vec_start(v);
data_t t = IDENT;
for (i = 0; i < length; i++)
t = t OP data[i];
*dest = t;
}

Data Types Operations


 Use different  Use different definitions
declarations for data_t
of OP and IDENT
 int  + / 0
 float  * / 1
 double
– 19 – 15-213, F’03
Machine Independent Opt. Results
Optimizations
 Reduce function calls and memory references within loop

Method
Performance Anomaly
Integer Floating Point
+ * + *
 Computing FP product of all elements exceptionally slow.
 Very large speedup when accumulate in temporary
 Caused by quirk of IA32 floating point
 Memory uses 64-bit format, register use 80

Abstract -g
 Benchmark data caused overflow of 64 bits, but not 80
42.06 41.86 41.44 160.00
Abstract -O2 31.25 33.25 31.25 143.00
Move vec_length 20.66 21.25 21.15 135.00
data access 6.00 9.00 8.00 117.00
Accum. in temp 2.00 4.00 3.00 5.00

– 20 – 15-213, F’03
Machine-Independent Opt. Summary
Code Motion
 Compilers are good at this for simple loop/array structures
 Don’t do well in presence of procedure calls and memory aliasing

Reduction in Strength
 Shift, add instead of multiply or divide
 compilers are (generally) good at this
 Exact trade-offs machine-dependent
 Keep data in registers rather than memory
 compilers are not good at this, since concerned with aliasing

Share Common Subexpressions


 compilers have limited algebraic reasoning capabilities

– 21 – 15-213, F’03
Modern CPU Design
Instruction Control
Address
Fetch
Retirement Control
Unit Instruction
Register Instrs. Cache
Instruction
File Decode
Operations

Register Updates Prediction OK?

Integer/ General FP FP Functional


Load Store
Branch Integer Add Mult/Div Units

Operation Results Addr. Addr.


Data Data

Data
Cache

Execution
– 22 – 15-213, F’03
CPU Capabilities of Pentium III
Multiple Instructions Can Execute in Parallel
 1 load
 1 store
 2 integer (one may be branch)
 1 FP Addition
 1 FP Multiplication or Division
Some Instructions Take > 1 Cycle, but Can be Pipelined
 Instruction Latency Cycles/Issue
 Load / Store 3 1
 Integer Multiply 4 1
 Integer Divide 36 36
 Double/Single FP Multiply 5 2
 Double/Single FP Add 3 1
 Double/Single FP Divide 38 38

– 23 – 15-213, F’03
Instruction Control
Instruction Control
Address
Fetch
Retirement Control
Unit Instruction
Register Instrs. Cache
Instruction
File Decode
Operations

Grabs Instruction Bytes From Memory


 Based on current PC + predicted targets for predicted branches
 Hardware dynamically guesses whether branches taken/not taken and
(possibly) branch target
Translates Instructions Into Operations
 Primitive steps required to perform instruction
 Typical instruction requires 1–3 operations
Converts Register References Into Tags
 Abstract identifier linking destination of one operation with sources of
later operations
– 24 – 15-213, F’03
Translation Example
Version of Combine4
 Integer data, multiply operation

Translation of First Iteration

.L24: # Loop:
imull (%eax,%edx,4),%ecx # t *= data[i]
incl %edx # i++
cmpl %esi,%edx # i:length
jl .L24 # if < goto Loop

.L24:
load (%eax,%edx.0,4)  t.1
imull (%eax,%edx,4),%ecx imull t.1, %ecx.0  %ecx.1
incl %edx.0  %edx.1
incl %edx cmpl %esi, %edx.1  cc.1
cmpl %esi,%edx jl-taken cc.1
jl .L24
– 25 – 15-213, F’03
Translation Example #1
imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4)  t.1
imull t.1, %ecx.0  %ecx.1
 Split into two operations
 load reads from memory to generate temporary result t.1
 Multiply operation just operates on registers
 Operands
 Register %eax does not change in loop. Values will be retrieved
from register file during decoding
 Register %ecx changes on every iteration. Uniquely identify
different versions as %ecx.0, %ecx.1, %ecx.2, …
» Register renaming
» Values passed directly from producer to consumers

– 26 – 15-213, F’03
Translation Example #2
incl %edx incl %edx.0  %edx.1

 Register %edx changes on each iteration. Rename as


%edx.0, %edx.1, %edx.2, …

– 27 – 15-213, F’03
Translation Example #3
cmpl %esi,%edx cmpl %esi, %edx.1  cc.1

 Condition codes are treated similar to registers


 Assign tag to define connection between producer and
consumer

– 28 – 15-213, F’03
Translation Example #4
jl .L24 jl-taken cc.1

 Instruction control unit determines destination of jump


 Predicts whether will be taken and target
 Starts fetching instruction at predicted destination
 Execution unit simply checks whether or not prediction was
OK
 If not, it signals instruction control
 Instruction control then “invalidates” any operations generated
from misfetched instructions
 Begins fetching and decoding instructions at correct target

– 29 – 15-213, F’03
Visualizing Operations
%edx.0
load (%eax,%edx,4)  t.1
incl %edx.1
imull t.1, %ecx.0  %ecx.1
load cmpl incl %edx.0  %edx.1
cc.1 cmpl %esi, %edx.1  cc.1
%ecx.0
jl jl-taken cc.1
t.1

Time Operations
imull  Vertical position denotes time at
which executed
 Cannot begin operation until
%ecx.1 operands available
 Height denotes latency
Operands
 Arcs shown only for operands that
are passed within execution unit

– 30 – 15-213, F’03
Visualizing Operations (cont.)
load (%eax,%edx,4)  t.1
%edx.0 iaddl t.1, %ecx.0  %ecx.1
load incl %edx.1
incl %edx.0  %edx.1
cmpl %esi, %edx.1  cc.1
load cmpl+1
%ecx.i jl-taken cc.1
cc.1
%ecx.0
jl
t.1
Time addl
%ecx.1 Operations
 Same as before, except
that add has latency of 1

– 31 – 15-213, F’03
3 Iterations of Combining Product
%edx.0
Unlimited Resource
incl
1 %edx.1
Analysis
2 load cmpl incl %edx.2

jl
cc.1
load cmpl incl
 Assume operation
3 %edx.3
%ecx.0
t.1
cc.2 can start as soon as
4 i=0 jl load cmpl
t.2 cc.3 operands available
5 jl
imull t.3  Operations for
6
multiple iterations
7 %ecx.1
overlap in time
8 Iteration 1

9 imull
Performance
10 Cycle i=1  Limiting factor
11 %ecx.2 becomes latency of
12 Iteration 2 integer multiplier
13 imull
 Gives CPE of 4.0
14 i=2

15 %ecx.3

– 32 – Iteration 3 15-213, F’03


4 Iterations of Combining Sum
%edx.0

1 incl %edx.1

2 load cmpl+1
%ecx.i incl %edx.2
cc.1
3 %ecx.0
jl load cmpl+1
%ecx.i incl %edx.3
t.1 cc.2
4 addl i=0
t.2
jl load cmpl+1
%ecx.i incl %edx.4 4 integer ops
%ecx.1 cc.3
5 Iteration 1 addl i=1 jl load cmpl+1
%ecx.i
%ecx.2 t.3 cc.4
6 Cycle Iteration 2 addl i=2 jl
%ecx.3 t.4
7 Iteration 3 addl i=3
%ecx.4
Iteration 4

Unlimited Resource Analysis


Performance
 Can begin a new iteration on each clock cycle
 Should give CPE of 1.0
 Would require executing 4 integer operations in parallel
– 33 – 15-213, F’03
Combining Sum: Resource Constraints
%edx.3

7 load incl %edx.4

8 %ecx.3 cmpl+1
%ecx.i incl %edx.5
t.4 cc.4
9 addl jl load

10 i=3 cmpl+1
%ecx.i load incl %edx.6
t.5 cc.5
%ecx.4
11 addl jl
Iteration 4 t.6
%ecx.5
12 i=4 addl cmpl load
Iteration 5 cc.6
cc.6
13 jl incl %edx.7
t.7
%ecx.6 i=5
14 addl cmpl
Iteration 6 cc.7
jl load incl
 Only have two15 integer functional units %edx.8

 Some operations delayed even though operands available


16 i=6 cmpl+1
%ecx.i
Cycle
 Set priority based on program order t.8 cc.8
%ecx.7
17 addl jl
Performance Iteration 7
 Sustain CPE of
182.0 i=7
%ecx.8

Iteration 8

– 34 – 15-213, F’03
Loop Unrolling
void combine5(vec_ptr v, int *dest)
{
int length = vec_length(v); Optimization
int limit = length-2;
int *data = get_vec_start(v);  Combine multiple
int sum = 0; iterations into single
int i; loop body
/* Combine 3 elements at a time */  Amortizes loop
for (i = 0; i < limit; i+=3) {
overhead across
sum += data[i] + data[i+2]
+ data[i+1]; multiple iterations
}  Finish extras at end
/* Finish any remaining elements */  Measured CPE = 1.33
for (; i < length; i++) {
sum += data[i];
}
*dest = sum;
}

– 35 – 15-213, F’03
Visualizing Unrolled Loop
%edx.0
 Loads can pipeline,
addl
since don’t have %edx.1

dependencies load cmpl+1


%ecx.i
 Only one set of loop load jl
cc.1

control operations %ecx.0c


t.1a
addl load
%ecx.1a t.1b Time
addl
%ecx.1b t.1c
addl
load (%eax,%edx.0,4)  t.1a %ecx.1c

iaddl t.1a, %ecx.0c  %ecx.1a


load 4(%eax,%edx.0,4)  t.1b
iaddl t.1b, %ecx.1a  %ecx.1b
load 8(%eax,%edx.0,4)  t.1c
iaddl t.1c, %ecx.1b  %ecx.1c
iaddl $3,%edx.0  %edx.1
cmpl %esi, %edx.1  cc.1
jl-taken cc.1

– 36 – 15-213, F’03
Executing with Loop Unrolling
%edx.2

7 addl %edx.3

8 load cmpl+1
%ecx.i
cc.3
9 %ecx.2c load jl
t.3a
10 addl load addl %edx.4
%ecx.3a t.3b
11 addl load cmpl+1
%ecx.i
%ecx.3b t.3c cc.4
12 i=6 addl load jl
%ecx.3c
t.4a
13 Iteration 3 addl load
%ecx.4a t.4b
14 Cycle addl
 Predicted Performance
%ecx.4b t.4c
 Can complete iteration in 3 cycles
 Should give15
i=9 addl
CPE of 1.0 %ecx.4c
 Measured Performance
 CPE of 1.33 Iteration 4
 One iteration every 4 cycles

– 37 – 15-213, F’03
Effect of Unrolling

Unrolling Degree 1 2 3 4 8 16
Integer Sum 2.00 1.50 1.33 1.50 1.25 1.06
Integer Product 4.00
FP Sum 3.00
FP Product 5.00

 Only helps integer sum for our examples


 Other cases constrained by functional unit latencies
 Effect is nonlinear with degree of unrolling
 Many subtle effects determine exact scheduling of operations

– 38 – 15-213, F’03
Parallel Loop Unrolling
void combine6(vec_ptr v, int *dest) Code Version
{  Integer product
int length = vec_length(v);
int limit = length-1; Optimization
int *data = get_vec_start(v);  Accumulate in two
int x0 = 1; different products
int x1 = 1;  Can be performed
int i; simultaneously
/* Combine 2 elements at a time */  Combine at end
for (i = 0; i < limit; i+=2) {  2-way parallism
x0 *= data[i];
x1 *= data[i+1]; Performance
}  CPE = 2.0
/* Finish any remaining elements */  2X performance
for (; i < length; i++) {
x0 *= data[i];
}
*dest = x0 * x1;
}
– 39 – 15-213, F’03
Dual Product Computation
Computation
((((((1 * x0) * x2) * x4) * x6) * x8) * x10) *
((((((1 * x1) * x3) * x5) * x7) * x9) * x11)

Performance
 N elements, D cycles/operation
1 x0 1 x1  (N/2+1)*D cycles
 ~2X performance improvement
* x2 * x3

* x4 * x5

* x6 * x7

* x8 * x9

* x10 * x11

* *

*
– 40 – 15-213, F’03
Requirements for Parallel Computation
Mathematical
 Combining operation must be associative & commutative
 OK for integer multiplication
 Not strictly true for floating point
» OK for most applications

Hardware
 Pipelined functional units
 Ability to dynamically extract parallelism from code

– 41 – 15-213, F’03
Visualizing Parallel Loop
%edx.0
 Two multiplies within
loop no longer have addl %edx.1
data depency load cmpl
 Allows them to cc.1
load jl
pipeline %ecx.0
t.1a

%ebx.0
t.1b

imull Time

imull
%ecx.1
load (%eax,%edx.0,4)  t.1a
imull t.1a, %ecx.0  %ecx.1 %ebx.1
load 4(%eax,%edx.0,4)  t.1b
imull t.1b, %ebx.0  %ebx.1
iaddl $2,%edx.0  %edx.1
cmpl %esi, %edx.1  cc.1
jl-taken cc.1

– 42 – 15-213, F’03
Executing with Parallel Loop
%edx.0

1 addl %edx.1

2 load cmpl addl %edx.2


cc.1
3 %ecx.0 load jl load cmpl addl %edx.3
t.1a cc.2
4 %ebx.0 load jl load cmpl
t.1b cc.3
5 load jl
imull
6
imull
7 %ecx.1
t.2a
8 i=0 %ebx.1
t.2b
9 Iteration 1
imull
10 Cycle
imull
11 %ecx.2
t.3a
i=2
 Predicted12Performance %ebx.2
t.3b
13 Iteration 2
 Can keep 4-cycle multiplier imull
busy performing
14 two
imull
simultaneous multiplications
15 %ecx.3
 Gives CPE of 2.0
16 i=4 %ebx.3

– 43 – Iteration 15-213,
3 F’03
Summary: Results for Pentium III

Method Integer Floating Point


+ * + *
Abstract -g 42.06 41.86 41.44 160.00
Abstract -O2 31.25 33.25 31.25 143.00
Move vec_length 20.66 21.25 21.15 135.00
data access 6.00 9.00 8.00 117.00
Accum. in temp 2.00 4.00 3.00 5.00
Pointer 3.00 4.00 3.00 5.00
Unroll 4 1.50 4.00 3.00 5.00
Unroll 16 1.06 4.00 3.00 5.00
2X2 1.50 2.00 2.00 2.50
4X4 1.50 2.00 1.50 2.50
8X4 1.25 1.25 1.50 2.00
Theoretical Opt. 1.00 1.00 1.00 2.00
Worst : Best 39.7 33.5 27.6 80.0
– 44 – 15-213, F’03
Limitations of Parallel Execution
Need Lots of Registers
 To hold sums/products
 Only 6 usable integer registers
 Also needed for pointers, loop conditions
 8 FP registers
 When not enough registers, must spill temporaries onto
stack
 Wipes out any performance gains
 Not helped by renaming
 Cannot reference more operands than instruction set allows
 Major drawback of IA32 instruction set

– 45 – 15-213, F’03
Register Spilling Example
.L165:
Example imull (%eax),%ecx
movl -4(%ebp),%edi
 8 X 8 integer product
imull 4(%eax),%edi
 7 local variables share 1 movl %edi,-4(%ebp)
register movl -8(%ebp),%edi
 See that are storing locals imull 8(%eax),%edi
movl %edi,-8(%ebp)
on stack
movl -12(%ebp),%edi
 E.g., at -8(%ebp) imull 12(%eax),%edi
movl %edi,-12(%ebp)
movl -16(%ebp),%edi
imull 16(%eax),%edi
movl %edi,-16(%ebp)

addl $32,%eax
addl $8,%edx
cmpl -32(%ebp),%edx
jl .L165
– 46 – 15-213, F’03
Results for Alpha Processor
Method Integer Floating Point
+ * + *
Abstract -g 40.14 47.14 52.07 53.71
Abstract -O2 25.08 36.05 37.37 32.02
Move vec_length 19.19 32.18 28.73 32.73
data access 6.26 12.52 13.26 13.01
Accum. in temp 1.76 9.01 8.08 8.01
Unroll 4 1.51 9.01 6.32 6.32
Unroll 16 1.25 9.01 6.33 6.22
4X2 1.19 4.69 4.44 4.45
8X4 1.15 4.12 2.34 2.01
8X8 1.11 4.24 2.36 2.08
Worst : Best 36.2 11.4 22.3 26.7

 Overall trends very similar to those for Pentium III.


 Even though very different architecture and compiler

– 47 – 15-213, F’03
Results for Pentium 4 Processor
Method Integer Floating Point
+ * + *
Abstract -g 35.25 35.34 35.85 38.00
Abstract -O2 26.52 30.26 31.55 32.00
Move vec_length 18.00 25.71 23.36 24.25
data access 3.39 31.56 27.50 28.35
Accum. in temp 2.00 14.00 5.00 7.00
Unroll 4 1.01 14.00 5.00 7.00
Unroll 16 1.00 14.00 5.00 7.00
4X2 1.02 7.00 2.63 3.50
8X4 1.01 3.98 1.82 2.00
8X8 1.63 4.50 2.42 2.31
Worst : Best 35.2 8.9 19.7 19.0
 Higher latencies (int * = 14, fp + = 5.0, fp * = 7.0)
 Clock runs at 2.0 GHz
 Not an improvement over 1.0 GHz P3 for integer *
 Avoids FP multiplication anomaly

– 48 – 15-213, F’03
Machine-Dependent Opt. Summary
Loop Unrolling
 Some compilers do this automatically
 Generally not as clever as what can achieve by hand

Exposing Instruction-Level Parallelism


 Generally helps, but extent of improvement is machine
dependent

Warning:
 Benefits depend heavily on particular machine
 Best if performed by compiler
 But GCC on IA32/Linux is not very good
 Do only for performance-critical parts of code

– 49 – 15-213, F’03
Important Tools
Observation
 Generating assembly code
 Lets you see what optimizations compiler can make
 Understand capabilities/limitations of particular compiler

Measurement
 Accurately compute time taken by code
 Most modern machines have built in cycle counters
 Using them to get reliable measurements is tricky
» Chapter 9 of the CS:APP textbook
 Profile procedure calling frequencies
 Unix tool gprof

– 50 – 15-213, F’03
Code Profiling Example
Task
 Count word frequencies in text document
 Produce sorted list of words from most frequent to least
Steps
 Convert strings to lowercase
 Apply hash function
 Read words and insert into hash table
 Mostly list operations
 Maintain counter for each unique word
 Sort results
Data Set
 Collected works of Shakespeare
 946,596 total words, 26,596 unique
 Initial implementation: 9.2 seconds

Shakespeare’s
most frequent words
29,801 the
27,529 and
21,029 I
20,957 to
18,514 of
15,370 a
14010 you
12,936 my
11,722 in
11,519 that
– 51 – 15-213, F’03
Code Profiling
Augment Executable Program with Timing Functions
 Computes (approximate) amount of time spent in each
function
 Time computation method
 Periodically (~ every 10ms) interrupt program
 Determine what function is currently executing
 Increment its timer by interval (e.g., 10ms)
 Also maintains counter for each function indicating number
of times called
Using
gcc –O2 –pg prog. –o prog
./prog
 Executes in normal fashion, but also generates file gmon.out
gprof prog
 Generates profile information based on gmon.out

– 52 – 15-213, F’03
Profiling Results
% cumulative self self total
time seconds seconds calls ms/call ms/call name
86.60 8.21 8.21 1 8210.00 8210.00 sort_words
5.80 8.76 0.55 946596 0.00 0.00 lower1
4.75 9.21 0.45 946596 0.00 0.00 find_ele_rec
1.27 9.33 0.12 946596 0.00 0.00 h_add

Call Statistics
 Number of calls and cumulative time for each function

Performance Limiter
 Using inefficient sorting algorithm
 Single call uses 87% of CPU time

– 53 – 15-213, F’03
Code
Optimizations
10
9
8
7 Rest
CPU Secs.

6 Hash
5 Lower
4 List
3 Sort
2
1
0
Initial Quicksort Iter First Iter Last Big Table Better Hash Linear Lower

 First step: Use more efficient sorting function


 Library function qsort

– 54 – 15-213, F’03
Further Optimizations
2
1.8
1.6
1.4 Rest
CPU Secs.

1.2 Hash
1 Lower
0.8 List
0.6 Sort
0.4
0.2
0
Initial Quicksort Iter First Iter Last Big Table Better Hash Linear Lower

 Iter first: Use iterative function to insert elements into linked list
 Causes code to slow down
 Iter last: Iterative function, places new entry at end of list
 Tend to place most common words at front of list
 Big table: Increase number of hash buckets
 Better hash: Use more sophisticated hash function
 Linear lower: Move strlen out of loop

– 55 – 15-213, F’03
Profiling Observations
Benefits
 Helps identify performance bottlenecks
 Especially useful when have complex system with many
components

Limitations
 Only shows performance for data tested
 E.g., linear lower did not show big gain, since words are
short
 Quadratic inefficiency could remain lurking in code
 Timing mechanism fairly crude
 Only works for programs that run for > 3 seconds

– 56 – 15-213, F’03
Role of Programmer
How should I write my programs, given that I have a good,
optimizing compiler?
Don’t: Smash Code into Oblivion
 Hard to read, maintain, & assure correctness
Do:
 Select best algorithm
 Write code that’s readable & maintainable
 Procedures, recursion, without built-in constant limits
 Even though these factors can slow down code
 Eliminate optimization blockers
 Allows compiler to do its job

Focus on Inner Loops


 Do detailed optimizations where code will be executed
repeatedly
 Will get most performance gain here
– 57 – 15-213, F’03

You might also like