Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
Code Optimization
Sept. 25, 2003
Topics
Machine-Independent Optimizations
Machine Dependent Optimizations
Code Profiling
class10.ppt
Harsh Reality
There’s more to performance than asymptotic complexity
Code Motion
Reduce frequency with which computation performed
If it will always produce same result
Especially moving code out of loop
int ni = 0;
for (i = 0; i < n; i++) for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) for (j = 0; j < n; j++)
a[n*i + j] = b[j]; a[ni + j] = b[j];
ni += n;
}
Clock Cycles
Most computers controlled by high frequency clock signal
Typical Range
100 MHz
» 108 cycles per second
» Clock period = 10ns
2 GHz
» 2 X 109 cycles per second
» Clock period = 0.5ns
Fish machines: 550 MHz (1.8 ns clock period)
–8– 15-213, F’03
Cycles Per Element
Convenient way to express performance of program that
operators on vectors or lists
Length = n
T = CPE*n + Overhead
1000
900
800
vsum1
700
Slope = 4.0
600
Cycles
500
400
vsum2
Slope = 3.5
300
200
100
0
0 50 100 150 200
Elements
–9– 15-213, F’03
Vector Abstract Data Type (ADT)
length 0 1 2 length–1
data
Procedures
vec_ptr new_vec(int len)
Create vector of specified length
int get_vec_element(vec_ptr v, int index, int *dest)
Retrieve vector element, store at *dest
Return 0 if out of bounds, 1 if successful
int *get_vec_start(vec_ptr v)
Return pointer to start of vector data
Similar to array implementations in Pascal, ML, Java
E.g., always do bounds checking
– 10 – 15-213, F’03
Optimization Example
void combine1(vec_ptr v, int *dest)
{
int i;
*dest = 0;
for (i = 0; i < vec_length(v); i++) {
int val;
get_vec_element(v, i, &val);
*dest += val;
}
}
Procedure
Compute sum of all elements of integer vector
Store result at destination location
Vector data structure and operations defined via abstract data type
– 11 – 15-213, F’03
Understanding Loop
void combine1-goto(vec_ptr v, int *dest)
{
int i = 0;
int val;
*dest = 0;
if (i >= vec_length(v))
goto done; 1 iteration
loop:
get_vec_element(v, i, &val);
*dest += val;
i++;
if (i < vec_length(v))
goto loop
done:
}
Inefficiency
Procedure vec_length called every iteration
Even though result always the same
– 12 – 15-213, F’03
Move vec_length Call Out of Loop
void combine2(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
*dest = 0;
for (i = 0; i < length; i++) {
int val;
get_vec_element(v, i, &val);
*dest += val;
}
}
Optimization
Move call to vec_length out of inner loop
Value does not change from one iteration to next
Code motion
CPE: 20.66 (Compiled -O2)
vec_length requires only constant time, but significant overhead
– 13 – 15-213, F’03
Optimization Blocker: Procedure Calls
Why couldn’t compiler move vec_len out of inner loop?
Procedure may have side effects
Alters global state each time called
Function may not return same value for given arguments
Depends on other parts of global state
Procedure lower could interact with strlen
Warning:
Compiler treats procedure call as a black box
Weak optimizations in and around them
– 14 – 15-213, F’03
Reduction in Strength
void combine3(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
*dest = 0;
for (i = 0; i < length; i++) {
*dest += data[i];
}
Optimization
Avoid procedure call to retrieve each vector element
Get pointer to start of array before loop
Within loop just do pointer reference
Not as clean in terms of data abstraction
CPE: 6.00 (Compiled -O2)
Procedure calls are expensive!
Bounds checking is expensive
– 15 – 15-213, F’03
Eliminate Unneeded Memory Refs
void combine4(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
int sum = 0;
for (i = 0; i < length; i++)
sum += data[i];
*dest = sum;
}
Optimization
Don’t need to store in destination until end
Local variable sum held in register
Avoids 1 memory read, 1 memory write per cycle
CPE: 2.00 (Compiled -O2)
Memory references are expensive!
– 16 – 15-213, F’03
Detecting Unneeded Memory Refs.
Combine3 Combine4
.L18: .L24:
movl (%ecx,%edx,4),%eax addl (%eax,%edx,4),%ecx
addl %eax,(%edi)
incl %edx incl %edx
cmpl %esi,%edx cmpl %esi,%edx
jl .L18 jl .L24
Performance
Combine3
5 instructions in 6 clock cycles
addl must read and write memory
Combine4
4 instructions in 2 clock cycles
– 17 – 15-213, F’03
Optimization Blocker: Memory Aliasing
Aliasing
Two different memory references specify single location
Example
v: [3, 2, 17]
combine3(v, get_vec_start(v)+2) --> ?
combine4(v, get_vec_start(v)+2) --> ?
Observations
Easy to have happen in C
Since allowed to do address arithmetic
Direct access to storage structures
Get in habit of introducing local variables
Accumulating within loops
Your way of telling compiler not to check for aliasing
– 18 – 15-213, F’03
General Forms of Combining
void abstract_combine4(vec_ptr v, data_t *dest)
{
int i;
int length = vec_length(v);
data_t *data = get_vec_start(v);
data_t t = IDENT;
for (i = 0; i < length; i++)
t = t OP data[i];
*dest = t;
}
Method
Performance Anomaly
Integer Floating Point
+ * + *
Computing FP product of all elements exceptionally slow.
Very large speedup when accumulate in temporary
Caused by quirk of IA32 floating point
Memory uses 64-bit format, register use 80
Abstract -g
Benchmark data caused overflow of 64 bits, but not 80
42.06 41.86 41.44 160.00
Abstract -O2 31.25 33.25 31.25 143.00
Move vec_length 20.66 21.25 21.15 135.00
data access 6.00 9.00 8.00 117.00
Accum. in temp 2.00 4.00 3.00 5.00
– 20 – 15-213, F’03
Machine-Independent Opt. Summary
Code Motion
Compilers are good at this for simple loop/array structures
Don’t do well in presence of procedure calls and memory aliasing
Reduction in Strength
Shift, add instead of multiply or divide
compilers are (generally) good at this
Exact trade-offs machine-dependent
Keep data in registers rather than memory
compilers are not good at this, since concerned with aliasing
– 21 – 15-213, F’03
Modern CPU Design
Instruction Control
Address
Fetch
Retirement Control
Unit Instruction
Register Instrs. Cache
Instruction
File Decode
Operations
Data
Cache
Execution
– 22 – 15-213, F’03
CPU Capabilities of Pentium III
Multiple Instructions Can Execute in Parallel
1 load
1 store
2 integer (one may be branch)
1 FP Addition
1 FP Multiplication or Division
Some Instructions Take > 1 Cycle, but Can be Pipelined
Instruction Latency Cycles/Issue
Load / Store 3 1
Integer Multiply 4 1
Integer Divide 36 36
Double/Single FP Multiply 5 2
Double/Single FP Add 3 1
Double/Single FP Divide 38 38
– 23 – 15-213, F’03
Instruction Control
Instruction Control
Address
Fetch
Retirement Control
Unit Instruction
Register Instrs. Cache
Instruction
File Decode
Operations
.L24: # Loop:
imull (%eax,%edx,4),%ecx # t *= data[i]
incl %edx # i++
cmpl %esi,%edx # i:length
jl .L24 # if < goto Loop
.L24:
load (%eax,%edx.0,4) t.1
imull (%eax,%edx,4),%ecx imull t.1, %ecx.0 %ecx.1
incl %edx.0 %edx.1
incl %edx cmpl %esi, %edx.1 cc.1
cmpl %esi,%edx jl-taken cc.1
jl .L24
– 25 – 15-213, F’03
Translation Example #1
imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4) t.1
imull t.1, %ecx.0 %ecx.1
Split into two operations
load reads from memory to generate temporary result t.1
Multiply operation just operates on registers
Operands
Register %eax does not change in loop. Values will be retrieved
from register file during decoding
Register %ecx changes on every iteration. Uniquely identify
different versions as %ecx.0, %ecx.1, %ecx.2, …
» Register renaming
» Values passed directly from producer to consumers
– 26 – 15-213, F’03
Translation Example #2
incl %edx incl %edx.0 %edx.1
– 27 – 15-213, F’03
Translation Example #3
cmpl %esi,%edx cmpl %esi, %edx.1 cc.1
– 28 – 15-213, F’03
Translation Example #4
jl .L24 jl-taken cc.1
– 29 – 15-213, F’03
Visualizing Operations
%edx.0
load (%eax,%edx,4) t.1
incl %edx.1
imull t.1, %ecx.0 %ecx.1
load cmpl incl %edx.0 %edx.1
cc.1 cmpl %esi, %edx.1 cc.1
%ecx.0
jl jl-taken cc.1
t.1
Time Operations
imull Vertical position denotes time at
which executed
Cannot begin operation until
%ecx.1 operands available
Height denotes latency
Operands
Arcs shown only for operands that
are passed within execution unit
– 30 – 15-213, F’03
Visualizing Operations (cont.)
load (%eax,%edx,4) t.1
%edx.0 iaddl t.1, %ecx.0 %ecx.1
load incl %edx.1
incl %edx.0 %edx.1
cmpl %esi, %edx.1 cc.1
load cmpl+1
%ecx.i jl-taken cc.1
cc.1
%ecx.0
jl
t.1
Time addl
%ecx.1 Operations
Same as before, except
that add has latency of 1
– 31 – 15-213, F’03
3 Iterations of Combining Product
%edx.0
Unlimited Resource
incl
1 %edx.1
Analysis
2 load cmpl incl %edx.2
jl
cc.1
load cmpl incl
Assume operation
3 %edx.3
%ecx.0
t.1
cc.2 can start as soon as
4 i=0 jl load cmpl
t.2 cc.3 operands available
5 jl
imull t.3 Operations for
6
multiple iterations
7 %ecx.1
overlap in time
8 Iteration 1
9 imull
Performance
10 Cycle i=1 Limiting factor
11 %ecx.2 becomes latency of
12 Iteration 2 integer multiplier
13 imull
Gives CPE of 4.0
14 i=2
15 %ecx.3
1 incl %edx.1
2 load cmpl+1
%ecx.i incl %edx.2
cc.1
3 %ecx.0
jl load cmpl+1
%ecx.i incl %edx.3
t.1 cc.2
4 addl i=0
t.2
jl load cmpl+1
%ecx.i incl %edx.4 4 integer ops
%ecx.1 cc.3
5 Iteration 1 addl i=1 jl load cmpl+1
%ecx.i
%ecx.2 t.3 cc.4
6 Cycle Iteration 2 addl i=2 jl
%ecx.3 t.4
7 Iteration 3 addl i=3
%ecx.4
Iteration 4
8 %ecx.3 cmpl+1
%ecx.i incl %edx.5
t.4 cc.4
9 addl jl load
10 i=3 cmpl+1
%ecx.i load incl %edx.6
t.5 cc.5
%ecx.4
11 addl jl
Iteration 4 t.6
%ecx.5
12 i=4 addl cmpl load
Iteration 5 cc.6
cc.6
13 jl incl %edx.7
t.7
%ecx.6 i=5
14 addl cmpl
Iteration 6 cc.7
jl load incl
Only have two15 integer functional units %edx.8
Iteration 8
– 34 – 15-213, F’03
Loop Unrolling
void combine5(vec_ptr v, int *dest)
{
int length = vec_length(v); Optimization
int limit = length-2;
int *data = get_vec_start(v); Combine multiple
int sum = 0; iterations into single
int i; loop body
/* Combine 3 elements at a time */ Amortizes loop
for (i = 0; i < limit; i+=3) {
overhead across
sum += data[i] + data[i+2]
+ data[i+1]; multiple iterations
} Finish extras at end
/* Finish any remaining elements */ Measured CPE = 1.33
for (; i < length; i++) {
sum += data[i];
}
*dest = sum;
}
– 35 – 15-213, F’03
Visualizing Unrolled Loop
%edx.0
Loads can pipeline,
addl
since don’t have %edx.1
– 36 – 15-213, F’03
Executing with Loop Unrolling
%edx.2
7 addl %edx.3
8 load cmpl+1
%ecx.i
cc.3
9 %ecx.2c load jl
t.3a
10 addl load addl %edx.4
%ecx.3a t.3b
11 addl load cmpl+1
%ecx.i
%ecx.3b t.3c cc.4
12 i=6 addl load jl
%ecx.3c
t.4a
13 Iteration 3 addl load
%ecx.4a t.4b
14 Cycle addl
Predicted Performance
%ecx.4b t.4c
Can complete iteration in 3 cycles
Should give15
i=9 addl
CPE of 1.0 %ecx.4c
Measured Performance
CPE of 1.33 Iteration 4
One iteration every 4 cycles
– 37 – 15-213, F’03
Effect of Unrolling
Unrolling Degree 1 2 3 4 8 16
Integer Sum 2.00 1.50 1.33 1.50 1.25 1.06
Integer Product 4.00
FP Sum 3.00
FP Product 5.00
– 38 – 15-213, F’03
Parallel Loop Unrolling
void combine6(vec_ptr v, int *dest) Code Version
{ Integer product
int length = vec_length(v);
int limit = length-1; Optimization
int *data = get_vec_start(v); Accumulate in two
int x0 = 1; different products
int x1 = 1; Can be performed
int i; simultaneously
/* Combine 2 elements at a time */ Combine at end
for (i = 0; i < limit; i+=2) { 2-way parallism
x0 *= data[i];
x1 *= data[i+1]; Performance
} CPE = 2.0
/* Finish any remaining elements */ 2X performance
for (; i < length; i++) {
x0 *= data[i];
}
*dest = x0 * x1;
}
– 39 – 15-213, F’03
Dual Product Computation
Computation
((((((1 * x0) * x2) * x4) * x6) * x8) * x10) *
((((((1 * x1) * x3) * x5) * x7) * x9) * x11)
Performance
N elements, D cycles/operation
1 x0 1 x1 (N/2+1)*D cycles
~2X performance improvement
* x2 * x3
* x4 * x5
* x6 * x7
* x8 * x9
* x10 * x11
* *
*
– 40 – 15-213, F’03
Requirements for Parallel Computation
Mathematical
Combining operation must be associative & commutative
OK for integer multiplication
Not strictly true for floating point
» OK for most applications
Hardware
Pipelined functional units
Ability to dynamically extract parallelism from code
– 41 – 15-213, F’03
Visualizing Parallel Loop
%edx.0
Two multiplies within
loop no longer have addl %edx.1
data depency load cmpl
Allows them to cc.1
load jl
pipeline %ecx.0
t.1a
%ebx.0
t.1b
imull Time
imull
%ecx.1
load (%eax,%edx.0,4) t.1a
imull t.1a, %ecx.0 %ecx.1 %ebx.1
load 4(%eax,%edx.0,4) t.1b
imull t.1b, %ebx.0 %ebx.1
iaddl $2,%edx.0 %edx.1
cmpl %esi, %edx.1 cc.1
jl-taken cc.1
– 42 – 15-213, F’03
Executing with Parallel Loop
%edx.0
1 addl %edx.1
– 43 – Iteration 15-213,
3 F’03
Summary: Results for Pentium III
– 45 – 15-213, F’03
Register Spilling Example
.L165:
Example imull (%eax),%ecx
movl -4(%ebp),%edi
8 X 8 integer product
imull 4(%eax),%edi
7 local variables share 1 movl %edi,-4(%ebp)
register movl -8(%ebp),%edi
See that are storing locals imull 8(%eax),%edi
movl %edi,-8(%ebp)
on stack
movl -12(%ebp),%edi
E.g., at -8(%ebp) imull 12(%eax),%edi
movl %edi,-12(%ebp)
movl -16(%ebp),%edi
imull 16(%eax),%edi
movl %edi,-16(%ebp)
…
addl $32,%eax
addl $8,%edx
cmpl -32(%ebp),%edx
jl .L165
– 46 – 15-213, F’03
Results for Alpha Processor
Method Integer Floating Point
+ * + *
Abstract -g 40.14 47.14 52.07 53.71
Abstract -O2 25.08 36.05 37.37 32.02
Move vec_length 19.19 32.18 28.73 32.73
data access 6.26 12.52 13.26 13.01
Accum. in temp 1.76 9.01 8.08 8.01
Unroll 4 1.51 9.01 6.32 6.32
Unroll 16 1.25 9.01 6.33 6.22
4X2 1.19 4.69 4.44 4.45
8X4 1.15 4.12 2.34 2.01
8X8 1.11 4.24 2.36 2.08
Worst : Best 36.2 11.4 22.3 26.7
– 47 – 15-213, F’03
Results for Pentium 4 Processor
Method Integer Floating Point
+ * + *
Abstract -g 35.25 35.34 35.85 38.00
Abstract -O2 26.52 30.26 31.55 32.00
Move vec_length 18.00 25.71 23.36 24.25
data access 3.39 31.56 27.50 28.35
Accum. in temp 2.00 14.00 5.00 7.00
Unroll 4 1.01 14.00 5.00 7.00
Unroll 16 1.00 14.00 5.00 7.00
4X2 1.02 7.00 2.63 3.50
8X4 1.01 3.98 1.82 2.00
8X8 1.63 4.50 2.42 2.31
Worst : Best 35.2 8.9 19.7 19.0
Higher latencies (int * = 14, fp + = 5.0, fp * = 7.0)
Clock runs at 2.0 GHz
Not an improvement over 1.0 GHz P3 for integer *
Avoids FP multiplication anomaly
– 48 – 15-213, F’03
Machine-Dependent Opt. Summary
Loop Unrolling
Some compilers do this automatically
Generally not as clever as what can achieve by hand
Warning:
Benefits depend heavily on particular machine
Best if performed by compiler
But GCC on IA32/Linux is not very good
Do only for performance-critical parts of code
– 49 – 15-213, F’03
Important Tools
Observation
Generating assembly code
Lets you see what optimizations compiler can make
Understand capabilities/limitations of particular compiler
Measurement
Accurately compute time taken by code
Most modern machines have built in cycle counters
Using them to get reliable measurements is tricky
» Chapter 9 of the CS:APP textbook
Profile procedure calling frequencies
Unix tool gprof
– 50 – 15-213, F’03
Code Profiling Example
Task
Count word frequencies in text document
Produce sorted list of words from most frequent to least
Steps
Convert strings to lowercase
Apply hash function
Read words and insert into hash table
Mostly list operations
Maintain counter for each unique word
Sort results
Data Set
Collected works of Shakespeare
946,596 total words, 26,596 unique
Initial implementation: 9.2 seconds
Shakespeare’s
most frequent words
29,801 the
27,529 and
21,029 I
20,957 to
18,514 of
15,370 a
14010 you
12,936 my
11,722 in
11,519 that
– 51 – 15-213, F’03
Code Profiling
Augment Executable Program with Timing Functions
Computes (approximate) amount of time spent in each
function
Time computation method
Periodically (~ every 10ms) interrupt program
Determine what function is currently executing
Increment its timer by interval (e.g., 10ms)
Also maintains counter for each function indicating number
of times called
Using
gcc –O2 –pg prog. –o prog
./prog
Executes in normal fashion, but also generates file gmon.out
gprof prog
Generates profile information based on gmon.out
– 52 – 15-213, F’03
Profiling Results
% cumulative self self total
time seconds seconds calls ms/call ms/call name
86.60 8.21 8.21 1 8210.00 8210.00 sort_words
5.80 8.76 0.55 946596 0.00 0.00 lower1
4.75 9.21 0.45 946596 0.00 0.00 find_ele_rec
1.27 9.33 0.12 946596 0.00 0.00 h_add
Call Statistics
Number of calls and cumulative time for each function
Performance Limiter
Using inefficient sorting algorithm
Single call uses 87% of CPU time
– 53 – 15-213, F’03
Code
Optimizations
10
9
8
7 Rest
CPU Secs.
6 Hash
5 Lower
4 List
3 Sort
2
1
0
Initial Quicksort Iter First Iter Last Big Table Better Hash Linear Lower
– 54 – 15-213, F’03
Further Optimizations
2
1.8
1.6
1.4 Rest
CPU Secs.
1.2 Hash
1 Lower
0.8 List
0.6 Sort
0.4
0.2
0
Initial Quicksort Iter First Iter Last Big Table Better Hash Linear Lower
Iter first: Use iterative function to insert elements into linked list
Causes code to slow down
Iter last: Iterative function, places new entry at end of list
Tend to place most common words at front of list
Big table: Increase number of hash buckets
Better hash: Use more sophisticated hash function
Linear lower: Move strlen out of loop
– 55 – 15-213, F’03
Profiling Observations
Benefits
Helps identify performance bottlenecks
Especially useful when have complex system with many
components
Limitations
Only shows performance for data tested
E.g., linear lower did not show big gain, since words are
short
Quadratic inefficiency could remain lurking in code
Timing mechanism fairly crude
Only works for programs that run for > 3 seconds
– 56 – 15-213, F’03
Role of Programmer
How should I write my programs, given that I have a good,
optimizing compiler?
Don’t: Smash Code into Oblivion
Hard to read, maintain, & assure correctness
Do:
Select best algorithm
Write code that’s readable & maintainable
Procedures, recursion, without built-in constant limits
Even though these factors can slow down code
Eliminate optimization blockers
Allows compiler to do its job