0% found this document useful (0 votes)

22 views

Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"

The document discusses optimizations that can be performed on code to improve performance, including both machine-independent and machine-dependent optimizations. It provides examples of optimizations like code motion, reduction in strength, sharing common subexpressions, and moving operations out of loops. It also discusses limitations of optimizing compilers and analyzing performance at different levels like clock cycles and cycles per element.

Uploaded by

abesalok123

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"

Uploaded by

abesalok123

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 57

15-213

“The course that gives CMU its Zip!”

Code Optimization
Sept. 25, 2003

Topics
 Machine-Independent Optimizations
 Machine Dependent Optimizations
 Code Profiling

class10.ppt
Harsh Reality
There’s more to performance than asymptotic complexity

Constant factors matter too!

 Easily see 10:1 performance range depending on how code is
written
 Must optimize at multiple levels:
 algorithm, data representations, procedures, and loops

Must understand system to optimize performance

 How programs are compiled and executed
 How to measure program performance and identify bottlenecks
 How to improve performance without destroying code
modularity and generality

–2– 15-213, F’03

Limitations of Optimizing Compilers
Operate under fundamental constraint
 Must not cause any change in program behavior under any
possible condition
 Often prevents it from making optimizations when would only affect
behavior under pathological conditions.
Behavior that may be obvious to the programmer can be
obfuscated by languages and coding styles
 e.g., Data ranges may be more limited than variable types suggest
Most analysis is performed only within procedures
 Whole-program analysis is too expensive in most cases
Most analysis is based only on static information
 Compiler has difficulty anticipating run-time inputs

When in doubt, the compiler must be conservative

–3– 15-213, F’03
Machine-Independent Optimizations
Optimizations that you or compiler should do
regardless of processor / compiler

Code Motion
 Reduce frequency with which computation performed
 If it will always produce same result
 Especially moving code out of loop

for (i = 0; i < n; i++) {

for (i = 0; i < n; i++) int ni = n*i;
for (j = 0; j < n; j++) for (j = 0; j < n; j++)
a[n*i + j] = b[j]; a[ni + j] = b[j];
}

–4– 15-213, F’03

Compiler-Generated Code Motion
 Most compilers do a good job with array code + simple loop
structures

Code Generated by GCC for (i = 0; i < n; i++) {

int ni = n*i;
for (i = 0; i < n; i++) int *p = a+ni;
for (j = 0; j < n; j++) for (j = 0; j < n; j++)
a[n*i + j] = b[j]; *p++ = b[j];
}

imull %ebx,%eax # i*n

movl 8(%ebp),%edi # a
leal (%edi,%eax,4),%edx # p = a+i*n (scaled by 4)
# Inner Loop
.L40:
movl 12(%ebp),%edi # b
movl (%edi,%ecx,4),%eax # b+j (scaled by 4)
movl %eax,(%edx) # *p = b[j]
addl $4,%edx # p++ (scaled by 4)
incl %ecx # j++
jl .L40 # loop if j<n

–5– 15-213, F’03

Reduction in Strength
 Replace costly operation with simpler one
 Shift, add instead of multiply or divide
16*x --> x << 4
 Utility machine dependent
 Depends on cost of multiply or divide instruction
 On Pentium II or III, integer multiply only requires 4 CPU cycles
 Recognize sequence of products

int ni = 0;
for (i = 0; i < n; i++) for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) for (j = 0; j < n; j++)
a[n*i + j] = b[j]; a[ni + j] = b[j];
ni += n;
}

–6– 15-213, F’03

Share Common Subexpressions
 Reuse portions of expressions
 Compilers often not very sophisticated in exploiting
arithmetic properties

/* Sum neighbors of i,j / int inj = in + j;

up = val[(i-1)*n + j]; up = val[inj - n];
down = val[(i+1)*n + j]; down = val[inj + n];
left = val[i*n + j-1]; left = val[inj - 1];
right = val[i*n + j+1]; right = val[inj + 1];
sum = up + down + left + right; sum = up + down + left + right;

3 multiplications: in, (i–1)n, (i+1)n 1 multiplication: in

leal -1(%edx),%ecx # i-1

imull %ebx,%ecx # (i-1)*n
leal 1(%edx),%eax # i+1
imull %ebx,%eax # (i+1)*n
imull %ebx,%edx # i*n

–7– 15-213, F’03

Time Scales
Absolute Time
 Typically use nanoseconds
 10–9 seconds
 Time scale of computer instructions

Clock Cycles
 Most computers controlled by high frequency clock signal
 Typical Range
 100 MHz
» 108 cycles per second
» Clock period = 10ns
 2 GHz
» 2 X 109 cycles per second
» Clock period = 0.5ns
 Fish machines: 550 MHz (1.8 ns clock period)
–8– 15-213, F’03
Cycles Per Element
 Convenient way to express performance of program that
operators on vectors or lists
 Length = n
 T = CPE*n + Overhead
1000

900

800
vsum1
700
Slope = 4.0
600
Cycles

500

400
vsum2
Slope = 3.5
300

200

100

0
0 50 100 150 200
Elements
–9– 15-213, F’03
Vector Abstract Data Type (ADT)

length 0 1 2 length–1
data   

Procedures
vec_ptr new_vec(int len)
 Create vector of specified length
int get_vec_element(vec_ptr v, int index, int *dest)
 Retrieve vector element, store at *dest
 Return 0 if out of bounds, 1 if successful
int *get_vec_start(vec_ptr v)
 Return pointer to start of vector data
 Similar to array implementations in Pascal, ML, Java
 E.g., always do bounds checking

– 10 – 15-213, F’03
Optimization Example
void combine1(vec_ptr v, int *dest)
{
int i;
*dest = 0;
for (i = 0; i < vec_length(v); i++) {
int val;
get_vec_element(v, i, &val);
*dest += val;
}
}
Procedure
 Compute sum of all elements of integer vector
 Store result at destination location
 Vector data structure and operations defined via abstract data type

Pentium II/III Performance: Clock Cycles / Element

 42.06 (Compiled -g) 31.25 (Compiled -O2)

– 11 – 15-213, F’03
Understanding Loop
void combine1-goto(vec_ptr v, int *dest)
{
int i = 0;
int val;
*dest = 0;
if (i >= vec_length(v))
goto done; 1 iteration
loop:
get_vec_element(v, i, &val);
*dest += val;
i++;
if (i < vec_length(v))
goto loop
done:
}
Inefficiency
 Procedure vec_length called every iteration
 Even though result always the same
– 12 – 15-213, F’03
Move vec_length Call Out of Loop
void combine2(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
*dest = 0;
for (i = 0; i < length; i++) {
int val;
get_vec_element(v, i, &val);
*dest += val;
}
}
Optimization
 Move call to vec_length out of inner loop
 Value does not change from one iteration to next
 Code motion
 CPE: 20.66 (Compiled -O2)
 vec_length requires only constant time, but significant overhead

– 13 – 15-213, F’03
Optimization Blocker: Procedure Calls
Why couldn’t compiler move vec_len out of inner loop?
 Procedure may have side effects
 Alters global state each time called
 Function may not return same value for given arguments
 Depends on other parts of global state
 Procedure lower could interact with strlen

Why doesn’t compiler look at code for vec_len?

 Interprocedural optimization is not used extensively due to cost

Warning:
 Compiler treats procedure call as a black box
 Weak optimizations in and around them

– 14 – 15-213, F’03
Reduction in Strength
void combine3(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
*dest = 0;
for (i = 0; i < length; i++) {
*dest += data[i];
}

Optimization
 Avoid procedure call to retrieve each vector element
 Get pointer to start of array before loop
 Within loop just do pointer reference
 Not as clean in terms of data abstraction
 CPE: 6.00 (Compiled -O2)
 Procedure calls are expensive!
 Bounds checking is expensive
– 15 – 15-213, F’03
Eliminate Unneeded Memory Refs
void combine4(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
int sum = 0;
for (i = 0; i < length; i++)
sum += data[i];
*dest = sum;
}

Optimization
 Don’t need to store in destination until end
 Local variable sum held in register
 Avoids 1 memory read, 1 memory write per cycle
 CPE: 2.00 (Compiled -O2)
 Memory references are expensive!
– 16 – 15-213, F’03
Detecting Unneeded Memory Refs.
Combine3 Combine4
.L18: .L24:
movl (%ecx,%edx,4),%eax addl (%eax,%edx,4),%ecx
addl %eax,(%edi)
incl %edx incl %edx
cmpl %esi,%edx cmpl %esi,%edx
jl .L18 jl .L24

Performance
 Combine3
 5 instructions in 6 clock cycles
 addl must read and write memory
 Combine4
 4 instructions in 2 clock cycles

– 17 – 15-213, F’03
Optimization Blocker: Memory Aliasing
Aliasing
 Two different memory references specify single location

Example
 v: [3, 2, 17]
 combine3(v, get_vec_start(v)+2) --> ?
 combine4(v, get_vec_start(v)+2) --> ?

Observations
 Easy to have happen in C
 Since allowed to do address arithmetic
 Direct access to storage structures
 Get in habit of introducing local variables
 Accumulating within loops
 Your way of telling compiler not to check for aliasing
– 18 – 15-213, F’03
General Forms of Combining
void abstract_combine4(vec_ptr v, data_t *dest)
{
int i;
int length = vec_length(v);
data_t *data = get_vec_start(v);
data_t t = IDENT;
for (i = 0; i < length; i++)
t = t OP data[i];
*dest = t;
}

Data Types Operations

 Use different  Use different definitions
declarations for data_t
of OP and IDENT
 int  + / 0
 float  * / 1
 double
– 19 – 15-213, F’03
Machine Independent Opt. Results
Optimizations
 Reduce function calls and memory references within loop

Method
Performance Anomaly
Integer Floating Point
+ * + *
 Computing FP product of all elements exceptionally slow.
 Very large speedup when accumulate in temporary
 Caused by quirk of IA32 floating point
 Memory uses 64-bit format, register use 80

Abstract -g
 Benchmark data caused overflow of 64 bits, but not 80
42.06 41.86 41.44 160.00
Abstract -O2 31.25 33.25 31.25 143.00
Move vec_length 20.66 21.25 21.15 135.00
data access 6.00 9.00 8.00 117.00
Accum. in temp 2.00 4.00 3.00 5.00

– 20 – 15-213, F’03
Machine-Independent Opt. Summary
Code Motion
 Compilers are good at this for simple loop/array structures
 Don’t do well in presence of procedure calls and memory aliasing

Reduction in Strength
 Shift, add instead of multiply or divide
 compilers are (generally) good at this
 Exact trade-offs machine-dependent
 Keep data in registers rather than memory
 compilers are not good at this, since concerned with aliasing

Share Common Subexpressions

 compilers have limited algebraic reasoning capabilities

– 21 – 15-213, F’03
Modern CPU Design
Instruction Control
Address
Fetch
Retirement Control
Unit Instruction
Register Instrs. Cache
Instruction
File Decode
Operations

Register Updates Prediction OK?

Integer/ General FP FP Functional

Load Store
Branch Integer Add Mult/Div Units

Operation Results Addr. Addr.

Data Data

Data
Cache

Execution
– 22 – 15-213, F’03
CPU Capabilities of Pentium III
Multiple Instructions Can Execute in Parallel
 1 load
 1 store
 2 integer (one may be branch)
 1 FP Addition
 1 FP Multiplication or Division
Some Instructions Take > 1 Cycle, but Can be Pipelined
 Instruction Latency Cycles/Issue
 Load / Store 3 1
 Integer Multiply 4 1
 Integer Divide 36 36
 Double/Single FP Multiply 5 2
 Double/Single FP Add 3 1
 Double/Single FP Divide 38 38

– 23 – 15-213, F’03
Instruction Control
Instruction Control
Address
Fetch
Retirement Control
Unit Instruction
Register Instrs. Cache
Instruction
File Decode
Operations

Grabs Instruction Bytes From Memory

 Based on current PC + predicted targets for predicted branches
 Hardware dynamically guesses whether branches taken/not taken and
(possibly) branch target
Translates Instructions Into Operations
 Primitive steps required to perform instruction
 Typical instruction requires 1–3 operations
Converts Register References Into Tags
 Abstract identifier linking destination of one operation with sources of
later operations
– 24 – 15-213, F’03
Translation Example
Version of Combine4
 Integer data, multiply operation

Translation of First Iteration

.L24: # Loop:
imull (%eax,%edx,4),%ecx # t *= data[i]
incl %edx # i++
cmpl %esi,%edx # i:length
jl .L24 # if < goto Loop

.L24:
load (%eax,%edx.0,4)  t.1
imull (%eax,%edx,4),%ecx imull t.1, %ecx.0  %ecx.1
incl %edx.0  %edx.1
incl %edx cmpl %esi, %edx.1  cc.1
cmpl %esi,%edx jl-taken cc.1
jl .L24
– 25 – 15-213, F’03
Translation Example #1
imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4)  t.1
imull t.1, %ecx.0  %ecx.1
 Split into two operations
 load reads from memory to generate temporary result t.1
 Multiply operation just operates on registers
 Operands
 Register %eax does not change in loop. Values will be retrieved
from register file during decoding
 Register %ecx changes on every iteration. Uniquely identify
different versions as %ecx.0, %ecx.1, %ecx.2, …
» Register renaming
» Values passed directly from producer to consumers

– 26 – 15-213, F’03
Translation Example #2
incl %edx incl %edx.0  %edx.1

 Register %edx changes on each iteration. Rename as

%edx.0, %edx.1, %edx.2, …

– 27 – 15-213, F’03
Translation Example #3
cmpl %esi,%edx cmpl %esi, %edx.1  cc.1

 Condition codes are treated similar to registers

 Assign tag to define connection between producer and
consumer

– 28 – 15-213, F’03
Translation Example #4
jl .L24 jl-taken cc.1

 Instruction control unit determines destination of jump

 Predicts whether will be taken and target
 Starts fetching instruction at predicted destination
 Execution unit simply checks whether or not prediction was
OK
 If not, it signals instruction control
 Instruction control then “invalidates” any operations generated
from misfetched instructions
 Begins fetching and decoding instructions at correct target

– 29 – 15-213, F’03
Visualizing Operations
%edx.0
load (%eax,%edx,4)  t.1
incl %edx.1
imull t.1, %ecx.0  %ecx.1
load cmpl incl %edx.0  %edx.1
cc.1 cmpl %esi, %edx.1  cc.1
%ecx.0
jl jl-taken cc.1
t.1

Time Operations
imull  Vertical position denotes time at
which executed
 Cannot begin operation until
%ecx.1 operands available
 Height denotes latency
Operands
 Arcs shown only for operands that
are passed within execution unit

– 30 – 15-213, F’03
Visualizing Operations (cont.)
load (%eax,%edx,4)  t.1
%edx.0 iaddl t.1, %ecx.0  %ecx.1
load incl %edx.1
incl %edx.0  %edx.1
cmpl %esi, %edx.1  cc.1
load cmpl+1
%ecx.i jl-taken cc.1
cc.1
%ecx.0
jl
t.1
Time addl
%ecx.1 Operations
 Same as before, except
that add has latency of 1

– 31 – 15-213, F’03
3 Iterations of Combining Product
%edx.0
Unlimited Resource
incl
1 %edx.1
Analysis
2 load cmpl incl %edx.2

jl
cc.1
load cmpl incl
 Assume operation
3 %edx.3
%ecx.0
t.1
cc.2 can start as soon as
4 i=0 jl load cmpl
t.2 cc.3 operands available
5 jl
imull t.3  Operations for
6
multiple iterations
7 %ecx.1
overlap in time
8 Iteration 1

9 imull
Performance
10 Cycle i=1  Limiting factor
11 %ecx.2 becomes latency of
12 Iteration 2 integer multiplier
13 imull
 Gives CPE of 4.0
14 i=2

15 %ecx.3

– 32 – Iteration 3 15-213, F’03

4 Iterations of Combining Sum
%edx.0

1 incl %edx.1

2 load cmpl+1
%ecx.i incl %edx.2
cc.1
3 %ecx.0
jl load cmpl+1
%ecx.i incl %edx.3
t.1 cc.2
4 addl i=0
t.2
jl load cmpl+1
%ecx.i incl %edx.4 4 integer ops
%ecx.1 cc.3
5 Iteration 1 addl i=1 jl load cmpl+1
%ecx.i
%ecx.2 t.3 cc.4
6 Cycle Iteration 2 addl i=2 jl
%ecx.3 t.4
7 Iteration 3 addl i=3
%ecx.4
Iteration 4

Unlimited Resource Analysis

Performance
 Can begin a new iteration on each clock cycle
 Should give CPE of 1.0
 Would require executing 4 integer operations in parallel
– 33 – 15-213, F’03
Combining Sum: Resource Constraints
%edx.3

7 load incl %edx.4

8 %ecx.3 cmpl+1
%ecx.i incl %edx.5
t.4 cc.4
9 addl jl load

10 i=3 cmpl+1
%ecx.i load incl %edx.6
t.5 cc.5
%ecx.4
11 addl jl
Iteration 4 t.6
%ecx.5
12 i=4 addl cmpl load
Iteration 5 cc.6
cc.6
13 jl incl %edx.7
t.7
%ecx.6 i=5
14 addl cmpl
Iteration 6 cc.7
jl load incl
 Only have two15 integer functional units %edx.8

 Some operations delayed even though operands available

16 i=6 cmpl+1
%ecx.i
Cycle
 Set priority based on program order t.8 cc.8
%ecx.7
17 addl jl
Performance Iteration 7
 Sustain CPE of
182.0 i=7
%ecx.8

Iteration 8

– 34 – 15-213, F’03
Loop Unrolling
void combine5(vec_ptr v, int *dest)
{
int length = vec_length(v); Optimization
int limit = length-2;
int *data = get_vec_start(v);  Combine multiple
int sum = 0; iterations into single
int i; loop body
/* Combine 3 elements at a time */  Amortizes loop
for (i = 0; i < limit; i+=3) {
overhead across
sum += data[i] + data[i+2]
+ data[i+1]; multiple iterations
}  Finish extras at end
/* Finish any remaining elements */  Measured CPE = 1.33
for (; i < length; i++) {
sum += data[i];
}
*dest = sum;
}

– 35 – 15-213, F’03
Visualizing Unrolled Loop
%edx.0
 Loads can pipeline,
addl
since don’t have %edx.1

dependencies load cmpl+1

%ecx.i
 Only one set of loop load jl
cc.1

control operations %ecx.0c

t.1a
addl load
%ecx.1a t.1b Time
addl
%ecx.1b t.1c
addl
load (%eax,%edx.0,4)  t.1a %ecx.1c

iaddl t.1a, %ecx.0c  %ecx.1a

load 4(%eax,%edx.0,4)  t.1b
iaddl t.1b, %ecx.1a  %ecx.1b
load 8(%eax,%edx.0,4)  t.1c
iaddl t.1c, %ecx.1b  %ecx.1c
iaddl $3,%edx.0  %edx.1
cmpl %esi, %edx.1  cc.1
jl-taken cc.1

– 36 – 15-213, F’03
Executing with Loop Unrolling
%edx.2

7 addl %edx.3

8 load cmpl+1
%ecx.i
cc.3
9 %ecx.2c load jl
t.3a
10 addl load addl %edx.4
%ecx.3a t.3b
11 addl load cmpl+1
%ecx.i
%ecx.3b t.3c cc.4
12 i=6 addl load jl
%ecx.3c
t.4a
13 Iteration 3 addl load
%ecx.4a t.4b
14 Cycle addl
 Predicted Performance
%ecx.4b t.4c
 Can complete iteration in 3 cycles
 Should give15
i=9 addl
CPE of 1.0 %ecx.4c
 Measured Performance
 CPE of 1.33 Iteration 4
 One iteration every 4 cycles

– 37 – 15-213, F’03
Effect of Unrolling

Unrolling Degree 1 2 3 4 8 16
Integer Sum 2.00 1.50 1.33 1.50 1.25 1.06
Integer Product 4.00
FP Sum 3.00
FP Product 5.00

 Only helps integer sum for our examples

 Other cases constrained by functional unit latencies
 Effect is nonlinear with degree of unrolling
 Many subtle effects determine exact scheduling of operations

– 38 – 15-213, F’03
Parallel Loop Unrolling
void combine6(vec_ptr v, int *dest) Code Version
{  Integer product
int length = vec_length(v);
int limit = length-1; Optimization
int *data = get_vec_start(v);  Accumulate in two
int x0 = 1; different products
int x1 = 1;  Can be performed
int i; simultaneously
/* Combine 2 elements at a time */  Combine at end
for (i = 0; i < limit; i+=2) {  2-way parallism
x0 *= data[i];
x1 *= data[i+1]; Performance
}  CPE = 2.0
/* Finish any remaining elements */  2X performance
for (; i < length; i++) {
x0 *= data[i];
}
*dest = x0 * x1;
}
– 39 – 15-213, F’03
Dual Product Computation
Computation
((((((1 * x0) * x2) * x4) * x6) * x8) * x10) *
((((((1 * x1) * x3) * x5) * x7) * x9) * x11)

Performance
 N elements, D cycles/operation
1 x0 1 x1  (N/2+1)*D cycles
 ~2X performance improvement
* x2 * x3

* x4 * x5

* x6 * x7

* x8 * x9

* x10 * x11

* *

*
– 40 – 15-213, F’03
Requirements for Parallel Computation
Mathematical
 Combining operation must be associative & commutative
 OK for integer multiplication
 Not strictly true for floating point
» OK for most applications

Hardware
 Pipelined functional units
 Ability to dynamically extract parallelism from code

– 41 – 15-213, F’03
Visualizing Parallel Loop
%edx.0
 Two multiplies within
loop no longer have addl %edx.1
data depency load cmpl
 Allows them to cc.1
load jl
pipeline %ecx.0
t.1a

%ebx.0
t.1b

imull Time

imull
%ecx.1
load (%eax,%edx.0,4)  t.1a
imull t.1a, %ecx.0  %ecx.1 %ebx.1
load 4(%eax,%edx.0,4)  t.1b
imull t.1b, %ebx.0  %ebx.1
iaddl $2,%edx.0  %edx.1
cmpl %esi, %edx.1  cc.1
jl-taken cc.1

– 42 – 15-213, F’03
Executing with Parallel Loop
%edx.0

1 addl %edx.1

2 load cmpl addl %edx.2

cc.1
3 %ecx.0 load jl load cmpl addl %edx.3
t.1a cc.2
4 %ebx.0 load jl load cmpl
t.1b cc.3
5 load jl
imull
6
imull
7 %ecx.1
t.2a
8 i=0 %ebx.1
t.2b
9 Iteration 1
imull
10 Cycle
imull
11 %ecx.2
t.3a
i=2
 Predicted12Performance %ebx.2
t.3b
13 Iteration 2
 Can keep 4-cycle multiplier imull
busy performing
14 two
imull
simultaneous multiplications
15 %ecx.3
 Gives CPE of 2.0
16 i=4 %ebx.3

– 43 – Iteration 15-213,
3 F’03
Summary: Results for Pentium III

Method Integer Floating Point

+ * + *
Abstract -g 42.06 41.86 41.44 160.00
Abstract -O2 31.25 33.25 31.25 143.00
Move vec_length 20.66 21.25 21.15 135.00
data access 6.00 9.00 8.00 117.00
Accum. in temp 2.00 4.00 3.00 5.00
Pointer 3.00 4.00 3.00 5.00
Unroll 4 1.50 4.00 3.00 5.00
Unroll 16 1.06 4.00 3.00 5.00
2X2 1.50 2.00 2.00 2.50
4X4 1.50 2.00 1.50 2.50
8X4 1.25 1.25 1.50 2.00
Theoretical Opt. 1.00 1.00 1.00 2.00
Worst : Best 39.7 33.5 27.6 80.0
– 44 – 15-213, F’03
Limitations of Parallel Execution
Need Lots of Registers
 To hold sums/products
 Only 6 usable integer registers
 Also needed for pointers, loop conditions
 8 FP registers
 When not enough registers, must spill temporaries onto
stack
 Wipes out any performance gains
 Not helped by renaming
 Cannot reference more operands than instruction set allows
 Major drawback of IA32 instruction set

– 45 – 15-213, F’03
Register Spilling Example
.L165:
Example imull (%eax),%ecx
movl -4(%ebp),%edi
 8 X 8 integer product
imull 4(%eax),%edi
 7 local variables share 1 movl %edi,-4(%ebp)
register movl -8(%ebp),%edi
 See that are storing locals imull 8(%eax),%edi
movl %edi,-8(%ebp)
on stack
movl -12(%ebp),%edi
 E.g., at -8(%ebp) imull 12(%eax),%edi
movl %edi,-12(%ebp)
movl -16(%ebp),%edi
imull 16(%eax),%edi
movl %edi,-16(%ebp)
…
addl $32,%eax
addl $8,%edx
cmpl -32(%ebp),%edx
jl .L165
– 46 – 15-213, F’03
Results for Alpha Processor
Method Integer Floating Point
+ * + *
Abstract -g 40.14 47.14 52.07 53.71
Abstract -O2 25.08 36.05 37.37 32.02
Move vec_length 19.19 32.18 28.73 32.73
data access 6.26 12.52 13.26 13.01
Accum. in temp 1.76 9.01 8.08 8.01
Unroll 4 1.51 9.01 6.32 6.32
Unroll 16 1.25 9.01 6.33 6.22
4X2 1.19 4.69 4.44 4.45
8X4 1.15 4.12 2.34 2.01
8X8 1.11 4.24 2.36 2.08
Worst : Best 36.2 11.4 22.3 26.7

 Overall trends very similar to those for Pentium III.

 Even though very different architecture and compiler

– 47 – 15-213, F’03
Results for Pentium 4 Processor
Method Integer Floating Point
+ * + *
Abstract -g 35.25 35.34 35.85 38.00
Abstract -O2 26.52 30.26 31.55 32.00
Move vec_length 18.00 25.71 23.36 24.25
data access 3.39 31.56 27.50 28.35
Accum. in temp 2.00 14.00 5.00 7.00
Unroll 4 1.01 14.00 5.00 7.00
Unroll 16 1.00 14.00 5.00 7.00
4X2 1.02 7.00 2.63 3.50
8X4 1.01 3.98 1.82 2.00
8X8 1.63 4.50 2.42 2.31
Worst : Best 35.2 8.9 19.7 19.0
 Higher latencies (int * = 14, fp + = 5.0, fp * = 7.0)
 Clock runs at 2.0 GHz
 Not an improvement over 1.0 GHz P3 for integer *
 Avoids FP multiplication anomaly

– 48 – 15-213, F’03
Machine-Dependent Opt. Summary
Loop Unrolling
 Some compilers do this automatically
 Generally not as clever as what can achieve by hand

Exposing Instruction-Level Parallelism

 Generally helps, but extent of improvement is machine
dependent

Warning:
 Benefits depend heavily on particular machine
 Best if performed by compiler
 But GCC on IA32/Linux is not very good
 Do only for performance-critical parts of code

– 49 – 15-213, F’03
Important Tools
Observation
 Generating assembly code
 Lets you see what optimizations compiler can make
 Understand capabilities/limitations of particular compiler

Measurement
 Accurately compute time taken by code
 Most modern machines have built in cycle counters
 Using them to get reliable measurements is tricky
» Chapter 9 of the CS:APP textbook
 Profile procedure calling frequencies
 Unix tool gprof

– 50 – 15-213, F’03
Code Profiling Example
Task
 Count word frequencies in text document
 Produce sorted list of words from most frequent to least
Steps
 Convert strings to lowercase
 Apply hash function
 Read words and insert into hash table
 Mostly list operations
 Maintain counter for each unique word
 Sort results
Data Set
 Collected works of Shakespeare
 946,596 total words, 26,596 unique
 Initial implementation: 9.2 seconds

Shakespeare’s
most frequent words
29,801 the
27,529 and
21,029 I
20,957 to
18,514 of
15,370 a
14010 you
12,936 my
11,722 in
11,519 that
– 51 – 15-213, F’03
Code Profiling
Augment Executable Program with Timing Functions
 Computes (approximate) amount of time spent in each
function
 Time computation method
 Periodically (~ every 10ms) interrupt program
 Determine what function is currently executing
 Increment its timer by interval (e.g., 10ms)
 Also maintains counter for each function indicating number
of times called
Using
gcc –O2 –pg prog. –o prog
./prog
 Executes in normal fashion, but also generates file gmon.out
gprof prog
 Generates profile information based on gmon.out

– 52 – 15-213, F’03
Profiling Results
% cumulative self self total
time seconds seconds calls ms/call ms/call name
86.60 8.21 8.21 1 8210.00 8210.00 sort_words
5.80 8.76 0.55 946596 0.00 0.00 lower1
4.75 9.21 0.45 946596 0.00 0.00 find_ele_rec
1.27 9.33 0.12 946596 0.00 0.00 h_add

Call Statistics
 Number of calls and cumulative time for each function

Performance Limiter
 Using inefficient sorting algorithm
 Single call uses 87% of CPU time

– 53 – 15-213, F’03
Code
Optimizations
10
9
8
7 Rest
CPU Secs.

6 Hash
5 Lower
4 List
3 Sort
2
1
0
Initial Quicksort Iter First Iter Last Big Table Better Hash Linear Lower

 First step: Use more efficient sorting function

 Library function qsort

– 54 – 15-213, F’03
Further Optimizations
2
1.8
1.6
1.4 Rest
CPU Secs.

1.2 Hash
1 Lower
0.8 List
0.6 Sort
0.4
0.2
0
Initial Quicksort Iter First Iter Last Big Table Better Hash Linear Lower

 Iter first: Use iterative function to insert elements into linked list
 Causes code to slow down
 Iter last: Iterative function, places new entry at end of list
 Tend to place most common words at front of list
 Big table: Increase number of hash buckets
 Better hash: Use more sophisticated hash function
 Linear lower: Move strlen out of loop

– 55 – 15-213, F’03
Profiling Observations
Benefits
 Helps identify performance bottlenecks
 Especially useful when have complex system with many
components

Limitations
 Only shows performance for data tested
 E.g., linear lower did not show big gain, since words are
short
 Quadratic inefficiency could remain lurking in code
 Timing mechanism fairly crude
 Only works for programs that run for > 3 seconds

– 56 – 15-213, F’03
Role of Programmer
How should I write my programs, given that I have a good,
optimizing compiler?
Don’t: Smash Code into Oblivion
 Hard to read, maintain, & assure correctness
Do:
 Select best algorithm
 Write code that’s readable & maintainable
 Procedures, recursion, without built-in constant limits
 Even though these factors can slow down code
 Eliminate optimization blockers
 Allows compiler to do its job

Focus on Inner Loops

 Do detailed optimizations where code will be executed
repeatedly
 Will get most performance gain here
– 57 – 15-213, F’03

Chapter 2 Basic Elements of C++
No ratings yet
Chapter 2 Basic Elements of C++
32 pages
Program Optimization
No ratings yet
Program Optimization
63 pages
LEC12-Optimization and New Trends
No ratings yet
LEC12-Optimization and New Trends
23 pages
Lecture 7 - Optimizations - A 2025
No ratings yet
Lecture 7 - Optimizations - A 2025
55 pages
Lab Report 1
No ratings yet
Lab Report 1
8 pages
HPC Unit 5 b
No ratings yet
HPC Unit 5 b
31 pages
1 End of The non-OOP C++: Some Pointers, Dynamic Allocation, Miscellaneous Things
No ratings yet
1 End of The non-OOP C++: Some Pointers, Dynamic Allocation, Miscellaneous Things
19 pages
CD Unit4
No ratings yet
CD Unit4
59 pages
25 Optimization
No ratings yet
25 Optimization
54 pages
Simplex Optimisation
No ratings yet
Simplex Optimisation
44 pages
2 CPP Basic More
No ratings yet
2 CPP Basic More
7 pages
CP4161 - ADSLAB (1)
No ratings yet
CP4161 - ADSLAB (1)
95 pages
PPL Final Test Solution
100% (1)
PPL Final Test Solution
18 pages
Chapter 4 Operators
No ratings yet
Chapter 4 Operators
25 pages
Exam / Homework: - Open Book / Open Notes
No ratings yet
Exam / Homework: - Open Book / Open Notes
17 pages
7-Functions: 7-1 Definition
No ratings yet
7-Functions: 7-1 Definition
4 pages
The Keyword Operator: Unit - III
No ratings yet
The Keyword Operator: Unit - III
21 pages
University of Bahrain: 1.1 Creating A C++ Program
No ratings yet
University of Bahrain: 1.1 Creating A C++ Program
52 pages
Ads Lab
No ratings yet
Ads Lab
40 pages
More On Java ® Data Types, Control Structures
No ratings yet
More On Java ® Data Types, Control Structures
16 pages
119DCE070 - DAA - Practical 4-6
No ratings yet
119DCE070 - DAA - Practical 4-6
10 pages
LAB03 ArrayPointer
No ratings yet
LAB03 ArrayPointer
8 pages
TP2
No ratings yet
TP2
4 pages
Computer Network LAB 5
No ratings yet
Computer Network LAB 5
19 pages
C Introduction11
No ratings yet
C Introduction11
291 pages
Verilog Programming Styles
No ratings yet
Verilog Programming Styles
95 pages
Internal Product Attribute Measurement: Size
No ratings yet
Internal Product Attribute Measurement: Size
70 pages
Cpunit Iii
No ratings yet
Cpunit Iii
57 pages
Advance Algorithm Lab Record
No ratings yet
Advance Algorithm Lab Record
32 pages
CP - Unit 1 - 2
No ratings yet
CP - Unit 1 - 2
43 pages
WINSEM2024-25_BCSE102L_TH_VL2024250504101_2025-04-02_Reference-Material-I
No ratings yet
WINSEM2024-25_BCSE102L_TH_VL2024250504101_2025-04-02_Reference-Material-I
19 pages
Struc@Array Question
No ratings yet
Struc@Array Question
3 pages
Java Assignment-3 (Programmes)
No ratings yet
Java Assignment-3 (Programmes)
16 pages
Lab7 DSA BSEE20034
No ratings yet
Lab7 DSA BSEE20034
7 pages
Arrays: Dept. of Computer Science Faculty of Science and Technology
No ratings yet
Arrays: Dept. of Computer Science Faculty of Science and Technology
17 pages
DATASTRUCTURE LABmanual
No ratings yet
DATASTRUCTURE LABmanual
74 pages
Python 4 HPC
No ratings yet
Python 4 HPC
31 pages
lecture-5
No ratings yet
lecture-5
29 pages
pointers
No ratings yet
pointers
7 pages
PL1 Lec4
No ratings yet
PL1 Lec4
19 pages
slide-10
No ratings yet
slide-10
31 pages
Python Module-1 QB Solution (21EC643)
No ratings yet
Python Module-1 QB Solution (21EC643)
25 pages
Debugging, Problem Solving & Complexity Calculation
No ratings yet
Debugging, Problem Solving & Complexity Calculation
8 pages
ADA Continue
No ratings yet
ADA Continue
23 pages
cp assignment 1.0
No ratings yet
cp assignment 1.0
42 pages
Department of Electrical Engineering
No ratings yet
Department of Electrical Engineering
8 pages
Data Structure
No ratings yet
Data Structure
28 pages
Operators in Java
No ratings yet
Operators in Java
37 pages
Nimesh 1
No ratings yet
Nimesh 1
31 pages
Python 02-Expressions-Variables-Forloops
No ratings yet
Python 02-Expressions-Variables-Forloops
15 pages
Basic Java Data Types, Control Structures
No ratings yet
Basic Java Data Types, Control Structures
12 pages
02 Expressions Variables Forloops
No ratings yet
02 Expressions Variables Forloops
15 pages
C++ Assignment, Pointers
No ratings yet
C++ Assignment, Pointers
8 pages
Practical File074
No ratings yet
Practical File074
13 pages
CS101 2023 Autumn Theory Quiz 2
No ratings yet
CS101 2023 Autumn Theory Quiz 2
6 pages
Excelente
No ratings yet
Excelente
64 pages
Lab Manual # 08: Title: Arrays in C++ One D Array
No ratings yet
Lab Manual # 08: Title: Arrays in C++ One D Array
18 pages
ADSA 2nd Assess
No ratings yet
ADSA 2nd Assess
4 pages
Review Mid2 C Questions Model Answer
No ratings yet
Review Mid2 C Questions Model Answer
8 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
EX18
No ratings yet
EX18
2 pages
HyperTransport 3.1 Interconnect Technology PDF
100% (1)
HyperTransport 3.1 Interconnect Technology PDF
30 pages
Multiple Processor Scheduling
No ratings yet
Multiple Processor Scheduling
4 pages
Computer Organization & Architecture MCQs and Answers
No ratings yet
Computer Organization & Architecture MCQs and Answers
33 pages
6dd31d8c-ae62-48dc-9c39-6e1e5d14d78f
No ratings yet
6dd31d8c-ae62-48dc-9c39-6e1e5d14d78f
13 pages
Product Brief Product Brief: Powerful Processor Cores Integrated Memory Controllers
No ratings yet
Product Brief Product Brief: Powerful Processor Cores Integrated Memory Controllers
2 pages
Comp Exam
No ratings yet
Comp Exam
11 pages
CH01 COA11e
No ratings yet
CH01 COA11e
45 pages
Computing in A Parallel Universe
No ratings yet
Computing in A Parallel Universe
6 pages
vs07 Asdf Awer
No ratings yet
vs07 Asdf Awer
70 pages
Class11 Cache
No ratings yet
Class11 Cache
41 pages
Unit II 8086 SYSTEM Bus Structure: Book
No ratings yet
Unit II 8086 SYSTEM Bus Structure: Book
194 pages
Onur 447 Spring15 Lecture21 Main Memory Afterlecture
No ratings yet
Onur 447 Spring15 Lecture21 Main Memory Afterlecture
94 pages
Intel Processor I5 I7
No ratings yet
Intel Processor I5 I7
50 pages
A Technical Overview of The Tandem TXP Processor: Robert Horst Sandy Metz
No ratings yet
A Technical Overview of The Tandem TXP Processor: Robert Horst Sandy Metz
12 pages
Literature Review of Cache Memory
100% (2)
Literature Review of Cache Memory
7 pages
Parallelism in Arm Processor
No ratings yet
Parallelism in Arm Processor
9 pages
Computer Maintenance & Troubleshooting
60% (10)
Computer Maintenance & Troubleshooting
108 pages
Lec3 B
No ratings yet
Lec3 B
24 pages
Introduction NoC Paper PDF
No ratings yet
Introduction NoC Paper PDF
12 pages
Jntu Online Examinations (Mid 2 - DLD)
No ratings yet
Jntu Online Examinations (Mid 2 - DLD)
25 pages
Chapter 1 Digital Literacy
No ratings yet
Chapter 1 Digital Literacy
5 pages
OpenSPARCT1 Micro Arch PDF
No ratings yet
OpenSPARCT1 Micro Arch PDF
268 pages
Chapter 4 Questions
No ratings yet
Chapter 4 Questions
4 pages
Samsung Smart Signage: QBH-TR / QBN-W
No ratings yet
Samsung Smart Signage: QBH-TR / QBN-W
4 pages
MIDTERM-EXAM-IN-CC101-Introduction-to-Computing BSIS
No ratings yet
MIDTERM-EXAM-IN-CC101-Introduction-to-Computing BSIS
8 pages
2023 11 06 14 56 52 Report Monthly06 Nov 2023
No ratings yet
2023 11 06 14 56 52 Report Monthly06 Nov 2023
105 pages
Introduction To Computers: Encfap2L
No ratings yet
Introduction To Computers: Encfap2L
27 pages
Outline: - Introduction - Different Scratch Pad Memories - Cache and Scratch Pad For Embedded Applications
No ratings yet
Outline: - Introduction - Different Scratch Pad Memories - Cache and Scratch Pad For Embedded Applications
54 pages
Cyrix 6×86-P166+Gp 133Mhz 0.22 Gram Gold: Read More
No ratings yet
Cyrix 6×86-P166+Gp 133Mhz 0.22 Gram Gold: Read More
3 pages

Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"

Uploaded by

Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"

Uploaded by

15-213

“The course that gives CMU its Zip!”

Constant factors matter too!

Must understand system to optimize performance

–2– 15-213, F’03

When in doubt, the compiler must be conservative

for (i = 0; i < n; i++) {

–4– 15-213, F’03

Code Generated by GCC for (i = 0; i < n; i++) {

imull %ebx,%eax # i*n

–5– 15-213, F’03

–6– 15-213, F’03

/* Sum neighbors of i,j */ int inj = i*n + j;

3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n

leal -1(%edx),%ecx # i-1

–7– 15-213, F’03

Pentium II/III Performance: Clock Cycles / Element

Why doesn’t compiler look at code for vec_len?

Data Types Operations

Share Common Subexpressions

Register Updates Prediction OK?

Integer/ General FP FP Functional

Operation Results Addr. Addr.

Grabs Instruction Bytes From Memory

Translation of First Iteration

 Register %edx changes on each iteration. Rename as

 Condition codes are treated similar to registers

 Instruction control unit determines destination of jump

– 32 – Iteration 3 15-213, F’03

Unlimited Resource Analysis

7 load incl %edx.4

 Some operations delayed even though operands available

dependencies load cmpl+1

control operations %ecx.0c

iaddl t.1a, %ecx.0c  %ecx.1a

 Only helps integer sum for our examples

2 load cmpl addl %edx.2

Method Integer Floating Point

 Overall trends very similar to those for Pentium III.

Exposing Instruction-Level Parallelism

 First step: Use more efficient sorting function

Focus on Inner Loops

You might also like

/* Sum neighbors of i,j / int inj = in + j;

3 multiplications: in, (i–1)n, (i+1)n 1 multiplication: in