1 - Performance
1 - Performance
Performance Concepts
Performance Perspectives
Purchasing perspective
Given a collection of machines, which has the
- Best performance ?
- Least cost ?
- Best performance / cost ?
Design perspective
Faced with design options, which has the
- Best performance improvement ?
- Least cost ?
- Best performance / cost ?
Both require
basis for comparison
metric for evaluation
DC to Paris
Speed
Passengers
Throughput
(pmph)
Boeing 747
6.5 hours
610 mph
470
286,700
Concorde
3 hours
1350 mph
132
178,200
Definitions
Performance is typically in units-per-second
bigger is better
If we are primarily concerned with response time
performance =
execution_time
ExecutionTime Performance
n
ExecutionTime Performance
y
Exampl
e
Time of Concorde vs. Boeing 747?
Concord is 1350 mph / 610 mph
= 2.2 times faster
= 6.5 hours / 3 hours
Throughput of Concorde vs. Boeing 747 ?
Benchmarks
Evaluation Tools
Benchmarks, traces and mixes
Macrobenchmarks and suites
Microbenchmarks
Traces
Workloads
Simulation at many levels
ISA, microarchitecture, RTL, gate circuit
Trade fidelity for simulation rate (Levels of abstraction)
Other metrics
Area, clock frequency, power, cost,
Analysis
Queuing theory, back-of-the-envelope
Rules of thumb, basic laws and principles
Taken from Northwestern University
Benchmarks
Microbenchmarks
Measure one performance dimension
- Cache bandwidth
- Memory bandwidth
- Procedure call overhead
- FP performance
Insight into the underlying performance factors
Not a good predictor of application performance
Macrobenchmarks
Application execution time
- Measures overall performance, but on just one application
- Need application suite
Why Do Benchmarks?
How we evaluate differences
Different systems
Changes to a single system
Provide a target
Benchmarks should represent large class of important
programs
Improving benchmark performance should help many
programs
For better or worse, benchmarks shape a field
Good ones accelerate progress
good target for development
Bad benchmarks hurt progress
help real programs v. sell machines/papers?
Inventions that help real programs dont help benchmark
Taken from Northwestern University
SPEC CINT2000
tpC
Basis of Evaluation
Pros
representative
portable
widely used
improvements
useful in reality
easy to run, early
in design cycle
identify peak
capability and
potential
bottlenecks
Cons
Small Kernel
Benchmarks
Microbenchmarks
Taken from Northwestern University
very specific
non-portable
difficult to run, or
measure
hard to identify cause
less representative
easy to fool
Performance
In this exercise, you should evaluate the difference in performance
between two CPU architectures: CISC (Complex Instruction Set
Computing) and RISC (Reduced Instrucion Set Computing). Overall, the
CISC CPUs are more complex than RISC CPU instructions. Therefore
require fewer instructions to perform the same tasks.
However, a CISC instruction, since it is more complex, takes longer to
be completed than a RISC operation. Assume that a certain task
requires P and 2P CISC instruction manual RISC, CISC instruction and
takes 8T ns to complete, while a RISC operation takes 2T ns. Under this
assumption,
Performance
Sometimes software optimization may dramatically improve the
performance of a computer system.
Assume that the CPU can execute a multiplication in 10 ns, and
execute a subtraction in 1 ns.
How much will it take the CPU to calculate the result of
d = a x b - a x c?
You could optimize the equation to take less time?
ExcecutionTimeY
n
ExcecutionTimeX
Since the runtime performance is reciprocal, the following relationship
holds:
1
Tiempo deEjecucinY
Re n dim iento X
Re n dim ientoY
n
1
Tiempo de EjecucinX
Re n dim ientoY
Re n dim iento X
Taken from Northwestern University
Pr ograma
Pr ograma Instruccio nes
Ciclo
T =
Tiempo de ejecucin por
programa por segundo
Nmero de instrucciones
ejecutadas
CPI
CPI promedio
por programa
C
Ciclo del reloj
de la CPU
Tiempo CPU
Pr ograma
Pr ograma Instruccio nes
Ciclo
10.000.000
10.000.000
x 2.5 x
5x10-9
.125 segundos
Taken from Northwestern University
T = I x CPI x C
Depende de:
Programa usado
Compilador
ISA
Nmero de Instrucciones I
Depende de:
Programa usado
Compilador
ISA
Organizacin CPU
CPI
(CPI Promedio)
Depende de:
Organizacin CPU
Tegnologa (VLSI)
Pr ograma
Pr ograma Instruccio nes
Ciclo
Nmero de
Instrucciones I
CPI
Programa
Compilador
Instruction Set
Architecture (ISA)
Organizacin
(Diseo de la CPU)
Tecnologa
X
X
(VLSI)
Performance Example
Returning to the previous example: a program is run with the following
parameters:
Number of executed instructions: 10,000,000
CPI program average: 2.5 x instruction cycles
CPU clock speed: 200 MHz
Performance Example
What is the increase (Speedup)?
Speedup
Tiempo EjecuinViejo
Tiempo EjecucinNuevo
i = 1, 2, . n
Ci
Then:
Donde:
I C i
i1
i 1
CPI
1
2
3
To design a
CPU
= 2 x 1 + 1 x 2 + 2 x 3 = 10 cycles
= Ciclos de reloj / Num. Instrucciones
= 10 /5 = 2
= 4 x 1 + 1 x 2 + 1 x 3 = 9 cycles
n
= 9 / 6 = 1.5
CPI i Ci
Then:
n
CPI CPI i F i
i 1
CPIi x Fi
CPI
CPIi x Fi
CPI
Frec (Fi)
CPIi
CPIi x Fi
% Tiempo
ALU
50%
0.5
23% = 0.5/2.2
Load
20%
1.0
45% = 1.0/2.2
Store
10%
0.3
14% = 0.3/2.2
Branch
20%
0.4
18% = 0.4/2.2
Suma = 2.2
CPI CPI i F i
n
i 1
CPI = .5 x 1 + .2 x 5 + .1 x 3 + .2 x 2 =
2.2
= .5Taken+from Northwestern
1 + .3University
+ .4
Performance metrics
(Medidas)
Tiempo de ejecucin: Carga de trabajo,
SPEC, etc.
Aplicacin
Lenguaje de
programacin
Compilador
ISA
Datapath
Control
Unidades de Funcin
Transistores Cables
Pines
Performance w/ E
=
--------------------Performance w/o E
Summary
CPU
CPUtime
time == Seconds
Seconds ==Instructions
Instructions xx Cycles
Cycles
Program
Program
Instruction
Program
Program
Instruction
xx Seconds
Seconds
Cycle
Cycle
Percentage 1 = F1 = 20%
Speedup2 = S2 = 15
Percentage 2 = F2 = 15%
Speedup3 = S3 = 30
Percentage 3 = F3 = 10%
Speedup
((1 F ) F )
i
S i a different part of
All the improvements use the new design, but each affect
the code
i
.55
= 1 / .5833 =
.0333
1.71
Taken from Northwestern University
A graphical view
Before:
Execution time without the improvements: 1
S1 = 10
S2 = 15
S3 = 30
F1 = .2
F2 = .15
F3 = .1
/ 10
/ 15
/ 30
Sin cambios