0% found this document useful (0 votes)
44 views

1 - Performance

The document discusses computer architecture performance from multiple perspectives including purchasing, design, and metrics. It covers topics such as benchmarks, performance evaluation tools, factors that affect CPU performance like the number of instructions, clock speed, and CPI. Formulas for calculating execution time and speedup are also provided.

Uploaded by

Yesid Soto Cobos
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

1 - Performance

The document discusses computer architecture performance from multiple perspectives including purchasing, design, and metrics. It covers topics such as benchmarks, performance evaluation tools, factors that affect CPU performance like the number of instructions, clock speed, and CPI. Formulas for calculating execution time and speedup are also provided.

Uploaded by

Yesid Soto Cobos
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Computer Architecture II Performance

Chapter 1, Hennesy & Patterson


Augusto Salazar
Departamento de Ingeniera de Sistemas
Universidad del Norte
[email protected]

Taken from Northwestern University

Performance Concepts

Taken from Northwestern University

Performance Perspectives
Purchasing perspective
Given a collection of machines, which has the
- Best performance ?
- Least cost ?
- Best performance / cost ?
Design perspective
Faced with design options, which has the
- Best performance improvement ?
- Least cost ?
- Best performance / cost ?
Both require
basis for comparison
metric for evaluation

Our goal: understand cost & performance


implications of architectural choices
Taken from Northwestern University

Two Notions of Performance


Plane

DC to Paris

Speed

Passengers

Throughput
(pmph)

Boeing 747

6.5 hours

610 mph

470

286,700

Concorde

3 hours

1350 mph

132

178,200

Which has higher performance?


Execution time (response time, latency, )
Time to do a task
Throughput (bandwidth, )
Tasks per unit of time
Response time and throughput often are in opposition
Taken from Northwestern University

Definitions
Performance is typically in units-per-second
bigger is better
If we are primarily concerned with response time
performance =
execution_time

" X is n times faster than Y" means

ExecutionTime Performance

n
ExecutionTime Performance
y

Taken from Northwestern University

Exampl
e
Time of Concorde vs. Boeing 747?
Concord is 1350 mph / 610 mph
= 2.2 times faster
= 6.5 hours / 3 hours
Throughput of Concorde vs. Boeing 747 ?

Concord is 178,200 pmph / 286,700 pmph = 0.62 times faster


Boeing is 286,700 pmph / 178,200 pmph = 1.60 times faster

Boeing is 1.6 times (60%) faster in terms of throughput


Concord is 2.2 times (120%) faster in terms of flying time
We will focus primarily on execution time for a single job
Lots of instructions in a program => Instruction thruput important!

Taken from Northwestern University

Benchmarks

Taken from Northwestern University

Evaluation Tools
Benchmarks, traces and mixes
Macrobenchmarks and suites
Microbenchmarks
Traces
Workloads
Simulation at many levels
ISA, microarchitecture, RTL, gate circuit
Trade fidelity for simulation rate (Levels of abstraction)
Other metrics
Area, clock frequency, power, cost,
Analysis
Queuing theory, back-of-the-envelope
Rules of thumb, basic laws and principles
Taken from Northwestern University

Benchmarks
Microbenchmarks
Measure one performance dimension
- Cache bandwidth
- Memory bandwidth
- Procedure call overhead
- FP performance
Insight into the underlying performance factors
Not a good predictor of application performance
Macrobenchmarks
Application execution time
- Measures overall performance, but on just one application
- Need application suite

Taken from Northwestern University

Why Do Benchmarks?
How we evaluate differences
Different systems
Changes to a single system
Provide a target
Benchmarks should represent large class of important
programs
Improving benchmark performance should help many
programs
For better or worse, benchmarks shape a field
Good ones accelerate progress
good target for development
Bad benchmarks hurt progress
help real programs v. sell machines/papers?
Inventions that help real programs dont help benchmark
Taken from Northwestern University

Popular Benchmark Suites


Desktop
SPEC CPU2000 - CPU intensive, integer & floating-point applications
SPECviewperf, SPECapc - Graphics benchmarks
SysMark, Winstone, Winbench
Embedded
EEMBC - Collection of kernels from 6 application areas
Dhrystone - Old synthetic benchmark
Servers
SPECweb, SPECfs
TPC-C - Transaction processing system
TPC-H, TPC-R - Decision support system
TPC-W - Transactional web benchmark
Parallel Computers
SPLASH - Scientific applications & kernels
Most markets have specific benchmarks
for design and marketing.
Taken from Northwestern University

SPEC CINT2000

Taken from Northwestern University

tpC

Taken from Northwestern University

Basis of Evaluation
Pros
representative

portable
widely used
improvements
useful in reality
easy to run, early
in design cycle
identify peak
capability and
potential
bottlenecks

Cons

Actual Target Workload

Full Application Benchmarks

Small Kernel
Benchmarks

Microbenchmarks
Taken from Northwestern University

very specific
non-portable
difficult to run, or
measure
hard to identify cause
less representative

easy to fool

peak may be a long


way from application
performance

Programs to Evaluate Processor Performance


(Toy) Benchmarks
10-100 line
e.g.,: sieve, puzzle, quicksort
Synthetic Benchmarks
attempt to match average frequencies of real
workloads
e.g., Whetstone, dhrystone
Kernels
Time critical excerpts
Taken from Northwestern University

Now its your turn

Download at least two benchmark apps on your


cellphone and run the test.
Compare those results with those obtained by the
members of your group and try to make sense
of the results based on the HW and SW
specifications of each phone.
Homework: Write a report (english and at least
two pages) and deliver it for next class.

Taken from Northwestern University

Processor Design Metrics

Taken from Northwestern University

Performance
In this exercise, you should evaluate the difference in performance
between two CPU architectures: CISC (Complex Instruction Set
Computing) and RISC (Reduced Instrucion Set Computing). Overall, the
CISC CPUs are more complex than RISC CPU instructions. Therefore
require fewer instructions to perform the same tasks.
However, a CISC instruction, since it is more complex, takes longer to
be completed than a RISC operation. Assume that a certain task
requires P and 2P CISC instruction manual RISC, CISC instruction and
takes 8T ns to complete, while a RISC operation takes 2T ns. Under this
assumption,

Which has better performance?

Taken from Northwestern University

Performance
Sometimes software optimization may dramatically improve the
performance of a computer system.
Assume that the CPU can execute a multiplication in 10 ns, and
execute a subtraction in 1 ns.
How much will it take the CPU to calculate the result of
d = a x b - a x c?
You could optimize the equation to take less time?

Taken from Northwestern University

Performance measurement and reporting


What is said in "A is faster than B?"
A user of a desktop could say that a program is running in less time
A user of a server tell you that means you can complete more tasks per
hour

What the user is interested in reducing?


The user is interested in computer response time (runtime)
The user of a data center is interested in throughput, ie the number of
completed tasks per unit time

Taken from Northwestern University

Performance measurement and reporting


Performance and runtime
"X is faster than Y" means that the execution time or response is lower
in X than in Y
X is n times faster than Y "means:

ExcecutionTimeY
n
ExcecutionTimeX
Since the runtime performance is reciprocal, the following relationship
holds:

1
Tiempo deEjecucinY
Re n dim iento X
Re n dim ientoY
n

1
Tiempo de EjecucinX
Re n dim ientoY
Re n dim iento X
Taken from Northwestern University

Formula for runtime


A program consists of a set of instructions to be executed, I

The average number of clock cycles it takes to complete home


instruction (CPI)
Measured as cycles / instruction, CPI

CPU has a fixed number of clock cycle time (C)


C = 1 / clock speed
Measured in seconds / cycle

Taken from Northwestern University

CPU Execution Time


Runtime is the product of these 3 parameters

Segundos Instruccio nes


Ciclos
Segundos
Tiempo CPU

Pr ograma
Pr ograma Instruccio nes
Ciclo

T =
Tiempo de ejecucin por
programa por segundo

Nmero de instrucciones
ejecutadas

CPI
CPI promedio
por programa

Taken from Northwestern University

C
Ciclo del reloj
de la CPU

Run Time CPU


The following are the parameters of execution of a program running on a
computer
Number of executed instructions: 10,000,000
CPI program average: 2.5 x instruction cycles
CPU clock speed: 200 MHz (clock cycle: 5x10-9 s)
What is the runtime for this program:

Tiempo CPU

Segundos Instruccio nes


Ciclos
Segundos

Pr ograma
Pr ograma Instruccio nes
Ciclo

Tiempo CPU = Instrucciones x CPI x Ciclo del reloj

10.000.000

x 2.5 x 1 / velocidad reloj

10.000.000

x 2.5 x

5x10-9

.125 segundos
Taken from Northwestern University

T = I x CPI x C

Run Time CPU

Tiempo CPU = Instrucciones x CPI x del Ciclo reloj


T = I x CPI x C

Depende de:
Programa usado
Compilador
ISA

Nmero de Instrucciones I

Depende de:
Programa usado
Compilador
ISA
Organizacin CPU

CPI

Ciclo del reloj


C

(CPI Promedio)

Taken from Northwestern University

Depende de:
Organizacin CPU
Tegnologa (VLSI)

Factors that affect CPU performance


Segundos Instruccio nes
Ciclos
Segundos
Tiempo CPU

Pr ograma
Pr ograma Instruccio nes
Ciclo
Nmero de
Instrucciones I

CPI

Programa

Compilador

Instruction Set
Architecture (ISA)

Organizacin
(Diseo de la CPU)

Tecnologa

Ciclos reloj (C)

X
X

(VLSI)

Taken from Northwestern University

Performance Example
Returning to the previous example: a program is run with the following
parameters:
Number of executed instructions: 10,000,000
CPI program average: 2.5 x instruction cycles
CPU clock speed: 200 MHz

By using the same program with these changes:


A new compiler which is used:
Number of executed instructions: 9,500,000
CPI program average: 3.0
Faster CPU. Clock Speed: 300 MHz

Taken from Northwestern University

Performance Example
What is the increase (Speedup)?

Speedup

Tiempo EjecuinViejo
Tiempo EjecucinNuevo

IViejo CPIViejo Clock CycleViejo


I Nuevo CPI Nuevo Clock Cycle Nuevo

10.000.000 2,5 5 x10 9


Speedup
9.500.000 3 3,33x10 9
= 0.125 / 0.095

= 1.32 or 32 % faster after changes


Taken from Northwestern University

Types of instructions and CPI


Given:
A program with n types of class instruction
Executed on a CPU with the following characteristics:

i = 1, 2, . n

Ci

= Type number instruction i

CPIi = Cycles per instruction type i

Then:

Ciclosdel reloj delaCPU


CPI
NmerodeInstrucciones(I )
n

Ciclosdel reloj delaCPU CPI i C i

Donde:

I C i

i1

i 1

Taken from Northwestern University

Types of instructions and CPI


An instruction set has the following 3 classes:
Clase
A
B
C

CPI
1
2
3

To design a
CPU

Two sequences of code have the following number of instructions:


Code Sequence
1
2

Number of instructions per class


A
B
C
2
1
2
4
1
1

Taken from Northwestern University

Types of instructions and CPI


CPU cycles for Sequence 1
CPI for Sequence 1

= 2 x 1 + 1 x 2 + 2 x 3 = 10 cycles
= Ciclos de reloj / Num. Instrucciones
= 10 /5 = 2

CPU cycles for tier 2

= 4 x 1 + 1 x 2 + 1 x 3 = 9 cycles
n

CPI para secuencia 2

CPU clock cycles


i 1

= 9 / 6 = 1.5

CPI i Ci

CPI = CPU Cycles / I

Taken from Northwestern University

Frequency Instructions and CPI


Given a program with n types of class instruction with the following
characteristics:
i = 1, 2, . n
Ci = Type number instruction i
CPIi = Average number of cycles per instruction type i
Fi

= Frequency or fraction of instruction type i


= Frequency or fraction of instruction type = Ci/ I

Then:
n

CPI CPI i F i
i 1

Fraction of the total execution time for instruction type i =


Taken from Northwestern University

CPIi x Fi
CPI

Frecuencia: Ejemplo con RISC

CPIi x Fi
CPI

Mquina base (Reg / Reg)


Op

Frec (Fi)

CPIi

CPIi x Fi

% Tiempo

ALU

50%

0.5

23% = 0.5/2.2

Load

20%

1.0

45% = 1.0/2.2

Store

10%

0.3

14% = 0.3/2.2

Branch

20%

0.4

18% = 0.4/2.2

Suma = 2.2

CPI CPI i F i
n

i 1

CPI = .5 x 1 + .2 x 5 + .1 x 3 + .2 x 2 =
2.2
= .5Taken+from Northwestern
1 + .3University
+ .4

Performance metrics
(Medidas)
Tiempo de ejecucin: Carga de trabajo,
SPEC, etc.

Aplicacin
Lenguaje de
programacin
Compilador

(milliones) de instrucciones por segundo MIPS


(milliones) de operaciones (P.F.) por segundo MFLOPS

ISA
Datapath
Control

Megabytes per second.

Unidades de Funcin
Transistores Cables

Pines

Ciclos por segundo (velocidad del reloj).

Taken from Northwestern University

Amdahl's Law: Make the Common Case Fast


Speedup due to enhancement E:
ExTime w/o E
Speedup(E) = -------------------ExTime w/ E

Performance w/ E
=

--------------------Performance w/o E

Suppose that enhancement E accelerates a fraction F of the task


by a factor S and the remainder of the task is unaffected
then,
Performance
improvement
is limited by how much the
improved feature is used
ExTime(with E) = ((1-F) + F/S) X ExTime(without E)
Invest resources where
time is spent.
Speedup(with E) = ExTime(without E)
((1-F) + F/S) X ExTime(without E)
Taken from Northwestern University

Summary
CPU
CPUtime
time == Seconds
Seconds ==Instructions
Instructions xx Cycles
Cycles
Program
Program
Instruction
Program
Program
Instruction

xx Seconds
Seconds
Cycle
Cycle

Time is the measure of computer performance!


Good products created when have:
Good benchmarks
Good ways to summarize performance
If not good benchmarks and summary, then choice between improving product
for real programs vs. improving product to get more sales sales almost
always wins
Remember Amdahls Law: Speedup is limited by unimproved part of program

Taken from Northwestern University

Amdahls Law with multiple improvments


The following proposed improvements are made with its respective
percentage of affections:
Speedup1 = S1 = 10

Percentage 1 = F1 = 20%

Speedup2 = S2 = 15

Percentage 2 = F2 = 15%

Speedup3 = S3 = 30

Percentage 3 = F3 = 10%

Speedup

((1 F ) F )
i

S i a different part of
All the improvements use the new design, but each affect
the code
i

Which is the result of the speed up?


Speedup = 1 / [(1 - .2 - .15 - .1) + .2/10 + .15/15 + .1/30)]
= 1/ [

.55

= 1 / .5833 =

.0333

1.71
Taken from Northwestern University

A graphical view
Before:
Execution time without the improvements: 1

Fraccin no afectada: .55

S1 = 10

S2 = 15

S3 = 30

F1 = .2

F2 = .15

F3 = .1

/ 10

/ 15

/ 30

Sin cambios

Fraccin no afectada: .55


After:
Execution time with the improvements : .55 + .02 + .01 + .00333 = .5833
Speedup = 1 / .5833 = 1.71

Taken from Northwestern University

You might also like