0% found this document useful (0 votes)
33 views

L-2 (Computer Performance)

The document discusses computer performance and factors that influence it such as clock cycle time, number of instructions, and average cycles per instruction. It defines key performance metrics like response time and throughput. It also covers topics like Amdahl's law, benchmarks, and trends like the power wall that limit increasing single thread performance.

Uploaded by

jubairahmed1678
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

L-2 (Computer Performance)

The document discusses computer performance and factors that influence it such as clock cycle time, number of instructions, and average cycles per instruction. It defines key performance metrics like response time and throughput. It also covers topics like Amdahl's law, benchmarks, and trends like the power wall that limit increasing single thread performance.

Uploaded by

jubairahmed1678
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

CSE 213

Computer Architecture

Lecture 2: Computer Performance

Military Institute of Science


and Technology
Performance

• Performance is the key to understanding underlying motivation for the


hardware and its organization

• Why is some hardware better than others for different programs?

• What factors of system performance are hardware related?


(e.g., do we need a new machine, or a new operating system?)
Performance

3
What do we measure?
Define performance….

• How much faster is the Concorde compared to the 747?


• How much bigger is the Boeing 747 than the Douglas DC-8?
Defining Performance

5
Computer Performance: TIME, TIME,
TIME!!!
• Response Time (elapsed time, latency):
• How long does it take to complete (start to finish) a task? Individual user
• Eg: how long must I wait for the database query? concerns…

Individual is more interested in response time. As a user of a smart phone/laptop,


the one that responds faster is better!

Response time (computer ): the total time required by computer to complete a


task including :
Disk access Memory access I/O activities OS overheads CPU exec. time etc
Computer Performance: TIME, TIME,
TIME!!!
• Throughput:
• Total work done per unit time……(per hr,day etc)
• how many jobs can the machine run at once? Systems
manager
• what is the average execution rate? concerns…
• how much work is getting done?
Response Time and Throughput
• If we upgrade a machine with a new processor what do we increase?

8
Relative Performance

9
Relative Performance

10
Execution Time
• Elapsed Time
• counts everything (disk and memory accesses, waiting for I/O, running other
programs, etc.) from start to finish
• a useful number, but often not good for comparison purposes
Elapsed time = CPU time + wait time (I/O, other programs, etc.)

• CPU time
• doesn't count waiting for I/O or time spent running other programs
• can be divided into user CPU time and system CPU time (OS calls)
CPU time = user CPU time + system CPU time

• Our focus:
• user CPU time (CPU execution time or, simply, execution time)
• time spent executing the lines of code that are in our program
• For easier writing, user CPU time has been termed simply as CPU time in rest of
the studies.
Execution Time
CPU Clocking
Clock cycle. The speed of a computer processor, or CPU, is determined
by the clock cycle, which is the amount of time between two pulses of an
oscillator.

13
CPU Time

Performance Equation - I

14
Example

• Our favorite program runs in 10 seconds on computer A, which has a


2Ghz. clock.
• We are trying to help a computer designer build a new machine B, that
will run this program in 6 seconds. The designer can use new (or
perhaps more expensive) technology to substantially increase the clock
rate, but has informed us that this increase will affect the rest of the CPU
design, causing machine B to require 1.2 times as many clock cycles as
machine A for the same program.

• What clock rate should we tell the designer to target?


CPU Time Example

16
CPU Time Example

17
CPU Time Example

18
Instruction Count and CPI

Performance Equation - II 19
Factors Influencing Performance

Execution time = clock cycle time x number of instrs x avg CPI

• Clock cycle time: manufacturing process (how fast is each


transistor), how much work gets done in each pipeline stage
(more on this later)

• Number of instrs: the quality of the compiler and the


instruction set architecture

• CPI: the nature of each instruction and the quality of the


architecture implementation

20
CPU Time Example

21
CPU Time Example

22
CPU Time Example

23
Self Help

• Suppose we have two implementations of the same instruction set


architecture (ISA). For some program:
• machine A has a clock cycle time of 10 ns. and a CPI of 2.0
• machine B has a clock cycle time of 20 ns. and a CPI of 1.2

• Which machine is faster for this program, and by how much?


• If two machines have the same ISA, which of our quantities (e.g., clock
rate, CPI, execution time, # of instructions, MIPS) will always be
identical?
CPI Example
• A compiler designer is trying to decide between two code sequences for
a particular machine.
• Based on the hardware implementation, there are three different classes
of instructions: Class A, Class B, and Class C,

• Which code sequence has the most instructions? Which sequence will be
faster? How much? What is the CPI for each sequence?
CPI Example
Which code sequence has the most instructions?

Which sequence will be faster?

What is the CPI for each sequence


Self Help

• Two different compilers are being tested for a 500 MHz. machine with
three different classes of instructions: Class A, Class B, and Class C,
which require 1, 2 and 3 cycles (respectively). Both compilers are used
to produce code for a large piece of software.
• Compiler 1 generates code with 5 billion Class A instructions, 1 billion
Class B instructions, and 1 billion Class C instructions.
• Compiler 2 generates code with 10 billion Class A instructions, 1 billion
Class B instructions, and 1 billion Class C instructions.

• Which sequence will be faster according to MIPS?


• Which sequence will be faster according to execution time?
Example
Benchmarks

• Performance best determined by running a real application


• use programs typical of expected workload
• or, typical of expected class of applications
e.g., compilers/editors, scientific applications, graphics, etc.

• Benchmark suites
• Each vendor announces a SPEC rating for their system
• a measure of execution time for a fixed collection of programs
• is a function of a specific CPU, memory system, IO system, operating
system, compiler enables easy comparison of different systems

• The key is coming up with a collection of relevant programs


SPEC (System Performance Evaluation
Corporation)
• Sponsored by industry but independent and self-managed – trusted by
code developers and machine vendors
• Clear guides for testing, see www.spec.org
• Regular updates (benchmarks are dropped and new ones added
periodically according to relevance)
• Specialized benchmarks for particular classes of applications
SPEC CPU

• The 2006 version includes 12 integer and 17 floating-point applications

• The SPEC rating specifies how much faster a system is, compared to
a baseline machine – a system with SPEC rating 600 is 1.5 times
faster than a system with SPEC rating 400

• Note that this rating incorporates the behavior of all 29 programs – this
may not necessarily predict performance for your favorite program!

31
Summary

• Performance is specific to a particular program


• total execution time is a consistent summary of performance
• For a given architecture performance increases come from:
• increases in clock rate (without adverse CPI affects)
• improvements in processor organization that lower CPI
• compiler enhancements that lower CPI and/or instruction count
Important Trends

• Running out of ideas to improve single thread performance

• Power wall makes it harder to add complex features

• Power wall makes it harder to increase frequency

33
34
Power Wall
Power Wall

The energy of a pulse during the logic


transition of 0 → 1 → 0 or 1 → 0 → 1
Energy ∝ Capacitive
load X Voltage2
The energy of a single transition
Energy ∝ ½ X Capacitive
load X Voltage2
The power required per transistor
Power ∝ ½ X Capacitive load 36
XVoltage2 X Frequency switched
Power Wall
Power Wall

The energy of a pulse during the logic


transition of 0 → 1 → 0 or 1 → 0 → 1
Energy ∝ Capacitive
load X Voltage2
The energy of a single transition
Energy ∝ ½ X Capacitive
load X Voltage2
The power required per transistor
Power ∝ ½ X Capacitive load
XVoltage2 X Frequency switched
38
Power Wall

39
+ ■ Gene Amdahl [AMDA67]

■ Deals with the potential speedup of a


program using multiple processors
compared to a single processor
Amdahl’s ■ Illustrates the problems facing industry
in the development of multi-core
Law machines
■ Software must be adapted to a highly
parallel execution environment to
exploit the power of parallel
processing

■ Can be generalized to evaluate and


design technical improvement in a
computer system
Amdahl’s law

Amdahl’s law states that in parallelization

❑P is the proportion of a system or program that can be made parallel


❑1-P is the proportion that remains serial
❑Then the maximum speedup that can be achieved using N number of
processors is 1/((1-P)+(P/N).

If N tends to infinity then the maximum speedup tends to 1/(1-P).

If a program needs 20 hours using a single processor core, and a particular


part of the program which takes one hour to execute cannot be parallelized, If
there are no limitations of using processors, then calculate the minimum
execution time of the program.
41
Amdahl’s law

If a program needs 20 hours using a single processor core, and a particular part of the program
which takes one hour to execute cannot be parallelized, If there are no limitations of using
processors, then calculate the minimum execution time of the program.

Answer:
while the remaining 19 hours (p = 0.95) of execution time can be parallelized, then regardless of
how many processors are devoted to a parallelized execution of this program, the minimum
execution time cannot be less than that critical one hour. Hence, the theoretical speedup is limited
to at most 20 times (1/(1 − p) = 20).

42
+
Amdahl’s Law
+
Little’s Law
■ Fundamental and simple relation with broad applications

■ Can be applied to almost any system that is statistically in


steady state, and in which there is no leakage

■ Queuing system
■ If server is idle an item is served immediately, otherwise an
arriving item joins a queue
■ There can be a single queue for a single server or for multiple
servers, or multiples queues with one being for each of multiple
servers
+
Little’s Law
■ Average number of items in a queuing system equals the
average rate at which items arrive multiplied by the time that
an item spends in the system
■ Relationship requires very few assumptions
■ Because of its simplicity and generality it is extremely useful

Number of items in the system (L)= the rate items enter


and leave the system (A / arrival rate / departure rate /
throughput / λ) x the average amount of time items spend
in the system (w/ lead time)

L = A xW
46

Little’s Law

You are estimating number of threads required by your server to execute clients requests
efficiently and initially you starts 4 threads on the server. Request arrival rate on your server is 4
request/sec and each request takes fixed amount of time to complete with following time
descriptions. The arrival rate is fixed and all new request arrivals have fixed service time.

Request_1= 0.2 sec; Request_2 =0.9 sec; Request_3=0.6 sec Request_4=0.5 sec

What improvement factor should you think for the maximization of your thread uses?
47

Little’s Law

Since each request would be assigned to each thread so no waits are there so Average Response
time is (0.2+0.9+0.6+0.5)/4=0.55

Request arrival rate is 4.

Applying Little's law to estimate required threads on server to serve requests is now as follows:

Required threads on server =Request arrival rate ∗ average response time=4∗(0.55)=2.2

So, there should be 2 threads only for request arrivals of 4 req/sec with given service times,

and in this case any 2 requests must wait on each cycle of arrivals because we have only 2
resources (threads) and arrival rate is 4 requests/sec. So any 2 requests must wait and this
results in increased response time. And two 2 threads on the server will always remain idle.

You might also like