Intro
Intro
Introduction
Uniprocessor Performance?
Microprocessor Performance
11-01-2023
Trends:
Running out of ideas to improve single thread performance
Power wall makes it harder to add complex features
Power wall makes it harder to increase frequency
Additional performance provided by: more cores, occasional spikes in
frequency, accelerators
11-01-2023
Important Trends
Historical contributions to performance:
1. Better processes (faster devices) ∼20%
2. Better circuits/pipelines ∼ 15%
3. Better organization/architecture ∼ 15%
In the future, (2) will help little and (1) will eventually disappear!
Today, one can expect only a 20% annual improvement; the improvement is
even lower if the program is not multi-threaded
Manufacturing ICs
𝑊𝑎𝑓𝑒𝑟 𝑎𝑟𝑒𝑎
𝐷𝑖𝑒𝑠 𝑝𝑒𝑟 𝑤𝑎𝑓𝑒𝑟 ≈
𝐷𝑖𝑒 𝑎𝑟𝑒𝑎
1
𝑌𝑖𝑒𝑙𝑑 =
𝐷𝑒𝑓𝑒𝑐𝑡𝑠 𝑝𝑒𝑟 𝑎𝑟𝑒𝑎 × 𝐷𝑖𝑒 𝑎𝑟𝑒𝑎
(1 + )
2
Transistor density increases by 35% per year and die size increases by 10-20%
per year… functionality improvements!
Wire delays do not scale down at the same rate as transistor delays
11-01-2023
DRAM density increases by 40-60% per year, latency has reduced by 33% in
10 years (the memory wall!), bandwidth improves twice as fast as latency
decreases
lw $15, 0($2)
add $16, $15, $14
add $17, $15, $13
Systems software lw $18, 0($12)
(OS, compiler) lw $19, 0($17)
add $20, $18, $19
sw $20, 0($16)
Assembler
Hardware 000000101100000
110100000100010
…
11-01-2023
Performance Metrics
• Possible measures:
response time – time elapsed between start and end of a program
throughput – amount of work done in a fixed time
Relative Performance
CPU Clocking
• Operation of digital hardware governed by a constant-rate clock
Clock period
Clock (cycles)
Data transfer
and computation
Update state
CPU Time
Performance improved by
• Reducing number of clock cycles
• Increasing clock rate
• Hardware designer must often trade off clock rate against cycle count
11-01-2023
CPU Time
CPU execution time CPU clock cycles CPU clock cycles
= × Clock cycle time = for a program / Clock rate
for a program for a program
Example
2. The same program is converted into 2 billion x86 instructions; the x86
processor is implemented such that each instruction completes in an
average of 6 cycles and the clock speed is 1.5 GHz
𝐶𝑃𝐼 = =∑ (𝐶𝑃𝐼 × )
Relative
frequency
11-01-2023
CPI Example
• Alternative compiled code sequences using instructions in classes A, B, C
Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1
Sequence 1: IC = 5 Sequence 2: IC = 6
Clock Cycles Clock Cycles
Performance Summary
The BIG Picture
• Performance depends on
• Algorithm: affects IC, possibly CPI
• Programming language: affects IC, CPI
• Compiler: affects IC, CPI
• Instruction set architecture: affects IC, CPI, Tc
11-01-2023
Amdahl’s Law
Amdahl’s Law: performance improvements through an enhancement is
limited by the fraction of time the enhancement comes into play
80
20 = + 20
𝑛
Conclusion
Cost/performance is improving
• Due to underlying technology development
Hierarchical layers of abstraction
• In both hardware and software
Instruction set architecture
• The hardware/software interface
Execution time: the best performance measure
Power is a limiting factor
• Use parallelism to improve performance