0% found this document useful (0 votes)

39 views

Intro

Hardware performance has historically improved through faster processors and better architecture, but power constraints now limit frequency scaling. Multicore designs and specialized accelerators are necessary to continue performance gains. Programmers must optimize for parallelism, data locality, and efficient use of hardware resources to achieve good performance on modern processors.

Uploaded by

vineet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

Intro

Uploaded by

vineet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

11-01-2023

Introduction

Why study Hardware?

 Decline of Moore’s Law (#transistors per silicon chip doubles every 18-24 months)

 Proliferation of multi-core processors

 Emergence of new platforms (e.g., cell phones, automobiles)

Hardware knowledge helps programmers (chip/OS/compiler) to write better code

11-01-2023

Uniprocessor Performance?

50% improvement every year!

Constrained by power wall
What contributes to this improvement?

Microprocessor Performance
11-01-2023

Power Consumption Trends

 Dynamic power ∝ activity × capacitance × voltage2 × frequecy
Q. What is the effect of Moore’s Law of scaling on the dynamic power equation?
 Voltage and frequency are somewhat constant now, while capacitance per
transistor is decreasing and number of transistors (activity) is increasing
 Leakage power is also rising (function of #transistor and voltage)

Summary and Important Trends

Summary:
 Increasing frequency led to power wall in early 2000s
 Frequency has stagnated since then
 End of voltage scaling in early 2010s

Trends:
 Running out of ideas to improve single thread performance
 Power wall makes it harder to add complex features
 Power wall makes it harder to increase frequency
 Additional performance provided by: more cores, occasional spikes in
frequency, accelerators
11-01-2023

Important Trends
Historical contributions to performance:
1. Better processes (faster devices) ∼20%
2. Better circuits/pipelines ∼ 15%
3. Better organization/architecture ∼ 15%

In the future, (2) will help little and (1) will eventually disappear!

Pentium P-Pro P-II P-III P-4 Itanium Montecito

Year 1993 1995 1997 1999 2000 2002 2005
Transistor 3.1M 5.5M 7.5M 9.5M 42M 300M 1720M
Clock speed 60M 200M 300M 500m 1500M 800M 1800M

Moore’s Law in action

At this point, adding transistors
to a core yields little benefit

What Does This Mean to a Programmer

Today, one can expect only a 20% annual improvement; the improvement is
even lower if the program is not multi-threaded

 A program needs many threads

 The threads need efficient synchronization and communication
 Data placement in the memory hierarchy is important
 Accelerators should be used when possible
11-01-2023

Challenges for Hardware Designers

Find efficient ways to

 improve single-thread performance and energy

 improve data sharing
 boost programmer productivity
 manage the memory system
 build accelerators for important kernels
 provide security

Manufacturing ICs

Yield: proportion of working dies per wafer

11-01-2023

Intel® Core 10th Gen

𝐶𝑜𝑠𝑡 𝑝𝑒𝑟 𝑤𝑎𝑓𝑒𝑟
𝐶𝑜𝑠𝑡 𝑝𝑒𝑟 𝑑𝑖𝑒 =
𝐷𝑖𝑒𝑠 𝑝𝑒𝑟 𝑤𝑎𝑓𝑒𝑟 × 𝑌𝑖𝑒𝑙𝑑

𝑊𝑎𝑓𝑒𝑟 𝑎𝑟𝑒𝑎
𝐷𝑖𝑒𝑠 𝑝𝑒𝑟 𝑤𝑎𝑓𝑒𝑟 ≈
𝐷𝑖𝑒 𝑎𝑟𝑒𝑎
1
𝑌𝑖𝑒𝑙𝑑 =
𝐷𝑒𝑓𝑒𝑐𝑡𝑠 𝑝𝑒𝑟 𝑎𝑟𝑒𝑎 × 𝐷𝑖𝑒 𝑎𝑟𝑒𝑎
(1 + )
2

• Nonlinear relation to area and defect rate

• Wafer cost and area are fixed
• Defect rate determined by
manufacturing process
300mm wafer, 506 chips, 10nm technology • Die area determined by architecture
and circuit design
Each chip is 11.4 x 10.7 mm

Processor Technology Trends

 Shrinking of transistor sizes: 250nm (1997)  130nm (2002)

 70nm (2008)  35nm (2014)  10nm (2019)  now transitioning
to 7nm

 Transistor density increases by 35% per year and die size increases by 10-20%
per year… functionality improvements!

 Transistor speed improves linearly with size (complex equation involving

voltages, resistances, capacitances)

 Wire delays do not scale down at the same rate as transistor delays
11-01-2023

Memory and IO Technology Trends

 DRAM density increases by 40-60% per year, latency has reduced by 33% in
10 years (the memory wall!), bandwidth improves twice as fast as latency
decreases

 Disk density improves by 100% every year, latency improvement similar to

DRAM

 Networks: primary focus on bandwidth; 10Mb  100Mb in 10 years;

100Mb  1Gb in 5 years

The HW/SW Interface

Application software a[i] = b[i] + c;

Compiler

lw $15, 0($2)
add $16, $15, $14
add $17, $15, $13
Systems software lw $18, 0($12)
(OS, compiler) lw $19, 0($17)
add $20, $18, $19
sw $20, 0($16)
Assembler
Hardware 000000101100000
110100000100010
…
11-01-2023

Performance Metrics

• Possible measures:
 response time – time elapsed between start and end of a program
 throughput – amount of work done in a fixed time

• How are response time and throughput affected by

Replacing the processor with a faster version?
Adding more processors?

• Note: we will be primarily concerned with response time

Relative Performance

• Define Performance = 1/Execution Time

• “X is n time faster than Y”

𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒

= =𝑛
𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒

 Example: time taken to run a program

 10s on A, 15s on B

 Execution TimeB / Execution TimeA

= 15s / 10s = 1.5

 So A is 1.5 times faster than B
11-01-2023

CPU Clocking
• Operation of digital hardware governed by a constant-rate clock

Clock period

Clock (cycles)

Data transfer
and computation

Update state

 Clock period: duration of a clock cycle

 e.g., 250ps = 0.25ns = 250×10
–12s

 Clock frequency (rate): cycles per second

 e.g., 4.0GHz = 4000MHz = 4.0×10 Hz
9

CPU Time

CPU execution time CPU clock cycles

= × Clock cycle time
for a program for a program

CPU clock cycles

= / Clock rate
for a program

Performance improved by
• Reducing number of clock cycles
• Increasing clock rate
• Hardware designer must often trade off clock rate against cycle count
11-01-2023

CPU Time
CPU execution time CPU clock cycles CPU clock cycles
= × Clock cycle time = for a program / Clock rate
for a program for a program

Example: A program runs on

• Computer A: 2GHz clock, 10s CPU time
• Designing computer B:
• Aim for 6s CPU time
• Can have faster clock, but the faster clock affect the rest of CPU design,
causing machine B to require 1.2 times as many clock cycles as machine A
to execute the program
• How fast must Computer B clock be?

Instruction Count (IC) and Cycles per Instruction (CPI)

Clock cycles = Instruction count × Cycles per instruction

CPU time = Instruction count × Cycles per instruction × Clock cycle time

• Instruction Count for a program

• Determined by program, ISA and compiler
• Average cycles per instruction
• Determined by CPU hardware
• If different instructions have different CPI: Average CPI affected by instruction mix
11-01-2023

Example

Clock cycles = Instruction count × Cycles per instruction

CPU time = Instruction count × Cycles per instruction × Clock cycle time

Which of the following two systems is better?

1. A program is converted into 4 billion MIPS instructions by a compiler ; the

MIPS processor is implemented such that each instruction completes in
an average of 1.5 cycles and the clock speed is 1 GHz

2. The same program is converted into 2 billion x86 instructions; the x86
processor is implemented such that each instruction completes in an
average of 6 cycles and the clock speed is 1.5 GHz

CPI for Different Instruction Classes

• If different instruction classes take different numbers of cycles

Clock cycles = ∑ (𝐶𝑃𝐼 × 𝐼𝐶 )

• Weighted average CPI

𝐶𝑃𝐼 = =∑ (𝐶𝑃𝐼 × )

Relative
frequency
11-01-2023

CPI Example
• Alternative compiled code sequences using instructions in classes A, B, C

Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1

 Sequence 1: IC = 5  Sequence 2: IC = 6
 Clock Cycles  Clock Cycles

= 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3

= 10 =9
 Avg. CPI = 10/5 = 2.0  Avg. CPI = 9/6 = 1.5

Performance Summary
The BIG Picture

Instructions Clock cycles Seconds

CPU Time   
Program Instruction Clock cycle

• Performance depends on
• Algorithm: affects IC, possibly CPI
• Programming language: affects IC, CPI
• Compiler: affects IC, CPI
• Instruction set architecture: affects IC, CPI, Tc
11-01-2023

Power and Energy

 Total power = dynamic power + leakage power
 Dynamic power ∝ activity × capacitance × voltage2 × frequecy
 Leakage power ∝ voltage
 Energy (J) = power (w) × time (sec.)

Example: A 1 GHz processor takes 100 seconds to execute a program, while

consuming 70 W of dynamic power and 30 W of leakage power. Does the
program consume less energy in Turbo boost mode when the frequency is
increased to 1.2 GHz?

Normal mode energy = 100 W × 100 s = 10000 J

Turbo mode energy = (70 × 1.2 + 30) × 100/1.2 = 9500 J
Note: Frequency only impacts dynamic power, not leakage power. We
assume that the program’s CPI is unchanged when frequency is changed,
i.e., execution time varies linearly with cycle time

Amdahl’s Law
Amdahl’s Law: performance improvements through an enhancement is
limited by the fraction of time the enhancement comes into play

Example: Suppose a programs runs in 100 sec on a machine, with multiply

operations responsible for 80 sec of this time. How much do you have to improve
the speed of multiplication if you want my program to run five times faster?
𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑎𝑓𝑓𝑒𝑐𝑡𝑒𝑑
̇ 𝑡𝑖𝑚𝑒
𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑏𝑦 𝑖𝑚𝑝𝑟𝑜𝑣𝑒𝑚𝑒𝑛𝑡
= + 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑢𝑛𝑎𝑓𝑓𝑒𝑐𝑡𝑒𝑑
𝑎𝑓𝑡𝑒𝑟 𝑖𝑚𝑝𝑟𝑜𝑣𝑒𝑚𝑒𝑛𝑡 𝐴𝑚𝑜𝑢𝑛𝑡 𝑜𝑓 𝑖𝑚𝑝𝑟𝑜𝑣𝑒𝑚𝑒𝑛𝑡

80
20 = + 20
𝑛

• Architecture design is very bottleneck-driven – make the common case

fast, do not waste resources on a component that has little impact on
overall performance/power
11-01-2023

Conclusion

Cost/performance is improving
• Due to underlying technology development
Hierarchical layers of abstraction
• In both hardware and software
Instruction set architecture
• The hardware/software interface
Execution time: the best performance measure
Power is a limiting factor
• Use parallelism to improve performance

GST 123 Introudction To Ict
75% (4)
GST 123 Introudction To Ict
89 pages
Computer Organization & Design The Hardware/Software Interface, 2nd Edition Patterson & Hennessy
80% (5)
Computer Organization & Design The Hardware/Software Interface, 2nd Edition Patterson & Hennessy
118 pages
CMP2008 L1
No ratings yet
CMP2008 L1
47 pages
Chapter_1_Introduction
No ratings yet
Chapter_1_Introduction
49 pages
Lecture - 4 - Performance
No ratings yet
Lecture - 4 - Performance
31 pages
Computer Architecture Measurement
No ratings yet
Computer Architecture Measurement
26 pages
Lect 1
No ratings yet
Lect 1
54 pages
Lect 1
No ratings yet
Lect 1
56 pages
CH02-HP Computer Abstractions and Technology
No ratings yet
CH02-HP Computer Abstractions and Technology
36 pages
Lecture 2: Performance/Power, MIPS Instructions
No ratings yet
Lecture 2: Performance/Power, MIPS Instructions
28 pages
LEC 2
No ratings yet
LEC 2
31 pages
Performance Numericals
No ratings yet
Performance Numericals
24 pages
Ca02 2014 PDF
No ratings yet
Ca02 2014 PDF
79 pages
02 Performance
No ratings yet
02 Performance
23 pages
Cs23402- Computer Architecture - Unit - 1 (4)
No ratings yet
Cs23402- Computer Architecture - Unit - 1 (4)
161 pages
1ACA_L1
No ratings yet
1ACA_L1
35 pages
Computer Organization The Role of Performance
No ratings yet
Computer Organization The Role of Performance
45 pages
Unit I-Basic Structure of A Computer: System
No ratings yet
Unit I-Basic Structure of A Computer: System
64 pages
Week 2 - Lecture 2 - Performance Measurement
No ratings yet
Week 2 - Lecture 2 - Performance Measurement
25 pages
Chapter 1 Performance
No ratings yet
Chapter 1 Performance
32 pages
Lecture 3: Performance/Power, MIPS Instructions
No ratings yet
Lecture 3: Performance/Power, MIPS Instructions
18 pages
Performance
No ratings yet
Performance
51 pages
Ico22 - 1 - Computer Abstraction and Technology
No ratings yet
Ico22 - 1 - Computer Abstraction and Technology
42 pages
CA01_2024S2
No ratings yet
CA01_2024S2
30 pages
LEC 2
No ratings yet
LEC 2
31 pages
ACA Lec2 New
No ratings yet
ACA Lec2 New
44 pages
Computer Abstractions and Technology Measuring Performance
No ratings yet
Computer Abstractions and Technology Measuring Performance
21 pages
CA 02 Performance
No ratings yet
CA 02 Performance
21 pages
Computer Performance
No ratings yet
Computer Performance
22 pages
Lec 1
No ratings yet
Lec 1
32 pages
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
28 pages
Module 2 [26-10-2024]
No ratings yet
Module 2 [26-10-2024]
50 pages
lect1
No ratings yet
lect1
25 pages
L-2 (Computer Performance)
No ratings yet
L-2 (Computer Performance)
47 pages
Lecture # 2
No ratings yet
Lecture # 2
33 pages
COD Ch. 2 The Role of Performance
No ratings yet
COD Ch. 2 The Role of Performance
28 pages
3310
No ratings yet
3310
26 pages
COAL lecture 02
No ratings yet
COAL lecture 02
36 pages
Computer Organization and Architecture (AT70.01)
No ratings yet
Computer Organization and Architecture (AT70.01)
29 pages
Aula Ch1
No ratings yet
Aula Ch1
40 pages
09 Perf
No ratings yet
09 Perf
22 pages
Slide 1
No ratings yet
Slide 1
33 pages
Chap1 PPA
No ratings yet
Chap1 PPA
30 pages
CCE 131 Lecture1
No ratings yet
CCE 131 Lecture1
26 pages
Instructor: L. N. Bhuyan
No ratings yet
Instructor: L. N. Bhuyan
32 pages
Lecture1_Computer Abstractions and Technology v2
No ratings yet
Lecture1_Computer Abstractions and Technology v2
58 pages
CS322 - Computer Architecture (CA) : Spring 2019 Section V3
No ratings yet
CS322 - Computer Architecture (CA) : Spring 2019 Section V3
52 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
17 pages
2 RISC V Performance ISA
No ratings yet
2 RISC V Performance ISA
72 pages
Week 1
No ratings yet
Week 1
34 pages
L7 Performance
No ratings yet
L7 Performance
11 pages
CHUONG 2 2
No ratings yet
CHUONG 2 2
24 pages
CSE 332 L4 - 14 Nov 2020
No ratings yet
CSE 332 L4 - 14 Nov 2020
41 pages
Chapter 1 Computer Abstractions and Technology
No ratings yet
Chapter 1 Computer Abstractions and Technology
39 pages
L5-L6-Performance Issues
No ratings yet
L5-L6-Performance Issues
47 pages
Chapter 01
No ratings yet
Chapter 01
20 pages
ARM Computer Organization-Chapter01
No ratings yet
ARM Computer Organization-Chapter01
55 pages
Chapter 01 Modified
No ratings yet
Chapter 01 Modified
55 pages
CCS 1202 Lecture 2_Computer Evolution and Performance
No ratings yet
CCS 1202 Lecture 2_Computer Evolution and Performance
32 pages
Lecture 02 CH01 Performance Power
No ratings yet
Lecture 02 CH01 Performance Power
76 pages
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
MR Joseph
No ratings yet
MR Joseph
717 pages
Introduction To Computer Hardware Part 1 PDF
No ratings yet
Introduction To Computer Hardware Part 1 PDF
31 pages
IT1001 Lecture 2 - Computer Hardware and Software
No ratings yet
IT1001 Lecture 2 - Computer Hardware and Software
72 pages
ECE 371 Exam #2 Test Notes: Question # 10 (RTI What Is Counter Value)
No ratings yet
ECE 371 Exam #2 Test Notes: Question # 10 (RTI What Is Counter Value)
8 pages
Evolution of Processors: Department of Elex. & Instru. Engg. Shri G.S. Institute of Tech & Sci Indore
No ratings yet
Evolution of Processors: Department of Elex. & Instru. Engg. Shri G.S. Institute of Tech & Sci Indore
30 pages
Coin Cell Powered Embedded Design
100% (1)
Coin Cell Powered Embedded Design
155 pages
Processor Types: RISC Processor RISC Stands For Reduced Instruction Set
No ratings yet
Processor Types: RISC Processor RISC Stands For Reduced Instruction Set
9 pages
01 Introduction
No ratings yet
01 Introduction
31 pages
Drills
No ratings yet
Drills
5 pages
Crash 2023 08 26 - 14.14.05 Client
No ratings yet
Crash 2023 08 26 - 14.14.05 Client
7 pages
Lecture 1 Parallel and Scalable Machine Learning by HPC Morris Riedel
No ratings yet
Lecture 1 Parallel and Scalable Machine Learning by HPC Morris Riedel
50 pages
HP ENVY 17-s017cl Notebook: Think Outside The Desk
No ratings yet
HP ENVY 17-s017cl Notebook: Think Outside The Desk
2 pages
Chapter 4: The Components of The System Unit
No ratings yet
Chapter 4: The Components of The System Unit
9 pages
The Green Data Center Chapter 2
No ratings yet
The Green Data Center Chapter 2
17 pages
Chapter 1: Introduction To Computers, Programs, and Java: What Is A Computer?
No ratings yet
Chapter 1: Introduction To Computers, Programs, and Java: What Is A Computer?
21 pages
T2 Worksheet 2-1
No ratings yet
T2 Worksheet 2-1
3 pages
Ese 2023 Coa
No ratings yet
Ese 2023 Coa
4 pages
Chapter 04 Processors and Memory Hierarchy PDF
No ratings yet
Chapter 04 Processors and Memory Hierarchy PDF
50 pages
CS3691 ESIOT Soft Copy Notes (2)
No ratings yet
CS3691 ESIOT Soft Copy Notes (2)
125 pages
White Paper - QlikView Server Memory Management and CPU Utilization
No ratings yet
White Paper - QlikView Server Memory Management and CPU Utilization
11 pages
Chapter 1 Computer Abstractions and Technology
No ratings yet
Chapter 1 Computer Abstractions and Technology
46 pages
[Ebooks PDF] download Information Technology in a Global Society Glossary 1st Edition Stuart Gray full chapters
100% (33)
[Ebooks PDF] download Information Technology in a Global Society Glossary 1st Edition Stuart Gray full chapters
72 pages
A Report of Six Months Industrial Training
No ratings yet
A Report of Six Months Industrial Training
37 pages
Motherboard H81M P33
No ratings yet
Motherboard H81M P33
186 pages
Ec8681-Microprocessors and Microcontrollers Laboratory-1053372192-Cse MPMC Lab Manual
No ratings yet
Ec8681-Microprocessors and Microcontrollers Laboratory-1053372192-Cse MPMC Lab Manual
116 pages
Chapter2 - Computer Hardware
No ratings yet
Chapter2 - Computer Hardware
33 pages

Intro

Uploaded by

Intro

Uploaded by

11-01-2023

Why study Hardware?

 Proliferation of multi-core processors

 Emergence of new platforms (e.g., cell phones, automobiles)

Hardware knowledge helps programmers (chip/OS/compiler) to write better code

50% improvement every year!

Power Consumption Trends

Summary and Important Trends

Pentium P-Pro P-II P-III P-4 Itanium Montecito

Moore’s Law in action

What Does This Mean to a Programmer

 A program needs many threads

Challenges for Hardware Designers

Find efficient ways to

 improve single-thread performance and energy

Yield: proportion of working dies per wafer

Intel® Core 10th Gen

• Nonlinear relation to area and defect rate

Processor Technology Trends

 Shrinking of transistor sizes: 250nm (1997)  130nm (2002)

 Transistor speed improves linearly with size (complex equation involving

Memory and IO Technology Trends

 Disk density improves by 100% every year, latency improvement similar to

 Networks: primary focus on bandwidth; 10Mb  100Mb in 10 years;

The HW/SW Interface

Application software a[i] = b[i] + c;

• How are response time and throughput affected by

• Note: we will be primarily concerned with response time

• Define Performance = 1/Execution Time

𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒

 Example: time taken to run a program

 Execution TimeB / Execution TimeA

= 15s / 10s = 1.5

 Clock period: duration of a clock cycle

 Clock frequency (rate): cycles per second

CPU execution time CPU clock cycles

CPU clock cycles

Example: A program runs on

Instruction Count (IC) and Cycles per Instruction (CPI)

Clock cycles = Instruction count × Cycles per instruction

• Instruction Count for a program

Clock cycles = Instruction count × Cycles per instruction

Which of the following two systems is better?

1. A program is converted into 4 billion MIPS instructions by a compiler ; the

CPI for Different Instruction Classes

Clock cycles = ∑ (𝐶𝑃𝐼 × 𝐼𝐶 )

• Weighted average CPI

= 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3

Instructions Clock cycles Seconds

Power and Energy

Example: A 1 GHz processor takes 100 seconds to execute a program, while

Normal mode energy = 100 W × 100 s = 10000 J

Example: Suppose a programs runs in 100 sec on a machine, with multiply

• Architecture design is very bottleneck-driven – make the common case

You might also like