Administrivia: ECE 252 / CPS 220 Advanced Computer Architecture I
Administrivia: ECE 252 / CPS 220 Advanced Computer Architecture I
© 2009 by by Sorin, Roth, Hill, Wood, ECE 252/ CPS 220 Lecture Notes 1 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 2
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 3 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 4
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
What is This Course All About? Course Goals and Activities
State-of-the-art computer hardware design Course Goals
Topics + Understand how current processors work
• Microarchitecture of single core microprocessors + Understand how to evaluate/compare processors
• Memory system architecture + Learn how to use simulator to perform experiments
• Multithreaded processors + Learn research skills by performing term project
• Multicore processors + Learn how to critically read research papers
Course Activities:
Fundamentals, current systems, and future systems • Will loosely follow textbook
• Major emphasis on cutting-edge issues
Will read from: classic papers, brand-new papers, textbook
• Students will read and discuss many research papers
• Term project
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 5 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 6
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
What You Should Expect from Course What I Expect You to Know Already
Things NOT to expect in this course: Courses you should have taken already
• 100% of class = me lecturing to you • Basic architecture (ECE 152 / CPS 104 or equivalent)
• Homework sets and exams where every question is either • Programming in C/C++/Java (our simulator is in C)
quantitative or has a single correct answer
• Basic OS (ECE 153 / CPS 110) — not critical, but helpful
Topics you should remember fondly - I will not cover these in any
Things to expect in this course: detail in this course
• Active discussions/arguments about architecture ideas • Instruction sets, computer arithmetic, assembly
programming, memory, I/O
• Essay questions
Topics that wil be briefly reviewed but that you should’ve seen
• Being asked to explain, discuss, defend, and argue
before
• Questions with multiple possible answers
• Pipelining, caches, virtual memory
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 7 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 8
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Course Components Term Project
Reading Materials This is a semester-long research project
• Computer Architecture: A Quantitative Approach by • Do not expect to do whole thing in last week, because
Hennessy and Patterson, 4th Edition E[project grade] < B
• (optional) Modern Processor Design by Shen and Lipasti • I will suggest a bunch of possible project ideas, but many
students choose to pursue their own ideas
• Recent research papers (on course website)
• Project proposals due TBD
Homework
You may “combine” this project with a project from another class,
• 4 to 6 homework assignments, performed in groups of 2
but you MUST consult with me first
Term Project
You must absolutely, positively reference prior work
• Groups of 2 or 3
• Please ask me if you have ANY questions
Exams • Not knowing != valid excuse
• Midterm and final exam
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 9 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 10
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 11 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 12
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
A Friendly Warning
This is not an easy class. What is Computer Architecture?
Seriously.
The term architecture is used here to describe the attributes of a
If you’re an ECE undergrad: consider programming in C system as seen by the programmer, i.e., the conceptual structure
and functional behavior as distinct from the organization of the
If you’re a CS grad: consider having to think about circuits dataflow and controls, the logic design, and the physical
implementation.
Please see me if you think you might be getting in over your head!
–Gene Amdahl, IBM Journal of R&D, Apr 1964
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 13 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 14
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 15 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 16
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
The Role of the Microarchitect Applications -> Requirements -> Designs
architect: defines the hardware/software interface • scientific: weather prediction, molecular modeling
• need: large memory, floating-point arithmetic
microarchitect: defines the hardware implementation • examples: CRAY XT4, IBM BlueGene/L
• usually the same person as the architect • commercial: inventory, payroll, web serving, e-commerce
• need: integer arithmetic, high I/O
• examples: SUN SPARCcenter, Enterprise, AlphaServer GS320
Two very important questions in this course: • desktop: multimedia, games, entertainment
• need: high data bandwidth, graphics
What goals are we (microarchitects!) trying to achieve? • examples: Intel Core2 Quad, AMD Opteron QuadCore, IBM Power6
And what units do we use to measure our success? • mobile: laptops, netbooks, tablet PCs
• need: low power (battery), decent performance
Hint: how do you decide which computer to buy? • examples: Intel Celeron, AMD Turion, Intel Atom
• desktop? laptop? smart phone? mp3 player? • embedded: cell phones, automobile engines, door knobs
• is a Dell box better/worse than an iMac? • need: low power (battery + heat), low cost
• examples: ARM core, Intel Atom
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 17 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 18
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
• “if you build it, they will come” • parameters change and change relative to one another!
• and that’s not even including “exotic” nanotechnologies
is speed the only goal? • or, for that matter, less exotic technologies like Flash memory
• power: heat dissipation + battery life + utility bill
• cost
designs change even if requirements fixed
• reliability
• etc. ... but requirements are not fixed
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 19 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 20
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Examples of Changing Designs Moore’s Law
example I: caches “Cramming More Components onto Integrated Circuits”
• 1970: 10K transistors, DRAM faster than logic -> bad idea –G.E. Moore, Electronics, 1965
• 1990: 1M transistors, logic faster than DRAM -> good idea
• will caches ever be a bad idea again? • observation: (DRAM) transistor density doubles annually
example II: out-of-order execution • became known as “Moore’s Law”
• wrong—density doubles every 18 months (had only 4 data points)
• 1985: 100K transistors + no precise interrupts -> bad idea
• corollaries
• 1995: 2M transistors + precise interrupts -> good idea • cost per transistor halves annually (18 months)
• 2005: 500M transistors + 4GHz clock -> bad idea? • power per transistor decreases with scaling
• speed increases with scaling
• 2009: >1B transistors + multiple cores -> ??? • reliability starting to decrease with scaling
semiconductor technology is an incredible driving force
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 21 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 22
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
• wrong! “performance” used to double every ~2 years Clock Frequency 0.2–2MHz 2–20MHz 20M–1GHz 2GHz?
• self-fulfilling prophecy (Moore’s Curve) IPC (per core) < 0.1 0.1–0.9 0.9–2.0 2.0?
• 2X every 2 years = ~3% increase per month MIPS/MFLOPS < 0.2 0.2–20 20–2,000 100,000?
• 3% per month used to judge performance features
Number of cores 1 1 1 64?
• if feature adds 9 months to schedule...
• ...it should add at least 30% to performance (1.039 = 1.30 → 30%)
• e.g., Intel Itanium: under Moore’s Curve in a big way
some perspective: 1971–2001 performance improved 35,000X!!!
• what if cars improved at this rate?
performance improvements have slowed down in past few years • 1971: 60 MPH & 10 MPG, 2001: 2,100,000 MPH & 350,000 MPG
• architects haven’t figured out how to use the extra • but... what if cars crashed as often as computers did?
transistors to improve performance of single core without
melting the chip --> multicore chips at lower frequencies
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 23 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 24
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Performance Readings
Much of the focus of this course is on improving performance Hennessy & Patterson
Topics: • Chapter 1
• performance metrics R. P. Colwell et al. “Instruction Sets and Beyond: Computers,
Complexity, and Controversy.” IEEE Computer, 18(9), 1996.
• CPU performance equation
• benchmarks and benchmarking
• reporting averages
• Amdahl’s Law
• Little’s Law
• concepts
• balance
• tradeoffs
• bursty behavior (average and peak performance)
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 25 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 26
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 27 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 28
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Performance Metric II: MFLOPS CPU Performance Equation
MFLOPS (millions of floating-point operations per second) processor performance = seconds / program
• (FP ops / execution time) x 10-6 • separate into three components (for single core)
• like MIPS, but counts only FP operations
• FP ops have longest latencies anyway (problem #1)
• FP ops are the same across machines (problem #2) instructions cycles seconds
x x
– may have been valid in 1980 (most programs were FP) program instruction cycle
• most programs today are “integer” i.e., light on FP
• load from memory takes longer than FP divide (prob #1)
• Cray doesn’t implement divide, Motorola has SQRT, SIN, COS (#2) architecture implementation realization
(ISA) (micro-architecture) (physical layout)
compiler-designer processor-designer circuit-designer
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 29 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 30
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 31 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 32
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
CPU Back-of-the-Envelope Calculation Actually Measuring Performance
base machine how are execution-time & CPI actually measured?
• 43% ALU ops (1 cycle), 21% loads (1 cycle), 12% stores (2 • execution time: time (Unix cmd): wall-clock, CPU, system
cycles), 24% branches (2 cycles)
• CPI = CPU time / (clock frequency * # instructions)
• note: pretending latency is 1 because of pipelining
• more useful? CPI breakdown (compute, memory stall, etc.)
Q: should 1-cycle stores be implemented if it slows clock 15%? • so we know what the performance problems are (what to fix)
• old CPI = 0.43 + 0.21 + (0.12 x 2) + (0.24 x 2) = 1.36
measuring CPI breakdown
• new CPI = 0.43 + 0.21 + 0.12 + (0.24 x 2) = 1.24
• hardware event counters (built into core)
• speedup = (P x 1.36 x T) / (P x 1.24 x 1.15T) = 0.95 • calculate CPI using instruction frequencies/event costs
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 33 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 34
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 35 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 36
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Benchmarks: Toys, Kernels, Synthetics Benchmarks: Real Programs
toy benchmarks: little programs that no one really runs real programs
• e.g., fibonacci, 8 queens + only accurate way to characterize performance
– little value, what real programs do these represent? – requires considerable work (porting)
• scary fact: used to prove the value of RISC in early 80’s
Standard Performance Evaluation Corporation (SPEC)
kernels: important (frequently executed) pieces of real programs
• e.g., Livermore loops, Linpack (inner product) • http://www.spec.org
+ good for focusing on individual features, but not big picture • collects, standardizes and distributes benchmark suites
– over-emphasize target feature (for better or worse) • consortium made up of industry leaders
synthetic benchmarks: programs made up for benchmarking • SPEC CPU (CPU intensive benchmarks)
• SPEC89, SPEC92, SPEC95, SPEC2000, SPEC2006
• e.g., Whetstone, Dhrystone
• toy kernels++, which programs do these represent? • other benchmark suites
• SPECjvm, SPECmail, SPECweb, SPEComp
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 39 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 40
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Reporting Average Performance What Does The Mean Mean?
averages: one of the things architects frequently get wrong arithmetic mean (AM): average execution times of N programs
+ pay attention now and you won’t get them wrong • ∑1..Ν(time(i)) / N
important things about averages (i.e., means) harmonic mean (HM): average IPCs of N programs
• ideally proportional to execution time (ultimate metric) • arithmetic mean cannot be used for rates (e.g., IPCs)
• Arithmetic Mean (AM) for times • 30 MPH for 1 mile + 90 MPH for 1 mile != avg. 60 MPH
• Harmonic Mean (HM) for rates (IPCs) • N / ∑1..N(1 / rate(i))
• Geometric Mean (GM) for ratios (speedups)
• there is no such thing as the average program geometric mean (GM): average speedups of N programs
• use average when absolutely necessary • N√(∏1..N(speedup(i))
what if programs run at different frequencies within workload?
• “weighting”
• weighted AM = (∑1..N w(i) * time(i)) / N
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 41 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 42
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 43 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 44
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Little’s Law System Balance
Key Relationship between latency and bandwidth: each system component produces & consumes data
Average number in system = arrival rate * mean holding time • make sure data supply and demand is balanced
• X demand >= X supply ⇒ computation is “X-bound”
Possibly the most useful equation I know • e.g., memory bound, CPU-bound, I/O-bound
• Useful in design of computers, software, industrial • goal: be bound everywhere at once (why?)
processes, etc.
• X can be bandwidth or latency
Example: • X is bandwidth ⇒ buy more bandwidth
• How big of a wine cellar should we build? • X is latency ⇒ much tougher problem
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 45 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 46
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 47 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 48
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
Performance in the Real World Roadmap for Rest of Semester
A paper comparing performance of RISC vs. CISC and trying to Primary topics for rest of course
show that RISC is not obviously better
• Pipelined processors
• “Instruction Sets and Beyond: Computers, Complexity, and
Controversy” by Colwell et al., IEEE Computer 1986. • Multiple-issue (superscalar), in-order processors
• Hardware managed out-of-order instruction execution
• Static (compiler) instruction scheduling, VLIW, EPIC
• Advanced cache/memory issues
• Multithreaded processors
• Intro to multicore chips and multi-chip multiprocessors
Advanced topics
• Power-efficiency, fault tolerance, security, virtual machines,
grid processors, nanocomputing
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 49 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 50
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction
© 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 51 © 2009 by Sorin, Roth, Hill, Wood, ECE 252 / CPS 220 Lecture Notes 52
Sohi, Smith, Vijaykumar, Lipasti Introduction Sohi, Smith, Vijaykumar, Lipasti Introduction