Performance and Processor Design
Performance and Processor Design
Design
Outline
14.1 Introduction
14.2 Important Trends Affecting Performance Issues
14.3 Why Performance Monitoring and Evaluation are Needed
14.4 Performance Measures
14.5 Performance Evaluation Techniques
14.5.1 Tracing and Profiling
14.5.2 Timings and Microbenchmarks
14.5.3 Application-Specific Evaluation
14.5.4 Analytic Models
14.5.5 Benchmarks
14.5.6 Synthetic Programs
14.5.7 Simulation
14.5.8 Performance Monitoring
14.6 Bottlenecks and Saturation
14.7 Feedback Loops
14.7.1 Negative Feedback
14.7.2 Positive Feedback
• Trace
– Record of system activity, typically a log of user or application
requests to the operating system
– Characterizes a system’s execution environment
– Manipulate to test for “what if” scenarios
– Standard traces
• Can be used to compare systems that execute in a similar
environment
• Standard traces are difficult to obtain because
– Traces proprietary to installation where recorded
– Subtle differences between environments can make an impact
on performance, hindering the portability of traces
• Profile
– Record of system activity in kernel mode (e.g., process
scheduling and I/O management)
– Indicate which primitives are most heavily used and should be
optimized
• Timing
– Raw performance measure (e.g., cycles per second or
instructions per second)
– Quick comparisons between hardware
– Comparisons between members of the same family of computers
(e.g., Pentiums)
• Microbenchmark
– Measures the time required to perform a specific operating
system operation (e.g., process creation)
– Also used for system operations (e.g., read/write bandwidth)
– Only used for measuring small aspects of system performance,
not the system’s performance as a whole
• Microbenchmark suites
– Programs that contain a number of microbenchmarks to test
different instructions and operations of a system
– lmbench
• Compare system performance between different UNIX platforms
• Several limitations
– Timings too coarse (used a software clock) for some tests
– Statistics reporting was not uniform
– hbench
• Analyzes the relationship between operating system primitives and
hardware components
• Corrected some of the limitations of lmbench
• Vector-based methodology
– System vector
• Microbenchmark test for each primitive
• Vector consists of the results of these tests
– Application vector
• Profile the system when running the target application
• Vector consists of the demand on each primitive
– Performance of the system obtained by
• For each element in the system vector, multiply it by the element in
the application vector that corresponds to the same primitive
• Sum the results
• Hybrid methodology
– Combines the vector-based methodology with a trace
– Useful for system’s whose execution environment depends not
only the target application, but on the stream of user requests
(e.g., a Web server)
• Kernel program
– A simple algorithm (e.g., matrix inversion) or an entire program
– Executed “on paper” using manufacturer’s timings
– Useful for consumers who have not yet purchased a system
– Not commonly used anymore
• Analytic models
– Mathematical representations of computer systems
– Examples: those of queuing theory and Markov processes
– Pros
• A large body of results exist that can be applied to new models
• Can be relatively fast and accurate
– Cons
• Systems often too complex to model exactly
• Evaluator must be a skilled mathematician
– Must use other techniques to validate results
• Synthetic programs
– Programs constructed for a specific purpose (not a real program)
• To test a specific component
• To approximate the instruction mix of an application or group of
applications
– Useful for isolating the performance of specific components, but
not the system as a whole
• Simulation
– Computerized model of a system
– Useful in performance projection
– Results of a simulation must be validated
– Two types
• Event driven simulators – controlled by events made to occur
according to a probability distribution
• Script-driven simulators – controlled by data carefully manipulated
to reflect the system’s anticipated environment
– Common errors
• Bugs in the simulator
• Deliberate omissions (due to complexity)
• Imprecise modeling
• Performance monitoring
– Can locate inefficiencies in a system that administrators or
developers can remove
– Software monitors
• Windows Task Manager and Linux proc file system
• Might distort results because these program require system
resources
– Hardware monitors
• Use counting registers
• Record events such as TLB misses, clock ticks and memory
operations
• Bottleneck
– Resource that performs its designated task slowly relative to
other resources
– Degrades system performance
– Arrival rate > service rate
– Removing a bottleneck might not increase performance if there
are other bottlenecks
• Saturated resource
– Processes competing for use of the resource interfere with each
other’s execution
– Thrashing occurs when memory is saturated
• Feedback loop
– Technique in which information about the current state of the
system can affect arriving requests
– Negative feedback implies resource is saturated
– Positive feedback implies resource is underutilized
• Negative feedback
– Arrival rate a resource might decrease as a result of negative
feedback
– Examples:
• Multiple print servers
• Print servers with long queues cause negative feedback
• Jobs go to other print servers
– Contributes to system stability
• Positive feedback
– Arrival rate a resource might increase as a result of positive
feedback
– Might be misleading
• E.g., Processor utilization is low might cause the scheduler to
admit more processes to that processor’s queue
• Low utilization might be due to thrashing
• Admitting more processes causes more thrashing and worse
performance
• Designers must be cautious of these types of situations
• RISC
– Few instructions
– Complexity in the software
– Instruction decode hardwired
– All instructions a fixed size (typically, one machine word)
– All instructions require nearly the same amount execution time
– Many general purpose registers
• RISC performance gains vs. CISC
– Better use of pipelines
– Delayed branching
– Common instructions execute fast
– Fewer memory accesses
• Modern processors
– Stray from traditional RISC and CISC designs
– Include anything that increases performance
– Common names: post-RISC, second generation RISC and fast
instruction set computing (FISC)
– Some do not agree that RISC and CISC designs converged
• RISC convergence to CISC
– Superscalar architecture
– Out of order execution (OOO)
– Branch prediction
– On-chip floating point and vector support
– Additional, infrequently used instructions