MObile Communication
MObile Communication
Scientific Supercomputing Trends CPU Performance and Technology Trends, Parallelism in Microprocessor Generations Computer System Peak FLOP Rating History/Near Future
Why?
The Goal of Parallel Processing Elements of Parallel Computing Factors Affecting Parallel System Performance Parallel Architectures History
Parallel Programming Models Flynns 1972 Classification of Computer Architecture
Shared Address Space Parallel Architectures Message-Passing Multicomputers: Message-Passing Programming Tools Data Parallel Systems Dataflow Architectures Systolic Architectures: Matrix Multiplication Systolic Array Example
PCA Chapter 1.1, 1.2
EECC756 - Shaaban
#1 lec # 1 Spring 2011 3-8-2011
The concurrency and communication characteristics of parallel algorithms for a given computational problem (represented by dependency graphs)
Parallel Processing Performance and Scalability Goals: Maximize performance enhancement of parallelism: Maximize Speedup. Goals
Processor = Programmable computing element that runs stored programs written using pre-defined instruction set Processing Elements = PEs = Processors
By minimizing parallelization overheads and balancing workload on processors Scalability of performance to larger systems/problems.
EECC756 - Shaaban
#2 lec # 1 Spring 2011 3-8-2011
Processing Nodes
$ P
Processing Nodes:
Each processing node contains one or more processing elements (PEs) or processor(s), memory system, plus communication assist: (Network interface and communication controller)
EECC756 - Shaaban
#3 lec # 1 Spring 2011 3-8-2011
Scientific/Engineering computing: CFD, Biology, Chemistry, Physics, ... General-purpose computing: Video, Graphics, CAD, Databases, Transaction Processing, Gaming Mainstream multithreaded programs, are similar to parallel programs
Moores Law still alive
Technology Trends:
Number of transistors on chip growing rapidly. Clock rates expected to continue to go up but only slowly. Actual performance returns diminishing due to deeper pipelines. Increased transistor density allows integrating multiple processor cores per creating ChipMultiprocessors (CMPs) even for mainstream computing applications (desktop/laptop..).
+ multi-tasking (multiple independent programs)
Architecture Trends:
Instruction-level parallelism (ILP) is valuable (superscalar, VLIW) but limited. Increased clock rates require deeper pipelines with longer latencies and higher CPIs. Coarser-level parallelism (at the task or thread level, TLP), utilized in multiprocessor systems is the most viable approach to further improve performance. Multi-core Main motivation for development of chip-multiprocessors (CMPs)
Economics:
Processors
The increased utilization of commodity of-the-shelf (COTS) components in high performance parallel computing systems instead of costly custom components used in traditional supercomputers leading to much lower parallel system cost. Todays microprocessors offer high-performance and have multiprocessor support eliminating the need for designing expensive custom Pes. Commercial System Area Networks (SANs) offer an alternative to custom more costly networks
EECC756 - Shaaban
#4 lec # 1 Spring 2011 3-8-2011
EECC756 - Shaaban
Computational and memory demands exceed the capabilities of even the fastest current uniprocessor systems
GLOP = 109 FLOPS TeraFLOP = 1000 GFLOPS = 1012 FLOPS PetaFLOP = 1000 TeraFLOPS = 1015 FLOPS
EECC756 - Shaaban
#6 lec # 1 Spring 2011 3-8-2011
However even the fastest current single microprocessor systems still cannot meet the needed computational demands. As shown in last slide
Currently: Large-scale microprocessor-based multiprocessor systems and computer clusters are replacing (replaced?) vector supercomputers that utilize custom processors.
EECC756 - Shaaban
#7 lec # 1 Spring 2011 3-8-2011
(in seconds)
TC = Total program execution clock cycles f = clock rate C = CPU clock cycle time = 1/f I = Instructions executed count CPI = Cycles per instruction CPIexecution = CPI with ideal memory
M = Memory stall cycles per memory access k = Memory accesses per instruction
Performance factors: (I, CPIexecution, m, k, C) are influenced by: instruction-set architecture (ISA) , compiler design, CPU micro-architecture, implementation and control, cache and memory hierarchy, program access locality, and program instruction mix and instruction dependencies.
T = I x CPI x C
EECC756 - Shaaban
#8 lec # 1 Spring 2011 3-8-2011
Supercomputers
0.1 1965
1970
1975
1980
1985
1990
1995
EECC756 - Shaaban
#9 lec # 1 Spring 2011 3-8-2011
1,000 Mhz
21264S 21164A 21264 Pentium(R) 21064A 21164 II 21066 MPC750 604 604+
Realty Check: Clock frequency scaling is slowing down! (Did silicone finally hit the wall?) Why? 1- Power leakage 2- Clock distribution delays Result: Deeper Pipelines Longer stalls Higher CPI (lowers effective performance per cycle)
10
100
486 386
10 1991 1993 1995 1997 1999 2001 2003 1987 1989 2005 1