0% found this document useful (0 votes)
55 views

MObile Communication

The document discusses parallel processing and its need. It covers trends in CPU and supercomputing performance, factors affecting parallel systems, and different parallel architectures. Parallel processing is needed to solve challenging applications due to their high computational and memory requirements exceeding single processor capabilities.

Uploaded by

Nikhil Prakash
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

MObile Communication

The document discusses parallel processing and its need. It covers trends in CPU and supercomputing performance, factors affecting parallel systems, and different parallel architectures. Parallel processing is needed to solve challenging applications due to their high computational and memory requirements exceeding single processor capabilities.

Uploaded by

Nikhil Prakash
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Introduction to Parallel Processing

Parallel Computer Architecture: Definition & Broad issues involved


A Generic Parallel Computer Architecture

The Need And Feasibility of Parallel Computing


Scientific Supercomputing Trends CPU Performance and Technology Trends, Parallelism in Microprocessor Generations Computer System Peak FLOP Rating History/Near Future

Why?

The Goal of Parallel Processing Elements of Parallel Computing Factors Affecting Parallel System Performance Parallel Architectures History
Parallel Programming Models Flynns 1972 Classification of Computer Architecture

Current Trends In Parallel Architectures


Modern Parallel Architecture Layered Framework

Shared Address Space Parallel Architectures Message-Passing Multicomputers: Message-Passing Programming Tools Data Parallel Systems Dataflow Architectures Systolic Architectures: Matrix Multiplication Systolic Array Example
PCA Chapter 1.1, 1.2

EECC756 - Shaaban
#1 lec # 1 Spring 2011 3-8-2011

Parallel Computer Architecture


A parallel computer (or multiple processor system) is a collection of communicating processing elements (processors) that cooperate to solve large computational problems fast by dividing such problems into parallel tasks, exploiting Thread-Level Parallelism (TLP). i.e Parallel Processing
Broad issues involved:
Task = Computation done on one processor

The concurrency and communication characteristics of parallel algorithms for a given computational problem (represented by dependency graphs)

Computing Resources and Computation Allocation:


The number of processing elements (PEs), computing power of each element and amount/organization of physical memory used. What portions of the computation and data are allocated or mapped to each PE.

Data access, Communication and Synchronization


How the processing elements cooperate and communicate. How data is shared/transmitted between processors. Abstractions and primitives for cooperation/communication and synchronization. The characteristics and performance of parallel system network (System interconnects).

Parallel Processing Performance and Scalability Goals: Maximize performance enhancement of parallelism: Maximize Speedup. Goals
Processor = Programmable computing element that runs stored programs written using pre-defined instruction set Processing Elements = PEs = Processors

By minimizing parallelization overheads and balancing workload on processors Scalability of performance to larger systems/problems.

EECC756 - Shaaban
#2 lec # 1 Spring 2011 3-8-2011

A Generic Parallel Computer Architecture


2 Parallel Machine Network 1
A processing nodes
Mem Communication assist (CA) Network

(Custom or industry standard)

Processing Nodes

Operating System Parallel Programming Environments

$ P

Network Interface AKA Communication Assist (CA) (custom or industry standard)


One or more processing elements or processors per node: Custom or commercial microprocessors. Single or multiple processors per chip Homogenous or heterogonous

2-8 cores per chip

Processing Nodes:

Each processing node contains one or more processing elements (PEs) or processor(s), memory system, plus communication assist: (Network interface and communication controller)

Parallel machine network (System Interconnects).


Function of a parallel machine network is to efficiently (reduce communication cost) transfer information (data, results .. ) from source node to destination node as needed to allow cooperation among parallel processing nodes to solve large computational problems divided into a number parallel computational tasks.
Parallel Computer = Multiple Processor System

EECC756 - Shaaban
#3 lec # 1 Spring 2011 3-8-2011

The Need And Feasibility of Parallel Computing


Application demands: More computing cycles/memory needed
Driving Force

Scientific/Engineering computing: CFD, Biology, Chemistry, Physics, ... General-purpose computing: Video, Graphics, CAD, Databases, Transaction Processing, Gaming Mainstream multithreaded programs, are similar to parallel programs
Moores Law still alive

Technology Trends:

Number of transistors on chip growing rapidly. Clock rates expected to continue to go up but only slowly. Actual performance returns diminishing due to deeper pipelines. Increased transistor density allows integrating multiple processor cores per creating ChipMultiprocessors (CMPs) even for mainstream computing applications (desktop/laptop..).
+ multi-tasking (multiple independent programs)

Architecture Trends:

Instruction-level parallelism (ILP) is valuable (superscalar, VLIW) but limited. Increased clock rates require deeper pipelines with longer latencies and higher CPIs. Coarser-level parallelism (at the task or thread level, TLP), utilized in multiprocessor systems is the most viable approach to further improve performance. Multi-core Main motivation for development of chip-multiprocessors (CMPs)

Economics:

Processors

The increased utilization of commodity of-the-shelf (COTS) components in high performance parallel computing systems instead of costly custom components used in traditional supercomputers leading to much lower parallel system cost. Todays microprocessors offer high-performance and have multiprocessor support eliminating the need for designing expensive custom Pes. Commercial System Area Networks (SANs) offer an alternative to custom more costly networks

EECC756 - Shaaban
#4 lec # 1 Spring 2011 3-8-2011

Why is Parallel Processing Needed?

Challenging Applications in Applied Science/Engineering



Traditional Driving Force For HPC/Parallel Processing Astrophysics Atmospheric and Ocean Modeling Such applications have very high Bioinformatics 1- computational and 2- memory Biomolecular simulation: Protein folding requirements that cannot be met Computational Chemistry with single-processor architectures. Computational Fluid Dynamics (CFD) Many applications contain a large degree of computational parallelism Computational Physics Computer vision and image understanding Data Mining and Data-intensive Computing Engineering analysis (CAD/CAM) Global climate modeling and forecasting Material Sciences Military applications Driving force for High Performance Computing (HPC) Quantum chemistry and multiple processor system development VLSI design .

EECC756 - Shaaban

#5 lec # 1 Spring 2011 3-8-2011

Why is Parallel Processing Needed?

Scientific Computing Demands


Driving force for HPC and multiple processor system development
(Memory Requirement)

Computational and memory demands exceed the capabilities of even the fastest current uniprocessor systems

3-5 GFLOPS for uniprocessor

GLOP = 109 FLOPS TeraFLOP = 1000 GFLOPS = 1012 FLOPS PetaFLOP = 1000 TeraFLOPS = 1015 FLOPS

EECC756 - Shaaban
#6 lec # 1 Spring 2011 3-8-2011

Scientific Supercomputing Trends


Proving ground and driver for innovative architecture and advanced high performance computing (HPC) techniques: Market is much smaller relative to commercial (desktop/server) segment. Dominated by costly vector machines starting in the 1970s through the 1980s. Microprocessors have made huge gains in the performance needed for such applications:
High clock rates. (Bad: Higher CPI?) Multiple pipelined floating point units. Instruction-level parallelism. Effective use of caches. Multiple processor cores/chip (2 cores 2002-2005, 4 end of 2006, 6-12 cores 2011)

Enabled with high transistor density/chip

However even the fastest current single microprocessor systems still cannot meet the needed computational demands. As shown in last slide
Currently: Large-scale microprocessor-based multiprocessor systems and computer clusters are replacing (replaced?) vector supercomputers that utilize custom processors.

EECC756 - Shaaban
#7 lec # 1 Spring 2011 3-8-2011

Uniprocessor Performance Evaluation


CPU Performance benchmarking is heavily program-mix dependent. Ideal performance requires a perfect machine/program match. Performance measures:

Total CPU time = T = TC / f = TC x C = I x CPI x C = I x (CPIexecution + M x k) x C

(in seconds)

TC = Total program execution clock cycles f = clock rate C = CPU clock cycle time = 1/f I = Instructions executed count CPI = Cycles per instruction CPIexecution = CPI with ideal memory

M = Memory stall cycles per memory access k = Memory accesses per instruction

MIPS Rating = I / (T x 106) = f / (CPI x 106) = f x I /(TC x 106)


(in million instructions per second)

Throughput Rate: Wp = 1/ T = f /(I x CPI) = (MIPS) x 106 /I


(in programs per second)

Performance factors: (I, CPIexecution, m, k, C) are influenced by: instruction-set architecture (ISA) , compiler design, CPU micro-architecture, implementation and control, cache and memory hierarchy, program access locality, and program instruction mix and instruction dependencies.

T = I x CPI x C

EECC756 - Shaaban
#8 lec # 1 Spring 2011 3-8-2011

Single CPU Performance Trends


The microprocessor is currently the most natural building block for multiprocessor systems in terms of cost and performance. This is even more true with the development of cost-effective multi-core microprocessors that support TLP at the chip level.
100
Custom Processors

Supercomputers

10 Performance Mainframes Microprocessors Minicomputers 1


Commodity Processors

0.1 1965

1970

1975

1980

1985

1990

1995

EECC756 - Shaaban
#9 lec # 1 Spring 2011 3-8-2011

Microprocessor Frequency Trend


10,000
Intel IBM Power PC DEC Gate delays/clock

100 Processor freq scales by 2X per generation Gate Delays/ Clock

1,000 Mhz

21264S 21164A 21264 Pentium(R) 21064A 21164 II 21066 MPC750 604 604+

Realty Check: Clock frequency scaling is slowing down! (Did silicone finally hit the wall?) Why? 1- Power leakage 2- Clock distribution delays Result: Deeper Pipelines Longer stalls Higher CPI (lowers effective performance per cycle)

10

100

Pentium Pro 601, 603 (R) Pentium(R)

486 386
10 1991 1993 1995 1997 1999 2001 2003 1987 1989 2005 1

No longer the case

Frequency doubles each generation ? Number of gates/clock reduce by 25% Leads to deeper pipelines with more stages
(e.g Intel Pentium 4E has 30+ pipeline stages)

Solution: Exploit TLP at the chip level, Chip-multiprocessor (CMPs)

T = I x CPI x C

EECC756 - Shaaban
#10 lec # 1 Spring 2011 3-8-2011

Transistor Count Growth Rate


Enabling Technology for Chip-Level Thread-Level Parallelism (TLP)

~ 800,000x transistor density increase in the last 38 years

Currently > 2 Billion


Moore Moores Law: 2X transistors/Chip Every 1.5 years (circa 1970) still holds

Enables Thread-Level Parallelism (TLP) at the chip level: Chip-Multiprocessors (CMPs) + Simultaneous Multithreaded (SMT) processors Intel 4004 (2300 transistors)

Solution

One billion transistors/chip reached in 2005, two billion in 2008-9, Now ~ three billion Transistor count grows faster than clock rate: Currently ~ 40% per year Single-threaded uniprocessors do not efficiently utilize the increased transistor count.

Limited ILP, increased size of cache

EECC756 - Shaaban
#11 lec # 1 Spring 2011 3-8-2011

Parallelism in Microprocessor VLSI Generations


Bit-level parallelism 100,000,000 Instruction-level Thread-level (?)

(ILP)

(TLP) Superscalar /VLIW CPI <1


Multiple micro-operations per cycle (multi-cycle non-pipelined)


10,000,000

Single-issue Pipelined CPI =1


Simultaneous Multithreading SMT: e.g. Intels Hyper-threading Chip-Multiprocessors (CMPs)


e.g IBM Power 4, 5 Intel Pentium D, Core Duo AMD Athlon 64 X2 Dual Core Opteron Sun UltraSparc T1 (Niagara)

Not Pipelined CPI >> 1


1,000,000 Transistors

R10000

Pentium

i80386

100,000

i80286

R2000

R3000

Chip-Level TLP/Parallel Processing


Even more important due to slowing clock rate increase
ILP = Instruction-Level Parallelism TLP = Thread-Level Parallelism

i8086

10,000
i8080 i8008 i4004

Single Thread
Per Chip
1975 1980 1985 1990 1995 2000

1,000 1970

2005

Improving microprocessor generation performance by exploiting more levels of parallelism

EECC756 - Shaaban
#12 lec # 1 Spring 2011 3-8-2011

Current Dual-Core Chip-Multiprocessor Architectures


Single Die Shared L2 Cache Single Die Private Caches Shared System Interface Two Dice Shared Package Private Caches Private System Interface

Shared L2 or L3

On-chip crossbar/switch Cores communicate using shared cache (Lowest communication latency) Examples: IBM POWER4/5
Intel Pentium Core Duo (Yonah), Conroe (Core 2), i7, Sun UltraSparc T1 (Niagara) AMD Phenom .
Source: Real World Technologies, http://www.realworldtech.com/page.cfm?ArticleID=RWT101405234615

Cores communicate using on-chip Interconnects (shared system interface) Examples:


AMD Dual Core Opteron, Athlon 64 X2 Intel Itanium2 (Montecito)

FSB Cores communicate over external Front Side Bus (FSB) (Highest communication latency) Examples:
Intel Pentium D, Intel Quad core (two dual-core chips)

EECC756 - Shaaban
#13 lec # 1 Spring 2011 3-8-2011

Microprocessors Vs. Vector Processors

Uniprocessor Performance: LINPACK


10,000
CRAY CRAY Micro Micro

n = 1,000 n = 100 n = 1,000 n = 100

Vector Processors

Now about 5-20 GFLOPS per microprocessor core


T94

1,000

1 GFLOP
(109 FLOPS)
LINPACK (MFLOPS)

C90 Ymp Xmp/416


DEC 8200

100

IBM Power2/990

Xmp/14se

MIPS R4400

HP9000/735

DEC Alpha

CRAY 1s

IBM RS6000/540

DEC Alpha AXP HP 9000/750

10

MIPS M/2000

Microprocessors
1 1975

MIPS M/120

Sun 4/260 1980 1985 1990 1995 2000

EECC756 - Shaaban
#14 lec # 1 Spring 2011 3-8-2011

Parallel Performance: LINPACK


Since ~ Nov. 2010

Current Top LINPACK Performance:


10,000
MPP peak CRAY peak

Now about 2,566,000 GFlop/s = 2566 TeraFlops = 2.566 PetaFlops Tianhe-1A ( @ National Supercomputing Center in Tianjin, China) 186,368 processor cores: 14,336 Intel Xeon X5670 6-core processors @ 2.9 GHz + 7,168 Nvidia Tesla M2050 (8-core?) GPUs
ASCI Red Paragon XP/S MP (6768) Paragon XP/S MP (1024) T3D CM-5 T932(32) Paragon XP/S CM-200 CM-2
C90(16)

1,000

1 TeraFLOP
(1012 FLOPS = 1000 GFLOPS)

LINPACK (GFLOPS)

100

10

Delta

Ymp/832(8) 1 Xmp /416(4)

iPSC/860 nCUBE/2(1024)

0.1 1985

1987

1989

1991

1993

1995

1996

Current ranking of top 500 parallel supercomputers in the world is found at: www.top500.org

EECC756 - Shaaban
#15 lec # 1 Spring 2011 3-8-2011

Why is Parallel Processing Needed?

LINPAK Performance Trends


10,000
CRA Y n = 1,000 CRA Y n = 100 Micron = 1,000 Micron = 100

10,000

MPP peak CRA Y peak

1,000

ASCI Red

1 GFLOP
(109 FLOPS)
LINPACK (MFLOPS)

T94

C90

LINPACK (GFLOPS)

1,000

1 TeraFLOP
(1012 FLOPS =1000 GFLOPS)
100

Paragon XP/S MP (6768) Paragon XP/S MP (1024) T3D CM-5


T932(32) Paragon XP/S

DEC 8200

Ymp Xmp/416

100

IBM Power2/99

Xmp/14se

MIPS R4400

CM-200 CM-2

C90(16)

CRA Y 1s

DEC Alpha HP9000/735 DEC Alpha AXP HP 9000/750 IBM RS6000/540

10

Delta

Ymp/832(8)
10

iPSC/860 nCUBE/2(1024)

1

MIPS M/2000

Xmp /416(4)

MIPS M/120

Sun 4/260 1 1975


0.1 1985
1995 200

1987

1989

1991

1993

1995 199

1980

1985

1990

Uniprocessor Performance

Parallel System Performance

EECC756 - Shaaban
#16 lec # 1 Spring 2011 3-8-2011

Computer System Peak FLOP Rating History


Current Top Peak FP Performance: Since ~ Nov. 2010 Now about 4,701,000 GFlop/s = 4701 TeraFlops = 4.701 PetaFlops Tianhe-1A ( @ National Supercomputing Center in Tianjin, China) 186,368 processor cores: 14,336 Intel Xeon X5670 6-core processors @ 2.9 GHz + 7,168 Nvidia Tesla M2050 (8-core?) GPUs

Tianhe-1A

Peta FLOP (1015 FLOPS = 1000 Tera FLOPS) Teraflop

(1012 FLOPS = 1000 GFLOPS)

Current ranking of top 500 parallel supercomputers in the world is found at: www.top500.org

EECC756 - Shaaban
#17 lec # 1 Spring 2011 3-8-2011

November 2005

Source (and for current list): www.top500.org

EECC756 - Shaaban
#18 lec # 1 Spring 2011 3-8-2011

TOP500 Supercomputers

32nd List (November 2008): The Top 10


KW

Source (and for current list): www.top500.org

EECC756 - Shaaban
#19 lec # 1 Spring 2011 3-8-2011

TOP500 Supercomputers

34nd List (November 2009): The Top 10


KW

Source (and for current list): www.top500.org

EECC756 - Shaaban
#20 lec # 1 Spring 2011 3-8-2011

TOP500 Supercomputers

36nd List (November 2010): The Top 10


KW

Current List

Source (and for current list): www.top500.org

EECC756 - Shaaban
#21 lec # 1 Spring 2011 3-8-2011

The Goal of Parallel Processing


Goal of applications in using parallel machines: Maximize Speedup over single processor performance
Parallel

Speedup (p processors) =

Performance (p processors) Performance (1 processor)

For a fixed problem size (input data set), performance = 1/time


Parallel Speedup, Speedupp

Fixed Problem Size Parallel Speedup


Time (1 processor) Time (p processors)

Speedup fixed problem (p processors) =

Ideal speedup = number of processors = p


Very hard to achieve
+ load imbalance Due to parallelization overheads: communication cost, dependencies ...

EECC756 - Shaaban
#22 lec # 1 Spring 2011 3-8-2011

The Goal of Parallel Processing


Parallel processing goal is to maximize parallel speedup:
Fixed Problem Size Parallel Speedup

Or time

Speedup =

Sequential Work on one processor Time(1) Time(p) < Max (Work + Synch Wait Time + Comm Cost + Extra Work)
Time

Parallelization overheads

i.e the processor with maximum execution time

Ideal Speedup = p = number of processors


Very hard to achieve: Implies no parallelization overheads and perfect load balance among all processors.

1 2

Maximize parallel speedup by:


Balancing computations on processors (every processor does the same amount of work) and the same amount of overheads. Minimizing communication cost and other overheads associated with each step of parallel program creation and execution.

Performance Scalability:

Achieve a good speedup for the parallel application on the parallel architecture as problem size and machine size (number of processors) are increased.

EECC756 - Shaaban
#23 lec # 1 Spring 2011 3-8-2011

Elements of Parallel Computing


HPC Driving Force

Computing Problems

Assign parallel computations (Tasks) to processors

Processing Nodes/Network

Parallel Algorithms and Data Structures


Dependency analysis
(Task Dependency Graphs)

Mapping

Parallel Hardware Architecture


Operating System
Applications Software

Programming

High-level Languages

Binding (Compile, Load)

Performance Evaluation

e.g Parallel Speedup

EECC756 - Shaaban
#24 lec # 1 Spring 2011 3-8-2011

Elements of Parallel Computing


1 Computing Problems:
Driving Force

Numerical Computing: Science and and engineering numerical problems demand intensive integer and floating point computations. Logical Reasoning: Artificial intelligence (AI) demand logic inferences and symbolic manipulations and large space searches. Special algorithms and data structures are needed to specify the computations and communication present in computing problems (from dependency analysis). Most numerical algorithms are deterministic using regular data structures. Symbolic processing may use heuristics or non-deterministic searches. Parallel algorithm development requires interdisciplinary interaction.

2 Parallel Algorithms and Data Structures

EECC756 - Shaaban
#25 lec # 1 Spring 2011 3-8-2011

Elements of Parallel Computing


3 Hardware Resources
A

Computing power

Processors, memory, and peripheral devices (processing nodes) form the hardware core of a computer system. B Processor connectivity (system interconnects, network), memory organization, influence the system architecture.

4 Operating Systems

Communication/connectivity

Manages the allocation of resources to running processes. Mapping to match algorithmic structures with hardware architecture and vice versa: processor scheduling, memory mapping, interprocessor communication.

Parallelism exploitation possible at: 1- algorithm design, 2- program writing, 3- compilation, and 4- run time.
EECC756 - Shaaban
#26 lec # 1 Spring 2011 3-8-2011

Elements of Parallel Computing


5 System Software Support
Needed for the development of efficient programs in high-level languages (HLLs.) Assemblers, loaders. Portable parallel programming languages/libraries User interfaces and tools.

6 Compiler Support

Approaches to parallel programming

(a) Implicit Parallelism Approach


Parallelizing compiler: Can automatically detect parallelism in sequential source code and transforms it into parallel constructs/code. Source code written in conventional sequential languages

(b) Explicit Parallelism Approach:


Programmer explicitly specifies parallelism using: Sequential compiler (conventional sequential HLL) and low-level library of the target parallel computer , or .. Concurrent (parallel) HLL . Concurrency Preserving Compiler: The compiler in this case preserves the parallelism explicitly specified by the programmer. It may perform some program flow analysis, dependence checking, limited optimizations for parallelism detection.
Illustrated next

EECC756 - Shaaban
#27 lec # 1 Spring 2011 3-8-2011

Approaches to Parallel Programming


Programmer
Source code written in sequential languages C, C++ FORTRAN, LISP ..

Programmer
Source code written in concurrent dialects of C, C++ FORTRAN, LISP ..

Parallelizing compiler Parallel object code

Concurrency preserving compiler Concurrent object code

(a) Implicit Parallelism


Compiler automatically detects parallelism in sequential source code and transforms it into parallel constructs/code

(b) Explicit Parallelism


Programmer explicitly specifies parallelism using parallel constructs

Execution by runtime system

Execution by runtime system

EECC756 - Shaaban
#28 lec # 1 Spring 2011 3-8-2011

Factors Affecting Parallel System Performance


Parallel Algorithm Related:
i.e Inherent Parallelism

Available concurrency and profile, grain size, uniformity, patterns.


Dependencies between computations represented by dependency graph

Type of parallelism present: Functional and/or data parallelism. Required communication/synchronization, uniformity and patterns. Data size requirements. Communication to computation ratio (C-to-C ratio, lower is better).

Parallel program Related:


Programming model used. Resulting data/code memory requirements, locality and working set characteristics. Parallel task grain size. Assignment (mapping) of tasks to processors: Dynamic or static. Cost of communication/synchronization primitives.

Hardware/Architecture related:
Total CPU computational power available. + Number of processors Types of computation modes supported. (hardware parallelism) Shared address space Vs. message passing. Communication network characteristics (topology, bandwidth, latency) Memory hierarchy properties.

Concurrency = Parallelism

EECC756 - Shaaban
#29 lec # 1 Spring 2011 3-8-2011

Sequential Execution on one processor


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Task Dependency Graph Task:


Computation run on one processor

Possible Parallel Execution Schedule on Two Processors P0, P1


0

A B

A
Comm

1 2 3 4

A C

Idle Comm

B C D D E

C
Comm

5 6 7 8

B F D E
Idle Comm Idle

F
Comm

9 10 11 12

E F G
Time

G
What would the speed be with 3 processors? 4 processors? 5 ?
Assume computation time for each task A-G = 3 Assume communication time between parallel tasks = 1 Assume communication can overlap with computation Speedup on two processors = T1/T2 = 21/16 = 1.3

13 14 15 16 17 18 19 20 21

T2 =16

Time

P0

P1

T1 =21

A simple parallel execution example

EECC756 - Shaaban
#30 lec # 1 Spring 2011 3-8-2011

Non-pipelined

Scalar

Sequential
Limited Pipelining

Evolution of Computer Architecture Lookahead


Functional Parallelism

I/E Overlap
Multiple Func. Units

Pipeline

Pipelined (single or multiple issue)

I/E: Instruction Fetch and Execute SIMD: Single Instruction stream over Multiple Data streams

Implicit Vector Memory-to -Memory

Explicit Vector

Vector/data parallel

Register-to -Register
Parallel Machines

Shared Memory

MIMD: Multiple Instruction streams over Multiple Data streams

SIMD
Processor Array

MIMD
Multiprocessor
Massively Parallel Processors (MPPs)

Associative Processor

Multicomputer
Computer Clusters

Data Parallel Message Passing

EECC756 - Shaaban

Parallel Architectures History


Historically, parallel architectures were tied to parallel programming models: Divergent architectures, with no predictable pattern of growth.

Application Software Systolic Arrays Dataflow


More on this next lecture

System Software Architecture

SIMD

Data Parallel Architectures

Message Passing Shared Memory

EECC756 - Shaaban
#32 lec # 1 Spring 2011 3-8-2011

Parallel Programming Models


Programming methodology used in coding parallel applications Specifies: 1- communication and 2- synchronization
Examples:
However, a good way to utilize multi-core processors for the masses!

Multiprogramming: or Multi-tasking (not true parallel processing!)


No communication or synchronization at program level. A number of independent programs running on different processors in the system.

Shared memory address space (SAS):


Parallel program threads or tasks communicate implicitly using a shared memory address space (shared data in memory).

Message passing:
Explicit point to point communication (via send/receive pairs) is used between parallel program tasks using messages.

Data parallel:
More regimented, global actions on data (i.e the same operations over all elements on an array or vector) Can be implemented with shared address space or message passing.

EECC756 - Shaaban
#33 lec # 1 Spring 2011 3-8-2011

Flynns 1972 Classification of Computer Architecture


(Taxonomy)
Instruction Stream = Thread of Control or Hardware Context

(a) (b)

Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines or uniprocessors. Single Instruction stream over Multiple Data streams (SIMD): Vector computers, array of synchronized processing elements. Data parallel systems Multiple Instruction streams and a Single Data stream (MISD): Systolic arrays for pipelined execution. Multiple Instruction streams over Multiple Data streams (MIMD): Parallel computers: Shared memory multiprocessors. Multicomputers: Unshared distributed memory, message-passing used instead (e.g clusters)
EECC756 - Shaaban
#34 lec # 1 Spring 2011 3-8-2011

(c) (d)

Tightly coupled processors Loosely coupled processors

Classified according to number of instruction streams (threads) and number of data streams in architecture

Flynns Classification of Computer Architecture


(Taxonomy)
Uniprocessor
Single Instruction stream over Multiple Data streams (SIMD): Vector computers, array of synchronized processing elements.

CU = Control Unit PE = Processing Element M = Memory

Shown here: array of synchronized processing elements

Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines or uniprocessors.

Parallel computers or multiprocessor systems


Multiple Instruction streams and a Single Data stream (MISD): Systolic arrays for pipelined execution. Multiple Instruction streams over Multiple Data streams (MIMD): Parallel computers: Distributed memory multiprocessor system shown

Classified according to number of instruction streams (threads) and number of data streams in architecture

EECC756 - Shaaban
#35 lec # 1 Spring 2011 3-8-2011

Current Trends In Parallel Architectures


Conventional or sequential

The extension of computer architecture to support communication and cooperation:


OLD: Instruction Set Architecture (ISA) NEW: Communication Architecture

Defines:
1 Critical abstractions, boundaries, and primitives

(interfaces). 2 Organizational structures that implement interfaces (hardware or software) Implementation of Interfaces

Compilers, libraries and OS are important bridges today


i.e. software abstraction layers

More on this next lecture

EECC756 - Shaaban
#36 lec # 1 Spring 2011 3-8-2011

Modern Parallel Architecture Layered Framework


CAD Multiprogramming Database Shared address Scientific modeling Message passing Data parallel Parallel applications Programming models
User Space

Software

Compilation or library Operating systems support

Communication abstraction User/system boundary


System Space

Hardware

Communication hardware Physical communication medium

Hardware/software boundary

(ISA)

Hardware: Processing Nodes & Interconnects More on this next lecture

EECC756 - Shaaban
#37 lec # 1 Spring 2011 3-8-2011

Shared Address Space (SAS) Parallel Architectures


(in shared address space)

Any processor can directly reference any memory location


Communication occurs implicitly as result of loads and stores

Convenient:

Communication is implicit via loads/stores

Location transparency Similar programming model to time-sharing in uniprocessors


Except processes run on different processors Good throughput on multiprogrammed workloads
i.e multi-tasking

Naturally provided on a wide range of platforms


Wide range of scale: few to hundreds of processors

Popularly known as shared memory machines or model


Ambiguous: Memory may be physically distributed among processing nodes. i.e Distributed shared memory multiprocessors
Sometimes called Tightly-Coupled Parallel Computers

EECC756 - Shaaban
#38 lec # 1 Spring 2011 3-8-2011

Shared Address Space (SAS) Parallel Programming Model


Process: virtual address space plus one or more threads of control Portions of address spaces of processes are shared:
Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space
Pn pr i v at e

In SAS: Communication is implicit via loads/stores.


P1

Load

Pn Common physical addresses

P2

Shared Space

Ordering/Synchronization is explicit using synchronization Primitives.

P0 St or e Shared portion of address space P 2 pr i v at e

Private portion of address space

P1 pr i v at e P0 pr i v at e

Writes to shared address visible to other threads (in other processes too)

Natural extension of the uniprocessor model: Thus communication is implicit via loads/stores Conventional memory operations used for communication Special atomic operations needed for synchronization: i.e for event ordering and mutual exclusion Using Locks, Semaphores etc. Thus synchronization is explicit OS uses shared memory to coordinate processes.

EECC756 - Shaaban

#39 lec # 1 Spring 2011 3-8-2011

Models of Shared-Memory Multiprocessors


1

The Uniform Memory Access (UMA) Model:


All physical memory is shared by all processors. All processors have equal access (i.e equal memory bandwidth and access latency) to all memory addresses. Also referred to as Symmetric Memory Processors (SMPs).

Distributed memory or Non-uniform Memory Access (NUMA) Model:


Shared memory is physically distributed locally among processors. Access latency to remote memory is higher.

The Cache-Only Memory Architecture (COMA) Model:


A special case of a NUMA machine where all distributed main memory is converted to caches. No memory hierarchy at each processor. EECC756 - Shaaban
#40 lec # 1 Spring 2011 3-8-2011

Models of Shared-Memory Multiprocessors


1
Uniform Memory Access (UMA) Model or Symmetric Memory Processors (SMPs).
I/O devices

UMA
Interconnect: Bus, Crossbar, Multistage network P: Processor M or Mem: Memory C: Cache D: Cache directory

Mem

Mem

Mem

Mem

I/O ctrl

I/O ctrl

Interconnect

Interconnect /Network

Processor

Processor

Network Network M $ P M $ P D M $ C P C P C P D D

Distributed memory or Non-uniform Memory Access (NUMA) Model

NUMA

Cache-Only Memory Architecture (COMA)

EECC756 - Shaaban
#41 lec # 1 Spring 2011 3-8-2011

Uniform Memory Access (UMA) Example: Intel Pentium Pro Quad


4-way SMP
CPU 256-KB Interrupt L2 $ controller Bus interface P-Pro module P-Pro module

Circa 1997

P-Pro module

Shared FSB

P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)

PCI bridge PCI I/O cards PCI bus

PCI bridge PCI bus

Memory controller MIU 1-, 2-, or 4-way interleaved DRAM


Bus-Based Symmetric Memory Processors (SMPs).
A single Front Side Bus (FSB) is shared among processors This severely limits scalability to only ~ 2-4 processors

All coherence and multiprocessing glue in processor module Highly integrated, targeted at high volume Computing node used in Intels ASCI-Red MPP

EECC756 - Shaaban
#42 lec # 1 Spring 2011 3-8-2011

Non-Uniform Memory Access (NUMA) Example: AMD 8-way Opteron Server Node
Circa 2003

Dedicated point-to-point interconnects (HyperTransport links) used to connect processors alleviating the traditional limitations of FSB-based SMP systems. Each processor has two integrated DDR memory channel controllers: memory bandwidth scales up with number of processors. NUMA architecture since a processor can access its own memory at a lower latency than access to remote memory directly connected to other processors in the system.

Total 16 processor cores when dual core Opteron processors used (32 cores with quad core processors)

EECC756 - Shaaban
#43 lec # 1 Spring 2011 3-8-2011

Distributed Shared-Memory Multiprocessor System Example: Circa 1995-1999 Cray T3E NUMA MPP Example
External I/O

MPP = Massively Parallel Processor System


P $ Mem ctrl and NI Mem

More recent Cray MPP Example: Cray X1E Supercomputer


3D Torus Point-To-Point Network

XY

Switch

Communication Assist (CA)

Scale up to 2048 processors, DEC Alpha EV6 microprocessor (COTS) Custom 3D Torus point-to-point network, 480MB/s links Memory controller generates communication requests for non-local references No hardware mechanism for coherence (SGI Origin etc. provide this)
Example of Non-uniform Memory Access (NUMA)

EECC756 - Shaaban
#44 lec # 1 Spring 2011 3-8-2011

Message-Passing Multicomputers
Comprised of multiple autonomous computers (computing nodes) connected via a suitable network. Industry standard System Area Network (SAN) or proprietary network Each node consists of one or more processors, local memory, attached storage and I/O peripherals and Communication Assist (CA). Local memory is only accessible by local processors in a node (no shared memory among nodes). Inter-node communication is carried explicitly out by message passing through the connection network via send/receive operations.
Thus communication is explicit

Process communication achieved using a message-passing programming environment (e.g. PVM, MPI). Portable, platform-independent
Programming model more removed or abstracted from basic hardware operations

Include:
A number of commercial Massively Parallel Processor systems (MPPs). Computer clusters that utilize commodity of-the-shelf (COTS) components.

Also called Loosely-Coupled Parallel Computers

EECC756 - Shaaban
#45 lec # 1 Spring 2011 3-8-2011

Message-Passing Abstraction
Tag

Send (X, Q, t)
Data Recipient
Addr essX Local pr ocess address space Send X, Q, t

Match

Receive Y , P, t Address Y

Tag
Local pr ocess addr ess space
Recipient blocks (waits) until message is received

Receive (Y, P, t)
Data Sender

Sender P
Process P

Recipient Q
Process Q

Send specifies buffer to be transmitted and receiving process. Communication is explicit Receive specifies sending process and application storage to receive into. via sends/receives Memory to memory copy possible, but need to name processes. Optional tag on send and matching rule on receive. i.e event ordering, in this case User process names local data and entities in process/tag space too In simplest form, the send/receive match achieves implicit pairwise synchronization event Ordering of computations according to dependencies Synchronization is Many possible overheads: copying, buffer management, protection ... implicit
Data Dependency /Ordering
Blocking Receive Recipient Q

Pairwise synchronization using send/receive match Sender P

EECC756 - Shaaban
#46 lec # 1 Spring 2011 3-8-2011

Message-Passing Example:
Intel Paragon
i860 L1 $

Circa 1983
i860 L1 $ Intel Paragon node

Each node Is a 2-way-SMP

Memory bus (64-bit, 50 MHz)

Mem ctrl Driver


Sandia s Intel Paragon XP/S-based Supercomputer

DMA NI

4-way interleaved DRAM

Communication Assist (CA)


8 bits, 175 MHz, bidirectional

2D grid network with processing node attached to every switch

2D grid point to point network

EECC756 - Shaaban
#47 lec # 1 Spring 2011 3-8-2011

Message-Passing Example: IBM SP-2


MPP
Circa 1994-1998

Power 2 CPU L2 $

IBM SP-2 node

Memory bus

General interconnection network formed from 8-port switches

Memory controller

4-way interleaved DRAM

Made out of essentially complete RS6000 workstations Network interface integrated in I/O bus (bandwidth limited by I/O bus)
Multi-stage network

MicroChannel bus NIC I/O i860 DMA NI

Communication Assist (CA)

EECC756 - Shaaban
#48 lec # 1 Spring 2011 3-8-2011

MPP = Massively Parallel Processor System

DRAM

Message-Passing MPP Example:

IBM Blue Gene/L


System Location: Lawrence Livermore National Laboratory Networks: 3D Torus point-to-point network Global tree 3D point-to-point network (both proprietary)
Node Board (32 chips, 4x4x2) 16 Compute Cards Compute Card (2 chips, 2x1x1) Chip (2 processors) 90/180 GF/s 8 GB DDR 2.8/5.6 GF/s 4 MB 5.6/11.2 GF/s 0.5 GB DDR
2.8 Gflops peak per processor core

Circa 2005

(2 processors/chip) (2 chips/compute card) (16 compute cards/node board) (32 node boards/tower) (64 tower) = 128k = 131072 (0.7 GHz PowerPC 440) processors (64k nodes)
System (64 cabinets, 64x32x32)

Cabinet (32 Node boards, 8x8x16)

180/360 TF/s 16 TB DDR 2.9/5.7 TF/s 256 GB DDR

Design Goals: - High computational power efficiency - High computational density per volume

LINPACK Performance: 280,600 GFLOPS = 280.6 TeraFLOPS = 0.2806 Peta FLOP Top Peak FP Performance: Now about 367,000 GFLOPS = 367 TeraFLOPS = 0.367 Peta FLOP

EECC756 - Shaaban
#49 lec # 1 Spring 2011 3-8-2011

Message-Passing Programming Tools


Message-passing programming environments include: Message Passing Interface (MPI):
Provides a standard for writing concurrent message-passing programs. MPI implementations include parallel libraries used by existing programming languages (C, C++). Both MPI & PVM are examples

Parallel Virtual Machine (PVM):

of the explicit parallelism approach to parallel programming

Enables a collection of heterogeneous computers to used as a coherent and flexible concurrent computational resource. PVM support software executes on each machine in a userconfigurable pool, and provides a computational environment of concurrent applications. User programs written for example in C, Fortran or Java are provided access to PVM through the use of calls to PVM library routines.
Both MPI and PVM are portable (platform-independent) and allow the user to explicitly specify parallelism

EECC756 - Shaaban
#50 lec # 1 Spring 2011 3-8-2011

Data Parallel Systems SIMD in Flynn taxonomy


Programming model (Data Parallel)
Similar operations performed in parallel on each element of data structure Logically single thread of control, performs sequential or parallel steps
Control processor

Conceptually, a processor is associated with each data element

Architectural model
Array of many simple processors each with little memory Processors dont sequence through instructions Attached to a control processor that issues instructions Specialized and general communication, global synchronization

PE PE

PE PE

PE PE

PE

PE


PE

Example machines:
Thinking Machines CM-1, CM-2 (and CM-5) Maspar MP-1 and MP-2, Other Data Parallel Architectures: Vector Machines PE = Processing Element

All PE are synchronized (same instruction or operation in a given cycle)

EECC756 - Shaaban
#51 lec # 1 Spring 2011 3-8-2011

Dataflow Architectures
Represent computation as a graph of essential data dependencies
Non-Von Neumann Architecture (Not PC-based Architecture) Logical processor at each node, activated by availability of operands Message (tokens) carrying tag of next instruction sent to next processor Tag compared with others in matching store; match fires execution c e 1 b Research Dataflow machine prototypes include: Token + a = (b +1) (b c) Distribution The MIT Tagged Architecture d=ce Network f=ad The Manchester Dataflow Machine d
Dataflow graph a Network f

i.e data or results

Dependency graph for entire computation (program)

Token store Waiting Matching

Program store Instruction fetch

One Node
Form token Network

The Tomasulo approach of dynamic instruction execution utilizes dataflow driven execution engine: The data dependency graph for a small window of instructions is constructed dynamically when instructions are issued in order of the program. The execution of an issued instruction is triggered by the availability of its operands (data it needs) over the CDB.

Execute Token queue

Network

Token Matching

Token Distribution

Tokens = Copies of computation results

EECC756 - Shaaban
#52 lec # 1 Spring 2011 3-8-2011

Example of Flynns Taxonomys MISD (Multiple Instruction Streams Single Data Stream):

Systolic Architectures
Replace single processor with an array of regular processing elements Orchestrate data flow for high throughput with less memory access
PE = Processing Element M = Memory

PE PE

PE

PE

Different from linear pipelining Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory Different from SIMD: each PE may do something different
Initial motivation: VLSI Application-Specific Integrated Circuits (ASICs) Represent algorithms directly by chips connected in regular pattern

A possible example of MISD in Flynns Classification of Computer Architecture

EECC756 - Shaaban
#53 lec # 1 Spring 2011 3-8-2011

C=AXB Systolic Array Example: 3x3 Systolic Array Matrix Multiplication


Processors arranged in a 2-D grid Each processor accumulates one element of the product

Alignments in time
Column 0

b2,0 b1,0 b0,0

b2,1 b1,1 b0,1


Column 1

b2,2 b1,2 b0,2


Column 2

Columns of B

Rows of A

a0,2

a0,1

a0,0
Row 0

a1,2

a1,1

a1,0
Row 1

a2,2

a2,1 T=0

a2,0
Row 2

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

EECC756 - Shaaban
#54 lec # 1 Spring 2011 3-8-2011

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication


Processors arranged in a 2-D grid Each processor accumulates one element of the product

Alignments in time

b2,0 b1,0
b0,0

b2,1 b1,1 b0,1

b2,2 b1,2 b0,2

a0,2

a0,1

a0,0

a0,0*b0,0

a1,2

a1,1

a1,0

a2,2 T=1

a2,1

a2,0

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

EECC756 - Shaaban
#55 lec # 1 Spring 2011 3-8-2011

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication


Processors arranged in a 2-D grid Each processor accumulates one element of the product

Alignments in time

b2,0
b1,0

b2,1 b1,1
b0,1
a0,0*b0,1

b2,2 b1,2 b0,2

a0,2

a0,1

a0,0*b0,0 + a0,1*b1,0

a0,0

b0,0

a1,2

a1,1

a1,0

a1,0*b0,0

a2,2 T=2

a2,1

a2,0

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

EECC756 - Shaaban
#56 lec # 1 Spring 2011 3-8-2011

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication


Processors arranged in a 2-D grid Each processor accumulates one element of the product

Alignments in time
b2,0 a0,2
a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0

b2,1
b1,1
a0,0*b0,1 + a0,1*b1,1

b2,2 b1,2
b0,2
a0,0*b0,2

a0,1

a0,0

C00

b1,0

b0,1
a1,0*b0,1

a1,2

a1,1

a1,0*b0,0 + a1,1*b1,0

a1,0

b0,0

a2,2 T=3

a2,1

a2,0

a2,0*b0,0

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

EECC756 - Shaaban
#57 lec # 1 Spring 2011 3-8-2011

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication


Processors arranged in a 2-D grid Each processor accumulates one element of the product

Alignments in time
b2,1
a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0

b2,2
b1,2
a0,0*b0,2 + a0,1*b1,2

a0,2

a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1

a0,1

C00

C01

b2,0 a1,2
a1,0*b0,0 + a1,1*b1,0 + a1,2*a2,0

b1,1
a1,0*b0,1 +a1,1*b1,1

b0,2
a1,0*b0,2

a1,1

a1,0

C10

b1,0

b0,1
a2,0*b0,1

a2,2 a2,2 T=4

a2,1

a2,0*b0,0 + a2,1*b1,0

a2,0

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

EECC756 - Shaaban
#58 lec # 1 Spring 2011 3-8-2011

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication


Processors arranged in a 2-D grid Each processor accumulates one element of the product

Alignments in time
b2,2
a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0 a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1

a0,2

a0,0*b0,2 + a0,1*b1,2 + a0,2*b2,2

C00

C01

C02

b2,1
a1,0*b0,0 + a1,1*b1,0 + a1,2*a2,0

b1,2
a1,0*b0,2 + a1,1*b1,2

a1,2

a1,0*b0,1 +a1,1*b1,1 + a1,2*b2,1

a1,1

C10

C11

b2,0 a2,2
a2,0*b0,0 + a2,1*b1,0 + a2,2*b2,0

b1,1
a2,0*b0,1 + a2,1*b1,1

b0,2
a2,0*b0,2

a2,1

a2,0

T=5

C20

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

EECC756 - Shaaban
#59 lec # 1 Spring 2011 3-8-2011

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication


Processors arranged in a 2-D grid Each processor accumulates one element of the product

Alignments in time

a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0

a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1

a0,0*b0,2 + a0,1*b1,2 + a0,2*b2,2

C00

C01

C02

b2,2
a1,0*b0,0 + a1,1*b1,0 + a1,2*a2,0 a1,0*b0,1 +a1,1*b1,1 + a1,2*b2,1

a1,2

a1,0*b0,2 + a1,1*b1,2 + a1,2*b2,2

C10

C11

C12

b2,1
a2,0*b0,0 + a2,1*b1,0 + a2,2*b2,0

b1,2
a2,0*b0,2 + a2,1*b1,2

a2,2
C21

a2,0*b0,1 + a2,1*b1,1 + a2,2*b2,1

a2,1

T=6

C20

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

EECC756 - Shaaban
#60 lec # 1 Spring 2011 3-8-2011

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication


Processors arranged in a 2-D grid Each processor accumulates one element of the product

On one processor = O(n3) t = 27? Speedup = 27/7 = 3.85

Alignments in time

a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0

a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1

a0,0*b0,2 + a0,1*b1,2 + a0,2*b2,2

C00

C01

C02

a1,0*b0,0 + a1,1*b1,0 + a1,2*a2,0

a1,0*b0,1 +a1,1*b1,1 + a1,2*b2,1

a1,0*b0,2 + a1,1*b1,2 + a1,2*b2,2

Done

C10

C11

C12

b2,2
a2,0*b0,0 + a2,1*b1,0 + a2,2*b2,0 a2,0*b0,1 + a2,1*b1,1 + a2,2*b2,1

a2,2

a2,0*b0,2 + a2,1*b1,2 + a2,2*b2,2

T=7

C20

C21

C22

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

EECC756 - Shaaban
#61 lec # 1 Spring 2011 3-8-2011

You might also like