0% found this document useful (0 votes)
88 views

CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan

The document provides information about the CS 213 course on Parallel Processing Architectures at UCR taught by Professor Laxmi Narayan Bhuyan. The course will cover topics like shared memory and message passing multiprocessor architectures, interconnection networks, and multiprocessor scheduling. Students will have the option to work on projects involving experimenting with a supercomputer, I/O scheduling, chip multiprocessor design, or peer-to-peer networks. The syllabus outlines the course topics, projects, grading, and provides references and web resources for further information.

Uploaded by

amimul13748
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views

CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan

The document provides information about the CS 213 course on Parallel Processing Architectures at UCR taught by Professor Laxmi Narayan Bhuyan. The course will cover topics like shared memory and message passing multiprocessor architectures, interconnection networks, and multiprocessor scheduling. Students will have the option to work on projects involving experimenting with a supercomputer, I/O scheduling, chip multiprocessor design, or peer-to-peer networks. The syllabus outlines the course topics, projects, grading, and provides references and web resources for further information.

Uploaded by

amimul13748
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

CS 213: Parallel Processing Architectures

Laxmi Narayan Bhuyan http://www.cs.ucr.edu/~bhuyan

PARALLEL PROCESSING ARCHITECTURES CS213 SYLLABUS Winter 2008 INSTRUCTOR: L.N. Bhuyan (http://www.engr.ucr.edu/~bhuyan/) PHONE: (951) 827-2347 E-mail: [email protected] LECTURE TIME: TR 12:40pm-2pm PLACE: HMNSS 1502 OFFICE HOURS: W 2.00-4.00 or By Appointment

References:

John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, Morgan Kauffman Publisher. Research Papers to be available in the class
Introduction to Parallel Processing: Flynns classification, SIMD and MIMD operations, Shared Memory vs. message passing multiprocessors, Distributed shared memory Shared Memory Multiprocessors: SMP and CC-NUMA architectures, Cache coherence protocols, Consistency protocols, Data pre-fetching, CC-NUMA memory management, SGI 4700 multiprocessor, Chip Multiprocessors, Network Processors (IXP and Cavium) Interconnection Networks: Static and Dynamic networks, switching techniques, Internet techniques Message Passing Architectures: Message passing paradigms, Grid architecture, Workstation clusters, User level software Multiprocessor Scheduling: Scheduling and mapping, Internet web servers, P2P, Content aware load balancing

COURSE OUTLINE:

PREREQUISITE: CS 203A

GRADING:
Project I 20 points Project II 30 points Test 1 20 points Test 2 - 30 points

Possible Projects
Experiments with SGI Altix 4700 Supercomputer Algorithm design and FPGA offloading I/O Scheduling on SGI Chip Multiprocessor (CMP) Design, analysis and simulation P2P Using Planet Lab Note: 2 students/group Expect submission of a paper to a conference

Useful Web Addresses


http://www.sgi.com/products/servers/altix/4000/ and http://www.sgi.com/products/rasc/ Wisconsin Computer Architecture Page Simulators
http://www.cs.wisc.edu/~arch/www/tools.html

SimpleScalar www.simplescalar.com Look for multiprocessor extensions NepSim: http: www.cs.ucr.edu/~yluo/nepsim/ Working in a cluster environment Beowulf Cluster www.beowulf.org MPI www-unix.mcs.anl.gov/mpi Application Benchmarks http://www-flash.stanford.edu/apps/SPLASH/

Parallel Computers
Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.
Almasi and Gottlieb, Highly Parallel Computing ,1989

Questions about parallel computers:


How large a collection? How powerful are processing elements? How do they cooperate and communicate? How are data transmitted? What type of interconnection? What are HW and SW primitives for programmer? Does it translate into performance?

Parallel Processors Myth


The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Led to innovative organization tied to particular programming models since uniprocessors cant keep going
e.g., uniprocessors must stop getting faster due to limit of speed of light Has it happened? Killer Micros! Parallelism moved to instruction level. Microprocessor performance doubles every 1.5 years! In 1990s companies went out of business: Thinking Machines, Kendall Square, ...

What level Parallelism?


Bit level parallelism: 1970 to ~1985
4 bits, 8 bit, 16 bit, 32 bit microprocessors

Instruction level parallelism (ILP): ~1985 through today


Pipelining Superscalar VLIW Out-of-Order execution Limits to benefits of ILP?

Process Level or Thread level parallelism; mainstream for general purpose computing?
Servers are parallel High-end Desktop dual processor PC soon?? (or just the sell the socket?)

Why Multiprocessors?
1. Microprocessors as the fastest CPUs
Collecting several much easier than redesigning 1

2. Complexity of current microprocessors


Do we have enough ideas to sustain 2X/1.5yr? Can we deliver such complexity on schedule?

3. Slow (but steady) improvement in parallel software (scientific apps, databases, OS) 4. Emergence of embedded and server markets driving microprocessors in addition to desktops
Embedded functional parallelism Network processors exploiting packet-level parallelism SMP Servers and cluster of workstations for multiple users Less demand for parallel computing

Amdahls Law and Parallel Computers


Amdahls Law (f: original fraction sequential) Speedup = 1 / [(f + (1-f)/n] = n/[1+(n-1)/f],
where n = No. of processors

A portion f is sequential => limits parallel speedup


Speedup <= 1/ f Ex. What fraction sequential to get 80X speedup from 100 processors? Assume either 1 processor or 100 fully used
80 = 1 / [(f + (1-f)/100] => f = 0.0025

Only 0.25% sequential! => Must be a highly parallel program

Popular Flynn Categories


SISD (Single Instruction Single Data) MISD (Multiple Instruction Single Data)
Examples: Illiac-IV, CM-2

Uniprocessors

SIMD (Single Instruction Multiple Data)


Simple programming model Low overhead Flexibility All custom integrated circuits

???; multiple processors on a single data stream

MIMD (Multiple Instruction Multiple Data)


Flexible Use off-the-shelf micros

(Phrase reused by Intel marketing for media instructions ~ vector)


Examples: Sun Enterprise 5000, Cray T3D, SGI Origin

MIMD current winner: Concentrate on major design emphasis <= 128 processor MIMD machines

Classification of Parallel Processors


SIMD EX: Illiac IV and Maspar MIMD - True Multiprocessors 1. Message Passing Multiprocessor - Interprocessor

communication through explicit message passing through send and receive operations.

EX: IBM SP2, Cray XD1, and Clusters 2. Shared Memory Multiprocessor All processors share the
same address space. Interprocessor communication through load/store operations to a shared memory.

EX: SMP Servers, SGI Origin, HP V-Class, Cray T3E Their advantages and disadvantages?

More Message passing Computers


Cluster: Computers connected over highbandwidth local area network (Ethernet or Myrinet) used as a parallel computer

Network of Workstations (NOW):


Homogeneous cluster same type computers

Grid: Computers connected over wide area


network

Another Classification for MIMD Computers


Centralized Memory: Shared memory located at centralized
location may consist of several interleaved modules same distance from any processor Symmetric Multiprocessor (SMP) Uniform Memory Access (UMA) processor improves scalability

Distributed Memory: Memory is distributed to each


(a) Message passing architectures No processor can directly
access another processors memory

(b) Hardware Distributed Shared Memory (DSM) Multiprocessor Memory is distributed, but the address space is
shared Non-Uniform Memory Access (NUMA) (c) Software DSM A level of o/s built on top of message passing multiprocessor to give a shared memory view to the programmer.

Data Parallel Model


Operations can be performed in parallel on each element of a large regular data structure, such as an array 1 Control Processor (CP) broadcasts to many PEs. The CP reads an instruction from the control memory, decodes the instruction, and broadcasts control signals to all PEs. Condition flag per PE so that can skip Data distributed in each memory Early 1980s VLSI => SIMD rebirth: 32 1-bit PEs + memory on a chip was the PE Data parallel programming languages lay out data to processor

Data Parallel Model


Vector processors have similar ISAs, but no data placement restriction SIMD led to Data Parallel Programming languages Advancing VLSI led to single chip FPUs and whole fast Procs (SIMD less attractive) SIMD programming model led to Single Program Multiple Data (SPMD) model
All processors execute identical program

Data parallel programming languages still useful, do communication all at once: Bulk Synchronous phases in which all communicate after a global barrier

SIMD Programming HighPerformance Fortran (HPF)


Single Program Multiple Data (SPMD) FORALL Construct similar to Fork:
FORALL (I=1:N), A(I) = B(I) + C(I), END FORALL

Data Mapping in HPF 1. To reduce interprocessor communication 2. Load balancing among processors http://www.npac.syr.edu/hpfa/ http://www.crpc.rice.edu/HPFF/

Major MIMD Styles


1. Centralized shared memory ("Uniform Memory Access" time or "Shared Memory Processor") 2. Decentralized memory (memory module with CPU)
Advantages: Scalability, get more memory bandwidth, lower local memory latency Drawback: Longer remote communication latency, Software model more complex Two types: Shared Memory and Message passing

Symmetric Multiprocessor (SMP)


Memory: centralized with uniform access time (uma) and bus interconnect Examples: Sun Enterprise 5000 , SGI Challenge, Intel SystemPro

Decentralized Memory versions


1. Shared Memory with "Non Uniform Memory Access" time (NUMA) 2. Message passing "multicomputer" with separate address space per processor
Can invoke software with Remote Procedue Call (RPC) Often via library, such as MPI: Message Passing Interface Also called "Syncrohnous communication" since communication causes synchronization between 2 processes

Distributed Directory MPs

Shared Memory

Communication Models

Processors communicate with shared address space Easy on small-scale machines Advantages:
Model of choice for uniprocessors, small-scale MPs Ease of programming Lower latency Easier to use hardware controlled caching

Message passing
Processors have private memories, communicate via messages Advantages:
Less hardware, easier to design Good scalability Focuses attention on costly non-local operations

Virtual Shared Memory (VSM)

Shared Address/Memory Multiprocessor Model


Communicate via Load and Store
Oldest and most popular model

Based on timesharing: processes on multiple processors vs. sharing single processor process: a virtual address space and ~ 1 thread of control
Multiple processes can overlap (share), but ALL threads share a process address space

Writes to shared address space by one thread are visible to reads of other threads
Usual model: share code, private stack, some shared heap, some private heap

Shared Memory Multiprocessor Model


Communicate via Load and Store
Oldest and most popular model

Based on timesharing: processes on multiple processors vs. sharing single processor process: a virtual address space and ~ 1 thread of control
Multiple processes can overlap (share), but ALL threads share a process address space

Writes to shared address space by one thread are visible to reads of other threads
Usual model: share code, private stack, some shared heap, some private heap

You might also like