Course Material: Lecture 5: Introduction To Parallel Systems and Computing
Course Material: Lecture 5: Introduction To Parallel Systems and Computing
lms.umt.edu.pk
Lecture 5: Introduction to Parallel
Systems and Computing
Dr Jameel Ahmad
Department of Computer Science
UMT
Content Goal
• Motivate and define parallel computations The goal of the course is to give basic knowledge about
• Design of parallel algorithms - parallel computer hardware architectures
- design of parallel algorithms
• Overview of different classes of parallel - parallel programming paradigms and languages
systems - compiler techniques for automatic parallelization and vectorization
• Overview of different programming concepts - areas of application in parallel computing
This includes knowledge about central ideas and classification
• Historic and current parallel systems systems, machines with shared and distributed memory, data- och
functional parallelism, parallel programming languages, scheduling
• Applications demanding HPC algorithms, analyses of dependencies and different tools supporting
– Research within this area at the department development of parallel programs.
Course evaluation vt-11 Scientific Computing 87vs2K9
• 1987
– Minisupercomputers (1-20Mflop/s): Alliant, Convex, DEC
• Assignment 2 too difficult
– Parallel vector processors (PVP) (20-2000 Mflop/s)
• Look for a new book • 2002 PC:s (lots of ’em)
– RISC Workstations (500-4000 Mflop/s): DEC, HP, IBM, SGI, Sun
– RISC based symmetric multiprocessors (10-400 Gflop/s): IBM, SUN, SGI
– Parallel vector processors (10-36000! Gflop/s): Fujiutsu, Hitachi, NEC
– Highly parallel proc. (1-10000 Gflop/s): HP, IBM, NEC, Fujiutsu, Hitachi
– Earth Simulator 5120 vector-CPU, 36 teraflop
• 2004 - IBM’s Blue Gene Project (65k CPU), 136 teraflop
• 2005/6/7 - IBM’s Blue Gene Project (128k CPU) , (208k 2007), 480 teraflop
• 2008 - IBM’s Roadrunner, Cell, 1.1 petaflop
• 2009 - Cray XT5 (224162 cores), 1.75 petaflop
• 2010 – Tihane-1A, 2.57 petaflop, NVIDIA GPU
• 2011 – Fujitsu, K computer, SPARC64 (705024 cores), 10.5 petaflop
Communication time = a + ßk
Load Balancing Flynn's Taxonomy
0 0 0 0 0 0 0 0
Row block mappning Number of Data Streams
1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 Proc.: 0 1 2 3 Single Multiple
Nr : 13 22 10 3
3 3 3 3 3 3 3 3 SISD SIMD
Number of Single (von Neuman) (vector, array)
0 1 2 3 0 1 2 3 Instruction
Column block mappning Streams
0 1 2 3 0 1 2 3 Multiple MISD MIMD
Proc.: 0 1 2 3
0 1 2 3 0 1 2 3 (?) (multiple micros)
Nr : 4 13 19 12
0 1 2 3 0 1 2 3
0 1 0 1 0 1 0 1
Block-cyclisc mappning
2 3 2 3 2 3 2 3
Proc.: 0 1 2 3 • Flynn does not describe modernities like
0 1 0 1 0 1 0 1 Nr : 11 12 12 14
2 3 2 3 2 3 2 3 • Pipelining (MISD?)
• Memory model
• Interconnection network
Synchronous paradigms
Paradigms
Vector/Array
A model of the world that is used to formulate a computer • Each processor is alotted a very small
solution to a problem
operation
• Pipeline parallelism
• Good when operations can be broken
down into fine-grained steps
Synchronous paradigms Asynchronous paradigms
SIMD MIMD
• Dataparallel!
• The processors work independently of each other
• All processors do the same thing at the same
• Must be synchronized
time, or are idle
– Message passing
• Phase 1:
– Mutual exclusion (locks)
– Data partitioning and distribution
• Best for corse-grained problems
• Phase 2:
• Shared memory
– Data parallel work
– Virtually and physically shared
• Good for large regular data structures
– UMA, NUMA, COMA, CC-NUMA
• Distributed memory
– Highly parallel systems, NOWS, COWS
Distributed Memory
Shared Mamory Architectures
Architectures
• All processors have access to a global • Each node has its own local memory (no shared
address space adress space)
– UMA, NUMA • The processors communicate with each other over
a network by using messages
• Access to the shared memory can be by
• The network topology can be static or dynamic
a bus or a switched network
• The hardware scales well. Programmering are more
• The hardware does not scale well to difficult than with shared memory
massively parallel levels • Computations are much faster than communication
Memory Mesh, ring,
m m m linear array,
m p p p 2D-torus, 3D-mesh
m
p
p
3D-torus, tree
Memory bus/switching network Network fat tree, hypercube,
star, vulkan switch,
m
m
p
p
p p p cube connected cycl’
P P P P P m m m omega, crossbar,
etc, ......
SPMD, Single Program
Control vs Data parallelism
Multiple Data
– Asynchronous data parallel processing • Control parallelism (instruction parallelism)
– use parallelism in the control structures of a
– Software equivalent to SIMD
program
– Execute the same program, but on – independent parts of a program execute in
different data asynchronously parallel
• Data parallelism
– one processor per data element (block of data)
– each processor needs separate data memory
– millions of processors can be applied on large
problems