0% found this document useful (0 votes)
26 views

Course Material: Lecture 5: Introduction To Parallel Systems and Computing

The document provides an introduction to parallel systems and computing. It discusses the goal of giving basic knowledge about parallel computer hardware architectures, designing parallel algorithms, parallel programming paradigms and languages, and compiler techniques for automatic parallelization. It also provides an overview of different classes of parallel systems and applications that demand high-performance computing. The document discusses the history of parallel systems at a university department and examples of scientific applications that can benefit from parallel computing.

Uploaded by

JAMEEL AHMAD
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Course Material: Lecture 5: Introduction To Parallel Systems and Computing

The document provides an introduction to parallel systems and computing. It discusses the goal of giving basic knowledge about parallel computer hardware architectures, designing parallel algorithms, parallel programming paradigms and languages, and compiler techniques for automatic parallelization. It also provides an overview of different classes of parallel systems and applications that demand high-performance computing. The document discusses the history of parallel systems at a university department and examples of scientific applications that can benefit from parallel computing.

Uploaded by

JAMEEL AHMAD
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Course Material

lms.umt.edu.pk
Lecture 5: Introduction to Parallel
Systems and Computing

Dr Jameel Ahmad
Department of Computer Science
UMT

Content Goal
• Motivate and define parallel computations The goal of the course is to give basic knowledge about
• Design of parallel algorithms - parallel computer hardware architectures
- design of parallel algorithms
• Overview of different classes of parallel - parallel programming paradigms and languages
systems - compiler techniques for automatic parallelization and vectorization
• Overview of different programming concepts - areas of application in parallel computing
This includes knowledge about central ideas and classification
• Historic and current parallel systems systems, machines with shared and distributed memory, data- och
functional parallelism, parallel programming languages, scheduling
• Applications demanding HPC algorithms, analyses of dependencies and different tools supporting
– Research within this area at the department development of parallel programs.
Course evaluation vt-11 Scientific Computing 87vs2K9
• 1987
– Minisupercomputers (1-20Mflop/s): Alliant, Convex, DEC
• Assignment 2 too difficult
– Parallel vector processors (PVP) (20-2000 Mflop/s)
• Look for a new book • 2002 PC:s (lots of ’em)
– RISC Workstations (500-4000 Mflop/s): DEC, HP, IBM, SGI, Sun
– RISC based symmetric multiprocessors (10-400 Gflop/s): IBM, SUN, SGI
– Parallel vector processors (10-36000! Gflop/s): Fujiutsu, Hitachi, NEC
– Highly parallel proc. (1-10000 Gflop/s): HP, IBM, NEC, Fujiutsu, Hitachi
– Earth Simulator 5120 vector-CPU, 36 teraflop
• 2004 - IBM’s Blue Gene Project (65k CPU), 136 teraflop
• 2005/6/7 - IBM’s Blue Gene Project (128k CPU) , (208k 2007), 480 teraflop
• 2008 - IBM’s Roadrunner, Cell, 1.1 petaflop
• 2009 - Cray XT5 (224162 cores), 1.75 petaflop
• 2010 – Tihane-1A, 2.57 petaflop, NVIDIA GPU
• 2011 – Fujitsu, K computer, SPARC64 (705024 cores), 10.5 petaflop

Blue Gene (LLNL)


Roadrunner (LANL)
Jaguar (Oak Ridge NL) K computer

History at the department/HPC2N Scientific applications


• 1986: IBM 3090VF600
– Shared memory, 6 processors with vector unit
(Research at the department)
• 1987: Intel iPSC/2: 32-128 nodes
– Distributed memory MIMD, Hypercube with 64 noder (i386 + 4M per node)
• BLAS/LAPACK
– 16 nodes with a vector board – BLAS-2, matrix-vector operations
• 199X: Alliant FX2800
– Shared memory machine MIMD, 17 i860 processors – BLAS-3, matrix-matrix operations
• 1996: IBM SP
– 64 Thin nodes, 2 High nodes à 4 processors
– LAPACK
• 1997: SGI Onyx2 • Linjear algebra + eigenvalue problems
– 10 MIPS R10000
• 1998: 2-way POWER3 – ScaLAPACK
• 1999: Small Linux cluster
• 2001: Better POWER3
• Nonlinear optimization
• 2002: Large Linux cluster, Seth (120 dual Athlon
Athlo
processors), Wolfkit SCI – Neural networks
• 2003: SweGrid Linux cluster, Ingrid, 100 nodes
n
with Pentium4
• 2004: 384 CPU cluster (Opteron) Sarek 1,7 Tflops peak, 79% HP-Linpack • Development environments
• 2008: Linux cluster Akka, 5376 cores, 10.7 TB RAM, 46 Teraflop HP-Linpack, ranked
39 on Top 500 (June 2008)
– CONLAB/CONLAB-compiler
• 2012: Linux cluster Abisko, 15264 cores (318 nodes with 4 AMD 12 core Interlagos) • Functional languages
The Demand for Speed! Example of applications

• Grand Challenge Problems • Global atmospheric circulation


• Simulations of different kind • Weather prediction
– Differential equations (over time)
• Deep Blue – Descritization on a lattice

• Data analyses • Earthquakes


• Cryptography

Technical applications More Applications


• VLSI-design • Simulate atom bombs (ASCI)
– Simulation: different gates on one level can be • Scientific visualization
tested in //, as they act independently – Show large data sets graphically
– Placement: (move blocks randomly to minimize • Signal and Image Analysis
an object function, e.g. Cable length) • Reservoir modeling
– Cable drawing – Oil in Norway for example
• Design • Rempote analysis of e.g. The Earth
– Simulate flows around objects like cars, – Satellite data: adaptation, analysis, catalogization
aeroplans, boats
• Movies and commercials
– Tenacity (hållfasthet) computations
– Star Wars etc
– Heat distribution
• Searching on the Internet
• etc, etc, etc, etc ....
Parallel computations! Motive & Goal
A collections of processors that
• Manufacturing
communicate and cooperate to – Physical laws limits the speed of the processors
solve a large problem fast. – Moores law
– Price/Performance
• Cheaper to take many cheap and relatively fast
processors than to develop one super fast processor
• Possible to use fewer kinds of circuits but use more
of them
• Use
– Decrease wall clock time
Communication media – Solve bigger problems

Why we’re building parallel A little physics lesson


systems
Smaller transistors = faster processors.
Up to now, performance increases have Faster processors = increased power
been attributable to increasing density of consumption.
transistors.
Increased power consumption = increased
heat.
But there are
Increased heat = unreliable processors.
inherent
problems.
Solution Why we need to write parallel
programs
Move away from single-core systems to
multicore processors. Running multiple instances of a serial
“core” = central processing unit (CPU) program often isn’t very useful.
Think of running multiple instances of your
favorite game.
Introducing parallelism!!!
What you really want is for
it to run faster.

Approaches to the serial problem More problems


Rewrite serial programs so that they’re Some coding constructs can be
parallel. recognized by an automatic program
generator, and converted to a parallel
Write translation programs that construct.
automatically convert serial programs into However, it’s likely that the result will be a
parallel programs. very inefficient program.
This is very difficult to do. Sometimes the best parallel solution is to
Success has been limited. step back and devise an entirely new
algorithm.
Can all problems be Design of parallel programs
solved in parallel?
Dig a hole in the ground: Dig a ditch:
• Data Partitioning
– distribute data on the different processores
• Granularity
– size of the parallel parts
Yes No Yes No
Can be parallelized? x Can be parallelized? x • Load Balancing
– Make all processors have the same load
Data dependency:
• Synchronization
Can you put a brick anywhereYes No
anytime? x – Cooperate to produce the result

Parallel program design, example Load Balancing


Goal: All processors should do the same
Game-of-life on a 2D net
(see W-A page 190) amount of work
Look at the following example:

max 4 processors max 16 processors


coarse-grained fine-grained
small amount of communication a lot of communication

Communication time = a + ßk
Load Balancing Flynn's Taxonomy
0 0 0 0 0 0 0 0
Row block mappning Number of Data Streams
1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 Proc.: 0 1 2 3 Single Multiple
Nr : 13 22 10 3
3 3 3 3 3 3 3 3 SISD SIMD
Number of Single (von Neuman) (vector, array)
0 1 2 3 0 1 2 3 Instruction
Column block mappning Streams
0 1 2 3 0 1 2 3 Multiple MISD MIMD
Proc.: 0 1 2 3
0 1 2 3 0 1 2 3 (?) (multiple micros)
Nr : 4 13 19 12
0 1 2 3 0 1 2 3

0 1 0 1 0 1 0 1
Block-cyclisc mappning
2 3 2 3 2 3 2 3
Proc.: 0 1 2 3 • Flynn does not describe modernities like
0 1 0 1 0 1 0 1 Nr : 11 12 12 14
2 3 2 3 2 3 2 3 • Pipelining (MISD?)
• Memory model
• Interconnection network

Synchronous paradigms
Paradigms
Vector/Array
A model of the world that is used to formulate a computer • Each processor is alotted a very small
solution to a problem
operation
• Pipeline parallelism
• Good when operations can be broken
down into fine-grained steps
Synchronous paradigms Asynchronous paradigms
SIMD MIMD
• Dataparallel!
• The processors work independently of each other
• All processors do the same thing at the same
• Must be synchronized
time, or are idle
– Message passing
• Phase 1:
– Mutual exclusion (locks)
– Data partitioning and distribution
• Best for corse-grained problems
• Phase 2:
• Shared memory
– Data parallel work
– Virtually and physically shared
• Good for large regular data structures
– UMA, NUMA, COMA, CC-NUMA
• Distributed memory
– Highly parallel systems, NOWS, COWS

Distributed Memory
Shared Mamory Architectures
Architectures
• All processors have access to a global • Each node has its own local memory (no shared
address space adress space)
– UMA, NUMA • The processors communicate with each other over
a network by using messages
• Access to the shared memory can be by
• The network topology can be static or dynamic
a bus or a switched network
• The hardware scales well. Programmering are more
• The hardware does not scale well to difficult than with shared memory
massively parallel levels • Computations are much faster than communication
Memory Mesh, ring,
m m m linear array,
m p p p 2D-torus, 3D-mesh

m
p

p
3D-torus, tree
Memory bus/switching network Network fat tree, hypercube,
star, vulkan switch,
m

m
p

p
p p p cube connected cycl’
P P P P P m m m omega, crossbar,
etc, ......
SPMD, Single Program
Control vs Data parallelism
Multiple Data
– Asynchronous data parallel processing • Control parallelism (instruction parallelism)
– use parallelism in the control structures of a
– Software equivalent to SIMD
program
– Execute the same program, but on – independent parts of a program execute in
different data asynchronously parallel
• Data parallelism
– one processor per data element (block of data)
– each processor needs separate data memory
– millions of processors can be applied on large
problems

Parallel Programming – Implicitly


• Old Fortran, C, ...
– Lots of dependencies between different parts of the
program
– The compiler must find all dependencies
– The compiler restructures the program to identify
more parallelism
– Advantage: Backwards compatible with existing program
• New languages and extensions give more parallelism
– Fortran 90
– HPF
– OpenMP
– MPI }

You might also like