Parallel Computing Main
Parallel Computing Main
Parallel Computing
Introduction
For example:
In the simplest sense, parallel computing is the simultaneous use of multiple compute
resources to solve a computational problem:
o To be run using multiple CPUs
o A problem is broken into discrete parts that can be solved concurrently
o
o
For example:
o
o
o
o
o
o
Geology, Seismology
Mechanical Engineering - from
prosthetics to spacecraft
Electrical Engineering, Circuit
Design, Microelectronics
Computer Science,
Mathematics
o
o
o
o
o
Solve larger problems: Many problems are so large and/or complex that it is impractical
or impossible to solve them on a single computer, especially given limited computer
memory. For example:
o "Grand Challenge" problems requiring PetaFLOPS and PetaBytes of computing
resources.
o Web search engines/databases processing millions of transactions per second
Provide concurrency: A single compute resource can only do one thing at a time.
Multiple computing resources can be doing many things simultaneously. For example,
the Access Grid provides a global collaboration network where people from around the
world can meet and conduct work "virtually".
Use of non-local resources: Using compute resources on a wide area network, or even
the Internet when local compute resources are scarce. For example:
o SETI@home over 1.3 million users, 3.2 million computers in nearly every country
in the world.
o Folding@home uses over 450,000 cpus globally
Limits to serial computing: Both physical and practical reasons pose significant
constraints to simply building ever faster serial computers:
o Transmission speeds - the speed of a serial computer is directly dependent upon
how fast data can move through hardware. Absolute limits are the speed of light
(30 cm/nanosecond) and the transmission limit of copper wire (9 cm/nanosecond).
Increasing speeds necessitate increasing proximity of processing elements.
o Limits to miniaturization - processor technology is allowing an increasing number
of transistors to be placed on a chip. However, even with molecular or atomic-level
components, a limit will be reached on how small components can be.
o Economic limitations - it is increasingly expensive to make a single processor
faster. Using a larger number of moderately fast commodity processors to achieve
the same (or better) performance is less expensive.
o Current computer architectures are increasingly relying upon hardware level
parallelism to improve performance:
Multiple execution units
Pipelined instructions
Multi-core
Concepts related to parallel Computing
Named after the Hungarian mathematician John von Neumann who first authored the
general requirements for an electronic computer in his 1945 papers.
Since then, virtually all computers have followed this basic design, differing from earlier
computers which were programmed through "hard wiring".
o
o
o
Well, parallel computers still follow this basic design, just multiplied in units. The basic,
fundamental architecture remains the same.
SISD
SIMD
MISD
MIMD
Parallel Terminology
Like everything else, parallel computing has its own "jargon". Some of the more commonly
used terms associated with parallel computing are listed below. Most of these will be
discussed in more detail later.
Supercomputing / High Performance Computing (HPC)
Using the world's fastest and largest computers to solve large problems.
Node
A standalone "computer in a box". Usually comprised of multiple
CPUs/processors/cores. Nodes are networked together to comprise a supercomputer.
CPU / Socket / Processor / Core
This varies, depending upon who you talk to. In the past, a CPU (Central Processing Unit)
was a singular execution component for a computer. Then, multiple CPUs were
incorporated into a node. Then, individual CPUs were subdivided into multiple "cores",
each being a unique execution unit. CPUs with multiple cores are sometimes called
"sockets" - vendor dependent. The result is a node with multiple CPUs, each containing
multiple cores. The nomenclature is confused at times. Wonder why?
Task
A logically discrete section of computational work. A task is typically a program or
program-like set of instructions that is executed by a processor. A parallel program
consists of multiple tasks running on multiple processors.
Pipelining
Breaking a task into steps performed by different processor units, with inputs streaming
through, much like an assembly line; a type of parallel computing.
Shared Memory
From a strictly hardware point of view, describes a computer architecture where all
processors have direct (usually bus based) access to common physical memory. In a
programming sense, it describes a model where parallel tasks all have the same
"picture" of memory and can directly address and access the same logical memory
locations regardless of where the physical memory actually exists.
Symmetric Multi-Processor (SMP)
Hardware architecture where multiple processors share a single address space and
access to all resources; shared memory computing.
Distributed Memory
In hardware, refers to network based memory access for physical memory that is not
common. As a programming model, tasks can only logically "see" local machine memory
and must use communications to access memory on other machines where other tasks
are executing.
Communications
Parallel tasks typically need to exchange data. There are several ways this can be
accomplished, such as through a shared memory bus or over a network, however the
actual event of data exchange is commonly referred to as communications regardless of
the method employed.
Synchronization
The coordination of parallel tasks in r eal time, very often associated with
communications. Often implemented by establishing a synchronization point within an
application where a task may not proceed further until another task(s) reaches the same
or logically equivalent point.
Synchronization usually involves waiting by at least one task, and can therefore cause a
parallel application's wall clock execution time to increase.
Granularity
In parallel computing, granularity is a qualitative measure of the ratio of computation to
communication.
Observed Speedup
Observed speedup of a code which has been parallelized, defined as:
wall-clock time of serial execution
----------------------------------wall-clock time of parallel execution
One of the simplest and most widely used indicators for a parallel program's
performance.
Parallel Overhead
The amount of time required to coordinate parallel tasks, as opposed to doing useful
work. Parallel overhead can include factors such as:
Massively Parallel
Refers to the hardware that comprises a given parallel system - having many processors.
The meaning of "many" keeps increasing, but currently, the largest parallel computers
can be comprised of processors numbering in the hundreds of thousands.
Embarrassingly Parallel
Solving many similar, but independent tasks simultaneously; little to no need for
coordination between the tasks.
Scalability
Refers to a parallel system's (hardware and/or software) ability to demonstrate a
proportionate increase in parallel speedup with the addition of more processors.
Factors that contribute to scalability include:
General Characteristics:
Shared memory parallel computers vary widely, but generally have in common the
ability for all processors to access all memory as global address space.
Multiple processors can operate independently but share the same memory resources.
Changes in a memory location effected by one processor are visible to all other
processors.
Shared memory machines can be divided into two main classes based upon memory
access times: UMA and NUMA.
Advantages:
Disadvantages:
Primary disadvantage is the lack of scalability between memory and CPUs. Adding more
CPUs can geometrically increases traffic on the shared memory-CPU path, and for
cache coherent systems, geometrically increase traffic associated with cache/memory
management.
Programmer responsibility for synchronization constructs that ensure "correct" access
of global memory.
Expense: it becomes increasingly difficult and expensive to design and produce shared
memory machines with ever increasing numbers of processors.
Distributed Memory
General Characteristics:
Like shared memory systems, distributed memory systems vary widely but share a
common characteristic. Distributed memory systems require a communication network
to connect inter-processor memory.
Processors have their own local memory. Memory addresses in one processor do not
map to another processor, so there is no concept of global address space across all
processors.
Because each processor has its own local memory, it operates independently. Changes
it makes to its local memory have no effect on the memory of other processors. Hence,
the concept of cache coherency does not apply.
When a processor needs access to data in another processor, it is usually the task of
the programmer to explicitly define how and when data is communicated.
Synchronization between tasks is likewise the programmer's responsibility.
The network "fabric" used for data transfer varies widely, though it can can be as
simple as Ethernet.
Advantages:
Memory is scalable with the number of processors. Increase the number of processors
and the size of memory increases proportionately.
Each processor can rapidly access its own memory without interference and without
the overhead incurred with trying to maintain cache coherency.
Cost effectiveness: can use commodity, off-the-shelf processors and networking.
Disadvantages:
The programmer is responsible for many of the details associated with data
communication between processors.
It may be difficult to map existing data structures, based on global memory, to this
memory organization.
Non-uniform memory access (NUMA) times
The largest and fastest computers in the world today employ both shared and
distributed memory architectures.
The shared memory component can be a cache coherent SMP machine and/or graphics
processing units (GPU).
The distributed memory component is the networking of multiple SMP/GPU machines,
which know only about their own memory - not the memory on another machine.
Therefore, network communications are required to move data from one SMP/GPU to
another.
Current trends seem to indicate that this type of memory architecture will continue to
prevail and increase at the high end of computing for the foreseeable future.
Advantages and Disadvantages: whatever is common to both shared and distributed
memory architectures.
Which model to use? This is often a combination of what is available and personal
choice. There is no "best" model, although there certainly are better implementations
of some models over others.
The following sections describe each of the models mentioned above, and also discuss
some of their actual implementations.
In this programming model, tasks share a common address space, which they read and
write to asynchronously.
Various mechanisms such as locks / semaphores may be used to control access to the
shared memory.
An advantage of this model from the programmer's point of view is that the notion of
data "ownership" is lacking, so there is no need to specify explicitly the communication
of data between tasks. Program development can often be simplified.
An important disadvantage in terms of performance is that it becomes more difficult to
understand and manage data locality.
o Keeping data local to the processor that works on it conserves memory accesses,
cache refreshes and bus traffic that occurs when multiple processors use the
same data.
o Unfortunately, controlling data locality is hard to understand and beyond the
control of the average user.
Implementations:
Native compilers and/or hardware translate user program variables into actual memory
addresses, which are global. On stand-alone SMP machines, this is straightforward.
On distributed shared memory machines, such as the SGI Origin, memory is physically
distributed across a network of machines, but made global through specialized
hardware and software.
Threads Model
Implementations:
Distributed Memory /
Message Passing Model
Implementations:
library of subroutines. Calls to these subroutines are imbedded in source code. The
programmer is responsible for determining all parallelism.
Historically, a variety of message passing libraries have been available since the 1980s.
These implementations differed substantially from each other making it difficult for
programmers to develop portable applications.
In 1992, the MPI Forum was formed with the primary goal of establishing a standard
interface for message passing implementations.
Part 1 of the Message Passing Interface (MPI) was released in 1994. Part 2 (MPI-2) was
released in 1996.
MPI is now the "de facto" industry standard for message passing, replacing virtually all
other message passing implementations used for production work. MPI
implementations exist for virtually all popular parallel computing platforms. Not all
implementations include everything in both MPI1 and MPI2.
Implementations:
Programming with the data parallel model is usually accomplished by writing a program
with data parallel constructs. The constructs can be calls to a data parallel subroutine
library or, compiler directives recognized by a data parallel compiler.
Fortran 90 and 95 (F90, F95): ISO/ANSI standard extensions to Fortran 77.
o Contains everything that is in Fortran 77
o New source code format; additions to character set
o Additions to program structure and commands
o Variable additions - methods and arguments
o Pointers and dynamic memory allocation added
o Array processing (arrays treated as objects) added
o Recursive and new intrinsic functions added
o Many other new features
Implementations are available for most common parallel platforms.
Compiler Directives: Allow the programmer to specify the distribution and alignment of
data. Fortran implementations are available for most common parallel platforms.
Distributed memory implementations of this model usually require the compiler to
produce object code with calls to a message passing library (MPI) for data distribution.
All message passing is done invisibly to the programmer.
Hybrid Model
A hybrid model
combines more than
one of the previously
described
programming models.
Currently, a common
example of a hybrid
model is the
combination of the
message passing
model (MPI) with the
threads model
(OpenMP).
o Threads perform computationally intensive kernels using local, on-node data
o Communications between processes on different nodes occurs over the network
using MPI
This hybrid model lends itself well to the increasingly common hardware environment
of clustered multi/many-core machines.
Another similar and increasingly popular example of a hybrid model is using MPI with
GPU (Graphics Processing Unit) programming.
o GPUs perform computationally intensive kernels using local, on-node data
o Communications between processes on different nodes occurs over the network
using MPI
are designed to execute. That is, tasks do not necessarily have to execute the entire
program - perhaps only a portion of it.
The SPMD model, using message passing or hybrid programming, is probably the most
commonly used parallel programming model for multi-node clusters.
Designing and developing parallel programs has characteristically been a very manual
process. The programmer is typically responsible for both identifying and actually
implementing parallelism.
Very often, manually developing parallel codes is a time consuming, complex, errorprone and iterative process.
For a number of years now, various tools have been available to assist the programmer
with converting serial programs into parallel programs. The most common type of tool
used to automatically parallelize a serial program is a parallelizing compiler or preprocessor.
A parallelizing compiler generally works in two different ways:
o Fully Automatic
o The compiler analyzes the source code and identifies opportunities for
parallelism.
o The analysis includes identifying inhibitors to parallelism and possibly a
cost weighting on whether or not the parallelism would actually improve
performance.
Loops (do, for) loops are the most frequent target for automatic
parallelization.
o Programmer Directed
o Using "compiler directives" or possibly compiler flags, the programmer
explicitly tells the compiler how to parallelize the code.
o May be able to be used in conjunction with some degree of automatic
parallelization also.
If you are beginning with an existing serial code and have time or budget constraints,
then automatic parallelization may be the answer. However, there are several important
caveats that apply to automatic parallelization:
o Wrong results may be produced
o Performance may actually degrade
o Much less flexible than manual parallelization
o Limited to a subset (mostly loops) of code
o May actually not parallelize code if the analysis suggests there are inhibitors or
the code is too complex
The remainder of this section applies to the manual method of developing parallel codes.
o
Undoubtedly, the first step in developing parallel software is to first understand the
problem that you wish to solve in parallel. If you are starting with a serial program, this
necessitates understanding the existing code also.
Before spending time in an attempt to develop a parallel solution for a problem,
determine whether or not the problem is one that can actually be parallelized.
o Example of Parallelizable Problem:
Calculate the potential energy for each of several thousand
independent conformations of a molecule. When done, find the
minimum energy conformation.
o
ones. The calculation of the F(n) value uses those of both F(n-1) and F(n-2).
These three terms cannot be calculated independently and therefore, not in
parallel.
Partitioning
One of the first steps in designing a parallel program is to break the problem into
discrete "chunks" of work that can be distributed to multiple tasks. This is known as
decomposition or partitioning.
There are two basic ways to partition computational work among parallel tasks: domain
decomposition and functional decomposition.
Domain Decomposition:
In this type of partitioning, the data associated with a problem is decomposed. Each
parallel task then works on a portion of of the data.
There are different ways to partition data: 0ne Dimensional and two Dimensional
Functional Decomposition:
In this approach, the focus is on the computation that is to be performed rather than on
the data manipulated by the computation. The problem is decomposed according to the
work that must be done. Each task then performs a portion of the overall work.
Functional decomposition lends itself well to problems that can be split into different
tasks. For example:
Ecosystem Modeling
Each program calculates the population of a given group, where each group's growth
depends on that of its neighbors. As time progresses, each process calculates its current
state, then exchanges information with the neighbor populations. All tasks then progress
Signal Processing
An audio signal data set is passed through four distinct computational filters. Each filter
is a separate process. The first segment of data must pass through the first filter before
progressing to the second. When it does, the second segment of data passes through the
first filter. By the time the fourth segment of data is in the first filter, all four tasks are
busy.
Climate Modeling
Each model component can be thought of as a separate task. Arrows represent exchanges
of data between components during computation: the atmosphere model generates wind
velocity data that are used by the ocean model, the ocean model generates sea surface
temperature data that are used by the atmosphere model, and so on.
Communications
Who Needs Communications?
The need for communications between tasks depends upon your problem:
Factors to Consider:
There are a number of important factors to consider when designing your program's
inter-task communications:
Cost of communications
o Inter-task communication virtually always implies overhead.
o Machine cycles and resources that could be used for computation are instead used
Efficiency of communications
o Very often, the programmer will have a choice with regard to factors that can
affect communications performance. Only a few are mentioned here.
o Which implementation for a given model should be used? Using the Message
Passing Model as an example, one MPI implementation may be faster on a given
hardware platform than another.
o What type of communication operations should be used? As mentioned
previously, asynchronous communication operations can improve overall program
performance.
o Network media - some platforms may offer more than one network for
communications. Which one is best?
Overhead and Complexity
Finally, realize that this is only a partial list of things to consider!!!
Synchronization
Types of Synchronization:
Barrier
o Usually implies that all tasks are involved
o Each task performs its work until it reaches the barrier. It then stops, or "blocks".
o When the last task reaches the barrier, all tasks are synchronized.
o What happens from here varies. Often, a serial section of work must be done. In
other cases, the tasks are automatically released to continue their work.
Lock / semaphore
o Can involve any number of tasks
o Typically used to serialize (protect) access to global data or a section of code.
Only one task at a time may use (own) the lock / semaphore / flag.
o The first task to acquire the lock "sets" it. This task can then safely (serially)
access the protected data or code.
o Other tasks can attempt to acquire the lock but must wait until the task that owns
the lock releases it.
o Can be blocking or non-blocking
Synchronous communication operations
o Involves only those tasks executing a communication operation
o When a task performs a communication operation, some form of coordination is
required with the other task(s) participating in the communication. For example,
before a task can perform a send operation, it must first receive an
acknowledgment from the receiving task that it is OK to send.
o Discussed previously in the Communications section.
Data Dependencies
Definition:
A dependence exists between program statements when the order of statement execution
affects the results of the program.
A data dependence results from multiple use of the same location(s) in storage by
different tasks.
Dependencies are important to parallel programming because they are one of the
primary inhibitors to parallelism.
Examples:
DO 500 J = MYSTART,MYEND
A(J) = A(J-1) * 2.0
500 CONTINUE
The value of A(J-1) must be computed before the value of A(J), therefore A(J) exhibits a
data dependency on A(J-1). Parallelism is inhibited.
If Task 2 has A(J) and task 1 has A(J-1), computing the correct value of A(J)
necessitates:
Distributed memory architecture - task 2 must obtain the value of A(J-1) from task
1 after task 1 finishes its computation
o Shared memory architecture - task 2 must read A(J-1) after task 1 updates it
Loop independent data dependence
o
task 1
------
task 2
------
X = 2
.
.
Y = X**2
X = 4
.
.
Y = X**3
As with the previous example, parallelism is inhibited. The value of Y is dependent on:
Distributed memory architecture - if or when the value of X is communicated
between the tasks.
o Shared memory architecture - which task last stores the value of X.
Although all data dependencies are important to identify when designing parallel
programs, loop carried dependencies are particularly important since loops are possibly
the most common target of parallelization efforts.
o
Load Balancing
Load balancing refers to the practice of distributing work among tasks so that all tasks
are kept busy all of the time. It can be considered a minimization of task idle time.
Load balancing is important to parallel programs for performance reasons. For example,
if all tasks are subject to a barrier synchronization point, the slowest task will determine
the overall performance.
Granularity
Computation / Communication Ratio:
Coarse-grain Parallelism:
Which is Best?
The most efficient granularity is dependent on the algorithm and the hardware
environment in which it runs.
In most cases the overhead associated with communications and synchronization is high
relative to execution speed so it is advantageous to have coarse granularity.
Fine-grain parallelism can help reduce overheads due to load imbalance.
I/O
The Bad News:
speedup
1
-----
--1
1
speedup
-------
----P
S
--N
fraction.
speedup
------------------------------N
P = .50
P = .90
P = .99
----------------------10
1.82
5.26
9.17
100
1.98
9.17
50.25
1000
1.99
9.91
90.99
10000
1.99
9.91
99.02
100000
1.99
9.99
99.90
2D Grid Calculations
Serial fraction
85 seconds
15 seconds
85%
15%
We can increase the problem size by doubling the grid dimensions and halving the time
step. This results in four times the number of grid points and twice the number of time
steps. The timings then look like:
2D Grid Calculations
Serial fraction
680 seconds
15 seconds
97.84%
2.16%
Problems that increase the percentage of parallel time with their size are more scalable
than problems with a fixed percentage of parallel time.
Complexity:
In general, parallel applications are much more complex than corresponding serial
applications, perhaps an order of magnitude. Not only do you have multiple instruction
streams executing at the same time, but you also have data flowing between them.
The costs of complexity are measured in programmer time in virtually every aspect of
the software development cycle:
o Design
o Coding
o Debugging
o Tuning
o Maintenance
Adhering to "good" software development practices is essential when when working
with parallel applications - especially if somebody besides you will have to work with
the software.
Portability:
Thanks to standardization in several APIs, such as MPI, POSIX threads, HPF and
OpenMP, portability issues with parallel programs are not as serious as in years past.
However...
All of the usual portability issues associated with serial programs apply to parallel
programs. For example, if you use vendor "enhancements" to Fortran, C or C++,
portability will be a problem.
Even though standards exist for several APIs, implementations will differ in a number of
details, sometimes to the point of requiring code modifications in order to effect
portability.
Operating systems can play a key role in code portability issues.
Hardware architectures are characteristically highly variable and can affect portability.
Resource Requirements:
The primary intent of parallel programming is to decrease execution wall clock time,
however in order to accomplish this, more CPU time is required. For example, a parallel
code that runs in 1 hour on 8 processors actually uses 8 hours of CPU time.
The amount of memory required can be greater for parallel codes than serial codes, due
to the need to replicate data and for overheads associated with parallel support libraries
and subsystems.
For short running parallel programs, there can actually be a decrease in performance
compared to a similar serial implementation. The overhead costs associated with setting
up the parallel environment, task creation, communications and task termination can
comprise a significant portion of the total execution time for short runs.
Scalability:
Parallel Examples
Array Processing
do j = 1,n
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
Array Processing
Parallel Solution 1
do j = mystart, myend
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
once or in a cyclic way. Figure 1.4 presents a schematic representation of this _rst
approach.
The other way is to use a dynamically load-balanced master/slave paradigm,
which can be more suitable when the number of tasks exceeds the number of available processors, or when the number of tasks is unknown at the start of the application, or when the execution times are not predictable, or when we are dealing
with unbalanced problems. An important feature of dynamic load-balancing is the
ability of the application to adapt itself to changing conditions of the system, not
just the load of the processors, but also a possible recon_guration of the system
resources. Due to this characteristic, this paradigm can respond quite well to the
failure of some processors, which simpli_es the creation of robust applications that
are capable of surviving the loss of slaves or even the master.
At an extreme, this paradigm can also enclose some applications that are based
on a trivial decomposition approach: the sequential algorithm is executed simultaneously on di_erent processors but with di_erent data inputs. In such applications
there are no dependencies between di_erent runs so there is no need for communication or coordination between the processes.
This paradigm can achieve high computational speedups and an interesting degree of scalability. However, for a large number of processors the centralized control
of the master process can become a bottleneck to the applications. It is, however,
possible to enhance the scalability of the paradigm by extending the single master
to a set of masters, each of them controlling a di_erent group of process slaves.
1.7.3 Single-Program Multiple-Data (SPMD)
The SPMD paradigm is the most commonly used paradigm. Each process executes
basically the same piece of code but on a di_erent part of the data. This involves
20 Parallel Programming Models and Paradigms Chapter 1
Master
distribute tasks
Slave 1 Slave 2 Slave 3 Slave 4
Terminate
Collect
Results
communications
Figure 1.4 A static master/slave structure.
the splitting of application data among the available processors. This type of parallelism is also referred to as geometric parallelism, domain decomposition, or data
parallelism. Figure 1.5 presents a schematic representation of this paradigm.
Many physical problems have an underlying regular geometric structure, with
spatially limited interactions. This homogeneity allows the data to be distributed
uniformly across the processors, where each one will be responsible for a de_ned
spatial area. Processors communicate with neighbouring processors and the communication load will be proportional to the size of the boundary of the element,
while the computation load will be proportional to the volume of the element. It
may also be required to perform some global synchronization periodically among
all the processes. The communication pattern is usually highly structured and extremely predictable. The data may initially be self-generated by each process or
may be read from the disk during the initialization stage.
SPMD applications can be very e_cient if the data is well distributed by the
processes and the system is homogeneous. If the processes present di_erent workloads or capabilities, then the paradigm requires the support of some load-balancing
scheme able to adapt the data distribution layout during run-time execution.
This paradigm is highly sensitive to the loss of some process. Usually, the loss
Section 1.7 Parallel Programming Paradigms 21
of a single process is enough to cause a deadlock in the calculation in which none
of the processes can advance beyond a global synchronization point.
Distribute Data
Calculate Calculate Calculate Calculate
Exchange Exchange Exchange Exchange
Calculate Calculate Calculate Calculate
Collect Results
Figure 1.5 Basic structure of a SPMD program.
1.7.4 Data Pipelining
This is a more _ne-grained parallelism, which is based on a functional decomposition
approach: the tasks of the algorithm, which are capable of concurrent operation,
are identi_ed and each processor executes a small part of the total algorithm. The
pipeline is one of the simplest and most popular functional decomposition paradigms. Figure 1.6 presents the structure of this model.
Processes are organized in a pipeline { each process corresponds to a stage of the
pipeline and is responsible for a particular task. The communication pattern can
be very simple since the data ows between the adjacent stages of the pipeline. For
this reason, this type of parallelism is also sometimes referred to as data ow parallelism. The communication may be completely asynchronous. The e_ciency of this
paradigm is directly dependent on the ability to balance the load across the stages
of the pipeline. The robustness of this paradigm against recon_gurations of the
system can be achieved by providing multiple independent paths across the stages.
This paradigm is often used in data reduction or image processing applications.
1.7.5 Divide and Conquer
The divide and conquer approach is well known in sequential algorithm development. A problem is divided up into two or more subproblems. Each of these
subproblems is solved independently and their results are combined to give the _nal result. Often, the smaller problems are just smaller instances of the original
22 Parallel Programming Models and Paradigms Chapter 1
Process 1 Process 2 Process 3
Phase A Phase B Phase C
Input Output
Figure 1.6 Data pipeline structure.
problem, giving rise to a recursive solution. Processing may be required to divide
the original problem or to combine the results of the subproblems. In parallel divide
and conquer, the subproblems can be solved at the same time, given su_cient parallelism. The splitting and recombining process also makes use of some parallelism,
but these operations require some process communication. However, because the
subproblems are independent, no communication is necessary between processes
working on di_erent subproblems.
We can identify three generic computational operations for divide and conquer:
split, compute, and join. The application is organized in a sort of virtual tree: some
of the processes create subtasks and have to combine the results of those to produce
an aggregate result. The tasks are actually computed by the compute processes at
the leaf nodes of the virtual tree. Figure 1.7 presents this execution.
main problem
sub-problems
split
join
Figure 1.7 Divide and conquer as a virtual tree.
The task-farming paradigm can be seen as a slightly modi_ed, degenerated form
of divide and conquer; i.e., where problem decomposition is performed before tasks
are submitted, the split and join operations is only done by the master process and
Section 1.8 Programming Skeletons and Templates 23
all the other processes are only responsible for the computation.
In the divide and conquer model, tasks may be generated during runtime and
may be added to a single job queue on the manager processor or distributed through
several job queues across the system.
The programming paradigms can be mainly characterized by two factors: decomposition and distribution of the parallelism. For instance, in geometric parallelism both the decomposition and distribution are static. The same happens with
the functional decomposition and distribution of data pipelining. In task farming, the work is statically decomposed but dynamically distributed. Finally, in the
divide and conquer paradigm both decomposition and distribution are dynamic.
1.7.6 Speculative Parallelism
This paradigm is employed when it is quite di_cult to obtain parallelism through
any one of the previous paradigms. Some problems have complex data
dependencies,
which reduces the possibilities of exploiting the parallel execution. In these cases,
an appropriate solution is to execute the problem in small parts but use some
speculation or optimistic execution to facilitate the parallelism.
In some asynchronous problems, like discrete-event simulation [17], the system
will attempt the look-ahead execution of related activities in an optimistic assump-
tion that such concurrent executions do not violate the consistency of the problem
execution. Sometimes they do, and in such cases it is necessary to rollback to some
previous consistent state of the application.
Another use of this paradigm is to employ di_erent algorithms for the same
problem; the _rst one to give the _nal solution is the one that is chosen.
1.7.7 Hybrid Models
The boundaries between the paradigms can sometimes be fuzzy and, in some applications, there could be the need to mix elements of di_erent paradigms. Hybrid
methods that include more than one basic paradigm are usually observed in some
large-scale parallel applications. These are situations where it makes sense to mix
data and task parallelism simultaneously or in di_erent parts of the same program.