03-Task Decomposition and Mapping
03-Task Decomposition and Mapping
Alexandre David
Overview
Introduction to parallel algorithms
Decomposition techniques
Task interactions
Load balancing
Introduction
Parallel algorithms have the added dimension of
concurrency.
Typical tasks:
Identify concurrent works.
Map them to processors.
Distribute inputs, outputs, and other data.
Manage shared resources.
Synchronize the processors.
Decomposing problems
Decomposition into concurrent tasks.
No unique solution.
Different sizes.
Decomposition illustrated as a directed graph:
Nodes = tasks.
Edges = dependency.
Many solutions are often possible but few will yield good performance and be
scalable. We have to consider the computational and storage resources
needed to solve the problems.
Size of the tasks in the sense of the amount of work to do. Can be more, less,
or unknown. Unknown in the case of a search algorithm is common.
Dependency: All the results from incoming edges are required for the tasks at
the current node.
We will not consider tools for automatic decomposition. They work fairly well
only for highly structured programs or options of programs.
Vector
Matrix
N tasks, 1 task/row:
The question is: How to decompose this into concurrent tasks? Different tasks
may generate intermediate results that will be used by other tasks.
A solution
Meas
ure o
Nb. o f concurr
f pro
ency?
c
Optim essors?
al?
Another Solution
Bet
ter
/wo
rse
?
Granularity
Number and size of tasks.
Fine-grained: many small tasks.
Coarse-grained: few large tasks.
Related: degree of concurrency.
(Nb. of tasks executable in parallel).
Maximal degree of concurrency.
Average degree of concurrency.
Vector
Matrix
N tasks, 3 task/row:
10
10
Granularity
Average degree of concurrency if we take into account
varying amount of work?
Critical path = longest directed path between any start &
finish nodes.
Critical path length = sum of the weights of nodes along
this path.
Average degree of concurrency = total amount of work /
critical path length.
11
11
Database example
Critical path (3).
Critical path length = 27.
Av. deg. of concurrency = 63/27.
2.33
Introduction to Parallel Computing
1.88
12
12
Maximum degree of
concurrency.
Critical path length.
Maximum possible speedup.
Minimum number of
processes to reach this
speedup.
Maximum speedup if we
limit the processes to 2,4,
and 8.
Exercise
(a)
(b)
(c)
(d)
Introduction to Parallel Computing
13
13
18
18
19
19
Mapping example
20
Notice that the mapping keeps one process from the previous stage because
of dependency: We can avoid interaction by keeping the same process.
20
21
21
Decomposition techniques
Recursive decomposition.
Divide-and-conquer.
Data decomposition.
Large data structure.
Exploratory decomposition.
Search algorithms.
Speculative decomposition.
Dependent choices in computations.
22
22
Recursive decomposition
Problem solvable by divide-and-conquer:
Decompose into sub-problems.
Do it recursively.
23
23
Quicksort example
<5
<3
<9
<7
<10
<11
24
24
Minimal number
4 9 1 7 8 11 2 12
25
25
Data decomposition
2 steps:
Partition the data.
Induce partition into tasks.
How to partition data?
Partition output data:
Independent sub-outputs.
Partition input data:
Local computations, followed by combination.
1-D, 2-D, 3-D block decomposition.
26
26
27
We can partition further for the tasks. Notice the dependency between tasks.
What is the task dependency graph?
27
Linear combination
of the intermediate
results.
Introduction to Parallel Computing
28
28
Owner-compute rule
Process assigned to some data
is responsible for all computations associated with it.
Input data decomposition:
All computations done on the (partitioned) input data
are done by the process.
Output data decomposition:
All computations for the (partitioned) output data are
done by the process.
29
29
Exploratory decomposition
Model-checker example
model
(syntax)
states
(semantics)
30
Suitable for search algorithms. Partition the search space into smaller parts
and search in parallel. We search the solution by a tree search technique.
30
Performance anomalies
Work depends on the order of the search!
31
31
Speculative decomposition
Dependencies between tasks are not known a-priori.
How to identify independent tasks?
Conservative approach: identify tasks that are
guaranteed to be independent.
Optimistic approach: schedule tasks even if we are
not sure may roll-back later.
32
32
So far
Decomposition techniques.
Identify tasks.
Analyze with task dependency & interaction graphs.
Map tasks to processes.
Now properties of tasks that affect a good mapping.
Task generation, size of tasks, and size of data.
33
33
Task generation
Static task generation.
Tasks are known beforehand.
Apply to well-structured problems.
Dynamic task generation.
Tasks generated on-the-fly.
Tasks & task dependency graph not available
beforehand.
34
34
Task sizes
Relative amount of time for completion.
Uniform same size for all tasks.
Matrix multiplication.
Non-uniform.
Optimization & search problems.
35
35
36
36
37
37
38
The color of each pixel is determined as the weighted average of its original
value and the values of the neighboring pixels. Decompose into regions, 1
task/region. Pattern is a 2-D mesh. Regular pattern.
38
Read-write interactions.
Read & modify data of other tasks.
39
39
40
40
41
41
Example
42
42
Mapping techniques
Static mapping.
NP-complete problem for non-uniform tasks.
Large data compared to computation.
Dynamic mapping.
Dynamically generated tasks.
Task size unknown.
43
43
44
44
45
45
46
46
Example: Matrix*Matrix
47
In the case of matrix n*n multiplication, 1-D -> n processes at most, 2-D n2
processes at most.
47
48
48
Imbalance problem
If the amount of computation associated with data varies
a lot then block decomposition leads to imbalances.
Example: LU factorization (or Gaussian elimination).
Computations
49
Exercise on LU-decomposition.
49
LU factorization
Non singular square matrix A (invertible).
A = L*U.
Useful for solving linear equations.
50
50
LU factorization
In practice we work on A.
N steps
51
51
LU algorithm
Proc LU(A)
begin
U[k,k]
for k := 1 to n-1 do
for j := k+1 to n do
Normalize L
A[j,k] := A[j,k]/A[k,k]
U[k,j] := A[k,j]/L[k,k]
endfor
L[j,k]
for j := k+1 to n do
A
for i := k+1 to n do
A[i,j] := A[i,j] A[i,k]*A[k,j]
U
endfor
endfor
L
L[i,k] U[k,j]
endfor
Introduction to Parallel Computing
end
52
52
Decomposition
Exercise:
Task dependency graph?
Mapping to 3 & 4 processes?
53
53
55
55
Block-Cyclic Distributions
56
Reduce the amount of idling because all processes have a sampling of tasks
from all parts of the matrix.
But lack of locality may result in performance penalties + leads to high degree
of interaction. Good value for to find a compromise.
56
Randomized distributions
57
57
58
58
59
59
Graph partitioning
For sparse data structures and data dependent
interaction patterns.
Numerical simulations. Discretize the problem and
represent it as a mesh.
Sparse matrix: assign equal number of nodes to
processes & minimize interaction.
Example: simulation of dispersion of a water contaminant
in Lake Superior.
60
60
Discretization
61
61
Random partitioning.
62
Minimum edge cut from a graph point of view. Keep locality of data with
processes to minimize interaction.
62
Mapping on 8
processes.
63
63
Hierarchical mappings
Combine several mapping techniques in a structured
(hierarchical) way.
Task mapping of a binary tree (quicksort) does not use
all processors.
Mapping based on task dependency graph (hierarchy)
& block.
64
64
65
65
66
66
67
Replication is useful when the cost of interaction is greater than replicating the
computation. Replicating data is like caching, good for read-only accesses.
Processing power is cheap, memory access is expensive also apply at larger
scale with communicating processes.
Collective communication such as broadcast. However, depending on the
communication pattern, a custom collective communication may be better.
67