Parallel Random Access Machine (PRAM)
n
n
n
PRAM Algorithms
Arvind Krishnamurthy
Fall 2004
Collection of numbered processors
Accessing shared memory cells
Each processor could have local
memory (registers)
Each processor can access any
shared memory cell in unit time
Input stored in shared memory
cells, output also needs to be
stored in shared memory
PRAM instructions execute in 3phase cycles
n
n
n
Read (if any) from a shared
memory cell
Local computation (if any)
Write (if any) to a shared memory
cell
Different variations:
n
Exclusive Read Exclusive Write (EREW) PRAM: no two processors
are allowed to read or write the same shared memory cell
simultaneously
Concurrent Read Exclusive Write (CREW): simultaneous read
allowed, but only one processor can write
Concurrent Read Concurrent Write (CRCW)
n
n
n
n
n
P0
Priority CRCW: processors assigned fixed distinct priorities, highest
priority wins
Arbitrary CRCW: one randomly chosen write wins
Common CRCW: all processors are allowed to complete write if and
only if all the values to be written are equal
Complexity issues:
n
n
Optimal parallel algorithm:
n
n
n
n
Total number of steps in parallel algorithm is equal to the number
of steps in a sequential algorithm
Use n/logn processors instead of n
Have a local phase followed by the global phase
Local phase: compute maximum over log n values
n
n
Time complexity = O(log n)
Total number of steps = n * log n = O(n log n)
Private
Memory
P2
Private
Memory
Pp
Global
Memory
Let there be n processors and 2n inputs
PRAM model: EREW
Construct a tournament where values are compared
P0
P2
P0
P4
P4
P6
P0 P1 P2 P3 P4 P5 P6 P7
PRAM Model Issues
n
P1
A Basic PRAM Algorithm
Concurrent writes:
n
Private
Memory
Processors execute these 3-phase
PRAM instructions synchronously
Shared Memory Access Conflicts
n
Control
Processor k is active in step j
if (k % 2j) == 0
At each step:
Compare two inputs,
Take max of inputs,
Write result into shared memory
Details:
Need to know who is the parent and
whether you are left or right child
Write to appropriate input field
Time Optimality
n
n
n
n
Example: n = 16
Number of processors, p = n/log n = 4
Divide 16 elements into four groups of four each
Local phase: each processor computes the maximum of its
four local elements
Global phase: performed amongst the maximums
computed by the four processors
Simple sequential algorithm
Time for local phase = O(log n)
Global phase: take (n/log n) local maximums and compute
global maximum using the tournament algorithm
n
Time for global phase = O(log (n/log n)) = O(log n)
Finding Maximum: CRCW Algorithm
Broadcast and reduction
Broadcast of 1 value to p processors in log p time
Given n elements A[0, n-1], find the maximum.
With n2 processors, each processor (i,j) compare A[i] and A[j], for 0 i, j n-1.
v
Broadcast
FAST-MAX(A):
A[j]
nlength[A]
for i 0 to n-1, in parallel
do m[i] true
for i 0 to n-1 and j 0 to n-1, in parallel
do if A[i] < A[j]
then m[i] false
for i 0 to n-1, in parallel
do if m[i] =true
then max A[i]
return max
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
5 6 9
5F TT
A[i] 6 F F T
9F FF
2TTT
9F FF
2 9
FT
FT
F F
F T
F F
m
F
F
T
F
T
Reduction of p values to 1 in log p time
Takes advantage of associativity in +,*, min, max, etc.
n
n
1 3 1 0 4 -6 3 2
max=9
Add-reduction
The running time is O(1).
Note: there may be multiple maximum values, so their processors
Will write to max concurrently. Its work = n2 O(1) =O(n2).
Scan (or Parallel prefix)
Prefix Sum in Parallel
Algorithm: 1. Pairwise sum
What if you want to compute partial sums
Definition: the parallel prefix operation take a binary
associative operator , and an array of n elements
[a0, a1, a2, an-1]
and produces the array
[a0, (a0 a1), (a0 a1 ...
an-1)]
n
n
1 2 3 4
3 7
Example: add scan of
[1, 2, 0, 4, 2, 1, 1, 3] is [1, 3, 3, 7, 9, 10, 11, 14]
2. Recursively Prefix 3. Pairwise Sum
11 15
10 11 12 13 14
19 23 27
15
16
31
(Recursively Prefix)
3 10
21 36
55 78 105 136
Can be implemented in O(n) time by a serial algorithm
Obvious n-1 applications of operator will work
1 3 6 10 15 21 28 36 45 55 66 78 91 105 120 136
Implementing Scans
n
get values L and R from left and right child
save L in local variable Mine
compute Tmp = L + R and pass to parent
n
n
n
up sweep
n
down sweep
get value Tmp from parent
send Tmp to left child
send Tmp+Mine to right child
n
n
n
Up sweep:
Down sweep:
tmp = parent (root is 0)
tmp = left + right
right = tmp + mine
0
4
3
5
2
1 2
4
0
1
1
Given an array of n elements
[a0, a1, a2, an-1]
and an array of flags
[1,0,1,1,0,0,1,]
compress the flagged elements
[a0, a2, a3, a6, ]
Compute a prescan i.e., a scan that doesnt include the
element at position i in the sum
[0,1,1,2,3,3,4,]
th element in the compressed array
n Gives the index of the i
n
mine = left
E.g., Using Scans for Array Compression
Tree summation 2 phases
5
4
3 4
6
6
11
10 11
+X = 3
1 2
4 6
10
11 12
12
3
15
If the flag for this element is 1, write it into the result array at the
given position
E.g., Fibonacci via Matrix Multiply Prefix
Pointer Jumping list ranking
Fn+1 = Fn + Fn-1
n
Fn +1 1 1 Fn
=
Fn 1 0 Fn -1
Given a single linked list L with n objects, compute,
for each object in L, its distance from the end of the
list.
Formally: suppose next is the pointer field
D[i] = 0
d[next[i]]+1
if next[i] = nil
if next[i] nil
Can compute all Fn by matmul_prefix on
, 1
, 1
, 1 0 , 1
Serial algorithm:
(n)
then select the upper left entry
Slide source: Alan Edelman, MIT
List ranking EREW algorithm
List-ranking EREW algorithm
LIST-RANK(L) (in O(lg n) time)
for each processor i, in parallel
do if next[i]=nil
then d[i] 0
else d[i] 1
while there exists an object i such that next[i] nil
do for each processor i, in parallel
do if next[i] nil
then d[i] d[i]+ d[next[i]]
next[i] next[next[i]]
1.
2.
3.
4.
5.
6.
7.
8.
9.
Recap
n
PRAM algorithms covered so far:
n
n
Finding max on EREW and CRCW models
Time optimal algorithms: number of steps in parallel program is
equal to the number of steps in the best sequential algorithm
n Always qualified with the maximum number of processors that
can be used to achieve the parallelism
Reduction operation:
n Takes a sequence of values and applies an associative operator
on the sequence to distill a single value
n Associative operator can be: +, max, min, etc.
n Can be performed in O(log n) time with up to O(n/log n) procs
Broadcast operation: send a single value to all processors
n Also can be performed in O(log n) time with up to O(n/log n)
procs
(a)
3
1
4
1
6
1
1
1
0
1
5
0
(b)
3
2
4
2
6
2
1
2
0
1
(c)
3
4
4
4
6
3
1
2
0
1
(d)
3
5
4
4
6
3
1
2
0
1
5
5
Scan Operation
n
n
Used to compute partial sums
Definition: the parallel prefix operation take a binary associative
operator , and an array of n elements
[a0, a1, a2, an-1]
and produces the array
[a0, (a0 a1), (a0 a1 ...
an-1)]
Scan(a, n):
if (n == 1) { s[0] = a[0]; return s; }
for (j = 0 n/2-1)
x[j] = a[2*j] a[2*j+1];
y = Scan(x, n/2);
for odd j in {0 n-1}
s[j] = y[j/2];
for even j in {0 n-1}
s[j] = y[j/2] a[j];
return s;
Work-Time Paradigm
Recurrences for Scan
Associate two complexity measures with a parallel
algorithm
S(n): time complexity of a parallel algorithm
W(n): work complexity of the algorithm
n
n
n
Scan(a, n):
if (n == 1) { s[0] = a[0]; return s; }
for (j = 0 n/2-1)
x[j] = a[2*j] a[2*j+1];
y = Scan(x, n/2);
for odd j in {0 n-1}
s[j] = y[j/2];
for even j in {0 n-1}
s[j] = y[j/2] a[j];
return s;
Total number of steps taken by an algorithm
Total number of operations the algorithm performs
Wj(n): number of operations the algorithm performs in step j
W(n) = Wj(n) where j = 1S(n)
Can use recurrences to compute W(n) and S(n)
W(n) = 1 + n/2 + W(n/2) + n/2 + n/2 + 1
= 2 + 3n/2 + W(n/2)
S(n) = 1 + 1 + S(n/2) + 1 + 1 = S(n/2) + 4
Solving, W(n) = O(n); S(n) = O(log n)
Brents Scheduling Principle
n
A parallel algorithm with step complexity S(n) and work
complexity W(n) can be simulated on a p-processor PRAM
in no more than TC (n,p) = W(n)/p + S(n) parallel steps
n
Application of Brents Schedule to Scan
S(n) could be thought of as the length of the critical path
Some schedule exists; need some online algorithm for dynamically
allocating different numbers of processors at different steps of the
program
No need to give the actual schedule; just design a parallel algorithm
and give its W(n) and S(n) complexity measures
Goals:
n
Design algorithms with W(n) = TS(n), running time of sequential
algorithm
n Such algorithms are called work-efficient algorithms
n
Also make sure that S(n) = poly-log(n)
n
Speedup = TS(n) / TC(n,p)
Scan complexity measures:
TC(n,p) = W(n)/p + S(n)
If p equals 1:
n
n
n
n
TC(n,p) = O(log n)
Speedup = TS(n) / TC(n,p) = n/logn
If p equals n:
n
n
TC(n,p) = O(n) + O(log n) = O(n)
Speedup = TS(n) / TC(n,p) = 1
If p equals n/log(n):
n
W(n) = O(n)
S(n) = O(log n)
TC(n,p) = O(log n)
Speedup = n/logn
Scalable up to n/log(n) processors
Segmented Operations
Inputs = Ordered Pairs
(operand, boolean)
e.g. (x, T) or (x, F)
Change of
segment indicated
by switching T/F
Parallel prefix on a list
n
A prefix computation is defined as:
n
(y, T)
(y, F)
(x, T)
(x+ y, T)
(y, F)
(x, F)
(y, T)
(x y, F)
+ 2
e. g.
n
n
n
12
n
n
Result
Input: <x1, x2, , xn>
Binary associative operation
Output: <y1, y2, , yn>
Such that:
n y = x
1
1
n y = y
k
k-1 xk for k= 2, 3, , n, i.e, yk = x1 x2 xk .
Suppose <x1 , x2, , xn> are stored orderly in a list.
Define notation: [i,j]= xi xi+1 xj
List Prefix Operations
Prefix computation
n
What is S(n)?
What is W(n)?
What is speedup on n/logn processors?
LIST-PREFIX(L)
n
1.
2.
3.
4.
5.
6.
7.
for each processor i, in parallel
do y[i] x[i]
while there exists an object i such that prev[i] nil
do for each processor i, in parallel
do if prev[i] nil
then y[prev[i]] y[i] y[prev[i]]
prev[i] prev[prev[i]]
Announcements
n
Readings:
n
n
n
Lecture notes from Sid Chatterjee and Jans Prins
Prefix scan applications paper by Guy Blelloch
Lecture notes from Ranade (for list ranking algorithms)
13
11
13
20
20
20
13
20
24
27
Homework:
n
n
List Prefix
First theory homework will be on website tonight
To be done individually
TA office hours will be posted on the website soon
Optimizing List Prefix
Optimizing List Prefix
n
Randomized algorithm:
n
1.
2.
Eliminate some elements:
4
3
3.
11
Perform list prefix on remainder:
4
3
13
24
27
Integrate eliminated elements:
7
13
4
20
24
27
Goal: achieve W(n) = O(n)
Sketch of algorithm:
4.
Select a set of list elements that are non adjacent
Eliminate the selected elements from the list
Repeat steps 1 and 2 until only one element remains
Fill in values for the elements eliminated in preceding steps in the
reverse order of their elimination
Optimizing List Prefix
Randomized List Ranking
n
Elimination step:
n
n
n
n
Eliminate #1:
4
11
Eliminate #2:
4
13
11
14
Eliminate #3:
4
13
11
27
Each processor is assigned O(log n) elements
Processor j is assigned elements j*logn (j+1)*logn 1
Each processor marks the head of its queue as a candidate
Each processor flips a coin and stores the result along with the
candidate
A candidate is eliminated if its coin is a HEAD and if it so happens
that the previous element is not a TAIL or was not a candidate
Find root CREW algorithm
n
n
n
n
Suppose a forest of binary trees, each node i has a pointer
parent[i].
Find the identity of the tree of each node.
Assume that each node is associated a processor.
Assume that each node i has a field root[i].
Find-roots CREW algorithm
FIND-ROOTS(F)
n
1.
2.
3.
4.
5.
6.
7.
8.
Pointer Jumping Example
for each processor i, in parallel
do if parent[i] = nil
then root[i] i
while there exist a node i such that parent[i] nil
do for each processor i, in parallel
do if parent[i] nil
then root[i] root[parent[i]]
parent[i] parent[parent[i]]
Pointer Jumping Example
Pointer Jumping Example
Analysis
n
Complexity measures:
n
n
n
n
What is W(n)?
What is S(n)?
Termination detection: When do we stop?
All the writes are exclusive
But the read in line 7 is concurrent, since several nodes
may have same node as parent.
Euler Tours
Find roots CREW vs. EREW
n
How fast can n nodes in a forest determine their
roots using only exclusive read?
(lg n)
Argument: when exclusive read, a given peace of information can only be
Technique for fast processing of tree data
Euler circuit of directed graph:
Represent tree by Euler circuit of its directed version
Directed cycle that traverses each edge exactly once
copied to one other memory location in each step, thus the number of locations
containing a given piece of information at most doubles at each step. Looking
at a forest with one tree of n nodes, the root identity is stored in one place initially.
After the first step, it is stored in at most two places; after the second step, it is
Stored in at most four places, , so need lg n steps for it to be stored at n places.
So CREW: O(lg d) and EREW: (lg n).
If d=2o(lg n), CREW outperforms any EREW algorithm.
If d= (lg n), then CREW runs in O(lg lg n), and EREW is
much slower.
Using Euler Tours
n
Trees = balanced parentheses
n
Parentheses subsequence corresponding to a subtree is balanced
Depth of tree vertices
n
Input:
n
n
L[i] = position of incoming edge into i in euler tour
R[i] = position of outgoing edge from i in euler tour
forall i in 1..n {
A[L[i]] = 1;
A[R[i]] = -1;
}
B = EXCL-SCAN(A, +);
forall i in 1..n
Depth[i] = B[L[i]];
Parenthesis version:
(()(()()))
Parenthesis version:
Scan input:
Scan output:
( ( ) ( ( ) ( ) ) )
1 1 -1 1 1 -1 1 -1 -1 -1
0 1 212 32 3 2 1
Divide and Conquer
n
Just as in sequential algorithms
n
n
n
Convex Hull
n
Divide problems into sub-problems
Solve sub-problems recursively
Combine sub-solutions to produce solution
Overall approach:
n
n
Example: planar convex hull
n
n
Give set of points sorted by x-coord
Find the smallest convex polygon that contains the points
Take the set of points and divide the set into two halves
Assume that recursive call computes the convex hull of the two
halves
Conquer stage: take the two convex hulls and merge it to obtain
the convex hull for the entire set
Complexity:
n
n
n
n
W(n) = 2*W(n/2) + merge_cost
S(n) = S(n/2) + merge_cost
If merge_cost is O(log n), then S(n) is O(log2n)
Merge can be sequential, parallelism comes from the recursive
subtasks
Complex Hull Example
Complex Hull Example
Complex Hull Example
Complex Hull Example
Merge Operation
n
Challenge:
n
n
n
Finding the upper and lower common tangents
Simple algorithm takes O(n)
We need a better algorithm
Insight:
n
n
n
n
Resort to binary search
Consider the simpler problem of finding a tangent from a point to a
polygon
Extend this to tangents from a polygon to another polygon
More details in Preparata and Shamos book on Computational
Geometry (Lemma 3.1)