This document discusses parallel algorithms and their analysis. It begins by introducing parallel processors and algorithms that can leverage multiple processors working simultaneously. It then discusses different models of parallelism like SIMD and MIMD. A key model discussed is the PRAM model where multiple processors share memory. Examples are given of parallel search, finding the maximum value, and computing AND/OR of arrays. These examples show how algorithms can be parallelized to run in poly-log time, defining a new complexity class called NC.
Download as PPT, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
24 views
Chapter 14: Parallel Algorithms
This document discusses parallel algorithms and their analysis. It begins by introducing parallel processors and algorithms that can leverage multiple processors working simultaneously. It then discusses different models of parallelism like SIMD and MIMD. A key model discussed is the PRAM model where multiple processors share memory. Examples are given of parallel search, finding the maximum value, and computing AND/OR of arrays. These examples show how algorithms can be parallelized to run in poly-log time, defining a new complexity class called NC.
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 23
Chapter 14: Parallel Algorithms
• As processing power continues to become cheaper, it is
natural to build machines with multiple processors – parallel processors can execute multiple programs simultaneously instead of concurrently • Can we also write a program such that parallel processors can work on the single program in parallel through some form of cooperation? – Yes, the solution will use a parallel algorithm – Questions: • How can we parallelize an algorithm? • How can we handle multiple processors accessing memory at the same time? • How will the parallel algorithm impact the problem’s computational complexity? • What problems can be parallelized so that they obtain a speedup? Parallelism vs. Sequentiality • Suppose a sequential algorithm has a complexity of W(n) in the worst case – If we run this algorithm in parallel on p processors, the best we can hope for is a complexity of W(n) / p • Will we achieve this maximum speedup? – Probably not as any data dependencies will cause some processors to idle until the needed data becomes available • Example from the book: – We want to put our shoes and socks on – Sequential algorithm: put on left sock, put on right sock, put on left shoe, put on right show – With 4 processors (hands), we cannot accomplish these 4 tasks in 1 unit as the socks must be put on before the shoes • We similarly could not perform insertion sort on the kth value in a list before first inserting the first k-1 values in the list Models of Parallelism • SIMD: Single instruction, multiple data – Issue a single instruction to be carried out on different processors where each processor handles a different datum – Within SIMD, we can subdivide models based on how processors communicate • Nearest Neighbor schemes: – Hypercube: has 2d nodes where each node connects directly to exactly d neighbors (see figure 14.2a for a hypercube with d = 3) – Bounded degree: a node connects directly to d neighbors, but where the size of the network is not restricted by d (see figure 14.3) • PRAMs (covered next) • MIMD: Multiple instruction, multiple data – Each processor receives its own instruction and data to work on (or more commonly, each processor receives its own process and data to work on) PRAMs • PRAM: Parallel random access • PRAMs are impractical machine – because of the complex – A machine of p general-purpose nature of processor-to- processors processor communication and concurrent memory – They all share RAM accesses • Each processor may have its own local memory (e.g., cache, RAM or • They are convenient registers) but all communication theoretical machines for between processors takes place in the proving the complexity of shared RAM – Each processor knows its own id parallel algorithms (processor id or pid) – Therefore, while hypercube, bounded degree and MIMD – All processors are synchronized to machines are used in read, execute and write at the same practice, we won’t bother time by one control unit with these architectures for – Shared memory can accommodate our consideration of parallel concurrent reads algorithms • We will consider how writes are handled later Parallel Algorithms for PRAMs • The basic strategy for parallelizing an algorithm is as follows: – Load the data (an array in our examples) into shared memory – Each processor can access a given array value concurrently (each processor will be interested in data from the array based on the pid) – Each processor performs its operation – Each processor writes its result to the array concurrently • Note: as long as each processor is writing to a different memory location in shared memory, RAM can handle all writes concurrently – Concurrent reads/writes is not possible in practice – Communication of a previous result to another processor is performed through shared memory – As time goes on, fewer processors may be needed, although in practice, each processor could continue to execute as long as it does not affect the result of a processor that is still needed • Figure 14.3 demonstrates this idea where each processor starts with data pid and pid+1 (processors are even numbered in this figure) and as time goes on, fewer processors are used Example: Parallel Search • Search has improved from (n) to (n½) to (log n) • With parallelism, search can be reduced to (1): – In parallel • for all processors, copy a[pid] to M[pid] • if(M[pid] = = target) write pid to some variable k – If we assume only 1 element will equal target, then the location i of that element is written to k, so that a[k] stores target – This algorithm uses n processors and is (1) • So, even though the array may be unsorted, we have improved over all previous search algorithms • A problem with this algorithm arises if target appears in multiple array locations because the PRAM can not accommodate multiple writes to k of different values at the same time • We will visit this problem of handling multiple writes to the same memory location later in this chapter Parallel Tournament for Maximum • Here we see a parallel algorithm to determine the maximum item in an array incr = 1; while(incr < n) – The algorithm is based on the tournament temp0 = M[pid] idea from chapter 5, here each tournament of temp1 = M[pid + incr] if(temp0 > temp1) a given round is performed by a different M[pid] = M[pid + incr] processor incr *= 2 • For n array items, this algorithm uses n / 2 processors and takes (log n) time It should be easy to see that • Each processor takes 2 array elements from the above algorithm iterates log n times position M[pid] and M[pid + incr], finds the max and copies it into M[pid] So the complexity is 4 log n + 1 number of operations or log n For instance: comparisons P0 looks at locations 0 and 1, P1 looks at locations 1 and 2, etc In the next iteration, P0 compares the items at 0 (max of 0, 1) and 2 (max of 2, 3) Eventually, Processor 0 compares the items at 0 and n / 2 for max Formal Analysis • The book presents the following theorem: – At the end of the tth iteration of the while loop, incr = 2t and each cell M[i] for 0<=i<2log n contains the maximum of M[i]… M[i+incr-1] – The book proves this by induction – We do not need to be so formal to see that incr = 2t since we multiply incr by 2 each iteration, so if incr starts at 1, after t iterations, incr = 2t – By induction, we can see that the second result is true as follows: • If at some iteration t-1, M[i] contains max(M[i], …, M[i+incr/2-1]) where incr = 2t-1 • Then the next iteration will compare M[i+incr/2] and M[i] and by our assumption, M[i+incr/2] stores the max(M[i+incr/2], …, M[i+incr-1]) • So M[i] will now be whichever is greater and so at iteration t, M[i] is max(M[i], …, M[i+incr-1]) Example Variations of the Parallel Tournament • The “parallel tournament” approach is also called the “Binary fan-in technique” – see figure 14.3 • We can use this approach and apply it to other probems: – Given n boolean values in an array, we can compute the • AND of the array by replacing M[i] = max(M[i], M[i+incr]) with M[i] = M[i] AND M[i+incr]) • OR of the array by replacing M[i] = max(M[i], M[i+incr]) with M[i] = M[i] OR M[i+incr]) – Given n int values, we can sum these items • By replacing M[i] max(M[i], M[i+incr]) with M[i] = M[i] + M[i+incr]) • All of these algorithms require n processors and perform the operation in (log n) with concurrent reads but no write problems A New Class: NC • The class NC (poly-log time complexity) are those algorithms that can be solved by a PRAM of some O(nk) processors in O(logmn) time – That is, if the algorithm can be solved using a PRAM with a number of processors that is some polynomial amount of n (n 2, n3, n6, etc) and the run-time for the algorithm is log n multiplied some constant number of times, then the algorithm is classified as NC • We have seen finding max, summation, array-AND and array-OR are all in NC • What about other problems like sorting? We will explore sorting later – Note: It may not be practical to have n k processors (even when k = 1) for large input sizes • In practice, most parallel processors have no more than 1024 processors, often far fewer Parallel Matrix Multiplication x11 x12 x13 y11 y12 y13 • To perform matrix x21 x22 x23 * y21 y22 y23 multiplication on two x31 x32 x33 y31 y32 y33 NxN matrices For instance: – we need to perform n z11 = x11*y11 + x12*y21 + x13*y31 multiplications and n-1 additions per element in z32 = x31*y12 + x32*y22 + x33*y32 the resulting matrix Parallel Solutions: – The resulting matrix is Assume we have n2 processors each processor NxN, so we have n2 will produce the result of one matrix entry, items thus taking n multiplications and n-1 additions, or roughly 2*n operations ((n)) – So matrix multiplication needs a total of n3 Assume we have n3 processors, each process multiplications and n2 * does 1 multiplication and then the additions (n-1) additions, so this is can be done in log n time on n processors an (n3) algorithm requiring log n + 1 operations ((log n)) Parallel Reads vs. Writes • The basic form of PRAM disallows parallel • All PRAMs can writes to the same memory location perform concurrent – This PRAM is known as a CREW PRAM reads (concurrent reads, exclusive writes) – that is, any • However, there are stronger PRAM models as number of follows (known as CRCW), each of which processors can read the same relaxes the restriction more and more, resulting memory location in more powerful but harder to implement at the same time PRAMs: • PRAMs can also – Common-write: processors can write to the perform multiple same memory location concurrently as long as writes to different they all write the same value memory locations, – Arbitrary-write: when multiple processors write but what about to the same memory location, one is arbitrary concurrent writes (to chosen the same location)? – Priority-write: when multiple processors write – This is a problem to the same memory location, only the processor with the lowest pid is selected to write Using a Priority-Write PRAM • As an example, we implement insertion sort using a priority-write PRAM as follows: for(i=1; i<n; i++) – The PRAM has n processors copy element a[j] to M[j] for • The code is given to the right all j between 0 and i-1 temp = a[i] – Notice that we must use a priority- if(M[pid] < M[i]) write PRAM so that the proper location a[pid] = M[pid] is written into k in the else statement else a[pid+1] = M[pid] • The loop iterates n – 1 times k = pid – Within each iteration, each processor a[k] = temp performs up to 4 operations • or a total of 4 * (n – 1) parallel steps – So with n processors, we can perform a sort in (n) One Iteration of Parallel Insertion Sort
This process is repeated for i from 1 to n-1
so it takes 4 * (n – 1) operations if there are n processors Boolean OR on n bits • We want to OR together n different bits stored in n array elements – Notice that if any bit of n is true, then the OR results in true – We can use this information and the ability to handle concurrent writes to solve Boolean OR of n bits in (1) time • Use an n processor PRAM that permits common-writes – Each processor reads a[i] (a boolean) and initialize boolean variable k to false – If bit = = 1, then write true to k – k is then the result of the OR • With 2 operations, this is in (1) requiring n processors – Note: without the ability to perform common-writes, this will not work! • Could we use a similar approach for AND? – Yes, initialize k to true and have any processor write a[i] to k if a[i] is false Finding Maximum in (1) • By using the common-write PRAM, we can now solve this problem in constant time – we use n2 processors where each processor is denoted as p i,j • Assign n processors pi,0…pi,n-1 to array location i – Each processor pi,j compares the two array values a[i] and a[j] – If a[i] < a[j] then write 1 to b[i] else write 1 to b[j] • there will be multiple concurrent writes to the same array location, but all writes are a 1, so this is permissible – Now assign a processor to each element in b (n total processors) • if b[pid] = = 0 then write pid to k – a[k] is the maximum item • Why? Because b[pid] = = 0 if it never lost a comparison, so a[pid] is the maximum in the array • This algorithm takes 4 parallel operations, so is (1) requiring n2 processors Example of Parallel Max in (1) Parallel Merge • We now focus on merging 2 subarrays in parallel in support of a parallel MergeSort – Given two arrays of k elements each, both subarrays already being sorted • We want to combine them into a single sorted array of n elements – We will use 2*k processors, each assigned to one value in one of the two subarrays • Assume the first subarray has elements 0…k-1 and the second has elements k… 2k-1 • Processor i (0<=i<=2k-1) will determine where to place element a[i] (if 0<=i<=k- 1) or b[i] (if k<=i<=2k-1) into the merged array – We perform an “in-place merge” in parallel so that we do not need any additional array space • We make the following observation: – an element a[x] will be placed into the array at position x + b(a[x]) where b(a[x]) tells us how many elements in b are less than a[x] and an element b[x] will be placed into the array at position a(b[x]) + x where a(b[x]) tells us how many elements in a are less than b[x] – How can we determine a(b[x]) or b(a[x])? Using Binary Search • First, we make the assumption that no two elements in the two combined subarrays share a value – this allows us to use a CREW PRAM • A processor, using binary search, can find an element’s proper position in the array using binary search – the following allows us to position a[x] into the merged array: • position findLocationUsingBinarySearch(b, a[x]) • a[x+position] = a[x] – It should be obvious that this set of code is bound by (log n) • in fact, it will take log n + 2 operations in the worst case – binary search can require up to log n + 1 comparisons – We can therefore merge the two sorted subarrays into one combined sorted array using n processors in log n time where each processor does the above steps • Pseudocode is given in figure 14.8 Parallel MergeSort • Now that we have a parallel merge using n processors, we can rewrite MergeSort – Recall in MergeSort that we used recursion to divide the array in half and then merged the two sorted subarrays into a single array • We used recursion to keep track of the two subarrays to merge – Now that we have a parallel merge and parallel processors, to combine any two subarrays we can keep track of the various subarrays to merge on different processors – So, we no longer need the recursive step, instead we can replace it with iteration for(k=1; k<n; k=2*k) for each i in 0, 2k, 4k, …, i < n do in parallel Pi executes merge(M, i, k) // that is, Pi merges the two subarrays i..i+k-1 and i+k…i+2k-1 Example and Analysis
Each parallel merge is in (log n)
Since we double the number of elements being merged at each iteration,
it takes log n levels of merging, so we have log n * log n total parallel steps log n * log n = log 2 n or (log n)2
More precisely, the complexity is ½(log n + 1) (log n + 2) – 1
This complexity is between (log n) and (n)
CREW vs. CRCW PRAMs • It’s worth pointing out that an algorithm that is solvable using a CRCW PRAM can also be solved using a CREW PRAM – The CREW PRAM may require more time than the CRCW PRAM, problem can still be solved – How much more time? – It has been shown that the difference is no more than a factor of log n • Recall that we could find the maximum array element in (log n) using an n processor CREW PRAM and (1) using an n processor CRCW PRAM • Therefore, any algorithm that can be solved by a CRCW PRAM that is in NC means that the problem can be solved by a CREW PRAM in NC and so problems in the class NC are independent of the type of PRAM