Notes 02
Notes 02
1 Overview
In order to talk about parallel algorithms and analyze their performance, we need a model to define
basic operations and their costs. We start with a review of the widely accepted Random Access
Machine (RAM) model used for sequential algorithms. Currently, there is no single model that
has been accepted for parallel algorithms, so we cover a few commonly accepted models, including
Shared Memory Models, Local Memory Models, Modular Memory Models and some mixed models.
There are many models for parallel computation, and certain algorithms may run faster on one
model than on another.
The Random Access Machine (RAM) Model is highly accurate at predicting the runtime of se-
quential algorithms and thus, is a widely accepted model for this purpose. The model consists of a
single processor attached to a random access memory. Memory accesses and primitive operations
can be executed in constant time. Figure 1.1 shows a basic diagram of the RAM Model.
Time Complexity: We determine the time complexity of a sequential algorithm based on the
number of basic operations and memory accesses performed.
1
2.2 Shared Memory Models
The shared memory model consists of many processors that can access a single shared memory.
Figure 1.2 shows a diagram of this model with p processors.
One type of shared memory model is the Parallel Random Access Machine (PRAM) model. This
model is synchronized, meaning that every processor executes one operation per unit-time of the
same clock. Every processor can access any shared memory location in one unit-time.
Although it takes longer for each processor to access shared memory, the unit-time memory access
is made possible in a somewhat forced manner, by determining one unit-time based on the time it
takes for a memory access. If a processor is executing an operation that is faster than this, then it
will still wait for a unit-time to pass before moving forward.
There are several variations of the PRAM Model based on what is allowed in terms of allowing
different processors to read or to write to the same shared memory location at the same time:
– Common CRCW
– Arbitrary CRCW
– Priority CRCW
– Sum CRCW
2
The three broad variations of PRAM are EREW, CREW, and CRCW. EREW exclusively allows
one processor to read or write to a single memory location at a time. CREW allows multiple
processors to read a memory location at a time, but only one processor may write to a memory
location. CRCW allows multiple processors to read or write to a memory location simultaneously.
Allowing concurrent reads of a memory location does not cause a problem, but concurrent writes
can change the stored value depending on which processor’s data is written. This leads to several
variations of CRCW, as well. Common CRCW allows concurrent writes only if the values of the
processors attempting to write simultaneously are the same. Arbitrary CRCW allows one of the
processors, selected randomly, to succeed and the rest will fail. Priority CRCW gives an ID to each
processor and the smallest ID will write successfully to the memory location. Sum CRCW stores
the sum of all of the values written by the multiple processors into the memory location.
The list of PRAM models above, is written in order of least to most powerful. For example, we
can look at a program that sums up n elements and compare the runtimes of the CREW model
and that of the CRCW Sum model:
With the CREW model, the fastest runtime we could achieve is log(n) since two numbers need to
be added together repeatedly until one sum is obtained. This can be seen in the following image:
If we look at the CRCW Sum model, since addition is handled by the hardware, it would be possible
for processors to concurrently write to the same location, which would then contain the sum of all
of the elements. The runtime here is O(1).
Local memory models consist of many processors with their own memory, which are connected
through an interconnection network. These processors do their own work and send/receive messages
to each other. Figure 1.3 shows a graphic of this:
3
2.3.1 BSP Model
An example of a local memory model is the Bulk-Synchronous Parallel (BSP) model. This model
is asynchronous, meaning that each processor operates on its own clock. Usually, the processors
will perform a bulk of operations and then synchronize with one another. This can be made faster
because the processors don’t need to wait on one another. Each processor can only access its own
local memory. This model is able to scale better since processors can easily be added and removed
from the network. Communication takes place through the interconnection network.
Google’s MapReduce is another example of a local memory model. MapReduce was used by Google
to index the Internet.
Modular memory models have multiple processors and multiple memory modules. The processors
and modules are connected through an interconnection network. Figure 1.4 shows a diagram of a
modular memory model:
4
An example of a modular memory model is the Graphical Processing Unit (GPU).
Some models combine the concepts from multiple models. The PEM model and GPU are a couple
of examples.
The Parallel External Memory (PEM) Model combines memory modules and shared memory.
Figure 1.5 depicts the PEM Model:
5
2.5.2 GPU
The GPU is an example of a mixed model. The GPU implements both modular memory and
shared memory. Figure 1.6 depicts a GPU:
All of these memory models define primitive operations and their complexity. For example, the
Bulk-Synchronous Parallel model only counts rounds of communication while the PRAM model
doesn’t need to.
3 Interconnected Networks
There are different types of Interconnected Networks that can connect processors to allow commu-
nication.
3.1 Bus/Ethernet
A bus or Ethernet is the simplest connection between processors and is basically, a wire. The term
’bus’ is typically used when the connection is inside of a computer and ’Ethernet’ is used when the
connection is between computers. Figure 1.7 is a diagram of this connection.
6
3.2 Linear Array/Ring
The linear array consists of p processors connected to one another linearly. See Figure 1.8:
A ring is a linear array with the two processors at the ends connected to one another, forming a
ring. See Figure 1.9:
The two-dimensional mesh is the linear array in two dimensions. Each processor can now be
connected to up to four other processors. See Figure 1.10:
7
The torus is the 2-D mesh with the edge processors connected to one another. Now, all processors
have four connections. See Figure 1.11:
The torus is actually more of a donut shape when viewed as a 3-D depiction.
In an n-dimensional hypercube, processors are connected if the Hamming distance of their ID’s is
one. For a 3-dimensional cube, we can label each node of the network with a binary string as seen in
Figure 1.12. The Hamming distance between two nodes is the number of bits that are mismatched.
So if the two ID’s differ by one bit, then they are connected to one another. This can be scaled to
n dimensions.
8
9