0% found this document useful (0 votes)
4 views

module 3ppt

Module 3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

module 3ppt

Module 3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Chapter Five

Basic Communication Operations

By: Hatem Moharram

Addison Wesley:
An Introduction to Parallel Computing 2nd Ed
In most parallel algorithms, processes need to exchange
data with other processes.

This exchange of data can significantly impact the efficiency


of parallel programs by introducing interaction delays during
their execution.
It takes roughly ts + m tw time for a simple exchange of an m-
word message between two processes running on different
nodes of an interconnection network.

ts : the latency or the startup time for the data transfer


tw : is the per-word transfer time, which is inversely
proportional to the available bandwidth between the nodes.
4.1 One-to-All Broadcast and
All-to-One Reduction
one-to-all broadcast: A single process send identical data to
all other processes or to a subset of them.

The source process has the data of size m that needs to


be broadcast.

At the termination of the procedure, there are p copies


of the initial data – one belonging to each process.
The dual of one-to-all broadcast is all-to-one reduction.
all-to-one reduction operation: each pof the
participating processes starts with a buffer M
containing m words.

The data from all processes are combined through an


associative operator and accumulated at a single
destination process into one buffer of size m.

Reduction can be used to find the sum, product, maximum,


or minimum of sets of numbers
One-to-all broadcast and all-to-one reduction are used in

several important parallel algorithms including matrix-

vector multiplication, Gaussian elimination, shortest paths,

and vector inner product.


4.1.1 Ring or Linear Array
A naive way to perform one-to-all broadcast is to
sequentially send p - 1 messages from the source to the
other p - 1 processes.

This is inefficient because:


1- The source process becomes a bottleneck.
2- The communication network is underutilized because
only the connection between a single pair of nodes is
used at a time.
recursive doubling technique: The source process first sends
the message to another process. Now both these processes
can simultaneously send the message to two other
processes that are still waiting for the message.

By continuing this procedure until all the processes have


received the data, the message can be broadcast in log p
steps. One-to-all broadcast on an eight-node
ring. Node 0 is the source of the
broadcast. Each message transfer step
is shown by a numbered, dotted
arrow from the source of the message
to its destination. The number on an
arrow indicates the time step during
which the message is transferred.
Notes:
- The message is first sent to the farthest node (4) from the
source (0).

- The message recipients are selected in this manner at each


step to avoid congestion on the network.
Reduction on a linear array can be performed by simply
reversing the direction and the sequence of communication,

1- each odd numbered node sends its buffer to the even numbered node just before
itself, where the contents of the two buffers are combined into one.

2- there are four buffers left to be reduced on nodes 0, 2, 4, and 6, respectively. In


the second step, the contents of the buffers on nodes 0 and 2 are accumulated on
node 0 and those on nodes 6 and 4 are accumulated on node 4.

3- node 4 sends its buffer to node 0, which computes the final result of the
reduction.
Example 4.1 Matrix-vector multiplication
Consider the problem of multiplying an n x n matrix A with
an n x 1 vector x on an n x n mesh of nodes to yield an n x 1
result vector y.

-each element of the


matrix belongs to a
different process,

-the vector is distributed


among the processes in the
topmost row of the mesh

- the result vector is


generated on the leftmost
column of processes.
each process needs the
element of the vector
residing in the topmost
process of its column.

- Each column of nodes


performs a one-to-all
broadcast of the vector
elements with the topmost
process of the column as
the source.

- This is done by treating each column of the n x n mesh as


an n-node linear array, and simultaneously applying the
linear array broadcast procedure described previously to all
- each process multiplies its
matrix element with the
result of the broadcast.

- each row of processes


needs to add its result to
generate the corresponding
element of the product
vector.

- This is accomplished by performing all-to-one reduction on each row


of the process mesh with the first process of each row as the
destination of the reduction operation.
4.1.2 Mesh
We can regard each row and column of a square mesh of p
nodes as a linear array of p nodes.
A linear array communication operation
can be performed in two phases on a
mesh.
- In the first phase, the operation is
performed along one or all rows by
treating the rows as linear arrays.

- In the second phase, the columns


are treated similarly.
Consider the problem of one-to-all broadcast on a two-
dimensional square mesh with p rows and p columns.

1- a one-to-all broadcast is performed from the source to


the remaining ( p -1 ) nodes of the same row.
2- Once all the nodes in a row of the mesh have acquired the
data, they initiate a one-to-all broadcast in their respective
columns.

with node 0 at the bottom-left


corner as the source. Steps 1 and 2
correspond to the first phase, and
steps 3 and 4 correspond to the
second phase.
- We can use a similar procedure for one-to-all broadcast on
a three-dimensional mesh as well.
rows of p1/3 nodes in each of the three dimensions of
the mesh would be treated as linear arrays.

- As in the case of a linear array, reduction can be performed


on two- and three dimensional meshes by simply reversing
the direction and the order of messages.
4.1.3 Hypercube
A hypercube with 2d nodes can be regarded as a d-
dimensional mesh with two nodes in each dimension.
the mesh algorithm can be extended to the
hypercube, except that the process is now carried out
in d steps – one in each dimension.
- communication starts along
the highest dimension and
proceeds along successively
lower dimensions in
subsequent steps.
- Note that the source and
the destination nodes in
three communication steps
of the algorithm
are identical to the ones in
the broadcast algorithm on
a linear array
node 0 as the source.
Unlike a linear array, the hypercube broadcast would not suffer from
congestion if node 0 started out by sending the message to node 1 in
the first step, followed by nodes 0 and 1 sending messages to nodes 2
and 3, respectively, and finally nodes 0, 1, 2, and 3 sending messages
to nodes 4, 5, 6, and 7, respectively.
4.1.4 Balanced Binary Tree
The hypercube algorithm for one-to-all broadcast maps
naturally onto a balanced binary tree in which each leaf is a
processing node and intermediate nodes serve only as
switching units.
4.1.5 Detailed Algorithms

the basic communication pattern for one-to-all broadcast is


identical on all the four interconnection networks
considered in this section.
Algorithm 4.1
- shows a one-to-all broadcast procedure on a 2d-node
network when node 0 is the source of the broadcast.

-The procedure is executed at all the nodes.

-the value of my_id is the label of that node.

- Let X be the message to be broadcast, which initially


resides at the source node 0.

- The procedure performs d communication steps, one along


each dimension of a hypothetical hypercube.
-communication proceeds from the highest to the lowest dimension.

- The loop counter i indicates the current dimension of the hypercube


in which communication is taking place.

- Only the nodes with zero in the i


least significant bits of their labels
participate in communication along
dimension i.

- the nodes with a zero at bit position i


send the data, and the nodes with a
one at bit position i receive it.
NOTES:
- Algorithm 4.1 works only if node 0 is the source of the
broadcast. For an arbitrary source, we must relabel the
nodes of the hypothetical hypercube by XORing the label of
each node with the label of the source node before we
apply this procedure.
Algorithm 4.2 relabels the source node to 0, and relabels
the other nodes relative to the source. After this
relabeling, the algorithm of Algorithm 4.1 can be applied
to perform the broadcast.

- Algorithm 4.3 gives a procedure to perform an all-to-one


reduction on a hypothetical d dimensional hypercube such
that the final result is accumulated on node 0.
4.1.6 Cost Analysis
Assume that p processes participate in the operation and
the data to be broadcast or reduced contains m words. The
broadcast or reduction procedure involves log p point-to-
point simple message transfers, each at a time cost of:
ts + tw m.

The total time taken by the procedure is


4.2 All-to-All Broadcast and Reduction
All-to-all broadcast is a generalization of one-to-all
broadcast in which all p nodes simultaneously initiate a
broadcast.

A process sends the same m-word message to every other


process, but different processes may broadcast different
messages.

All-to-all broadcast is used in matrix operations, including


matrix multiplication and matrix-vector multiplication.
The dual of all-to-all broadcast is all-to-all reduction, in
which every node is the destination of an all-to- one
reduction
One way to perform an all-to-all broadcast is to perform p
one-to-all broadcasts, one starting at each node.

If performed naively, on some architectures this approach


may take up to p times as long as a one-to-all broadcast.

It is possible to use the communication links in the


interconnection network more efficiently by performing all p
one-to-all broadcasts simultaneously:
all messages traversing the same path at the same
time are concatenated into a single message whose
size is the sum of the sizes of individual messages.
4.2.1 Linear Array and Ring
Each node first sends to one of its neighbors the data it
needs to broadcast. In subsequent steps, it forwards the
data received from one of its neighbors to its other
neighbor.
All-to-all reduction can be performed by reversing the
direction and sequence of the messages.

The only additional step required is that upon receiving a


message, a node must combine it with the local copy of the
message that has the same destination as the received
message before forwarding the combined message to the
next neighbor.
4.2.2 Mesh
like 1-to-1 broadcast, the all-to-all broadcast
algorithm for the 2-D mesh is based on the
linear array algorithm, treating rows and
columns of the mesh as linear arrays.
communication takes place in two phases.
In the first phase, each row of the mesh
performs an all-to-all broadcast using the
procedure for the linear array.
All nodes collect p messages
corresponding to the p nodes of their
respective rows. Each node consolidates
this information into a single message of
size m p , and proceeds to the second
communication phase of the algorithm.
The second communication phase is a columnwise all-to-all
broadcast of the consolidated messages.

By the end of this phase, each node obtains all p pieces of m


word data that originally resided on different nodes.
4.2.3 Hypercube
The hypercube algorithm for all-to-all broadcast is an
extension of the mesh algorithm to log p dimensions. The
procedure requires log p steps. Communication takes place
along a different dimension of the p-node hypercube in each
step. In every step, pairs of nodes exchange their data and
double the size of the message to be transmitted in the next
step by concatenating the received message with their
current data.
As usual, the algorithm for all-to-all reduction can be
derived by reversing the order and direction of messages in
all-to-all broadcast.

Furthermore, instead of concatenating the messages, the


reduction operation needs to select the appropriate subsets
of the buffer to send out and accumulate received messages
in each iteration.
4.2.4 Cost Analysis
1- On a ring or a linear array,
all-to-all broadcast involves p - 1 steps of communication
between nearest neighbors.

Each step, involving a message of size m, takes time


ts + tw m.
Therefore, the time taken by the entire operation is
2- on a mesh,
the first phase of p simultaneous all-to-all broadcasts (each
among p nodes) concludes in time

The number of nodes participating in each all-to-all


broadcast in the second phase is also p, but the size of each
message is now m p .

Therefore, this phase takes time

to complete.
The time for the entire all-to-all broadcast on a p-node two-
dimensional square mesh is the sum of the times spent in
the individual phases, which is
3- On a p-node hypercube,
the size of each message exchanged in the i th of the log p
steps is 2i-1m.
It takes a pair of nodes time ts + 2i-1twm to send and receive
messages from each other during the i th step.
Hence, the time to complete the entire procedure is:
4.3 All-Reduce and Prefix-Sum Operations
The communication pattern of all-to-all broadcast can be
used to perform some other operations as well.

1- all-reduce operation: each node starts with a buffer of


size m and the final results of the operation are identical
buffers of size m on each node that are formed by
combining the original p buffers using an associative
operator.
performing an all-to-one reduction followed by a
one-to-all broadcast of the result.

This operation is different from all-to-all


reduction, in which p simultaneous all-to-one
reductions take place, each with a different
destination for the result.
A simple method to perform all-reduce is to perform an all-
to-one reduction followed by a one-to-all broadcast.

First Step

Second Step
There is a faster way to perform all-reduce by using the communication
pattern of all-to-all broadcast.
Ex.: Assume that each integer in parentheses in the figure denotes a
number to be added that originally resided at the node with that
integer label.

To perform reduction, we follow the communication steps of the all-


to-all broadcast procedure, but at the end of each step, add two
numbers instead of concatenating two messages.

At the termination of the reduction procedure, each node holds the


sum (0 + 1 + 2+ ··· + 7)

Unlike all-to-all broadcast, each message transferred in the reduction


operation has only one word.

The total communication time for all log p steps is:


2- prefix sums (the scan operation)
Given p numbers n0, n1, ..., np-1 (one on each node), the
problem is to compute the sums:
for all k between 0 and p - 1.
Ex.: if the original sequence of numbers is <3, 1, 4, 0, 2>, then the
sequence of prefix sums is <3, 4, 8,8, 10>.

Instead of starting with a single numbers, each node could


start with a buffer or vector of size m and the m-word result
would be the sum of the corresponding elements of buffers.
At the end of a communication step, the content of an
incoming message is added to the result buffer only if the
message comes from a node with a smaller label than that
of the recipient node.
4.4 Scatter and Gather
1- the scatter operation (one-to-all personalized
communication): a single node sends a unique message of
size m to every other node.

One-to-all personalized communication is different from


one-to-all broadcast in that the source node starts with p
unique messages, one destined for each node.

Unlike one-to-all broadcast, one-to-all


personalized
communication does not involve any duplication of data.
The communication patterns
of one-to-all Broadcast and
scatter are identical. Only the
size and the contents of
messages are different.

There is a total of log p


communication steps
corresponding to the log p
dimensions of the
hypercube.

The scatter operation on an eight-node


hypercube.
2- gather operation (concatenation): The dual of one-to-all
personalized communication, in which a single node collects
a unique message from each node.

A gather operation is different from an all-to-one reduce


operation in that it does not involve any combination or
reduction of data.
The gather operation is simply the reverse of scatter. Each
node starts with an m word message.

In the first step, every odd numbered node sends its buffer
to an even numbered neighbor behind it, which
concatenates the received message with its own buffer.

Only the even numbered nodes participate in the next


communication step which results in nodes with multiples of
four labels gathering more data and doubling the sizes of
their data.

The process continues similarly, until node 0 has gathered


the entire data.

You might also like