Parallel Breadth First Search: CS 420 Final Project
Parallel Breadth First Search: CS 420 Final Project
Problem
Implement a parallel Breadth-First Search to compute the distances to all the nodes in a
graph starting from node zero. Use a level-synchronous algorithm that synchronises
after computing distances to all nodes at depth d before visiting nodes at depth d+1.
The graph can be generated using a random graph generator, such as available for
Graph500.
Introduction
Challenges
2. The difficulty of effectively allocating tasks between all nodes to balance their
workload and compute times for each level.
Implementation
1. Sequential BFS:
The sequential implementation involved no parallelism and used a single thread
of a single processing core to traverse through the entire graph. It served as the
basis for comparison for the parallel codes. This code was varied to test different
optimization flags to observe their effects on the BFS running time.
2. OpenMP BFS:
The sequential code was parallelized to utilize the several threads in a single
computation node. OpenMP is used to implement shared memory parallelism.
This code could not involve any message passing and the only performance
bottleneck of the parallel BFS was data races resulting from multi-threaded
parallelism. This code was varied using different numbers of threads, using
different scheduling clauses, and with the proc_bind clause to test the scalability
of the code. The effect of varying the number of threads, different scheduling
policies and proc_bind clauses on the running time of this OpenMP code was
investigated to understand the scalability of the code.
a. Load balancing by splitting the vertices of the graph between each node:
The total number of vertices are split equally between the different processors.
Each processor contains the neighbor list of its own vertices. So at each level we
have to pass the neighboring vertices that needs to be traversed at the next level
to the corresponding processor. Thus, it would require each processor to
communicate with all others. This was implemented in done in three ways -
Isend/Irecv, Alltoallv, Ialltoallv. Both strong scaling and weak scaling studies are
performed for each of them.
c. Local list of nodes with each node and synchronization at each level:
Instead of partitioning based on the vertices, the graph is partitioned based on
the edges. Each processor contains a list of edges that will processed by node
during the BFS traversal. At the end of each level, nodes use Allgatherv generate
the list of nodes to be traveled in the next iteration. Additionally all nodes use
Allgatherv to synchronize vertices just visited and their corresponding depths.
Only strong scaling study can be performed for this due to memory requirements.
Results
1. Sequential BFS
Our results from our sequential implementation of the BFS graph traversal were
expected and in-line with results generated from previous homework assignments. The
-O1 optimization flag produces an optimized image in a short amount of time, -O2
optimization performs all supported optimizations within the given architecture that does
not involve space-speed tradeoff, and -O3 performs even more optimizations prioritizing
speed over size. From -O1 to -O2 we observe an 18% speed increase and from -O2 to
-O3 we observe a 4% speed increase. As optimization level increases so does the
speed of BFS which is the expected result.
2. OpenMP BFS
In our OpenMP BFS results we observe a strong relationship between increasing the
number of threads and the time taken for execution; as the number of threads increases
the time that it took to traverse the graph using BFS decreased. This trend is present for
unscheduled, as well as static and dynamic scheduling.
When the scheduling policy within the BFS code was varied, it was found that the
dynamic scheduling tended to perform faster than static scheduling. This is because the
code was parallelized by vertices and each vertex has a variable amount of work. In this
case, the load balancing from dynamic scheduling improves the speed of our
implementation even with the increased scheduling overhead.
Chunk size was defined as (N / t * num_threads) where N is the number of vertices to
be processed and t is a varying constant that affects the chunk size. When varying
chunk size and for a large number of threads in dynamic scheduling, we observed a
strong trend towards a larger chunk size resulting in faster runtime. Conversely, the
opposite result occurred when dealing with a small number of threads as is to be
expected. When having smaller number of threads, it is less beneficial to use dynamic
load balancing with a higher scheduling overhead.
No significant differences were observed in the execution time between spread vs
master for the proc_bind clause, but in general spread was faster than master and
benefited more from a smaller chunk size.
3. MPI + OpenMP BFS
A large variation in the BFS time is observed for the different types of message passing
methods - Alltoallv(Ialltoallv) vs Isend/Irecv. Time difference between the codes
employing Alltoallv and Ialltoallv is minimal, owing to the fact that there is not much
computational work overlap with the message passing (due to the nature of BFS
algorithm). Surprisingly the non-blocking functions seems to be less efficient than the its
blocking counterpart. Further this result is consistent with both vector and neighbor
splitting. More in-depth understanding of the exact functioning and the various
processes involved in the non-blocking communication is required to explain the
observed trend. At this point, it can be hypothesized that the non-blocking
communication requires more work than the blocking version, and the gain achieved by
the communication-computation overlap is not large enough to offset this.
On the other hand, the BFS code employing Isend/Irecv took around two orders of
magnitude more time than the Alltoallv communication methodology. Since during the
BFS traversal algorithm, each process has to communicate with all the other
processors, there will be huge congestion when large number of processors are
involved. In this scenario, it is much more optimal to use the MPI in-built communication
functions like Alltoallv.
Although we observe a large variation between the different message passing functions,
the time taken by BFS codes employing load balancing by vertex splitting and neighbor
list splitting is similar. This is a direct result of the graph properties in the graph-500
benchmark. The neighbors of vertices are not concentrated around some particular
vertices, but evenly spread over all the vertices. Hence this results in an almost
balanced stage even when vertex splitting is employed. This is also evident from the
small number of vertices which are moved between the processors while load
balancing, and also these communications only occur between adjacent processors.
The local division of the edges also is much slower than the Alltoall and Ialltoall (both
vertex and neighbor), although it is faster than Isend/Irecv code. In addition, there is
another great disadvantage with using this kind of splitting. Since each processor has to
maintain the entire vertex list for updating, this method is not scalable in terms of size
with increasing number of processors. This is the reason for the absence of this method
in the weak scaling study.
Based on our results for the MPI and OpenMP BFS implementations, level-by-level BFS
does not scale well with multi-node message passing. We believe that this is due to the
amount of time message passing necessary to synchronize all nodes takes outweighs
the actual computation time performed by each node. This is because assigning and
reporting the depth of each vertex of a graph is a trivial computation compared to the
message passing as the number of vertices and edges of a graph increase. That line of
thought was what caused us to exclude the use of GPUs when first investigating the
problem apart from the branching in the code.
Conclusion