0% found this document useful (0 votes)

51 views

Parallel Breadth First Search: CS 420 Final Project

This document describes a project to implement a parallel breadth-first search (BFS) algorithm to compute distances to all nodes in a graph starting from node zero. Three versions were implemented: sequential BFS, OpenMP parallel BFS, and hybrid MPI/OpenMP parallel BFS. The hybrid implementation used different approaches to balance workload between nodes, including splitting vertices or edges. Results showed OpenMP scaling well with more threads and dynamic scheduling performing better than static. The MPI+OpenMP implementation performed best using Alltoall communication over Isend/Irecv and showed similar runtimes for different load balancing approaches on benchmark graphs.

Uploaded by

kicha120492

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views

Parallel Breadth First Search: CS 420 Final Project

Uploaded by

kicha120492

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

CS 420 Final Project

Parallel Breadth First Search

Agrima Bansal
Ryan Chui
Krishnan Swaminanthan-Gopalan

Problem

Implement a parallel Breadth-First Search to compute the distances to all the nodes in a
graph starting from node zero. Use a level-synchronous algorithm that synchronises
after computing distances to all nodes at depth d before visiting nodes at depth d+1.
The graph can be generated using a random graph generator, such as available for
Graph500.

Introduction

The breadth-first search (BFS) algorithm is a systematic traversal algorithm for

complete graph exploration. It starts at a root node and explores every node with the
same depth, relative to the root, before moving to nodes of a greater depth.

Challenges

Performing BFS in parallel faces significant challenges in both extensive branching as

well as blocking communication to maintain level synchronization across all processes.
The problem of level synchronization could be explained as BFS implementation on
graphs, faces the problem that graphs have cycles and we need to avoid accessing the
same node of the graph twice (by different threads/cores under parallelism). This limits
the effectiveness of the using GPUs to perform the computations as they are throughput
oriented and are slow with both communications and with branchings.
We divided these challenges into two specific areas when working to perform
optimizations:
1. The graph traversal problems such as BFS are predominantly memory
access-bound. These accesses are further dependent on the structure of the
input graph, thus causing unpredictable branching behavior and leading to
increased computational requirement with increasing depth.

2. The difficulty of effectively allocating tasks between all nodes to balance their
workload and compute times for each level.

Implementation

We implemented three different versions of the BFS algorithm. They were:

1. Sequential BFS:
The sequential implementation involved no parallelism and used a single thread
of a single processing core to traverse through the entire graph. It served as the
basis for comparison for the parallel codes. This code was varied to test different
optimization flags to observe their effects on the BFS running time.

2. OpenMP BFS:
The sequential code was parallelized to utilize the several threads in a single
computation node. OpenMP is used to implement shared memory parallelism.
This code could not involve any message passing and the only performance
bottleneck of the parallel BFS was data races resulting from multi-threaded
parallelism. This code was varied using different numbers of threads, using
different scheduling clauses, and with the proc_bind clause to test the scalability
of the code. The effect of varying the number of threads, different scheduling
policies and proc_bind clauses on the running time of this OpenMP code was
investigated to understand the scalability of the code.

3. MPI + OpenMP (hybrid) BFS:

Next, the code was extended to include multiple nodes using distributed memory
parallelism between the nodes with MPI and shared memory parallelism using
OpenMP. Three different methodologies were implemented in order to achieve
load balancing between the different compute nodes. They are detailed below:

a. Load balancing by splitting the vertices of the graph between each node:
The total number of vertices are split equally between the different processors.
Each processor contains the neighbor list of its own vertices. So at each level we
have to pass the neighboring vertices that needs to be traversed at the next level
to the corresponding processor. Thus, it would require each processor to
communicate with all others. This was implemented in done in three ways -
Isend/Irecv, Alltoallv, Ialltoallv. Both strong scaling and weak scaling studies are
performed for each of them.

b. Split the edges of the graph between each node:

Instead of splitting the vertices equally among the processors, the vertices are
split in such a way that the number of neighbor nodes in each processor is
roughly the same. The rationale behind this method of splitting is that the amount
of work that a processor has to perform overall is proportional to the number of
neighbor nodes present within the processor. The same three message passing
methods as the previous case was implemented. Again both strong scaling and
weak scaling studies are performed for each of them.

c. Local list of nodes with each node and synchronization at each level:
Instead of partitioning based on the vertices, the graph is partitioned based on
the edges. Each processor contains a list of edges that will processed by node
during the BFS traversal. At the end of each level, nodes use Allgatherv generate
the list of nodes to be traveled in the next iteration. Additionally all nodes use
Allgatherv to synchronize vertices just visited and their corresponding depths.
Only strong scaling study can be performed for this due to memory requirements.

Results

1. Sequential BFS
Our results from our sequential implementation of the BFS graph traversal were
expected and in-line with results generated from previous homework assignments. The
-O1 optimization flag produces an optimized image in a short amount of time, -O2
optimization performs all supported optimizations within the given architecture that does
not involve space-speed tradeoff, and -O3 performs even more optimizations prioritizing
speed over size. From -O1 to -O2 we observe an 18% speed increase and from -O2 to
-O3 we observe a 4% speed increase. As optimization level increases so does the
speed of BFS which is the expected result.

2. OpenMP BFS

In our OpenMP BFS results we observe a strong relationship between increasing the
number of threads and the time taken for execution; as the number of threads increases
the time that it took to traverse the graph using BFS decreased. This trend is present for
unscheduled, as well as static and dynamic scheduling.

When the scheduling policy within the BFS code was varied, it was found that the
dynamic scheduling tended to perform faster than static scheduling. This is because the
code was parallelized by vertices and each vertex has a variable amount of work. In this
case, the load balancing from dynamic scheduling improves the speed of our
implementation even with the increased scheduling overhead.
Chunk size was defined as (N / t * num_threads) where N is the number of vertices to
be processed and t is a varying constant that affects the chunk size. When varying
chunk size and for a large number of threads in dynamic scheduling, we observed a
strong trend towards a larger chunk size resulting in faster runtime. Conversely, the
opposite result occurred when dealing with a small number of threads as is to be
expected. When having smaller number of threads, it is less beneficial to use dynamic
load balancing with a higher scheduling overhead.
No significant differences were observed in the execution time between spread vs
master for the proc_bind clause, but in general spread was faster than master and
benefited more from a smaller chunk size.
3. MPI + OpenMP BFS
A large variation in the BFS time is observed for the different types of message passing
methods - Alltoallv(Ialltoallv) vs Isend/Irecv. Time difference between the codes
employing Alltoallv and Ialltoallv is minimal, owing to the fact that there is not much
computational work overlap with the message passing (due to the nature of BFS
algorithm). Surprisingly the non-blocking functions seems to be less efficient than the its
blocking counterpart. Further this result is consistent with both vector and neighbor
splitting. More in-depth understanding of the exact functioning and the various
processes involved in the non-blocking communication is required to explain the
observed trend. At this point, it can be hypothesized that the non-blocking
communication requires more work than the blocking version, and the gain achieved by
the communication-computation overlap is not large enough to offset this.

On the other hand, the BFS code employing Isend/Irecv took around two orders of
magnitude more time than the Alltoallv communication methodology. Since during the
BFS traversal algorithm, each process has to communicate with all the other
processors, there will be huge congestion when large number of processors are
involved. In this scenario, it is much more optimal to use the MPI in-built communication
functions like Alltoallv.

Although we observe a large variation between the different message passing functions,
the time taken by BFS codes employing load balancing by vertex splitting and neighbor
list splitting is similar. This is a direct result of the graph properties in the graph-500
benchmark. The neighbors of vertices are not concentrated around some particular
vertices, but evenly spread over all the vertices. Hence this results in an almost
balanced stage even when vertex splitting is employed. This is also evident from the
small number of vertices which are moved between the processors while load
balancing, and also these communications only occur between adjacent processors.
The local division of the edges also is much slower than the Alltoall and Ialltoall (both
vertex and neighbor), although it is faster than Isend/Irecv code. In addition, there is
another great disadvantage with using this kind of splitting. Since each processor has to
maintain the entire vertex list for updating, this method is not scalable in terms of size
with increasing number of processors. This is the reason for the absence of this method
in the weak scaling study.

Based on our results for the MPI and OpenMP BFS implementations, level-by-level BFS
does not scale well with multi-node message passing. We believe that this is due to the
amount of time message passing necessary to synchronize all nodes takes outweighs
the actual computation time performed by each node. This is because assigning and
reporting the depth of each vertex of a graph is a trivial computation compared to the
message passing as the number of vertices and edges of a graph increase. That line of
thought was what caused us to exclude the use of GPUs when first investigating the
problem apart from the branching in the code.
Conclusion

The different implementations of our project can be summarized as below:

1. In general, we found that for BFS, using the highest level optimization flag
generated code with the best runtime. This is because the higher flag allows the
compiler to perform additional optimizations to improve code efficiency.
2. When using OpenMP for shared memory parallelization, it is best to use dynamic
scheduling since the work size per task is observed to vary substantially in order
to warrant the additional scheduling overhead generated by dynamic scheduling.
3. We found that BFS graph traversals do not scale well with multi-node message
passing schemes as the communication time far outweighs the additional
throughput benefits from using multiple nodes.

COS2614 2024 Assignment 1
No ratings yet
COS2614 2024 Assignment 1
3 pages
CS 1103 - Programming Assignment Unit 6
No ratings yet
CS 1103 - Programming Assignment Unit 6
9 pages
TopoBFS Pdcs09 CR Xia
No ratings yet
TopoBFS Pdcs09 CR Xia
8 pages
Experiment-1a
No ratings yet
Experiment-1a
4 pages
BE LP5 Manual 23-24
No ratings yet
BE LP5 Manual 23-24
67 pages
Level Synchronous BFS
No ratings yet
Level Synchronous BFS
6 pages
HPC Manual 2022-23
No ratings yet
HPC Manual 2022-23
25 pages
Mill Cpu Split-Stream Encoding
No ratings yet
Mill Cpu Split-Stream Encoding
5 pages
Variable-Stride Multi-Pattern Matching For Scalable Deep Packet Inspection
No ratings yet
Variable-Stride Multi-Pattern Matching For Scalable Deep Packet Inspection
9 pages
Cosmic Cube Inspired by Distributed Computing
No ratings yet
Cosmic Cube Inspired by Distributed Computing
18 pages
The Performance of A Selection of Sorting Algorithms On A General Purpose Parallel Computer
No ratings yet
The Performance of A Selection of Sorting Algorithms On A General Purpose Parallel Computer
18 pages
Parallel BFS On Graphs Using GPGPU
No ratings yet
Parallel BFS On Graphs Using GPGPU
10 pages
PDC Report
No ratings yet
PDC Report
22 pages
Answers
No ratings yet
Answers
2 pages
Exploiting Loop-Level Parallelism For Simd Arrays Using: Openmp
No ratings yet
Exploiting Loop-Level Parallelism For Simd Arrays Using: Openmp
12 pages
Manual LP V
No ratings yet
Manual LP V
84 pages
Disruptor-1 0
No ratings yet
Disruptor-1 0
11 pages
Reconfigurable Architectures For Bio-Sequence Database Scanning On Fpgas
No ratings yet
Reconfigurable Architectures For Bio-Sequence Database Scanning On Fpgas
5 pages
829-Article Text-5973-2-10-20210114
No ratings yet
829-Article Text-5973-2-10-20210114
11 pages
Distributed Irregular Finite Elements: C3P 715 February 1989
No ratings yet
Distributed Irregular Finite Elements: C3P 715 February 1989
10 pages
High Performance Computing-1 PDF
No ratings yet
High Performance Computing-1 PDF
15 pages
V Models of Parallel Computers V. Models of Parallel Computers - After PRAM and Early Models
No ratings yet
V Models of Parallel Computers V. Models of Parallel Computers - After PRAM and Early Models
35 pages
Icmc2011 Supernova
No ratings yet
Icmc2011 Supernova
6 pages
Scheduling Threads For Constructive Cache Sharing On Cmps
No ratings yet
Scheduling Threads For Constructive Cache Sharing On Cmps
11 pages
Inter Frame Bus Encoding
No ratings yet
Inter Frame Bus Encoding
5 pages
Prefix-Based Multi-Pattern Matching On FPGA
No ratings yet
Prefix-Based Multi-Pattern Matching On FPGA
2 pages
Software Power Optimization Via Post-Link-Time Binary Rewriting
No ratings yet
Software Power Optimization Via Post-Link-Time Binary Rewriting
7 pages
SPPU High Performance Computing
No ratings yet
SPPU High Performance Computing
12 pages
Compactly Encoding Unstructured Inputs With Differential Compression
No ratings yet
Compactly Encoding Unstructured Inputs With Differential Compression
50 pages
Parallel Implementation of Lee Algorithm: Project Report
No ratings yet
Parallel Implementation of Lee Algorithm: Project Report
37 pages
Conclusion and Future Work
No ratings yet
Conclusion and Future Work
2 pages
802.11a Transmitter: A Case Study in Microarchitectural Exploration
No ratings yet
802.11a Transmitter: A Case Study in Microarchitectural Exploration
10 pages
Laboratory Practice V (3)
No ratings yet
Laboratory Practice V (3)
90 pages
Scrambling Code Planning and Optimization For UMTS System: March 2019
No ratings yet
Scrambling Code Planning and Optimization For UMTS System: March 2019
7 pages
Assignment
No ratings yet
Assignment
4 pages
Software Requirement Analysis and Specification
No ratings yet
Software Requirement Analysis and Specification
8 pages
Writeup
No ratings yet
Writeup
5 pages
User Guide Incompact3d V1-0
No ratings yet
User Guide Incompact3d V1-0
8 pages
Message Passing Fundamentals: Reference: Http://foxtrot - Ncsa.uiuc - edu:8900/public/MPI
No ratings yet
Message Passing Fundamentals: Reference: Http://foxtrot - Ncsa.uiuc - edu:8900/public/MPI
22 pages
Closing The Fibre-Channel Resiliency Gap - Storage & Beyond
No ratings yet
Closing The Fibre-Channel Resiliency Gap - Storage & Beyond
7 pages
VII. Cache Coherence. Interconnection Networks (1) : March 16, 2009
No ratings yet
VII. Cache Coherence. Interconnection Networks (1) : March 16, 2009
42 pages
Towards Transparent Parallel/Distributed Support For Real-Time Embedded Applications
No ratings yet
Towards Transparent Parallel/Distributed Support For Real-Time Embedded Applications
4 pages
Distributed Path Computation Using DIV Algorithm: P.Vinay Bhushan, R.Sumathi
No ratings yet
Distributed Path Computation Using DIV Algorithm: P.Vinay Bhushan, R.Sumathi
8 pages
The Design of An API For Strict Multithreading in C++
No ratings yet
The Design of An API For Strict Multithreading in C++
10 pages
Scimakelatex 30610 None
No ratings yet
Scimakelatex 30610 None
7 pages
Raposo Et Al. - 2024 - Mixture-of-Depths Dynamically Allocating Compute
No ratings yet
Raposo Et Al. - 2024 - Mixture-of-Depths Dynamically Allocating Compute
14 pages
Interview_Qs_1
100% (1)
Interview_Qs_1
48 pages
Multicore Fpga
No ratings yet
Multicore Fpga
1 page
An FPGA-Based Implementation of Multi-Alphabet Arithmetic Coding
No ratings yet
An FPGA-Based Implementation of Multi-Alphabet Arithmetic Coding
9 pages
Boosting Single-Thread Performance in Multi-Core Systems Through Fine-Grain Multi-Threading
No ratings yet
Boosting Single-Thread Performance in Multi-Core Systems Through Fine-Grain Multi-Threading
10 pages
The Relationship Between Online Algorithms and Kernels Using SNIFF
No ratings yet
The Relationship Between Online Algorithms and Kernels Using SNIFF
6 pages
Application of Parallel Processing To Rendering in A Virtual Reality System
No ratings yet
Application of Parallel Processing To Rendering in A Virtual Reality System
8 pages
Computer Architecture CT 2 Paper Solution: K M K M K M K M T T) S (M
No ratings yet
Computer Architecture CT 2 Paper Solution: K M K M K M K M T T) S (M
12 pages
The AVR Microcontroller and C Compiler
No ratings yet
The AVR Microcontroller and C Compiler
6 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
A Constraint Programming Approach For
No ratings yet
A Constraint Programming Approach For
10 pages
APRIL: A Processor Architecture Multiprocessing
No ratings yet
APRIL: A Processor Architecture Multiprocessing
11 pages
Data Parallel Patterns
No ratings yet
Data Parallel Patterns
9 pages
IPV-4 ROUTING-Assignment Questions With Solutions
No ratings yet
IPV-4 ROUTING-Assignment Questions With Solutions
15 pages
Soto Ferrari
No ratings yet
Soto Ferrari
9 pages
Konstantinos Krommydas, Christos D. Antonopoulos, Nikolaos Bellas, Wu-Chun Feng
No ratings yet
Konstantinos Krommydas, Christos D. Antonopoulos, Nikolaos Bellas, Wu-Chun Feng
4 pages
Breadth First Search: Fundamentals and Applications
From Everand
Breadth First Search: Fundamentals and Applications
Fouad Sabry
No ratings yet
Selecor Forms
No ratings yet
Selecor Forms
18 pages
Log
No ratings yet
Log
124 pages
Digital Shadows Unmasking Process Hollowing
No ratings yet
Digital Shadows Unmasking Process Hollowing
14 pages
EvolverOL5
No ratings yet
EvolverOL5
316 pages
Integrating Selenium With Neoload For UI Rendering
No ratings yet
Integrating Selenium With Neoload For UI Rendering
19 pages
You Said:: Advanced Automation Certification - 2024
No ratings yet
You Said:: Advanced Automation Certification - 2024
40 pages
Leaky Bucket Algorithm
No ratings yet
Leaky Bucket Algorithm
8 pages
4 Quick Reference-PLSQL
No ratings yet
4 Quick Reference-PLSQL
66 pages
How To Add Predecessor and Successor Dependency To A Job in OPC
No ratings yet
How To Add Predecessor and Successor Dependency To A Job in OPC
9 pages
Godrej Properties
No ratings yet
Godrej Properties
2 pages
Mid Oop
No ratings yet
Mid Oop
4 pages
Indexing
No ratings yet
Indexing
6 pages
How To Upload Multiple Files React Native - by Ariel Salem - Medium
No ratings yet
How To Upload Multiple Files React Native - by Ariel Salem - Medium
3 pages
Andriod Programming Book
No ratings yet
Andriod Programming Book
29 pages
Java Programiming PDF
0% (1)
Java Programiming PDF
3 pages
Java Io: - Outputstream: - Inputstream
No ratings yet
Java Io: - Outputstream: - Inputstream
80 pages
SRS Real Time Face Detection
No ratings yet
SRS Real Time Face Detection
15 pages
Dell PowerFlex Specification Datasheet
No ratings yet
Dell PowerFlex Specification Datasheet
14 pages
END OF TERM III EXAMINATIONS OOP MG.docx
No ratings yet
END OF TERM III EXAMINATIONS OOP MG.docx
16 pages
Assignment-3 Ch-1 & 2
No ratings yet
Assignment-3 Ch-1 & 2
3 pages
CO SYllabus NIT KKR
No ratings yet
CO SYllabus NIT KKR
79 pages
100 Question Answers About Scrum 1694851079
No ratings yet
100 Question Answers About Scrum 1694851079
202 pages
Rational Robot Questions & Answers
No ratings yet
Rational Robot Questions & Answers
18 pages
strings in C language
No ratings yet
strings in C language
28 pages
Cohesion and Coupling
No ratings yet
Cohesion and Coupling
40 pages
Lab # 02 Linux System Calls: PROGRAM USING SYSTEM CALL Fork, Wait, Execve, Exit
No ratings yet
Lab # 02 Linux System Calls: PROGRAM USING SYSTEM CALL Fork, Wait, Execve, Exit
6 pages
Steven Warwick Part 2 - A Deeper Dive Into Developing Custom Interactions in Adobe Captivate Using JavaScript
No ratings yet
Steven Warwick Part 2 - A Deeper Dive Into Developing Custom Interactions in Adobe Captivate Using JavaScript
38 pages
Microsoft Exchange Tips and Tricks
No ratings yet
Microsoft Exchange Tips and Tricks
7 pages

Parallel Breadth First Search: CS 420 Final Project

Uploaded by

Parallel Breadth First Search: CS 420 Final Project

Uploaded by

CS 420 Final Project

Parallel Breadth First Search

The breadth-first search (BFS) algorithm is a systematic traversal algorithm for

Performing BFS in parallel faces significant challenges in both extensive branching as

We implemented three different versions of the BFS algorithm. They were:

3. MPI + OpenMP (hybrid) BFS:

b. Split the edges of the graph between each node:

The different implementations of our project can be summarized as below:

You might also like