0% found this document useful (0 votes)
46 views29 pages

Parallel and Distributed Computing Overview

Uploaded by

Sisira Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views29 pages

Parallel and Distributed Computing Overview

Uploaded by

Sisira Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CSE4001 Parallel and Distributed Computing

Digital Assignment-2(ETH)

Name: Nakul Jadeja


Reg no.: 19BCE0660

School of Computer Science and Engineering


Course Code: CSE4001

Slot: E2
Winter Semester 2021-22

Professor: Narayanamoorthi M
Q1) Assume that a program generates large quantities of a floating-
point data that is stored in an array. In order to determine the
distribution of the data we can make a histogram. We simply divide
the range of data into equal sized sub intervals of bins. Determine
the no of measurements in each bin and plot a graph (bar) showing
the relative sizes of the bin. Use MPI or OpenMP to implement the
above and find the max, min and mean of the data.

CODE: -
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main()
{
int arr[10]={1,6,4,2,3,7,9,4,2,10};
int r[2],j=0;
r[0]=0;
r[1]=0;
int tid;
printf("19BCE0660 NAKUL JADEJA \n");
omp_set_num_threads(1);
#pragma omp parallel
{
tid=omp_get_num_threads()
; if(tid==1)
{
for(int i=0;i<10;i++)
{
if(arr[i]>=0 && arr[i]<=5)
{
r[0]++;}
if(arr[i]>=6 && arr[i]<=10)
{
r[1]++;
}
}
}
}
printf("In range [0-5]:%d\n",r[0]);
printf("In range [6-10]:%d\n",r[1]);
return 0;
}
SCREENSHOT OF THE OUTPUT: -

Q2) Discuss about the following with an example illustration:


High performance computing
Answer: -
High performance computing (HPC) is the ability to process data and
perform complex calculations at high speeds. To put it into
perspective, a laptop or desktop with a 3 GHz processor can perform
around 3 billion calculations per second.
Why is HPC important?
It is through data that ground-breaking scientific discoveries are
made, game-changing innovations are fuelled, and quality of life is
improved for billions of people around the globe. HPC is the
foundation for scientific, industrial, and societal advancements. As
technologies like the Internet of Things (IoT), artificial intelligence
(AI), and 3-D imaging evolve, the size and amount of data that
organizations have to work with is growing exponentially. For many
purposes, such as streaming a live sporting event, tracking a
developing storm, testing new products, or analysing stock trends, the
ability to process data in real time is crucial. To keep a step ahead of
the competition, organizations need lightning-fast, highly reliable IT
infrastructure to process, store, and analyze massive amounts of data.
How does HPC work?
HPC solutions have three main components:
1. Compute
2. Network
3. Storage
To build a high-performance computing architecture, compute servers
are networked together into a cluster. Software programs and
algorithms are run simultaneously on the servers in the cluster. The
cluster is networked to the data storage to capture the output.
Together, these components operate seamlessly to complete a diverse
set of tasks. To operate at maximum performance, each component
must keep pace with the others. For example, the storage component
must be able to feed and ingest data to and from the compute servers
as quickly as it is processed. Likewise, the networking components
must be able to support the high-speed transportation of data between
compute servers and the data storage. If one component cannot keep
up with the rest, the performance of the entire HPC infrastructure
suffers.
HPC use cases
Deployed on premises, at the edge, or in the cloud, HPC solutions are
used for a variety of purposes across multiple industries. Examples
include:
• Research labs. HPC is used to help scientists find sources of
renewable energy, understand the evolution of our universe,
predict and track storms, and create new materials.
• Media and entertainment. HPC is used to edit feature films,
render mind-blowing special effects, and stream live events
around the world.
• Oil and gas. HPC is used to more accurately identify where to
drill for new wells and to help boost production from existing
wells.
• Artificial intelligence and machine learning. HPC is used to
detect credit card fraud, provide self-guided technical support,
teach self-driving vehicles, and improve cancer screening
techniques.
• Financial services. HPC is used to track real-time stock trends
and automate trading.
• HPC is used to design new products, simulate test scenarios, and
make sure that parts are kept in stock so that production lines
aren’t held up.
• HPC is used to help develop cures for diseases like diabetes and
cancer and to enable faster, more accurate patient diagnosis.
Cluster computing: -
Cluster computing refers that many of the computers connected on a
network and they perform like a single entity. Each computer that is
connected to the network is called a node. Cluster computing offers
solutions to solve complicated problems by providing faster
computational speed, and enhanced data integrity. The connected
computers execute operations all together thus creating the impression
like a single system (virtual machine). This process is termed as
transparency of the system. Based on the principle of distributed
systems, this networking technology performs its operations. And
here, LAN is the connection unit. This process is defined as the
transparency of the system. Cluster computing goes with the features
of:
• All the connected computers are the same kind of machines
• They are tightly connected through dedicated network
connections
• All the computers share a common home directory.
Clusters’ hardware configuration differs based on the selected
networking technologies. Cluster is categorized as Open and Close
clusters wherein Open Clusters all the nodes need IP’s and those are
accessed only through the internet or web. This type of clustering
causes enhanced security concerns. And in Closed Clustering, the
nodes are concealed behind the gateway node and they offer increased
protection.

A) Cluster Load Balancing


Load balancing clusters are employed in the situations of augmented
network and internet utilization and these clusters perform as the
fundamental factor. This type of clustering technique offers the
benefits of increased network capacity and enhanced performance.
Here the entire nodes stay as cohesive with all the instance where the
entire node objects are completely attentive of the requests those are
present in the network.
B) High Availability Clusters
These are also termed as failover clusters. Computers so often faces
failure issues. So, High Availability comes in line with the
augmenting dependency of computers as computers hold crucial
responsibility in many of the organizations and applications. In this
approach, redundant computer systems are utilized in the situation of
any component malfunction.
C) High-Performance Clusters
This networking approach utilizes supercomputers to resolve complex
computational problems. Along with the management of IO-
dependent applications like web services, high-performance clusters
are employed in computational models of climate and in-vehicle
breakdowns.
Cluster Computing Architecture
• A cluster is a kind of parallel/distributed processing network
which is designed with an array of interconnected individual
computers and the computer systems operating collectively as a
single standalone system.
• A node – Either a single or a multiprocessor network having
memory, input and output functions and an operating system
• In general, 2 or more nodes are connected on a single line or
every node might be connected individually through a LAN
connection
Advantages of Cluster Computing
There are numerous advantages of implementing cluster
computing in the applications. Few of them to be discussed are
as follows:

• Cost efficacy – Even mainframe computers seems to be


extremely stable, cluster computing is more in implementation
because of their cost-effectiveness and economical. Also, these
systems provide enhanced performance than that of mainframe
computer networks.

• Processing speed – The cluster computing systems offer the


same processing speed as that of mainframe computers and the
speed is also equal to supercomputers.

Grid computing: -
Grid computing is the practice of leveraging multiple computers,
often geographically distributed but connected by networks, to work
together to accomplish joint tasks. It is typically run on a “data grid,”
a set of computers that directly interact with each other to coordinate
jobs.
How Does Grid Computing Work?
Grid computing works by running specialized software on every
computer that participates in the data grid. The software acts as the
manager of the entire system and coordinates various tasks across the
grid. Specifically, the software assigns subtasks to each computer so
they can work simultaneously on their respective subtasks. After the
completion of subtasks, the outputs are gathered and aggregated to
complete a larger-scale task. The software lets each computer
communicate over the network with the other computers so they can
share information on what portion of the subtasks each computer is
running, and how to consolidate and deliver outputs.

How is Grid Computing Used?


Grid computing is especially useful when different subject matter
experts need to collaborate on a project but do not necessarily have
the means to immediately share data and computing resources in a
single site. By joining forces despite the geographical distance, the
distributed teams are able to leverage their own resources that
contribute to a bigger effort. This means that all computing resources
do not have to work on the same specific task, but can work on sub-
tasks that collectively make up the end goal. For example, a research
team might analyze weather patterns in the North Atlantic region,
while another team analyzes the south Atlantic region, and both
results can be combined to deliver a complete picture of Atlantic
weather patterns.

Q3) Write a c program using MPI to simulate the scatter and


gather operations?
An introduction to MPI_Scatter
MPI_Scatter is a collective routine that is very similar to MPI_Bcast
(If you are unfamiliar with these terms, please read the previous
lesson). MPI_Scatter involves a designated root process sending data
to all processes in a communicator. The primary difference between
MPI_Bcast and MPI_Scatter is small but important. MPI_Bcast sends
the same piece of data to all processes while MPI_Scatter sends
chunks of an array to different processes. Check out the illustration
below for further clarification.
In the illustration, MPI_Bcast takes a single data element at the root
process (the red box) and copies it to all other processes. MPI_Scatter
takes an array of elements and distributes the elements in the order of
process rank. The first element (in red) goes to process zero, the
second element (in green) goes to process one, and so on. Although
the root process (process zero) contains the entire array of data,
MPI_Scatter will copy the appropriate element into the receiving
buffer of the process. Here is what the function prototype of
MPI_Scatter looks like.
MPI_Scatter(
void* send_data,
int send_count,
MPI_Datatype send_datatype,
void* recv_data,
int recv_count,
MPI_Datatype recv_datatype,
int root,
MPI_Comm communicator)
Yes, the function looks big and scary, but let’s examine it in more
detail. The first parameter, send_data, is an array of data that resides
on the root process. The second and third parameters, send_count and
send_datatype, dictate how many elements of a specific MPI Datatype
will be sent to each process. If send_count is one and send_datatype is
MPI_INT, then process zero gets the first integer of the array, process
one gets the second integer, and so on. If send_count is two, then
process zero gets the first and second integers, process one gets the
third and fourth, and so on. In practice, send_count is often equal to
the number of elements in the array divided by the number of
processes. The receiving parameters of the function prototype are
nearly identical in respect to the sending parameters. The recv_data
parameter is a buffer of data that can hold recv_count elements that
have a datatype of recv_datatype. The last parameters, root and
communicator, indicate the root process that is scattering the array of
data and the communicator in which the processes reside.
An introduction to MPI_Gather
MPI_Gather is the inverse of MPI_Scatter. Instead of spreading
elements from one process to many processes, MPI_Gather takes
elements from many processes and gathers them to one single
process. This routine is highly useful to many parallel algorithms,
such as parallel sorting and searching. Below is a simple illustration
of this algorithm.

Similar to MPI_Scatter, MPI_Gather takes elements from each


process and gathers them to the root process. The elements are
ordered by the rank of the process from which they were received.
The function prototype for MPI_Gather is identical to that of
MPI_Scatter.
MPI_Gather(
void* send_data,
int send_count,
MPI_Datatype send_datatype,
void* recv_data,
int recv_count,
MPI_Datatype recv_datatype,
int root,
MPI_Comm communicator)
In MPI_Gather, only the root process needs to have a valid receive
buffer. All other calling processes can pass NULL for recv_data.
Also, don’t forget that the recv_count parameter is the count of
elements received per process, not the total summation of counts from
all processes. This can often confuse beginning MPI programmers.
Computing average of numbers with MPI_Scatter and MPI_Gather
The program takes the following steps:

• Generate a random array of numbers on the root process


(process 0).
• Scatter the numbers to all processes, giving each process an
equal amount of numbers.
• Each process computes the average of their subset of the
numbers.
• Gather all averages to the root process. The root process then
computes the average of these numbers to get the final average.
CODE: -
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <mpi.h>
#include <assert.h>

// Creates an array of random numbers. Each number has a value from


0-1
float *create_rand_nums(int num_elements) {
float *rand_nums = (float *)malloc(sizeof(float) * num_elements);
assert(rand_nums != NULL);
int i;
for (i = 0; i < num_elements; i++) {
rand_nums[i] = (rand() / (float)RAND_MAX);
}
return rand_nums;
}

// Computes the average of an array of numbers


float compute_avg(float *array, int num_elements) {
float sum = 0.f;
int i;
for (i = 0; i < num_elements; i++) {
sum += array[i];
}
return sum / num_elements;
}

int main(int argc, char** argv) {


if (argc != 2) {
fprintf(stderr, "Usage: avg num_elements_per_proc\n");
exit(1);
}

int num_elements_per_proc = atoi(argv[1]);


// Seed the random number generator to get different results each
time
srand(time(NULL));

MPI_Init(NULL, NULL);

int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

// Create a random array of elements on the root process. Its total


// size will be the number of elements per process times the number
// of processes
float *rand_nums = NULL;
if (world_rank == 0) {
rand_nums = create_rand_nums(num_elements_per_proc *
world_size);
}

// For each process, create a buffer that will hold a subset of the
entire
// array
float *sub_rand_nums = (float *)malloc(sizeof(float) *
num_elements_per_proc);
assert(sub_rand_nums != NULL);

// Scatter the random numbers from the root process to all processes
in
// the MPI world
MPI_Scatter(rand_nums, num_elements_per_proc, MPI_FLOAT,
sub_rand_nums,
num_elements_per_proc, MPI_FLOAT, 0,
MPI_COMM_WORLD);

// Compute the average of your subset


float sub_avg = compute_avg(sub_rand_nums,
num_elements_per_proc);

// Gather all partial averages down to the root process


float *sub_avgs = NULL;
if (world_rank == 0) {
sub_avgs = (float *)malloc(sizeof(float) * world_size);
assert(sub_avgs != NULL);
}
MPI_Gather(&sub_avg, 1, MPI_FLOAT, sub_avgs, 1,
MPI_FLOAT, 0, MPI_COMM_WORLD);

// Now that we have all of the partial averages on the root, compute
the
// total average of all numbers. Since we are assuming each process
computed
// an average across an equal amount of elements, this computation
will
// produce the correct answer.
if (world_rank == 0) {
float avg = compute_avg(sub_avgs, world_size);
printf("Avg of all elements is %f\n", avg);
// Compute the average across the original data for comparison
float original_data_avg =
compute_avg(rand_nums, num_elements_per_proc * world_size);
printf("Avg computed across original data is %f\n",
original_data_avg);
}

// Clean up
if (world_rank == 0) {
free(rand_nums);
free(sub_avgs);
}
free(sub_rand_nums);

MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
}
SCREENSHOT OF THE OUTPUT: -
Q4) Write c program using Open MP or MPI to perform radix
sort?
Answer)
Among many problems in computer science, sorting is probably most
common. Due to its applicability
to different domains in the field, sorting algorithms are often used in
the majority of business and
scientific applications. Many different serial sorting algorithms have
been developed over the years
and many have become standards, fine-tuned for use in specific
scenarios. Yet, because of this wide
range of applications, the need for optimization and speed has always
been in place. With the
emergence of multi-core processors, many of the serial algorithms
have been successfully parallelized
and some new algorithms were developed. As a result, processing
time have been reduced with
varying success. The reality is that some algorithms are simply easier
to parallelize than others.
The goals of this paper are to look at the details of implementation a
parallel sorting algorithm and
analyze the parts where the efficiency and bottlenecks come from, as
they relate to overall performance
common to most parallel algorithms.
The two algorithm implementations selected for this purpose:
1. Serial Radix Sort
2. Parallel Radix Sort
Implementation
Both algorithms are implemented in C programming language.
Parallel Radix Sort is also using
Message Passing Interface (MPI) to accomplish inter-process
communications.
Serial radix sort was implemented as follows:
• For each pass scan g consecutive bits from LSD
• Store keys in 2^g buckets according to g bits
• Count how many keys each bucket has
• Compute exclusive prefix sum for each bucket
• Assign starting address according to prefix sums
• Examine g bits to determine bucket and move key to that bucket
As a rule, any parallel sort implementation involves running multiple
sections of regular sort algorithm on multiple processors in parallel.
When each processor is done, the results are communicated between
the processors, and this pattern maybe repeated multiple times until
the whole sequence is completely sorted. Parallel radix sort
implementation is similar to the above with a few exceptions. First,
keys must be stored and moved across different processors. To
accomplish this task, MPI is used for inter-process communication.
As a result of the above, each processor can end up having varying
number of keys to manage after each, which requires additional work
to be done for keeping track of these moves. Here are the steps to
parallel radix sort implementation:
• Split initial problem set into multiple subsets and assign to different
processors
• Count number of keys per bucket by scanning g bits every pass
(local operation)
• Move keys within each processor to appropriate buckets (local
operation)
CODE: -
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include<omp.h>

#define s 4
int N;
int
limit,c=0,input[100000],pInput[100000],pivotpoints[s],sublists[10000
0],indices[100000],counts[s],countsoffset[s],histbin[100000]={0};
int rowindex,maxcol,value,position;
void bucketcount(){
int i,j,index,jump;
#pragma omp parallel for private(i,jump,index)
for(i=0;i<N;i++){
index=s/2-1;jump=s/4;
int pivot=pivotpoints[index];
int element=input[i];
while(jump>=1){
index=(element<pivot)?(index-jump):(index+jump);
pivot=pivotpoints[index];
jump/=2;
index=(element<pivot)? (index):(index+1);
sublists[i]=index;
#pragma omp critical
indices[i]=counts[index];
counts[index]++;

}
}
}
void refining(){
int elemsneeded=N/s,range;
for(int i=0;i<s;i++){
range=N/s*(i+1);
while(counts[i]>=elemsneeded){
pivotpoints[i+1]+=(elemsneeded/counts[i])*range;
elemsneeded=N/s;
counts[i]-=elemsneeded;
}
elemsneeded-=counts[i];
pivotpoints[i+1]+=range;

}
}
void bucketsort(){
int newpos,i;
#pragma omp parallel for private(i,newpos)
for(i=0;i<N;i++){
newpos=countsoffset[sublists[i]]+indices[i];
pInput[newpos]=input[i];
}
}

void histogram_keys()
{
#pragma omp parallel for private(rowindex,maxcol)
for(int i=0;i<s;i++)
{

rowindex=omp_get_thread_num();
for(int j=0;j<counts[rowindex];j++)
{

histbin[pInput[countsoffset[rowindex]+j]]++;
}
}
}

void rank_and_permute()
{
#pragma omp parallel for
private(rowindex,maxcol,value,position)
for(int i=0;i<s;i++)
{
rowindex=omp_get_thread_num();

for(int j=0;j<counts[rowindex];j++)
{
value=pInput[countsoffset[rowindex]+j];
position=histbin[value];
histbin[value]++;
input[position]=value;
}
}
}

int main(){

clock_t seconds1,seconds2,t;
int i,j,inde,jump,x=0,temp;
printf("Enter limit\n");
scanf("%d",&limit);
//Maximum of elements in the input array
printf("Enter the value of N\n");
scanf("%d",&N);
for(int i=0;i<N;i++)
input[i]=rand()%limit;
//Generating the number of elements
seconds1=clock();
//Initial pivot points
for(i=0;i<s;i++){
c+=limit/s;
pivotpoints[i]=c;
}
omp_set_num_threads(N);
printf("Max number of threads: %i
\n",omp_get_max_threads());
//Set the number of threads for sorting elements into buckets
bucketcount();
//Checked
for(i=1;i<s;i++)
countsoffset[i]=countsoffset[i-1]+counts[i-1];
bucketsort();
/*printf("Pinput\n ");
for(i=0;i<N;i++)
printf("%d ",pInput[i]);
printf("\n");*/
histogram_keys();
i=0;
while(i<=limit)
{
temp=histbin[i];
histbin[i]=x;
x+=temp;
i++;
}
rank_and_permute();
//Final output
for(i=0;i<N;i++)
printf("%d ",input[i]);
seconds2=clock();
t=seconds2-seconds1;
double time_taken=(double)t/CLOCKS_PER_SEC;
printf("\ntime taken in seconds :%lf",time_taken);
}

PERFORMANCE ANALYSIS
SCREENSHOT OF THE OUTPUT: -
THE END

You might also like