0% found this document useful (0 votes)
35 views

HPC Project Mpi

This document describes an MPI program for image processing using domain decomposition. The key steps are: 1) The image is read and distributed as blocks to processors using derived datatypes. 2) Halo data is swapped between processors using non-blocking communication. 3) Pixels are recalculated iteratively until the image quality converges. 4) Performance is evaluated by timing the parallel runs and calculating speedup for different processor counts. Reconstruction quality and scaling are demonstrated on sample images.

Uploaded by

jaya vignesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

HPC Project Mpi

This document describes an MPI program for image processing using domain decomposition. The key steps are: 1) The image is read and distributed as blocks to processors using derived datatypes. 2) Halo data is swapped between processors using non-blocking communication. 3) Pixels are recalculated iteratively until the image quality converges. 4) Performance is evaluated by timing the parallel runs and calculating speedup for different processor counts. Reconstruction quality and scaling are demonstrated on sample images.

Uploaded by

jaya vignesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

1

MPP-Message Passing Programming


Jaya Vignesh Madana Gopal
May 8, 2017

Contents
1.Introduction……………………………………………………………………………………………………………………….2
1.1. Task……………………………………………………………………………………………………………………..2
2.Program………………………………………………………………………………………………………………………………2
2. 1. Concept of our program………………………………………………………………………………………2
2.2. Algorithm of the program…………………………………………………………………………………….2
3. Critical ideas implemented in the program…………………………………………………………………………4
3.1. Derived Datatype…………………………………………………………………………………………………4
3.2. Communication for distributing and gathering arrays read from the image……………4
3.3. Communication for halo swapping……………………………………………………………………….4
4.Results and Performance…………………………………………………………………………………………………….5
5.Further Enhancements and modifications in code……………………………………………………………….8
6.Conclusion………………………………………………………………………………………………………………………….8
Source code for the MPI program………………………………………………………………………………………….9
Submitting jobs on ARCHERS supercomputer……………………………………………………………………….17
2

1.Introduction

1.1Task
The task is to write a

- working MPI code for the image processing problem that can use a two dimensional
domain decomposition and use non-blocking communication for halo swapping
- to calculate the average pixels of the reconstructed image during the iterative loop and
print it out at appropriate intervals.
- use a delta parameter to terminate the calculation when the image is sufficiently
accurate.
- Show that the performance of the code increases as we increase the number of process

2. Program
2.1. Concept of our program

A 2d virtual topology to distribute the processors is created in the mpi environment. We use the
provided source codes pgmio.c and header file pgmio.h to read the edge image and convert it to
array. Since it is not easy to use MPI_Scatter and MPI_Gather in 2d virtual topology, we divide
and distribute the whole array into corresponding blocks of data to the processors by Send and
Receive routines. But since the C stores data in memory as row-major, the entire data in the array
is not contiguous in memory. So, we use derived datatypes to specify MPI_Datatype of block
datatype for distribution and gathering of arrays and a column datatype for halo swapping. Each
block datatype consists of entire data which each processor requires as input, Whereas the
column datatype is used to swap 2 sides of block as the column is not contiguous in memory of
C, the other two sides of the block, i.e. rows are contiguous in memory and new datatype need
not be specified. Non-blocking communication is used for halo swapping as each processor tries
to send and receive at the same time. The run time of the entire program and run time of parallel
run are calculated to assess the performance. At sufficient intervals, the average pixel is
calculated and printed out to show the improvement in the quality of picture, and the iterations
are terminated when sufficient accuracy is obtained. The algorithm of the entire program
including image reconstruction is given below.

2.2. Algorithm of the program


The image can be reconstructed by repeated iterations of the form
1
𝑛𝑒𝑤𝑖,𝑗 = (𝑜𝑙𝑑𝑖−1,𝑗 + 𝑜𝑙𝑑𝑖+1,𝑗 + 𝑜𝑙𝑑𝑖,𝑗−1 + 𝑜𝑙𝑑𝑖,𝑗+1 − 𝑒𝑑𝑔𝑒𝑖,𝑗 )
4
3

Where old and new are the image values at the beginning and end of each iteration. The initial
value of the image array is initialised as pure white by setting the all the values in array to 255.
We declare all the arrays which are used in operation with extra columns and rows for halos.

1.Create a 2d topology of processors in Message Passing Interface based on the number


of processors as input.
2.Read the edges data file into the masterbuf array using pgmread

3. Find the size of the chunks to be sent which is MP x NP which is M =M/ PX and N=N/ PY
where PX & PY are no of processors in x and y direction of the virtual topology respectively
4.Divide and distribute the masterbuf array as buf array for each processor using Send-
Receive routines
5.Loop over I = 1, MP :j = 1, NP
𝑒𝑑𝑔𝑒𝑖,𝑗 = 𝑏𝑢𝑓𝑖−1,𝑗−1

6.End loop
7.Set the entire old array to 255 including the halos
8. Begin loop over iterations
-Find the average pixel quality of the picture
-Swap the halos with the neighbouring processors using non-blocking
communication routines
-Loop over i=1,MP :J=NP
1
𝑛𝑒𝑤𝑖,𝑗 = (𝑜𝑙𝑑𝑖−1,𝑗 + 𝑜𝑙𝑑𝑖+1,𝑗 + 𝑜𝑙𝑑𝑖,𝑗−1 + 𝑜𝑙𝑑𝑖,𝑗+1 − 𝑒𝑑𝑔𝑒𝑖,𝑗
4
-End loop
-Set the old array equal to new without copying the halos
9.End loop over iterations
10. Copy the old array back to the buf array excluding the halos.
11.Send the individual buf arrays of each processor to the rank 0 processor to reconstruct
the masterbuf array by using Send-Receive routines.
12.Write out the reconstructed image by passing masterbuf to pgmwrite.
4

3.Critical ideas implemented in the program


3.1. Derived Datatype
In our program we use one of the MPI derived datatype - vector datatype which is specified as
int MPI_Type_vector(int count, int blocklength, int
stride,MPI_Datatype oldtype,MPI_Datatype *newtype)

where the count consists of number of blocks, blocklength consists of number of elements in
each block , and stride specifies number of elements between start of each block , oldtype
specifies the datatype of input elements, newtype is the name given to the new datatype created.
For our block datatype which are sent to individual processors, the count is MP , blocklength is
NP, and stride is N in the array masterbuf. Which gives us a single unit consisting of MP X NP
elements.

For our column datatype which are used for halo swapping, the count is MP, the blocklength is
1, and stride is NP+2 including the halo spaces in the array buf. Which gives us a single unit
consisting of MP elements.

3.2. Communication for distributing and gathering arrays read from the image
Since the array is read by the rank 0 processor and sent to other processors and other processors
receive it, Synchronous send can be used. We copy the 1st block for processing in the rank 0
processor to the local buf array of the rank 0 processor and send the remaining blocks to the
corresponding processors. Similarly, when gathering the processed array by the rank 0 processor,
synchronous send can be used as all the processors other than the rank 0 processor send their
processed array to the rank 0 processor and the rank 0 processor receives it. Similar to
distributing the block of local buf array is copied into the reconstructed masterbuf array.

3.3. Communication for halo swapping


Since after every iteration the border of the blocks or chunks of data available for each processor
needs to be updated, we use halo swapping to enable each processor to communicate with its
neighboring processors to update the information of the borders. In our 2d virtual topology case
each processor may be necessary to communicate with a maximum of 4 processors-up, down,
left and right. As each of these processors try to send and receive data at the same time, it is
necessary to send non-blocking communications to avoid deadlock.
5

4. Results and Performance


Using our program, we could reconstruct the image from its edges to the grayscale picture. The
results of reconstruction are posted below

192 x 128 Image file before and after reconstruction using MPI

768 x 768 image file before and after reconstruction using MPI

To test the performance of our code in different no of processors we increase the number of
iterations to 1000k iterations for the 192x128 image and 100k iterations for the 768x768 image
and find the time of execution of serial and parallel sections for different number of processors,
since any time less than 1 second is meaningless and can’t be used for analyzation of our
processes or speed up we tune the iterations to get a single digit execution times.

No of No of Total Run time Parallel Run Total Speedup Parallel


Processors Iterations time Speedup
4 1000000 8.0549 8.0075 1 1
6 1000000 6.8810 6.8692 1.170600203 1.165710709
10 1000000 5.2876 5.0365 1.523356532 1.589893775
16 1000000 5.6319 5.6203 1.430227809 1.42474601
24 1000000 5.4876 5.4746 1.467836577 1.462663939
6

Table for 1000k iterations of processing 192 x 128 image

No of No of Total Run Parallel Run Total Speedup Parallel


Processors Iterations time time Speedup
4 100000 16.4807 16.322236 1 1
6 100000 11.894573 11.724157 1.38556466 1.392188453
10 100000 7.651293 7.189195 2.153975805 2.270384375
16 100000 5.1311004 4.960184 3.211923119 3.290651315
24 100000 3.424896 3.140217 4.812029329 5.19780512
Table for 100k iterations of processing 768 x 768 image

Runtime
8.5
Run time for 1000k iterations (s)

8
7.5
7
6.5
6 Total run time
5.5
Parallel run time
5
4.5
4
4 9 14 19 24
No of processors

Total and Parallel runtime for reconstructing 192 x 128.pgm at 1000k iterations

Speedup
1.7
1.6
1.5
1.4
Speedup

1.3
1.2 Total speedup
1.1
Parallel Speedup
1
0.9
0.8
4 9 14 19 24
NO of processors

Total and Parallel speedup for reconstructing 192 x 128.pgm at 1000k iterations
7

Runtime
18.5

Run time for 100k iterations (s) 16.5

14.5

12.5

10.5
Total run time
8.5
Parallel run time
6.5

4.5

2.5
4 9 14 19 24
No of processors

Total and Parallel runtime for reconstructing 768 x 768.pgm at 100k iterations

Speedup
5.8
5.3
4.8
4.3
Speedup

3.8
3.3
Total speedup
2.8
2.3 Parallel Speedup
1.8
1.3
0.8
4 9 14 19 24
NO of processors

Total and Parallel speedup for reconstructing 768 x 768.pgm at 100k iterations

We can see from the graph that for a low size image after the initial speedup of upto 10
processors there is no further significant speedup in performance by increasing the no of
processors. But the speedup in case of bigger image keeps on increasing linearly. And thus MPI
in supercomputers with huge number of processors may be quite useful for processing huge
amount of data.
8

5.Further Enhancements and modifications in code


Instead of writing extensive Send Receive routines, MPI_Sendrecv could be used to make the
program much more simple and elegant. But due to lack of time and plague of errors and difficult
debugging, I reverted to using the more usual routines which worked. Also, using MPI_Sendrecv
could result in lesser coding and increased performance which can be analyzed further. Also, we
could see the impact of declaring just one datatype for columns and send the blocks as specific
number of columns and use the same datatype for halo swapping.

6.Conclusion
We were successfully able to write a working MPI program which could run parallel processes by
creating a virtual 2d topology of processors for reconstructing a gray scale image from its edges
and exit the program when sufficient accuracy is reached. We could further analyze the speed up
and time taken for execution by running the program on different no of processors. Also, we
could observe that for bigger data using more processors in MPI would be quite useful to process
it as the speedup increases more efficiently. The syntax of the MPI is very unique and different
from that of C and gave a lot of errors which could not be solved until the last minute, thus
restricting my time in analyzing further extensive tests. MPI syntaxes are difficult until one gets
good practice on it and could understand the type of errors and its reason. Nevertheless, after
becoming proficient one could put it in good use.
9

Source code for the MPI program


1 #include <stdio.h>
2 #include <stdlib.h>
3 #include <math.h>
4
5 //header file necessory for MPI executions
6 #include <mpi.h>
7
8 //header file for image read and write functions
9 #include "pgmio.h"
10
11 //dimensions of the image
12 #define M 192
13 #define N 128
14
15 //Number of processors
16 #define P 4
17
18 //stopping sriterion
19 #define CRITERION 0.1
20
21 #define MAXITER 1500
22 #define PRINTFREQ 200
23
24 int main (int argc, char **argv)
25 {
26 //Initialising MPI
27 MPI_Init(NULL,NULL);
28
29 //declaring time variables
30 double stime,etime;
31 double fr_stime,fr_etime;
32 double fwr_stime,fwr_etime;
33 double p_stime,p_etime;
34 stime=MPI_Wtime();
35
36 //declaring initial variables
37 float masterbuf[M][N];
38 int i, j, iter, maxiter;
39 char *filename;
40 int rank, size;
41
42 //no of dimensions
43 int ndims=2;
44 //dimensions of virtual topology
45 int dims[ndims];
10

46 //y direction - no of rows


47 dims[0]=0;
48 //x direction - no of columns
49 dims[1]=0;
50 //
51
52 int tag=0;
53 int rankdes,ranksrc=0;
54
55 //non cyclic period in both direction zero
56 int period[2];
57 period[0]=0;
58 period[1]=0;
59
60 //keep the original ranks
61 int reorder=0;
62
63 int up,down,left,right;
64 int MP,NP;
65
66 //pixel variables
67 float pixval=0;
68 float pixavg=0;
69 float pixavg_sum=0;
70 float pixsum=0;
71
72 //stopping criterion variables
73 double del=0;
74 float delmax=0;
75 float delmax_tot=1;
76 int stoploop=0;
77
78
79 MPI_Status status[8],status0;
80 MPI_Request request[10],req[8],reqq[3];
81 MPI_Datatype block;
82 MPI_Datatype col;
83
84 //no of processors(size) and rank of processors
85 MPI_Comm_size(MPI_COMM_WORLD, &size);
86 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
87
88 //checking if required no of processors were allocated
89 if(size != P)
90 {
91 if (rank == 0) printf("ERROR: size = %d, P = %d",
size, P);
11

92 MPI_Finalize();
93 exit(-1);
94 }
95
96 //2D topology creation
97 MPI_Comm cart; //new name for communication
environment
98 MPI_Dims_create(P, ndims,dims);
99 MPI_Cart_create(MPI_COMM_WORLD, ndims,
dims,period,reorder,&cart);
100
101 //new ranks of processors
102 MPI_Comm_rank(cart, &rank);
103
104 //for communication with neighbouring processors
105 MPI_Cart_shift(cart,0,1,&up, &down);
106 MPI_Cart_shift(cart,1,1,&left, &right);
107
108 //chunk dimensions
109 MP=M/dims[0];
110 NP=N/dims[1];
111
112 //arrays for parallel processing
113 float buf[MP][NP];
114 float old[MP+2][NP+2];
115 float new[MP+2][NP+2];
116 float edge[MP+2][NP+2];
117
118 //Derived datatypes
119 MPI_Type_vector(MP,NP,N,MPI_FLOAT,&block);//block of
size MPXNP
120 MPI_Type_commit(&block);
121 MPI_Type_vector(MP,1,(NP+2),MPI_FLOAT,&col);//column of
size MP
122 MPI_Type_commit(&col);
123
124 //Reading the image file
125 if(rank == 0)
126 {
127 printf("Processing %d x %d image on %d processes", M,
N, P);
128 printf("Number of iterations = %d", MAXITER);
129 filename = "edge192x128.pgm";
130 printf("<%s>", filename);
131 fr_stime=MPI_Wtime();
132 pgmread(filename, &masterbuf[0][0], M, N);
133 fr_etime=MPI_Wtime();
12

134 printf("");
135 }
136
137 //Parallel start time
138 MPI_Barrier(cart);
139 p_stime=MPI_Wtime();
140
141 //Distributing the masterbuf array to all processors as
smaller buf arrays
142 if(rank==0)
143 {
144 rankdes=1;
145 //copying the buf array for the root processor from
masterbuf array
146 for(i=0;i<MP;i++)
147 {
148 for(j=0;j<NP;j++)
149 {
150 buf[i][j]=masterbuf[i][j];
151 }
152 }
153
154 //distributing the remaining parts of masterbuf
array as buf arrays to other processors
155 for(i=0;i<dims[0];i++)
156 {
157 for(j=0;j<dims[1];j++)
158 {
159 if(i!=0||j!=0)
160 {
161 MPI_Ssend(&masterbuf[i*MP][j*NP],1
,block ,rankdes,tag,cart);
162 rankdes++;
163 }
164 }
165 }
166 }
167
168 //other processors recieving their buf array
169 else
170 {
171
MPI_Recv(&buf[0][0],MP*NP,MPI_FLOAT,0,tag,cart,&status0);
172 }
173
174 //copying the buf array to edge array to perform
operations
13

175 for (i=1;i<MP+1;i++)


176 {
177 for (j=1;j<NP+1;j++)
178 {
179 edge[i][j]=buf[i-1][j-1];
180 }
181 }
182
183 //Declaring old array by default as (white 255)
184 for (i=0;i<MP+2;i++)
185 {
186 for (j=0;j<NP+2;j++)
187 {
188 old[i][j]=255.0;
189 }
190 }
191
192
193
194 //printing no of iterations and average pixel value
195 for (iter=1;iter<=MAXITER; iter++)
196 {
197 if(iter%PRINTFREQ==0)
198 {
199 for(i=0;i<MP+1;i++)
200 {
201 for(j=0;j<NP+1;j++)
202 {
203 pixsum += old[i][j];
204 }
205 }
206 pixavg=pixsum/(M*N);
207
MPI_Reduce(&pixavg,&pixavg_sum,1,MPI_FLOAT,MPI_SUM,0,cart);
208 if(rank==0)
209 {
210 printf("Iteration %d", iter);
211 printf("Avg pixel
value=%f",pixavg_sum/size);
212
213 }
214 }
215
216 //Non-Blocking communicating between the processors for
halo swapping
217 MPI_Issend(&old[1][NP], 1, col, right,
tag,cart,&req[0]);
14

218 MPI_Irecv(&old[1][0], 1, col, left, tag,cart, &req[4]);


219
220 MPI_Issend(&old[1][1], 1, col,left, tag,cart,&req[1]);
221 MPI_Irecv(&old[1][NP+1], 1, col,right,tag,cart,
&req[5]);
222
223 MPI_Issend(&old[MP][1], NP, MPI_FLOAT, down,
tag,cart,&req[2]);
224 MPI_Irecv(&old[0][1], NP, MPI_FLOAT,up, tag,cart,
&req[6]);
225
226 MPI_Issend(&old[1][1], NP, MPI_FLOAT, up,
tag,cart,&req[3]);
227 MPI_Irecv(&old[MP+1][1], NP, MPI_FLOAT, down,tag,cart,
&req[7]);
228
229 MPI_Waitall(8,req,status);//Waiting for the send and
recieve request to complete
230
231 //the main iteration which reconstructs the image
232 for (i=1;i<MP+1;i++)
233 {
234 for (j=1;j<NP+1;j++)
235 {
236 new[i][j]=0.25*(old[i-
1][j]+old[i+1][j]+old[i][j-1]+old[i][j+1]- edge[i][j]);
237
238 del=fabs(new[i][j]-old[i][j]);
239 if(del>delmax)
240 {
241 delmax=del;
242 }
243 }
244 }
245
246 //copying the reconstructed values into old values
247 for (i=1;i<MP+1;i++)
248 {
249 for (j=1;j<NP+1;j++)
250 {
251 old[i][j]=new[i][j];
252 }
253 }
254
255 //stopping criterion
256 if(iter!=0 && iter%(MAXITER/5))
257 {
15

258 MPI_Allreduce(&delmax,&delmax_tot,1,MPI_FLOAT,
MPI_MAX, cart);
259 if(delmax_tot <= CRITERION)
260 {
261 if(rank==0)
262 {
263 printf("iter=%d, delmax=%f is lower than
criterion=%f.",iter,delmax_tot,CRITERION);
264 }
265 stoploop=1;
266 }
267 delmax_tot=0;
268 delmax=0;
269 }
270
271
272 }
273
274 if (rank==0)
275 {
276 printf("%d iterations", iter-1);
277 }
278
279 //copying back the reconstructed values into buf array
280 for (i=1;i<MP+1;i++)
281 {
282 for (j=1;j<NP+1;j++)
283 {
284 buf[i-1][j-1]=old[i][j];
285 }
286 }
287
288 //All processors except root processor sending their
arrays back to root processor
289 if(rank!=0)
290 {
291 MPI_Ssend(&buf[0][0], MP*NP, MPI_FLOAT,0,tag,cart);
292 }
293
294 //Root processor gatherig the buf arrays and
reconstructing the masterbuf array
295 if(rank==0)
296 {
297 ranksrc=1;
298 //copying the root processors buf array into
masterbuf array
299 for(i=0;i<MP;i++)
16

300 {
301 for(j=0;j<NP;j++)
302 {
303 masterbuf[i][j]=buf[i][j];
304 }
305 }
306
307 //Recieving the buf arrays from other processors
into masterbuf array
308 for(i=0;i<dims[0];i++)
309 {
310 for(j=0;j<dims[1];j++)
311 {
312 if(i!=0||j!=0)
313 {
314
MPI_Recv(&masterbuf[i*MP][j*NP],1,block,ranksrc,tag,cart,&status
0);
315 ranksrc++;
316 }
317 }
318 }
319 }
320
321 //Calculating parallel end time by using barrier to
wait for all the processors to reach here
322 MPI_Barrier(cart);
323 p_etime=MPI_Wtime();
324
325
326 //Writing the reconstructed image
327 if (rank == 0)
328 {
329 filename="image192x128.pgm";
330 printf("<%s>", filename);
331 fwr_stime=MPI_Wtime();
332 pgmwrite(filename, &masterbuf[0][0], M, N);
333 fwr_etime=MPI_Wtime();
334 etime=MPI_Wtime();//end of program time
335 printf("Total run time = %fParallel run time = %f
",etime-stime,p_etime-p_stime);
336 printf("file read time = %ffile write time = %f
",fr_etime-fr_stime,fwr_etime-fwr_stime);
337 }
338
339 //exiting the mpi communication environment
340 MPI_Finalize(); }
17

Submitting jobs on ARCHERS supercomputer


The necessary information for submitting a job on the archers is provided on the links below
http://www.archer.ac.uk/training/course-material/2017/04/MPP_Soton/Exercises/ARCHER-
MPI-cribsheet.pdf
http://www.archer.ac.uk/documentation/user-guide/batch.php

You might also like