HPC Project Mpi
HPC Project Mpi
Contents
1.Introduction……………………………………………………………………………………………………………………….2
1.1. Task……………………………………………………………………………………………………………………..2
2.Program………………………………………………………………………………………………………………………………2
2. 1. Concept of our program………………………………………………………………………………………2
2.2. Algorithm of the program…………………………………………………………………………………….2
3. Critical ideas implemented in the program…………………………………………………………………………4
3.1. Derived Datatype…………………………………………………………………………………………………4
3.2. Communication for distributing and gathering arrays read from the image……………4
3.3. Communication for halo swapping……………………………………………………………………….4
4.Results and Performance…………………………………………………………………………………………………….5
5.Further Enhancements and modifications in code……………………………………………………………….8
6.Conclusion………………………………………………………………………………………………………………………….8
Source code for the MPI program………………………………………………………………………………………….9
Submitting jobs on ARCHERS supercomputer……………………………………………………………………….17
2
1.Introduction
1.1Task
The task is to write a
- working MPI code for the image processing problem that can use a two dimensional
domain decomposition and use non-blocking communication for halo swapping
- to calculate the average pixels of the reconstructed image during the iterative loop and
print it out at appropriate intervals.
- use a delta parameter to terminate the calculation when the image is sufficiently
accurate.
- Show that the performance of the code increases as we increase the number of process
2. Program
2.1. Concept of our program
A 2d virtual topology to distribute the processors is created in the mpi environment. We use the
provided source codes pgmio.c and header file pgmio.h to read the edge image and convert it to
array. Since it is not easy to use MPI_Scatter and MPI_Gather in 2d virtual topology, we divide
and distribute the whole array into corresponding blocks of data to the processors by Send and
Receive routines. But since the C stores data in memory as row-major, the entire data in the array
is not contiguous in memory. So, we use derived datatypes to specify MPI_Datatype of block
datatype for distribution and gathering of arrays and a column datatype for halo swapping. Each
block datatype consists of entire data which each processor requires as input, Whereas the
column datatype is used to swap 2 sides of block as the column is not contiguous in memory of
C, the other two sides of the block, i.e. rows are contiguous in memory and new datatype need
not be specified. Non-blocking communication is used for halo swapping as each processor tries
to send and receive at the same time. The run time of the entire program and run time of parallel
run are calculated to assess the performance. At sufficient intervals, the average pixel is
calculated and printed out to show the improvement in the quality of picture, and the iterations
are terminated when sufficient accuracy is obtained. The algorithm of the entire program
including image reconstruction is given below.
Where old and new are the image values at the beginning and end of each iteration. The initial
value of the image array is initialised as pure white by setting the all the values in array to 255.
We declare all the arrays which are used in operation with extra columns and rows for halos.
3. Find the size of the chunks to be sent which is MP x NP which is M =M/ PX and N=N/ PY
where PX & PY are no of processors in x and y direction of the virtual topology respectively
4.Divide and distribute the masterbuf array as buf array for each processor using Send-
Receive routines
5.Loop over I = 1, MP :j = 1, NP
𝑒𝑑𝑔𝑒𝑖,𝑗 = 𝑏𝑢𝑓𝑖−1,𝑗−1
6.End loop
7.Set the entire old array to 255 including the halos
8. Begin loop over iterations
-Find the average pixel quality of the picture
-Swap the halos with the neighbouring processors using non-blocking
communication routines
-Loop over i=1,MP :J=NP
1
𝑛𝑒𝑤𝑖,𝑗 = (𝑜𝑙𝑑𝑖−1,𝑗 + 𝑜𝑙𝑑𝑖+1,𝑗 + 𝑜𝑙𝑑𝑖,𝑗−1 + 𝑜𝑙𝑑𝑖,𝑗+1 − 𝑒𝑑𝑔𝑒𝑖,𝑗
4
-End loop
-Set the old array equal to new without copying the halos
9.End loop over iterations
10. Copy the old array back to the buf array excluding the halos.
11.Send the individual buf arrays of each processor to the rank 0 processor to reconstruct
the masterbuf array by using Send-Receive routines.
12.Write out the reconstructed image by passing masterbuf to pgmwrite.
4
where the count consists of number of blocks, blocklength consists of number of elements in
each block , and stride specifies number of elements between start of each block , oldtype
specifies the datatype of input elements, newtype is the name given to the new datatype created.
For our block datatype which are sent to individual processors, the count is MP , blocklength is
NP, and stride is N in the array masterbuf. Which gives us a single unit consisting of MP X NP
elements.
For our column datatype which are used for halo swapping, the count is MP, the blocklength is
1, and stride is NP+2 including the halo spaces in the array buf. Which gives us a single unit
consisting of MP elements.
3.2. Communication for distributing and gathering arrays read from the image
Since the array is read by the rank 0 processor and sent to other processors and other processors
receive it, Synchronous send can be used. We copy the 1st block for processing in the rank 0
processor to the local buf array of the rank 0 processor and send the remaining blocks to the
corresponding processors. Similarly, when gathering the processed array by the rank 0 processor,
synchronous send can be used as all the processors other than the rank 0 processor send their
processed array to the rank 0 processor and the rank 0 processor receives it. Similar to
distributing the block of local buf array is copied into the reconstructed masterbuf array.
192 x 128 Image file before and after reconstruction using MPI
768 x 768 image file before and after reconstruction using MPI
To test the performance of our code in different no of processors we increase the number of
iterations to 1000k iterations for the 192x128 image and 100k iterations for the 768x768 image
and find the time of execution of serial and parallel sections for different number of processors,
since any time less than 1 second is meaningless and can’t be used for analyzation of our
processes or speed up we tune the iterations to get a single digit execution times.
Runtime
8.5
Run time for 1000k iterations (s)
8
7.5
7
6.5
6 Total run time
5.5
Parallel run time
5
4.5
4
4 9 14 19 24
No of processors
Total and Parallel runtime for reconstructing 192 x 128.pgm at 1000k iterations
Speedup
1.7
1.6
1.5
1.4
Speedup
1.3
1.2 Total speedup
1.1
Parallel Speedup
1
0.9
0.8
4 9 14 19 24
NO of processors
Total and Parallel speedup for reconstructing 192 x 128.pgm at 1000k iterations
7
Runtime
18.5
14.5
12.5
10.5
Total run time
8.5
Parallel run time
6.5
4.5
2.5
4 9 14 19 24
No of processors
Total and Parallel runtime for reconstructing 768 x 768.pgm at 100k iterations
Speedup
5.8
5.3
4.8
4.3
Speedup
3.8
3.3
Total speedup
2.8
2.3 Parallel Speedup
1.8
1.3
0.8
4 9 14 19 24
NO of processors
Total and Parallel speedup for reconstructing 768 x 768.pgm at 100k iterations
We can see from the graph that for a low size image after the initial speedup of upto 10
processors there is no further significant speedup in performance by increasing the no of
processors. But the speedup in case of bigger image keeps on increasing linearly. And thus MPI
in supercomputers with huge number of processors may be quite useful for processing huge
amount of data.
8
6.Conclusion
We were successfully able to write a working MPI program which could run parallel processes by
creating a virtual 2d topology of processors for reconstructing a gray scale image from its edges
and exit the program when sufficient accuracy is reached. We could further analyze the speed up
and time taken for execution by running the program on different no of processors. Also, we
could observe that for bigger data using more processors in MPI would be quite useful to process
it as the speedup increases more efficiently. The syntax of the MPI is very unique and different
from that of C and gave a lot of errors which could not be solved until the last minute, thus
restricting my time in analyzing further extensive tests. MPI syntaxes are difficult until one gets
good practice on it and could understand the type of errors and its reason. Nevertheless, after
becoming proficient one could put it in good use.
9
92 MPI_Finalize();
93 exit(-1);
94 }
95
96 //2D topology creation
97 MPI_Comm cart; //new name for communication
environment
98 MPI_Dims_create(P, ndims,dims);
99 MPI_Cart_create(MPI_COMM_WORLD, ndims,
dims,period,reorder,&cart);
100
101 //new ranks of processors
102 MPI_Comm_rank(cart, &rank);
103
104 //for communication with neighbouring processors
105 MPI_Cart_shift(cart,0,1,&up, &down);
106 MPI_Cart_shift(cart,1,1,&left, &right);
107
108 //chunk dimensions
109 MP=M/dims[0];
110 NP=N/dims[1];
111
112 //arrays for parallel processing
113 float buf[MP][NP];
114 float old[MP+2][NP+2];
115 float new[MP+2][NP+2];
116 float edge[MP+2][NP+2];
117
118 //Derived datatypes
119 MPI_Type_vector(MP,NP,N,MPI_FLOAT,&block);//block of
size MPXNP
120 MPI_Type_commit(&block);
121 MPI_Type_vector(MP,1,(NP+2),MPI_FLOAT,&col);//column of
size MP
122 MPI_Type_commit(&col);
123
124 //Reading the image file
125 if(rank == 0)
126 {
127 printf("Processing %d x %d image on %d processes", M,
N, P);
128 printf("Number of iterations = %d", MAXITER);
129 filename = "edge192x128.pgm";
130 printf("<%s>", filename);
131 fr_stime=MPI_Wtime();
132 pgmread(filename, &masterbuf[0][0], M, N);
133 fr_etime=MPI_Wtime();
12
134 printf("");
135 }
136
137 //Parallel start time
138 MPI_Barrier(cart);
139 p_stime=MPI_Wtime();
140
141 //Distributing the masterbuf array to all processors as
smaller buf arrays
142 if(rank==0)
143 {
144 rankdes=1;
145 //copying the buf array for the root processor from
masterbuf array
146 for(i=0;i<MP;i++)
147 {
148 for(j=0;j<NP;j++)
149 {
150 buf[i][j]=masterbuf[i][j];
151 }
152 }
153
154 //distributing the remaining parts of masterbuf
array as buf arrays to other processors
155 for(i=0;i<dims[0];i++)
156 {
157 for(j=0;j<dims[1];j++)
158 {
159 if(i!=0||j!=0)
160 {
161 MPI_Ssend(&masterbuf[i*MP][j*NP],1
,block ,rankdes,tag,cart);
162 rankdes++;
163 }
164 }
165 }
166 }
167
168 //other processors recieving their buf array
169 else
170 {
171
MPI_Recv(&buf[0][0],MP*NP,MPI_FLOAT,0,tag,cart,&status0);
172 }
173
174 //copying the buf array to edge array to perform
operations
13
258 MPI_Allreduce(&delmax,&delmax_tot,1,MPI_FLOAT,
MPI_MAX, cart);
259 if(delmax_tot <= CRITERION)
260 {
261 if(rank==0)
262 {
263 printf("iter=%d, delmax=%f is lower than
criterion=%f.",iter,delmax_tot,CRITERION);
264 }
265 stoploop=1;
266 }
267 delmax_tot=0;
268 delmax=0;
269 }
270
271
272 }
273
274 if (rank==0)
275 {
276 printf("%d iterations", iter-1);
277 }
278
279 //copying back the reconstructed values into buf array
280 for (i=1;i<MP+1;i++)
281 {
282 for (j=1;j<NP+1;j++)
283 {
284 buf[i-1][j-1]=old[i][j];
285 }
286 }
287
288 //All processors except root processor sending their
arrays back to root processor
289 if(rank!=0)
290 {
291 MPI_Ssend(&buf[0][0], MP*NP, MPI_FLOAT,0,tag,cart);
292 }
293
294 //Root processor gatherig the buf arrays and
reconstructing the masterbuf array
295 if(rank==0)
296 {
297 ranksrc=1;
298 //copying the root processors buf array into
masterbuf array
299 for(i=0;i<MP;i++)
16
300 {
301 for(j=0;j<NP;j++)
302 {
303 masterbuf[i][j]=buf[i][j];
304 }
305 }
306
307 //Recieving the buf arrays from other processors
into masterbuf array
308 for(i=0;i<dims[0];i++)
309 {
310 for(j=0;j<dims[1];j++)
311 {
312 if(i!=0||j!=0)
313 {
314
MPI_Recv(&masterbuf[i*MP][j*NP],1,block,ranksrc,tag,cart,&status
0);
315 ranksrc++;
316 }
317 }
318 }
319 }
320
321 //Calculating parallel end time by using barrier to
wait for all the processors to reach here
322 MPI_Barrier(cart);
323 p_etime=MPI_Wtime();
324
325
326 //Writing the reconstructed image
327 if (rank == 0)
328 {
329 filename="image192x128.pgm";
330 printf("<%s>", filename);
331 fwr_stime=MPI_Wtime();
332 pgmwrite(filename, &masterbuf[0][0], M, N);
333 fwr_etime=MPI_Wtime();
334 etime=MPI_Wtime();//end of program time
335 printf("Total run time = %fParallel run time = %f
",etime-stime,p_etime-p_stime);
336 printf("file read time = %ffile write time = %f
",fr_etime-fr_stime,fwr_etime-fwr_stime);
337 }
338
339 //exiting the mpi communication environment
340 MPI_Finalize(); }
17