Early Prediction of the Cost of HPC Application Execution in the Cloud
Early Prediction of the Cost of HPC Application Execution in the Cloud
Abstract—Even if clouds are not fit for high-end HPC ap- cost for leasing a small set of virtual cores can be very low,
plications, they could be profitably used to bring the power of especially if there are relaxed time constraints for obtaining
economic and scalable parallel computing to the masses. But this the results. A wise choice among provider offerings often
requires both simple development environments, able to exploit allows to acquire the computing resources needed at very
cloud scalability, and the capability to easily predict the cost of low cost (see for example the EC2 Spot Instances offer [2]).
HPC application runs.
This enables any organization to run parallel code whenever
This paper presents a framework built on the top of a cloud- needed, at a low cost, without investing the capital in rapidly
aware programming platform (mOSAIC) for the development of obsolescing parallel hardware. The second important issue is
bag-of-tasks scientific applications. The framework integrates a cloud elasticity, which allows to scale in/out the number of
cloud-based simulation environment able to predict the behavior virtual cores on-the-fly (i.e., while the application is running),
of the developed applications. Simulations enable the developer based on the particular job requirements, paying just for the
to predict at an early development stage performance and cloud
resources actually used. In other words, cloud computing is
resource usage, and so the infrastructure lease cost on a public
cloud. also a great opportunity for everyone to experiment and to
exploit parallel computing at low cost, using a comfortable
The paper sketches the framework organization and discusses pay-as-you-go model.
the approach followed for application development. Moreover,
some validation tests of prediction results are presented.
We think that the final step to make clouds fully advanta-
geous for sporadic scientific users is providing simple tools to
predict the performance behavior of their application, allowing
I. I NTRODUCTION them to make a tradeoff between performance and leasing
costs. In a previous paper [3] we proposed the use of a cloud-
At least in theory, clouds could be profitably used to enabled programming platform. This platform makes it possi-
bring the power of economic and scalable parallel computing ble to develop cloud applications on the top of a cloud-aware
to the masses. The main obstacles to this process are the programming framework (mOSAIC [4], [5]) by exploiting the
substantial differences between the “traditional” and the cloud- bag-of-tasks programming paradigm. The bag-of-tasks (BOT)
based paradigm, and the lack of adequate development tools paradigm, also known as master-worker, processor farm, . . . , is
to support the porting of legacy application to the cloud. widely understood, and ubiquitous in small and medium-scale
scientific computing.
Furthermore, users/developers of scientific codes are not
prone to tolerate the moderate performance losses due to the
systematic use of virtualization and, above all, to the use in Moreover, in the past we worked on the performance
cloud data centers of networks designed mainly for scalability, prediction of cloud applications developed on the top of the
and not for performance [1]. The high variance of response mOSAIC framework [6]. Already-available tools enable us to
times due to multitenancy and to loads hidden from the user predict the performance of a cloud application without running
view and control, along with always-possible transient failures it on (payed) cloud resources. In this paper, we discuss the
of the cloud infrastructure, do the rest. As a matter of fact, enrichment of the bag-of-task framework with performance
cloud computing is inherently unfit for high-end scientific prediction capabilities, allowing the automatic generation of
applications, which in the near future are likely to be still the application simulation models.
executed in purposely-designed and dedicated HPC systems.
However, there is a wide range of applications widely The remainder of this paper is structured as follows. In
used in science, engineering and for commercial purposes the next section we will examine related work. Section III
that have highly variable response times, are moderately CPU illustrates the rationale and the architecture of the framework
intensive, are not immediately suitable for GPU computing we have implemented for the development of bag-of-tasks
and are made up of loosely coupled tasks, so that computation applications in the cloud. Section IV presents our approach
easily dwarfs communication times. We think that this class of to early performance prediction and Section V shows some of
“para-scientific” applications is an almost ideal candidate for our validation tests. The paper closes with our conclusions and
execution on the cloud. The major advantage is economic: the plans for future research.
410
408
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 08:02:14 UTC from IEEE Xplore. Restrictions apply.
Fig. 1. Architecture of the BOT framework
IV. P ERFORMANCE PREDICTION OF BAG - OF - TASKS the cloud application execution (HTTPgw, Orchestrator),
APPLICATIONS communication and storage (Queue and KVstore).
In the previous section we have described our bag-of-tasks The management components of the bag-of-task framework
framework for the development of scientific applications in the (HTTPgw and Orchestrator) have the role of managing
cloud. A key point is that such applications are fully defined the cloud application execution, offering an interface to end
by the number of instances of the mOSAIC components users (HTTPgw) and orchestrating the execution, forwarding
mentioned above and by their interconnections. Moving from the messages to the right splitter when multiple bag-of-tasks
these premises, we have devised a performance model of the are executed on the same resources and starting/stopping the
application that can be used to predict its behavior and to tune triple Splitter-Worker-Merger.
its performance. Our simulation model is driven by a workload
Our performance model is process-based [27], [28], [29], generator, which is in charge to generate the sequence of
in that it is described through a set of discrete-event simulation requests the users issue in time. At the start of simulation,
components whose temporal behavior is described as a process. it begins sending out the messages to the Orchestrator
Event management and discrete-event actions/reactions are according to the chosen workload, described in a configuration
modeled in terms of process synchronization primitives. The file. Currently two simple workload models are available: a set
simulated components have been developed by exploiting the of requests with fixed inter-arrival time, and a Poisson arrival
JADES simulation library [29], which allows the description model that generates messages with random exponential inter-
of process-oriented simulations in Java. arrival time. Moreover, it is also possible to start a multiple
number of concurrent workloads miming the load generated
A noteworthy feature of the solution devised is that the by multiple users.
simulation models expressed through the JADES library can be
The Orchestrator model is fairly straightforward: it
easily evaluated through the mJADES platform [30]. mJADES
continuously receives Jobs from a queue (that mimics a
is a recently-developed system that supports the distribution of
HTTP channel) and forwards them to the Splitter. At the
multiple JADES simulations on cloud resources. The mJADES
state of the art, our model is very simple, since it is based
simulation system is based on a Java-based modular architec-
on the assumption of a fixed application configuration, which
ture. The mJADES simulation manager produces simulation
cannot be dynamically altered (i.e., we cannot start a new
tasks from simulation jobs, and schedules them to be executed
set of Split-Work-Merge components during the simulation).
concurrently on multiple instances of the simulation core. This
So the Orchestrator has just the role of routing the
is a process-oriented discrete-event simulation engine based on
message to the Splitter instances. We aim at improving
the JADES simulation library. The outputs from the runs are
the Orchestrator model in the future.
handed on to a simulation analyzer, whose task is to compute
aggregates and to generate reports for the final user. mJADES The Splitter, Merger and Worker models behave in
has been developed as a collection of mOSAIC components, a similar way, receiving and forwarding the messages from/to
and so the evaluation of the models can be performed on the internal queues accordingly to the described bag-of-tasks
any mOSAIC platform, whether it is the one used by the pattern. The Merger process, after collecting all the interme-
application under study, or a different one. diate job Result messages, sends a job Result message
to a special simulator component, the report generator,
which gathers the results and produces the final simulation
A. Bag-of-tasks Simulation Model reports.
To model a bag-of tasks application, we developed a The computational resources consumed by our core com-
simulation component for each of the core components mak- ponents (i.e., Orchestrator, Splitter, Merger and
ing up the bag-of-task framework (Splitter, Worker Worker processes) are taken into account by means of a
and Merger), and for each component devoted to manage component simulating CPU resource sharing. At simulation
411
409
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 08:02:14 UTC from IEEE Xplore. Restrictions apply.
Listing 1. The Queue Simulation Process
public class Queue extends it.unisannio.ing.perflab.jades.core.Process {
start-up it is necessary to provide the actual number of avail- • send job frequence (SEND FREQ): job send rate;
able virtual CPUs (vCPUs) and the allocation to vCPUs of the
framework components involved in a run. • tasks (TASKS): number of tasks generated by the
Splitter;
The response time of the queues (i.e., the time needed to
notify a message to a process, once it has been published on • worker overhead (WORK TIME): estimate of the
the queue) is modeled as a function of the number of queued vCPU time required by a Worker to process a Task;
messages and of the dimension of the messages (according to
• allocation map: allocation matrix describing the al-
the iLDS model [31]). Listing 1 shows the code of the process
location of the framework processes to the available
simulating the behavior of the communication queues.
vCPUs.
Figure 2 sketches the proposed simulation model. It should
be noted how close it resembles the structure of the bag-of- The framework tuning parameters are used to model a
tasks application. specific framework instance, taking into account the overhead
introduced by the framework itself, by the platform and by
As previously pointed out, even if the framework actual any underlying software layer. After they have been estimated
behavior strictly depends on the specific algorithm to be for a framework instance, they will not vary between different
implemented, in any case the core bag-of-tasks components simulation runs. In the following section we will describe a
receive messages from queues and forward them to other methodology for the evaluation of such values. The parameters
queues, consuming a suitable amount of CPU time. Starting are:
from this consideration, we can fully describe a bag-of-tasks
instance by means of two sets of parameters. • HTTP overhead (HTTP OH): the communication
delay introduced by a HTTP communication channel;
The application instance parameters represent the values
under the control of the framework user. They describe the • Orchestrator overhead (ORCH OH): the orchestra-
application and the BOT framework configuration to be sim- tor overhead (vCPU time) introduced to process each
ulated, as follows: submitted job and to forward the descriptor to the
Splitter;
• virtual CPUs (vCPU N): vCPUs available to the
mOSAIC platform; • Splitter overhead (SPLIT OH): overhead introduced
(vCPU time) to execute a single split operation, to
• vCPU time slice (vCPU SLICE): vCPU scheduler
create and to forward a Task;
preemption time;
• Merger overhead (MERGE OH): overhead (vCPU
• workers (WORKERS): number of Worker instances;
time) introduced to execute a single merge operation;
• splitters (SPLITTERS): number of Splitter instances;
• Job Descriptor message (JOB MSG SIZE): size of
• jobs (JOBS): number of jobs generated and submitted; the Job Descriptor messages;
412
410
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 08:02:14 UTC from IEEE Xplore. Restrictions apply.
• Result message (RESULT MSG SIZE): size of the
Result messages;
• Task message (TASK MSG SIZE): size of the Task
messages;
• HTTP channel: beta0, beta1, beta2 parameters (see
Listing 1) to model the HTTP channel;
• bag-of-tasks queues: beta0, beta1, beta2 parameters
to model the framework internal queues.
413
411
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 08:02:14 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Bag-Of-Tasks simulation model
414
412
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 08:02:14 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. Comparison between skeletal application and simulated completion times
415
413
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 08:02:14 UTC from IEEE Xplore. Restrictions apply.
provided by the mOSAIC platform. On the other, we have [13] A. Raveendran, T. Bicer, and G. Agrawal, “A framework for elastic
developed a set of simulation components for JADES. These execution of existing MPI programs,” in Parallel and Distributed Pro-
cessing Workshops and Phd Forum (IPDPSW), 2011 IEEE International
simulation components correspond one-to-one to the mOSAIC Symposium on, May 2011, pp. 940–947.
components for the BOT framework. Besides presenting the
[14] Y. Gong, B. He, and J. Zhong, “An overview of cmpi: Network
approach, we have discussed the outcome of our preliminary performance aware mpi in the cloud,” in Proceedings of the 17th
performance tests used to evaluate the simulation accuracy. ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, ser. PPoPP ’12. New York, NY, USA: ACM, 2012,
In our tests, we offered some examples of usage under pp. 297–298.
different workloads and using different amount of resources, in [15] ——, “Network performance aware MPI collective communication
order to show how it will be possible for a developer to predict operations in the cloud,” Parallel and Distributed Systems, IEEE Trans-
the behavior of is application, without running it on (paid) actions on, 2013.
cloud resources, but simply using the associated simulation [16] R. Aversa, A. Mazzeo, N. Mazzocca, and U. Villano, “Heterogeneous
environment. system performance prediction and analysis using PS,” IEEE Concur-
rency, vol. 6, no. 3, pp. 20–29, Jul./Sep. 1998.
Our future research work will focus on the extensive testing [17] B. Di Martino, E. Mancini, M. Rak, R. Torella, and U. Villano, “Cluster
of the framework and simulation components, by collecting systems and simulation: from benchmarking to off-line performance
measurements on real-world scientific codes running in pri- prediction,” Concurrency and Computation: Practice and Experience,
vate and commercial cloud environments. We also plan to vol. 19, no. 11, pp. 1549–1562, 2007.
implement alternative frameworks for additional programming [18] S. Achour, M. Ammar, B. Khmili, and W. Nasri, “MPI-PERF-SIM:
Towards an automatic performance prediction tool of MPI programs
paradigms. on hierarchical clusters,” in Parallel, Distributed and Network-Based
Processing (PDP), 2011 19th Euromicro International Conference on.
R EFERENCES IEEE, 2011, pp. 207–211.
[1] M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable, commodity [19] P. Clauss, M. Stillwell, S. Genaud, F. Suter, H. Casanova, and M. Quin-
data center network architecture,” SIGCOMM Comput. Commun. Rev., son, “Single node on-line simulation of MPI applications with SMPI,”
vol. 38, no. 4, pp. 63–74, Aug. 2008. in Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE
International. IEEE, 2011, pp. 664–675.
[2] [Online]. Available: https://aws.amazon.com/ec2/purchasing-
options/spot-instances/ [20] A. Mos and J. Murphy, “A framework for performance monitoring,
modelling and prediction of component oriented distributed systems,”
[3] A. De Benedictis, M. Rak, M. Turtur, and U. Villano, “Cloud-aware in Proceedings of the 3rd international workshop on Software and
development of scientific applications,” in 23rd IEEE International performance. ACM, 2002, pp. 235–236.
WETICE Conference WETICE-2014, June 2014, pp. 149–154.
[21] H. Koziolek, “Performance evaluation of component-based software
[4] D. Petcu, C. Craciun, M. Neagul, S. Panica, B. D. Martino, S. Ven- systems: A survey,” Performance Evaluation, vol. 67, no. 8, pp. 634–
ticinque, M. Rak, and R. Aversa, “Architecturing a sky computing 658, 2010.
platform,” in ServiceWave Workshops, ser. Lecture Notes in Computer
Science, M. Cezon and Y. Wolfsthal, Eds., vol. 6569. Springer, 2010, [22] A. Li, X. Yang, S. Kandula, and M. Zhang, “Cloudcmp: comparing
pp. 1–13. public cloud providers,” in Proceedings of the 10th annual conference
on Internet measurement. ACM, 2010, pp. 1–14.
[5] D. Petcu and M. Rak, “Open-source cloudware support for the portabil-
ity of applications using cloud infrastructure services,” in Cloud Com- [23] R. Calheiros, R. Ranjan, A. Beloglazov, C. De Rose, and R. Buyya,
puting, ser. Computer Communications and Networks, Z. Mahmood, “Cloudsim: a toolkit for modeling and simulation of cloud computing
Ed. Springer London, 2013, pp. 323–341. environments and evaluation of resource provisioning algorithms,”
Software: Practice and Experience, vol. 41, no. 1, pp. 23–50, 2011.
[6] A. Cuomo, M. Rak, and U. Villano, “Performance prediction of
cloud applications through benchmarking and simulation,” International [24] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on
Journal of Computational Science and Engineering, vol. in the press, large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008.
2014. [25] A. Videla and J. J. W. Williams, RabbitMQ in action: distributed
[7] C. A. Lee, “A perspective on scientific cloud computing,” in Proceed- messaging for everyone . Rabbit MQ in action. Shelter Island NY:
ings of the 19th ACM International Symposium on High Performance Manning, 2012.
Distributed Computing, ser. HPDC ’10. New York, NY, USA: ACM, [26] [Online]. Available: http://basho.com/riak/
2010, pp. 451–459. [27] K. Perumalla and R. Fujimoto, “Efficient large-scale process-oriented
[8] A. Iosup, S. Ostermann, M. Yigitbasi, R. Prodan, T. Fahringer, and parallel simulations,” in Proc. of the 30th Winter Simulation Conference,
D. H. J. Epema, “Performance analysis of cloud computing services 1998, pp. 459–466.
for many-tasks scientific computing,” Parallel and Distributed Systems, [28] H. Schwetman, “CSIM19: a powerful tool for building system models,”
IEEE Transactions on, vol. 22, no. 6, pp. 931–945, June 2011. in Proc. of the 33nd Winter Simulation Conference. IEEE, 2001, pp.
[9] W. Lu, J. Jackson, and R. Barga, “Azureblast: A case study of 250–255.
developing science applications on the cloud,” in Proceedings of the [29] A. Cuomo, M. Rak, and U. Villano, “Process-oriented discrete-event
19th ACM International Symposium on High Performance Distributed simulation in Java with continuations:quantitative performance evalua-
Computing, ser. HPDC ’10. New York, NY, USA: ACM, 2010, pp. tion,” in Proc. of the International Conference on Simulation and Mod-
413–420. eling Methodologies, Technologies and Applications (SIMULTECH).
[10] D. Agarwal and S. Prasad, “Azurebot: A framework for bag-of-tasks SciTePress, 2012, pp. 87–96.
applications on the azure cloud platform,” in Parallel and Distributed [30] M. Rak, A. Cuomo, and U. Villano, “mjades: Concurrent simulation in
Processing Symposium Workshops PhD Forum (IPDPSW), 2013 IEEE the cloud.” in CISIS, L. Barolli, F. Xhafa, S. Vitabile, and M. Uehara,
27th International, May 2013, pp. 2139–2146. Eds. IEEE, 2012, pp. 853–860.
[11] E. Mocanu, V. Galtier, and N. Tapus, “Generic and fault-tolerant bag-of- [31] M. Rak, R. Aversa, B. D. Martino, and A. Sgueglia, “Web services
tasks framework based on javaspace technology,” in Systems Conference resilience evaluation using lds load dependent server models,” JCM,
(SysCon), 2012 IEEE International, March 2012, pp. 1–6. vol. 5, no. 1, pp. 39–49, 2010.
[12] G. Galante and L. Bona, “Constructing elastic scientific applications us- [32] G. Aversano, M. Rak, and U. Villano, “The mosaic benchmarking
ing elasticity primitives,” in Computational Science and Its Applications framework: Development and execution of custom cloud benchmarks.”
ICCSA 2013, ser. Lecture Notes in Computer Science, B. Murgante, Scalable Computing: Practice and Experience, vol. 14, no. 1, 2013.
S. Misra, M. Carlini, C. Torre, H.-Q. Nguyen, D. Taniar, B. Apduhan,
and O. Gervasi, Eds. Springer Berlin Heidelberg, 2013, vol. 7975, pp.
281–294.
416
414
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 08:02:14 UTC from IEEE Xplore. Restrictions apply.