0% found this document useful (0 votes)
22 views

RRL 10

Uploaded by

kimjieun1688
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

RRL 10

Uploaded by

kimjieun1688
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

WRENCH: A Framework for Simulating

Workflow Management Systems


Henri Casanova∗ , Suraj Pandey∗ , James Oeth§ , Ryan Tanaka∗ , Frédéric Suter‡ , Rafael Ferreira da Silva§
∗ Information and Computer Sciences, University of Hawaii, Honolulu, HI, USA
§ Information Sciences Institute, University of Southern California, Marina Del Rey, CA, USA
‡ IN2P3 Computing Center, CNRS, Villeurbanne, France

{henric,surajp,ryanyt}@hawaii.edu, {rafsilva,oeth}@isi.edu, [email protected]

Abstract—Scientific workflows are used routinely in numerous In spite of active WMS development and use in production,
scientific domains, and Workflow Management Systems (WMSs) which has entailed solving engineering challenges, fundamen-
have been developed to orchestrate and optimize workflow tal questions remain unanswered in terms of system designs
executions on distributed platforms. WMSs are complex software
systems that interact with complex software infrastructures. Most and algorithms. Although there are theoretical underpinnings
WMS research and development activities rely on empirical for most of these questions, theoretical results often make
experiments conducted with full-fledged software stacks on actual assumptions that do not hold with production hardware and
hardware platforms. Such experiments, however, are limited software infrastructures. Further, the specifics of the design
to hardware and software infrastructures at hand and can be of a WMS can impose particular constraints on what solu-
labor- and/or time-intensive. As a result, relying solely on real-
world experiments impedes WMS research and development. An tions can be implemented effectively, and these constraints
alternative is to conduct experiments in simulation. are typically not considered in available theoretical results.
In this work we present WRENCH, a WMS simulation Consequently, current research that aims at improving and
framework, whose objectives are (i) accurate and scalable simula- evolving the state of the art, although sometimes informed by
tions; and (ii) easy simulation software development. WRENCH theory, is mostly done via “real-world” experiments: designs
achieves its first objective by building on the SimGrid framework.
While SimGrid is recognized for the accuracy and scalability and algorithms are implemented, evaluated, and selected based
of its simulation models, it only provides low-level simulation on experiments conducted for a particular WMS implemen-
abstractions and thus large software development efforts are tation with particular workflow configurations on particular
required when implementing simulators of complex systems. platforms. As a corollary, from the WMS user’s perspective,
WRENCH thus achieves its second objective by providing high- quantifying accurately how a WMS would perform for a
level and directly re-usable simulation abstractions on top of
SimGrid. After describing and giving rationales for WRENCH’s particular workflow configuration on a particular platform
software architecture and APIs, we present a case study in entails actually executing that workflow on that platform.
which we apply WRENCH to simulate the Pegasus production Unfortunately, real-world experiments have limited scope,
WMS. We report on ease of implementation, simulation accuracy, which impedes WMS research and development. This is
and simulation scalability so as to determine to which extent because they are confined to application and platform con-
WRENCH achieves its two above objectives. We also draw
both qualitative and quantitative comparisons with a previously figurations available at hand, and thus cover only a small
proposed workflow simulator. subset of the relevant scenarios that may be encountered
Index Terms—Scientific Workflows, Workflow Management in practice. Furthermore, exclusively relying on real-world
Systems, Simulation, Distributed Computing experiments makes it difficult or even impossible to investigate
hypothetical scenarios (e.g., “What if the network had a
I. I NTRODUCTION different topology?”, “What if there were 10 times more
Scientific workflows have become mainstream in support compute nodes but they had half as many cores?”). Real-
of research and development activities in numerous scientific world experiments, especially when large-scale, are often not
domains [1]. Consequently, several Workflow Management fully reproducible due to shared networks and compute re-
Systems (WMSs) have been developed [2]–[7] that allow sources, and due to transient or idiosyncratic behaviors (main-
scientists to execute workflows on distributed platforms that tenance schedules, software upgrades, and particular software
can accommodate executions at various scales. WMSs handle (mis)configurations). Running real-world experiments is also
the logistics of workflow executions and make decisions re- time-consuming, thus possibly making it difficult to obtain
garding resource selection, data management, and computation statistically significant numbers of experimental results. Real-
scheduling, the goal being to optimize some performance world experiments are driven by WMS implementations that
metric (e.g., latency [8], [9], throughput [10], [11], jitter [12], often impose constraints on workflow executions. Further-
reliability [13]–[15], power consumption [16], [17]). WMSs more, WMSs are typically not monolithic but instead reuse
are complex software systems that interact with complex CyberInfrastructure (CI) components that impose their own
software infrastructures and can thus employ a wide range overheads and constraints on workflow execution. Exploring
of designs and algorithms. what lies beyond these constraints via real-world executions,
e.g., for research and development purposes, typically entails capture the behavior of a real-world system with as little
unacceptable software (re-)engineering costs. Finally, running bias as possible) and scalability (the ability to simulate large
real-world experiments can also be labor-intensive. This is due systems with as few CPU cycles and bytes of RAM as
to the need to install and execute many full-featured software possible). The aforementioned simulation frameworks achieve
stacks, including actual scientific workflow implementations, different compromises between these two concerns by using
which is often not deemed worthwhile for “just testing out” various simulation models. At one extreme are discrete event
ideas. models that simulate the “microscopic” behavior of hardware/-
An alternative to conducting WMS research via real-world software systems (e.g., by relying on packet-level network
experiments is to use simulation, i.e., implement a software simulation for communication [48], on cycle-accurate CPU
artifact that models the functional and performance behaviors simulation [49] or emulation for computation). In this case,
of software and hardware stacks of interest. Simulation is the scalability challenge can be handled by using Parallel
used in many computer science domains and can address the Discrete Event Simulation [50], i.e., the simulation itself is
limitations of real-world experiments outlined above. Several a parallel application that requires a parallel platform whose
simulation frameworks have been developed that target the scale is at least commensurate to that of the simulated plat-
parallel and distributed computing domain [18]–[34]. Some form. At the other extreme are analytical models that capture
simulation frameworks have also been developed specifically “macroscopic” behaviors (e.g., transfer times as data sizes
for the scientific workflow domain [11], [35]–[40]. divided by bottleneck bandwidths, compute times as numbers
We claim that advances in simulation capabilities in the field of operations divided by compute speeds). While these models
have made it possible to simulate WMSs that execute large are typically more scalable, they must be developed with care
workflows on large-scale platforms accessible via diverse CI so that they are accurate. In previous work, it has been shown
services in a way that is accurate (via validated simulation that several available simulation frameworks use macroscopic
models), scalable (fast execution and low memory footprint), models that can exhibit high inaccuracy [43].
and expressive (ability to describe arbitrary platforms, complex A number of simulators have been developed that target
WMSs, and complex software infrastructure). In this work, scientific workflows. Some of them are stand-alone simula-
we build on the existing open-source SimGrid simulation tors [11], [35]–[37]. Others are integrated with a particular
framework [33], [41], which has been one of the drivers of WMS to promote more faithful simulation and code re-
the above advances and whose simulation models have been use [38], [39] or to execute simulations at runtime to guide
extensively validated [42]–[46], to develop a WMS simulation on-line scheduling decisions made by the WMS [40].
framework called WRENCH [47]. More specifically, this work The authors in [39] conduct a critical analysis of the state-
makes the following contributions: of-the-art of workflow simulators. They observe that many
1) We justify the need for WRENCH and explain how it of these simulators do not capture the details of underlying
improves on the state of the art. infrastructures and/or use naive simulation models. This is the
2) We describe the high-level simulation abstractions pro- case with custom simulators such as that in [36], [37], [40].
vided by WRENCH that (i) make it straightforward to But it is also the case with workflow simulators built on top of
implement full-fledged simulated versions of complex generic simulation frameworks that provide convenient user-
WMS systems; and (ii) make it possible to instantiate level abstractions but fail to model the details of the underlying
simulation scenarios with only few lines of code. infrastructure, e.g., the simulators in [11], [35], [38], which
3) Via a case study with the Pegasus [2] production WMS, build on the CloudSim [25] or GroudSim [24] frameworks.
we evaluate the ease-of-use, accuracy, and scalability of These frameworks have been shown to lack in their network
WRENCH, and compare it with a previously proposed modeling capabilities [43]. As a result, some authors readily
simulator, WorkflowSim [35]. recognize that their simulators are likely only valid when
This paper is organized as follows. Section II discusses network effects play a small role in workflow executions (i.e.,
related work. Section III outlines the design of WRENCH and when workflows are not data-intensive).
describes how its APIs are used to implement simulators. Sec- To overcome the above limitations, in [39] the authors have
tion IV presents our case study. Finally, Section V concludes improved the network model in GroudSim and also use a
with a brief summary of results and a discussion of future separate simulator, DISSECT-CF [27], for simulating cloud
research directions. infrastructures accurately. Both [39] and [27] acknowledge
that the popular SimGrid [33], [41] simulation framework
II. R ELATED W ORK offers compelling capabilities, both in terms of scalability
Many simulation frameworks have been developed for par- and simulation accuracy. But one of their reasons for not
allel and distributed computing research and development. considering SimGrid is that, because it is low-level, using
They span domains such as HPC [18]–[21], Grid [22]– it to implement a simulator of a complex system, such as
[24], Cloud [25]–[27], Peer-to-peer [28], [29], or Volunteer a WMS and the CI services it uses, would be too labor-
Computing [30]–[32]. Some frameworks have striven to be intensive. In this work, we address this issue by developing
applicable across some or all or the above domains [33], a simulation framework that provides convenient, reusable,
[34]. Two conflicting concerns are accuracy (the ability to high-level abstractions but that builds on SimGrid so as to
benefit from its scalable and accurate simulation models.
Furthermore, unlike [38], [39], we do not focus on integration
with any specific WMS. The argument in [39] is that stand-
alone simulators, such as that in [35], are disconnected from
real-world WMSs because they abstract away much of the
complexity of these systems. Instead, our proposed framework
does capture low-level system details (and simulates them well
thanks to SimGrid), but provides high-level enough abstrac-
tions to implement faithful simulations of complex WMSs
with minimum effort, which we demonstrate via a case study
with the Pegasus WMS [2].
Also related to this work is previous research that has not
focused on providing simulators or simulation frameworks per
se, but instead on WMS simulation methodology. In particular,
several authors have investigated methods for injecting realistic
stochastic noise in simulated WMS executions [35], [51].
These techniques can be adopted by most of the aforemen-
tioned frameworks, including the one proposed in this work.
III. WRENCH
A. Objective and Intended Users
WRENCH’s objective is to make it possible to study WMSs
in simulation in a way that is accurate (faithful modeling of
real-world executions), scalable (low computation and memory Fig. 1: The four layers in the WRENCH architecture from bot-
footprints on a single computer), and expressive (ability to tom to top: simulation core, simulated core services, simulated
simulate arbitrary WMS, workflow, and platform scenarios WMS implementations, and simulators.
with minimal software engineering effort). WRENCH is not workflow issues, etc. These users can develop simulators
a simulator but a simulation framework that is distributed as via the WRENCH User API (described in Section III-E),
a C++ library. It provides high-level reusable abstractions for which makes it possible to build a full-fledged simulator
developing simulated WMS implementations and simulators with only a few lines of code.
for the execution of these implementations. There are two
Users in the first category above often also belong to
categories of WRENCH users:
the second category. That is, after implementing a simulated
1. Users who implement simulated WMSs – These users
WMS these users typically instantiate simulators for several
are engaged in WMS research and development activities
experimental scenarios to evaluate their WMS.
and need an “in simulation” version of their current or
intended WMS. Their goals typically include evaluating
B. Software Architecture Overview
how their WMS behaves over hypothetical experimental
scenarios and comparing competing algorithm and system Figure 1 depicts WRENCH’s software architecture. At the
design options. For these users, WRENCH provides the bottom layer is the Simulation Core, which simulates low-level
WRENCH Developer API (described in Section III-D) software and hardware stacks using the simulation abstractions
that eases WMS development by removing the typical and models provided by SimGrid (see Section III-C). The next
difficulties involved when developing, either in real-world layer implements simulated CI services that are commonly
or in simulation mode, a system comprised of distributed found in current distributed platforms and used by production
components that interact both synchronously and asyn- WMSs. At the time of this writing, WRENCH provides ser-
chronously. To this end, WRENCH makes it possible vices in 4 categories: compute services that provide access to
to implement a WMS as a single thread of control that compute resources to execute workflow tasks; storage services
interacts with simulated CI services via high-level APIs that provide access to storage resources for storing workflow
and must react to a small set of asynchronous events. data; network monitoring services that can be queried to
2. Users who execute simulated WMSs – These users sim- determine network distances; and data registry services that
ulate how given WMSs behave for particular workflows can be used to track the location of (replicas of) workflow data.
on particular platforms. Their goals include comparing Each category includes multiple service implementations, so
different WMSs, determining how a given WMS would as to capture specifics of currently available CI services used in
behave for various workflow configurations, comparing production. For instance, in its current version WRENCH pro-
different platform and resource provisioning options, de- vides a “batch-scheduled cluster” compute service, a a “cloud”
termining performance bottlenecks, engaging in peda- compute service, and a “bare-metal” compute service. The
gogic activities centered on distributed computing and above layer in the software architecture consists of simulated
WMS, that interact with CI services using the WRENCH De- Algorithm 1 Blueprint for a WMS execution
veloper API (see Section III-D). These WMS implementations, 1: procedure M AIN(workf low)
which can simulate production WMSs or WMS research pro- 2: Obtain list of available services
totypes, are not included as part of the WRENCH distribution, 3: Gather static information about the services
4: while workf low execution has not completed/failed do
but implemented as stand-alone projects. One such project is 5: Gather dynamic service/resource information
the simulated Pegasus implementation for our case study in 6: Submit pilot jobs if needed
Section IV. Finally, the top layer consists of simulators that 7: Make data/computation scheduling decisions
configure and instantiate particular CI services and particular 8: Interact with services to enact decisions
WMSs on a given simulated hardware platform, that launch 9: Wait for and react to the next event
10: end while
the simulation, and that analyze the simulation outcome. These 11: return
simulators use the WRENCH User API (see Section III-E). 12: end procedure
Here again, these simulators are not part of WRENCH, but
implemented as stand-alone projects.
To the best of our knowledge, among comparable available
C. Simulation Core
simulation frameworks (as reviewed in Section II), SimGrid
WRENCH’s simulation core is implemented using Sim- is the only one to offer all the above desirable characteristics.
Grid’s S4U API, which provides all necessary abstractions
and models to simulate computation, I/O, and communication D. WRENCH Developer API
activities on arbitrary hardware platform configurations. These With the Developer API, a WMS is implemented as a single
platform configurations are defined by XML files that specify thread of control that executes according to the pseudo-code
network topologies and endpoints, compute resources, and blueprint shown in Algorithm 1. Given a workflow to execute,
storage resources [52]. a WMS first gathers information about all the CI services
At its most fundamental level, SimGrid provides a Con- it can use to execute the workflow (lines 2-3). Examples
current Sequential Processes (CSP) model: a simulation con- of such information include the number of compute nodes
sists of sequential threads of control that consume hardware provided by a compute service, the number of cores per node
resources. These threads of control can implement arbitrary and the speed of these cores, the amount of storage space
code, exchange messages via a simulated network, can perform available in a storage service, the list of hosts monitored by a
computation on simulated (multicore) hosts, and can perform network monitoring service, etc. Then, the WMS iterates until
I/O on simulated storage devices. In addition, SimGrid pro- the workflow execution is complete or has failed (line 4). At
vides a virtual machine abstraction that includes a migration each iteration it gathers dynamic information about available
feature. Therefore, SimGrid provides all the base abstractions services and resources if needed (line 5). Example of such
necessary to implement the classes of distributed systems that information include currently available capacities at compute
are relevant to scientific workflow executions. However, these or storage services, current network distances between pairs
abstractions are low-level and a common criticism of SimGrid of hosts, etc. Then, if desired, the WMS can submit pilot
is that implementing a simulation of a complex system requires jobs [53] to compute services that support them, if any (line
a large software engineering effort. A WMS executing a 6). Based on resource information and on the current state of
workflow using several CI services is a complex system, and the workflow, the WMS can then make whatever scheduling
WRENCH builds on top of SimGrid to provide high-level decisions it sees fit (line 7). It then enacts these decisions
abstractions so that implementing this complex system is not by interacting with appropriate services. For instance, it could
labor-intensive. decide to submit a “job” to a compute service to execute a
We have selected SimGrid for WRENCH for the following ready task on some number of cores at some compute service
reasons. SimGrid has been used successfully in many dis- and copy all produced files to some storage service, or it could
tributed computing domains (cluster, peer-to-peer, grid, cloud, decide to just copy a file between storage services and then
volunteer computing, etc.), and thus can be used to simulate update a data location service to keep track of the location of
WMSs that execute over a wide range of platforms. SimGrid this new file replica. It is the responsibility of the developer
is open source and freely available, has been stable for many to implement all decision-making algorithms employed by the
years, is actively developed, has a sizable user community, WMS. At the end of the iteration, the WMS simply waits for
and has provided simulation results for over 350 research a workflow execution event to which it can react if need be.
publications since its inception. SimGrid has also been the Most common events are job completions/failures and data
object of many invalidation and validation studies [42]–[46], transfer completions/failures.
and its simulation models have been shown to provide com- The WRENCH Developer API provides a rich set of meth-
pelling advantages over other simulation frameworks in terms ods to process analyze the workflow and to interact with CI
of both accuracy and scalability [33]. Finally, most SimGrid services to execute the workflow. These methods were de-
simulations can be executed in minutes on a standard laptop signed based on current and envisioned capabilities of current
computer, making it possible to perform large numbers of state-of-the-art WMSs. We refer the reader to the WRENCH
simulations quickly with minimal compute resource expenses. Web site [47] for more information on how to use this API
and for the full API documentation. The key objective of this capability to simulate arbitrary failures via availability traces.
API is to make it straightforward to implement a complex Furthermore, failures can occur due to the WMS implemen-
system, namely a full-fledged WMS that interact with diverse tation itself, e.g., if it fails to check that the operations it
CI services. We achieve this objective by providing simple attempts are actually valid, if concurrent operations initiated
solutions and abstractions to handle well-known challenges by the WMS work at cross purposes. WRENCH abstracts
when implementing a complex distributed system (whether in away all these failures as C++ exceptions that can be caught
the real world or in simulation), as explained hereafter. by the WMS implementation, or caught by a manager and
SimGrid provides simple point-to-point communication be- passed to the WMS as workflow execution events. Regardless,
tween threads of control via a mailbox abstraction. One of each failure exposes a failure cause, which encodes a detailed
the recognized strengths of SimGrid is that it employs highly description of the failure. For instance, after initiating a file
accurate and yet scalable network simulation models. How- copy from a storage service to another storage service, a
ever, unlike some of its competitors, it does not provide any “file copy failed” event sent to the WMS would include a
higher-level simulation abstractions meaning that distributed failure cause that could specify that when trying to copy file
systems must be implemented essentially from scratch, with x from storage service y to storage service z, storage service
many message-based interactions. All message-based commu- z did not have sufficient storage space. Other example failure
nication is abstracted away by WRENCH, and although the causes could be that a network error occurred when storage
simulated CI services exchange many messages with the WMS service y attempted to receive a message from storage service
and among themselves, the WRENCH Developer API only z, or that service z was down. All CI services implemented
exposes higher-level interaction with services (“run this job”, in WRENCH simulate well-defined failure behaviors, and
“move this data”) and only requires that the WMS handle a few failure handling capabilities afforded to simulated WMSs can
events. The WMS developer thus completely avoids the need actually allow more sophisticated failure tolerance strategies
to send and receive (and thus orchestrate) network messages. than currently done or possible in real-world implementations.
Another challenge when developing a system like a WMS But more importantly, the amount of code that needs to be
is the need to handle asynchronous interactions. While some written for failure handling in a simulated WMS is minimal.
service interactions can be synchronous (e.g., “are you up?”, Given the above, WRENCH makes it possible to implement
“tell me your current load”), most need to be asynchronous a simulated WMS with very little code and effort. The example
so that the WMS retains control. The typical solution is to WMS implementation provided with the WRENCH distribu-
maintain sets of request handles and/or to use multiple threads tion, which is simple but functional, is under 200 lines of C++
of control. To free the WMS developer from these responsi- (once comments have been removed). See more discussion of
bilities, WRENCH provides already implemented “managers” the effort needed to implement a WMS with WRENCH in the
that can be used out-of-the-box to take care of asynchronicity. context of our Pegasus case study (Section IV).
A WMS can instantiate such managers, which are independent
threads of control. Each manager transparently interacts with E. WRENCH User API
CI services, maintains a database of pending requests, pro- With the User API one can quickly build a simulator, which
vides a simple API to check on the status of these requests, typically follows these steps:
and automatically generates workflow execution events. For 1. Instantiate a platform based on a SimGrid XML platform
instance, a WMS can instantiate a “job manager” through description file;
which it will create and submit jobs to compute services. It 2. Create one or more workflows;
can at any time check on the status of a job, and the job 3. Instantiate services on the platform;
manager interacts directly (and asynchronously) with compute 4. Instantiate one or more WMSs telling each what services
services so as to generate “job done” or “job failed” events are at its disposal and what workflow it should execute
to which the WMS can react. In our experience developing starting at what time;
simulators from scratch using SimGrid, the implementation of 5. Launch the simulation; and
asynchronous interactions with simulated processes is a non- 6. Process the simulation outcome.
trivial development effort, both in terms of amount of code to The above steps can be implemented with only a few lines of
write and difficulty to write this code correctly. We posit that C++. An example WRENCH simulator is shown in Figure 2,
this is one of the reasons why some users have preferred using which uses a WMS implementation (called SomeWMS) that
simulation frameworks that provide higher-level abstractions has already been developed using the WRENCH Developer
than SimGrid but offer less attractive accuracy and/or scalabil- API (see previous section). After initializing the simulation
ity features. WRENCH provides such higher-level abstractions (lines 5-6), the simulator instantiates a platform (line 8) and
to the WMS developers, and as a result implementing a WMS a workflow (line 10-11). A workflow is defined as a set
with WRENCH can be straightforward. of computation tasks and data files, with control and data
Finally, one of the challenges when developing a WMS dependencies between tasks. Each task can also have a priority,
is failure handling. It is expected that compute, storage, and which can then be taken into account by a WMS for scheduling
network resources, as well as the CI services that use them, purposes. Although the workflow can be defined purely pro-
can fail through the execution of the WMS. SimGrid has the grammatically, in this example the workflow is imported from
1 #include <math.h>
2 #include <wrench.h>
3 int main(int argc, char **argv) {
4 // Declare and initialize a simulation
5 wrench::Simulation simulation;
6 simulation.init(&argc, argv);
7 // Instantiate a platform
8 simulation.instantiatePlatform("my_platform.xml");
9 // Instantiate a workflow
10 wrench::Workflow workflow;
11 workflow.loadFromDAX("my_workflow.dax", "1000Gf");
12 // Instantiate a storage service
13 auto storage_service = simulation.add(
14 new wrench::SimpleStorageService("storage_host", pow(2,50)));
15 // Instantiate a sompute service (a batch−scheduled 4−node cluster that uses the
16 // EASY backfilling algorithm and is subject to a background load)
17 auto batch_service = simulation.add(
18 new wrench::BatchService("batch_login", {"node1", "node2", "node3", "node4"}, pow(2,40),
19 {{wrench::BatchServiceProperty::SIMULATED_WORKLOAD_TRACE_FILE, "load.swf"},
20 {wrench::BatchServiceProperty::BATCH_SCHEDULING_ALGORITHM, "easy_bf"}}));
21 // Instantiate a compute service (a 4−host cloud platform that does not support pilot jobs)
22 auto cloud_service = simulation.add(
23 new wrench::CloudService("cloud_gateway", {"host1", "host2", "host3", "host4"}, pow(2,42),
24 {{wrench::CloudServiceProperty::SUPPORTS_PILOT_JOBS, "false"}}));
25 // Instantiate a data registry service
26 auto data_registry_service = simulation.add(new wrench::FileRegistryService("my_desktop"));
27 // Instantiate a network monitoring service
28 auto network_monitoring_service =
29 simulation.add(new wrench::NetworkProximityService(
30 "my_desktop", {"my_desktop", "batch_login", "cloud_gateway"},
31 {{wrench::NetworkProximityServiceProperty::NETWORK_PROXIMITY_SERVICE_TYPE,
32 "vivaldi"}});
33 // Stage a workflow input file at the storage service
34 simulation.stageFile(workflow.getFileByID("input_file"), storage_service);
35 // Instantiate a WMS...
36 auto wms = simulation.add(
37 new wrench::SomeWMS({batch_service, cloud_service}, {storage_service},
{network_monitoring_service}, {data_registry_service}, "my_desktop"));
38 // ... and assign the workflow to it, to be executed one hour in
39 wms->addWorkflow(&workflow, 3600);
40 // Launch the simulation
41 simulation.launch();
42 // Retrieve task completion events
43 auto trace = simulation.getOutput().getTrace<wrench::SimulationTimestampTaskCompletion>();
44 // Determine the completion time of the last task that completed
45 double completion_time = trace[trace.size()-1]->getContent()->getDate();
46 }

Fig. 2: Example fully functional WRENCH simulator. Try-catch clauses are omitted.
a workflow description file in the DAX format [54]. At line (line 26) and a network monitoring service that uses the
13 the simulator creates a storage service with 1PiB capacity Vivaldi algorithm [57] to measure network distances between
accessible on host storage_host. This and other hostnames the two hosts from which the compute services are accessed
are specified in the XML platform description file. At line 17 (batch_login and cloud_gateway) and the my_host host,
the simulator creates a compute service that corresponds to a which is the host that runs these helper services and the
4-node batch-scheduled cluster. The physical characteristics of WMS (line 28). At line 34, the simulator specifies that the
the compute nodes (node[1-4]) are specified in the platform workflow data file input_file is initially available at the
description file. This compute service has a 1TiB scratch storage service. It then instantiates the WMS and passes to it
storage space. Its behavior is customized by passing a couple all available services (line 36), and assigns the workflow to it
of property-value pairs to its constructor. It will be subject to (line 39). The crucial call is at line 41, where the simulation
a background load as defined by a trace in the standard SWF is launched and the simulator hands off control to WRENCH.
format [55], and its batch queue will be managed using the When this call returns the workflow has either completed
EASY Backfilling scheduling algorithm [56]. The simulator or failed. Assuming it has completed, the simulator then
then creates a second compute service (line 22), which is a retrieves the ordered set of task completion events (line 43)
4-host cloud service, customized so that it does not support and performs some (in this example, trivial) mining of these
pilot jobs. Two helper services are instantiated, a data registry events (line 45).
service so that the WMS can keep track of file locations For brevity, the example in Figure 2 omits try/catch
clauses. Also, note that although the simulator uses the new WRENCH Pegasus Simulator

Pegasus
operator to instantiate WRENCH objects, the simulation object pegasus-run configuration
takes ownership of these objects (using unique or shared point-
ers), so that there is no memory deallocation onus placed on

DAGMan
scheduler
monitor
the user. This example showcases only the most fundamental DAGMan

features of the WRENCH User API, and we refer the reader

HTCondor
to the WRENCH Web site [47] for more detailed information Job Submission Service

on how to use this API and for the full API documentation. master schedd shadow

In the future this API will come with Python binding so that
Central Manager Service
users can implement simulators in Python.
master negotiator collector

IV. C ASE S TUDY: S IMULATING A PRODUCTION WMS Job Execution Service

In this section, we present a WRENCH-based simulator master startd starter

of a state-of-the-art WMS, Pegasus [2], as a case study for


evaluation and validation purposes. Fig. 3: Overview of the WRENCH Pegasus simulation com-
Pegasus is being used in production to execute workflows ponents, including components for DAGMan and HTCondor
for dozens of high-profile applications in a wide range of frameworks. Red boxes denote Pegasus services developed
scientific domains [2]. Pegasus provides the necessary ab- with WRENCH’s Developer API, and white boxes denote
stractions for scientists to create workflows and allows for WRENCH reused components.
transparent execution of these workflows on a range of com-
pute platforms including clusters, clouds, and national cyberin- about all other daemons, and of a negotiator daemon, which
frastructures. During execution, Pegasus translates an abstract performs task/resource matchmaking. The Job Submission
resource-independent workflow into an executable workflow, Service consists of a schedd daemon, which maintains a
determining the specific executables, data, and computational queue of tasks, and of several instances of a shadow daemon,
resources required for the execution. Workflow execution with each of which corresponds to a task submitted to the Condor
Pegasus includes data management, monitoring, and failure pool for execution.
handling, and is managed by HTCondor DAGMan [58]. In- Given the simulated HTCondor implementation above, we
dividual workflow tasks are managed by a workload man- then implemented the simulated Pegasus WMS, including the
agement framework, HTCondor [59], which supervises task DAGMan workflow engine, using the WRENCH Developer
executions on local and remote resources. API. This implementation instantiates all services and parses
the workflow description file, the platform description file, and
A. Implementing Pegasus with WRENCH a Pegasus-specific configuration file. DAGMan orchestrates the
Since Pegasus relies on HTCondor, first we have imple- workflow execution (e.g., a task is marked as ready for execu-
mented the HTCondor services as simulated core CI services, tion once all its parent tasks have successfully completed), and
which together form a new Compute Service that exposes the monitors the status of tasks submitted to the HTCondor pool
WRENCH Developer API. This makes HTCondor available using a pull model, i.e., task status is fetched from the pool
to any WMS implementation that is to be simulated using at regular time intervals. The top part of Figure 3 depicts the
WRENCH, and will be included in the next WRENCH release components of our simulated Pegasus implementation (each
as part of the growing set of simulated core CI services shown in a red box).
provided by WRENCH. By leveraging WRENCH’s high-level simulation abstrac-
HTCondor is composed of six main service daemons tions, implementing HTCondor as a reusable core WRENCH
(startd, starter, schedd, shadow, negotiator, and service using the Developer API required only 613 lines of
collector). In addition, each host on which one or more of code. Similarly, implementing a simulated version of Pegasus,
these daemons is spawned must also run a master daemon, including DAGMan, was done with only 666 lines of code
which controls the execution of all other daemons (including (127 of which are merely parsing simulation configuration
initialization and completion). The bottom part of Figure 3 files). These numbers include both header and source files,
depicts the components of our simulated HTCondor imple- but exclude comments. We argue that the above corresponds
mentation, where daemons are shown in red-bordered boxes. to minor simulation software development efforts when con-
In our simulator we implement the 3 fundamental HTCon- sidering the complexity of the system being simulated.
dor services, implemented as particular sets of daemons, as Service implementations in WRENCH are all parameteri-
depicted in the bottom part of Figure 3 in borderless white zable. For instance, as services use message-based commu-
boxes. The Job Execution Service consists of a startd nications it is possible to specify all message payloads in
daemon, which adds the host on which it is running to the bytes (e.g., for control messages). Other parameters encom-
HTCondor pool, and of a starter daemon, which manages pass various overheads, either in seconds or in computation
task executions on this host. The Central Manager Service volumes (e.g., task startup overhead on a compute service).
consists of a collector daemon, which collects information In WRENCH, all service implementations come with default
Experimental Scenario Avg. Makespan Task Submissions Tasks completions
Workflow Platform Error (%) p-value distance p-value distance
1000Genome ExoGENI 1.10 ±0.28 0.06 ±0.01 0.21 ±0.04 0.72 ±0.06 0.12 ±0.01
Montage-1.5 AWS-t2.xlarge 4.25 ±1.16 0.08 ±0.01 0.16 ±0.03 0.12 ±0.05 0.21 ±0.02
Montage-2.0 AWS-m5.xlarge 3.37 ±0.46 0.11 ±0.03 0.06 ±0.02 0.10 ±0.01 0.11 ±0.01

TABLE I: Average simulated makespan error (%), and p-values and Kolmogorov-Smirnov (KS) distances for task submission
and completion dates, computed for 5 runs of each of our 3 experimental scenarios.

values for all these parameters, but it is possible to pick m5.xlarge. The bandwidth between the data node and
custom values upon service instantiation. The process of the submit node was ∼0.44 Gbps, and the bandwidth
picking parameter values so as to match a specific real-world between the submit and worker nodes on these instances
system is referred to as simulation calibration. We calibrated were ∼0.74 Gbps and ∼1.24 Gbps, respectively.
our simulator by measuring delays observed in event traces
of real-world executions for workflows on hardware/software C. Simulation Accuracy
infrastructures (see Section IV-B). To evaluate the accuracy of our simulator, we consider 3
The simulator code, details on the simulation calibration particular experimental scenarios: 1000Genome on ExoGENI,
procedure, and experimental scenarios used in the rest of this Montage-1.5 on AWS-t2.xlarge, and Montage-2.0 on AWS-
section are all publicly available online [60]. m5.xlarge. Each execution is repeated 5 times and the overall
B. Experimental Scenarios workflow execution times, or makespans, are recorded.
The third column in Table I shows average relative differ-
We consider experimental scenarios defined by particular
ences between actual and simulated makespans. We see that
workflow instances to be executed on particular platforms. Due
simulated makespans are close to actual makespans across the
to the lack of publicly available detailed workflow execution
board (average relative error is below 5%). One of the key
traces (i.e., execution logs that include data sizes for all files,
advantages of building WRENCH on top of SimGrid is that
all execution delays, etc.), we have performed real workflow
WRENCH simulators benefit from the high-accuracy network
executions with Pegasus and collected raw, time-stamped event
models in SimGrid, e.g., these models capture many features
traces from these executions. These traces form the ground
of the TCP protocol. And indeed, when comparing real-world
truth to which we can compare simulated executions. We
and simulated executions we observe average relative error
consider these workflow applications:
below 3% for data movement operations. The many processes
• 1000Genome [61]: A data-intensive workflow that iden- involved in a workflow execution with Pegasus interact by
tifies mutational overlaps using data from the 1000 exchanging (typically small) control messages. Our simulator
genomes project in order to provide a null distribution simulates these interactions. For instance, each time an output
for rigorous statistical evaluation of potential disease- file is produced by a task a data registry service is contacted so
related mutations. We consider a 1000Genome instance that a new entry can be added to its database of file replicas,
that comprises 71 tasks. which incurs some overhead due to a message exchange. When
• Montage [2]: A compute-intensive astronomy workflow comparing real-world to simulated executions we observe
for generating custom mosaics of the sky. For this ex- average relative simulation error below 1% for these data
periment, we ran Montage for processing 1.5 and 2.0 registration overheads.
square degrees mosaic 2MASS. We thus refer to each To draw comparisons with a state-of-the-art simulator,
configuration as Montage-1.5 and Montage-2.0, respec- we repeated the above simulations using WorkflowSim [35].
tively. Montage-1.5, resp. Montage-2.0, comprises 573, WorkflowSim does not provide a detailed simulated HTCondor
resp. 1,240, tasks. implementation, does not offer the same simulation calibra-
We use these platforms, deploying on each a submit node tion capabilities as WRENCH, and is built on top of the
(which runs Pegasus, DAGMan, and HTCondor’s job submis- CloudSim simulation framework [25]. Nevertheless, we have
sion and central manager services), four worker nodes (4 cores painstakingly calibrated our WorkflowSim simulator so that
per node / shared file system), and a data node in the WAN: it models the hardware and software infrastructures of our
• ExoGENI: A widely distributed networked infrastructure- experimental scenarios as closely as possible. For each of
as-a-service testbed representative of a “bare metal” plat- the 3 experimental scenarios, we find that the relative average
form. Each worker node is a 4-core 2.0GHz processor makespan percentage error is 12.09 ±2.84, 26.87 ±6.26, and
with 12GiB of RAM. The bandwidth between the data 13.32 ±1.12, respectively, i.e., from 4x up to 11x larger than
node and the submit node was ∼0.40 Gbps, and the the error values obtained with our WRENCH-based simulator.
bandwidth between the submit and worker nodes was The reasons for the discrepancies between WorkflowSim and
∼1.00 Gbps. real-world results are twofold. First, WorkflowSim uses the
• AWS: Amazon’s cloud platform, on which we use two simplistic network models in CloudSim (see discussion in
types of virtual machine instances: t2.xlarge and Section II) and thus suffers from simulation bias w.r.t. data
transfer times. Second, WorkflowSim does not capture all the A 1.00 B 1.00
relevant details of the system and its execution. By contrast,

F(Completed Tasks)
implementing a fully detailed simulator with WRENCH can

F(Submitted Tasks)
0.75 0.75
be done in a few hundred lines of code.
In our experiments we also record the submission and com-
0.50 0.50
pletion dates of each task, thus obtaining empirical cumulative
density functions (ECDFs) of these times, for both real-world
0.25 0.25
executions and simulated executions. To further validate the
accuracy of our simulation results we apply Kolmogorov-
0.00 0.00
Smirnov goodness of fit tests (KS tests) with null hypotheses
0 1000 2000 3000 0 1000 2000 3000
(H0 ) that the real-world and simulation samples are drawn Workflow Makespan (s) Workflow Makespan (s)
from the same distributions. The two-sample KS test results in
pegasus wrench workflowsim
a miss if the null hypothesis (two-sided alternative hypothesis)
is rejected at 5% significance level (p-value ≤ 0.05). Each Fig. 4: Empirical cumulative distribution function of task sub-
test for which the null hypothesis is not rejected (p-value mit times (left) and task completion times (right) for sample
> 0.05), indicates that the simulated execution statistically real-world (“pegasus”) and simulated (“wrench” and “work-
matches the real-world execution. Table I shows p-value and flowsim”) executions of Montage-2.0 on AWS-m5.xlarge.
KS test distance for both task submission times and task
completion times. The null hypothesis is not rejected, and we A pegasus B wrench
thus conclude that simulated workflow task executions statis-
tically match real-world executions well. These conclusions
are confirmed by visually comparing ECFDs. For instance,
Figure 4 shows real-world and simulated ECDFs for sample
runs of Montage-2.0 on AWS-m5.xlarge, with task submission,
resp. completion, date ECDFs on the left-hand, resp. right-

Tasks
Tasks

hand, side. We observe that the simulated ECDFs (“wrench”)


track the real-world ECDFs (“pegasus”) closely. We repeated
these simulations using WorkflowSim, and found that the null
hypothesis is rejected for all 3 simulation scenarios. This is
confirmed visually in Figure 4, where the ECDFs obtained
from the WorkflowSim simulation (“workflowsim”) are far 0 1000 2000 3000 0 1000 2000 3000
Makespan (s) Makespan (s)
from the real-world ECDFs.
Although KS tests and ECDFs visual inspections validate Fig. 5: Task execution Gantt chart for sample real-world (“pe-
that the WRENCH-simulated ECDFs match the real-world gasus”) and simulated (“wrench”) executions of the Montage-
ECDFs statistically, these results do not distinguish between 2.0 workflow on the AWS-m5.xlarge platform.
individual tasks. In fact, there are some discrepancies between
real-world and simulated schedules. For instance, Figure 5 are returned when iterating over data structures in which task
shows Gantt charts corresponding to the workflow executions objects are stored. Building a perfectly faithful simulation
shown in Figure 4, with the real-world execution on the left- of a WMS would thus entail implementing/using the exact
hand side (“pegasus”) and the simulated execution on the same data structures as that in the actual implementation.
right-hand side (“wrench”). Tasks executions are shown on This could be labor intensive or perhaps not even possible
the vertical axis, each shown as a line segment along the depending on which data structures, languages, and/or libraries
horizontal time axis, spanning the time between the task’s are used in that implementation. In the context of this Pegasus
start time and the task’s finish time. Different task types, case study, the production implementation of the DAGMan
i.e., different executables, are shown with different colors. In scheduler uses a custom priority list implementation to store
this workflow, all tasks of the same type are independent and ready tasks, while our simulation version of it stores workflow
have the same priority. We see that the shapes of the yellow tasks in a std::map data structure indexed by task string
regions, for example, vary between the two executions. These IDs. Consequently, when the real-world scheduler picks the
variations are explained by implementation-dependent behav- first n ready tasks it typically picks different tasks than those
iors of the workflow scheduler. In many instances throughout picked by its simulated implementation. This is the cause the
workflow execution several ready tasks can be selected for discrepancies seen in Figure 5.
execution, e.g., sets of independent tasks in the same level
of the workflow. When the number of available compute D. Simulation Scalability
resources, n, is smaller than the number of ready tasks, the Table II shows average simulated makespans and simulation
scheduler picks n ready tasks for immediate execution. In execution times for our 3 experimental scenarios. Simulations
most WMSs, these tasks are picked as whatever first n tasks are executed on a single core of a MacBook Pro 3.5 GHz Intel
Experimental Scenario Avg. Workflow Avg. Simulation WorkflowSim is faster is because it simply does not simulate
Workflow Platform Makespan (s) Time (s)
many aspects of the execution. The downside, as seen in the
1000Genome ExoGENI 761.0 ±7.93 0.3 ±0.01 previous section, is that its simulation results are inaccurate.
Montage-1.5 AWS-t2.xlarge 1,784.0 ±137.67 8.3 ±0.09
Montage-2.0 AWS-m5.xlarge 2,911.8 ±48.80 28.1 ±0.52 V. C ONCLUSION
TABLE II: Simulated workflow makespans and simulation In this paper we have presented WRENCH, a simulation
times averaged over 5 runs of each of our 3 experimental framework for building simulators of Workflow Management
scenarios. Systems. WRENCH implements high-level simulation abstrac-
tions on top of the SimGrid simulation framework, so as
to make it possible to build simulators that are accurate,
2500 2500
that can run scalably on a single computer, and that can
memory usage

simulation time
be implemented with minimal software development effort.
2000 2000
Via a case study for the Pegasus production WMS we have
workflowsim
demonstrated that WRENCH achieves these objectives, and

Memory [MB]
1500 wrench 1500
Time (s)

that it favorably compares to a recently proposed workflow


1000 1000 simulator. The main finding is that with WRENCH one can im-
plement an accurate and scalable simulator of a complex real-
500 500 world system with a few hundred lines of code. WRENCH is
open source and welcomes contributors. WRENCH is already
0 0
being used for several research and education projects, and
Version 1.1 was released in August 2018. We refer the reader
00

00

00

00

00

00

00

00

00

0
00
10

20

30

40

50

60

70

80

90

10

# workflow tasks to http://wrench-project.org for software, documentation, and


links to related projects.
Fig. 6: Average simulation time (in seconds. left vertical axis)
A short-term development direction is to use WRENCH
and memory usage (maximum resident set size, right vertical
to simulate the execution of current production WMSs (as
axis) in MiB vs. workflow size.
was done for Pegasus in Section IV). Although we have
Core i7 with 16GiB of RAM. For these scenarios, simulation designed WRENCH with knowledge of these WMSs and with
times are more than 100x and up to 2500x shorter than real- the intent of making their implementations with WRENCH
world workflow executions. This is because SimGrid simu- feasible, we expect that WRENCH APIs and abstractions
lates computation and communication operations as delays will evolve once we set out to realize these implementations.
computed based on computation and communication volumes Another development direction is the implementation of more
using simulation models with low computational complexity. CI service abstractions in WRENCH, e.g., a Hadoop Compute
To evaluate the scalability of our simulator, we use a Service, specific distributed cloud Storage Services. From a
workflow generator [62] to generate representative randomized research perspective, a future direction is that of automated
configurations of the Montage workflow with from 1, 000 up simulation calibration. As seen in our Pegasus case study,
to 10, 000 tasks. We generate 5 workflow instances for each even when using validated simulation models, the values of
number of tasks, and simulate the execution of these generated a number of simulation parameters must be carefully chosen
workflow instances on 128 cores (AWS-m5.xlarge with 32 4- in order to obtain accurate simulation results. This issue is
core nodes). Figure 6 shows simulation time (left vertical axis) not confined to WRENCH, but is faced by all distributed
and maximum resident set size (right vertical axis) vs. the system simulators. In our case study we have calibrated these
number of tasks in the workflow. Each sample point is the parameters manually by analyzing and comparing simulated
average over the 5 workflow instances (error bars are shown and real-world execution event traces. While, to the best of our
as well). As expected, both simulation time and memory knowledge, this is the typical practice, what is truly needed
footprint increase as workflows become larger. The memory is an automated calibration method. Ideally, this method
footprint grows linearly with the number of tasks (simply due would process a (small) number of (not too large) real-world
to the need to store more task objects). The simulation time execution traces for “training scenarios”, and compute a valid
grows faster initially, but then linearly beyond 7,000 tasks. We and robust set of calibration parameter values. An important
conclude that the simulation scales well, making it possible research question will then be to understand to which extent
to simulate very large 10,000-task Montage configurations in these automatically computed calibrations can be composed
under 40 minutes on a standard laptop computer. and extrapolated to scenarios beyond the training scenarios.
Figure 6 also includes results obtained with WorkflowSim. Acknowledgments. This work is funded by NSF contracts
We find that WorkflowSim has a larger memory footprint than #1642369 and #1642335, “SI2-SSE: WRENCH: A Simulation
our WRENCH-based simulator (by a factor ∼1.48 for 10,000- Workbench for Scientific Worflow Users, Developers, and
task workflows). However, WorkflowSim is faster than our Researchers”, and by CNRS under grant #PICS07239. We
WRENCH-based simulator (by a factor ∼1.81 for 10,000- thank Martin Quinson, Arnaud Legrand, and Pierre-François
task workflows), with roughly similar trends. The reason why Dutot for their valuable help.
R EFERENCES [19] T. Hoefler, T. Schneider, and A. Lumsdaine, “LogGOPSim - Simulating
Large-Scale Applications in the LogGOPS Model,” in Proc. of the ACM
[1] I. J. Taylor, E. Deelman, D. B. Gannon, and M. Shields, Workflows for Workshop on Large-Scale System and Application Performance, Jun.
e-Science: scientific workflows for grids. Springer Publishing Company, 2010, pp. 597–604.
Incorporated, 2014. [20] G. Zheng, G. Kakulapati, and L. Kalé, “BigSim: A Parallel Simulator
[2] E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan, P. J. Maechling, for Performance Prediction of Extremely Large Parallel Machines,” in
R. Mayani, W. Chen, R. Ferreira da Silva, M. Livny, and K. Wenger, Proc. of the 18th Intl. Parallel and Distributed Processing Symposium
“Pegasus: a Workflow Management System for Science Automation,” (IPDPS), Apr. 2004.
Future Generation Computer Systems, vol. 46, pp. 17–35, 2015. [21] R. Bagrodia, E. Deelman, and T. Phan, “Parallel Simulation of Large-
[3] T. Fahringer, R. Prodan, R. Duan, J. Hofer, F. Nadeem, F. Nerieri, Scale Parallel Applications,” IJHPCA, vol. 15, no. 1, pp. 3–12, 2001.
S. Podlipnig, J. Qin, M. Siddiqui, H.-L. Truong et al., “Askalon: A [22] W. H. Bell, D. G. Cameron, A. P. Millar, L. Capozza, K. Stockinger,
development and grid computing environment for scientific workflows,” and F. Zini, “OptorSim - A Grid Simulator for Studying Dynamic Data
in Workflows for e-Science. Springer, 2007, pp. 450–471. Replication Strategies,” IJHPCA, vol. 17, no. 4, pp. 403–416, 2003.
[4] M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, D. S. Katz, and [23] R. Buyya and M. Murshed, “GridSim: A Toolkit for the Modeling
I. Foster, “Swift: A language for distributed parallel scripting,” Parallel and Simulation of Distributed Resource Management and Scheduling
Computing, vol. 37, no. 9, pp. 633–652, 2011. for Grid Computing,” Concurrency and Computation: Practice and
[5] K. Wolstencroft, R. Haines, D. Fellows, A. Williams, D. Withers, Experience, vol. 14, no. 13-15, pp. 1175–1220, Dec. 2002.
S. Owen, S. Soiland-Reyes, I. Dunlop, A. Nenadic, P. Fisher et al., [24] S. Ostermann, R. Prodan, and T. Fahringer, “Dynamic Cloud Provision-
“The taverna workflow suite: designing and executing workflows of web ing for Scientific Grid Workflows,” in Proc. of the 11th ACM/IEEE Intl.
services on the desktop, web or in the cloud,” Nucleic acids research, Conf. on Grid Computing (Grid), 2010, pp. 97–104.
p. gkt328, 2013. [25] R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. F. De Rose, and
[6] I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludascher, and S. Mock, R. Buyya, “CloudSim: A Toolkit for Modeling and Simulation of Cloud
“Kepler: an extensible system for design and execution of scientific Computing Environments and Evaluation of Resource Provisioning
workflows,” in Scientific and Statistical Database Management, 2004. Algorithms,” Software: Practice and Experience, vol. 41, no. 1, pp. 23–
Proceedings. 16th International Conference on. IEEE, 2004, pp. 423– 50, Jan. 2011.
424. [26] A. Nez, J. Vzquez-Poletti, A. Caminero, J. Carretero, and I. M. Llorente,
[7] M. Albrecht, P. Donnelly, P. Bui, and D. Thain, “Makeflow: A portable “Design of a New Cloud Computing Simulation Platform,” in Proc. of
abstraction for data intensive computing on clusters, clouds, and grids,” the 11th Intl. Conf. on Computational Science and its Applications, June
in 1st ACM SIGMOD Workshop on Scalable Workflow Execution En- 2011, pp. 582–593.
gines and Technologies. ACM, 2012, p. 1.
[27] G. Kecskemeti, “DISSECT-CF: A simulator to foster energy-aware
[8] N. Vydyanathan, U. V. Catalyurek, T. M. Kurc, P. Sadayappan, and
scheduling in infrastructure clouds,” Simulation Modelling Practice and
J. H. Saltz, “Toward optimizing latency under throughput constraints
Theory, vol. 58, no. 2, pp. 188–218, 2015.
for application workflows on clusters,” in Euro-Par 2007 Parallel
[28] A. Montresor and M. Jelasity, “PeerSim: A Scalable P2P Simulator,” in
Processing. Springer, 2007, pp. 173–183.
Proc. of the 9th Intl. Conf. on Peer-to-Peer, Sep. 2009, pp. 99–100.
[9] A. Benoit, V. Rehn-Sonigo, and Y. Robert, “Optimizing latency and
[29] I. Baumgart, B. Heep, and S. Krause, “OverSim: A Flexible Overlay
reliability of pipeline workflow applications,” in Parallel and Distributed
Network Simulation Framework,” in Proc. of the 10th IEEE Global
Processing, 2008. IPDPS 2008. IEEE International Symposium on.
Internet Symposium. IEEE, May 2007, pp. 79–84.
IEEE, 2008, pp. 1–10.
[10] Y. Gu and Q. Wu, “Maximizing workflow throughput for streaming [30] M. Taufer, A. Kerstens, T. Estrada, D. Flores, and P. J. Teller, “SimBA:
applications in distributed environments,” in Computer Communications A Discrete Event Simulator for Performance Prediction of Volunteer
and Networks (ICCCN), 2010 Proceedings of 19th International Con- Computing Projects,” in Proc. of the 21st Intl. Workshop on Principles
ference on. IEEE, 2010, pp. 1–6. of Advanced and Distributed Simulation, 2007, pp. 189–197.
[11] M. Malawski, G. Juve, W. Deelman, and J. Nabrzyski, “Algorithms for [31] T. Estrada, M. Taufer, K. Reed, and D. P. Anderson, “EmBOINC: An
cost- and deadline-constrained provisioning for scientific workflow en- Emulator for Performance Analysis of BOINC Projects,” in Proc. of the
sembles in IaaS clouds,” Future Generation Computer Systems, vol. 48, Workshop on Large-Scale and Volatile Desktop Grids (PCGrid), 2009.
pp. 1–18, 2015. [32] D. Kondo, “SimBOINC: A Simulator for Desktop Grids and Volunteer
[12] J. Chen and Y. Yang, “Temporal dependency-based checkpoint selection Computing Systems,” Available at http://simboinc.gforge.inria.fr/, 2007.
for dynamic verification of temporal constraints in scientific workflow [33] H. Casanova, A. Giersch, A. Legrand, M. Quinson, and F. Suter, “Ver-
systems,” ACM Transactions on Software Engineering and Methodology satile, Scalable, and Accurate Simulation of Distributed Applications
(TOSEM), vol. 20, no. 3, p. 9, 2011. and Platforms,” Journal of Parallel and Distributed Computing, vol. 74,
[13] G. Kandaswamy, A. Mandal, D. Reed et al., “Fault tolerance and no. 10, pp. 2899–2917, 2014.
recovery of scientific workflows on computational grids,” in Cluster [34] C. D. Carothers, D. Bauer, and S. Pearce, “ROSS: A High-Performance,
Computing and the Grid, 2008. CCGRID’08. 8th IEEE International Low Memory, Modular Time Warp System,” in Proc. of the 14th
Symposium on. IEEE, 2008, pp. 777–782. ACM/IEEE/SCS Workshop of Parallel on Distributed Simulation, 2000,
[14] R. Ferreira da Silva, T. Glatard, and F. Desprez, “Self-healing of pp. 53–60.
workflow activity incidents on distributed computing infrastructures,” [35] W. Chen and E. Deelman, “WorkflowSim: A Toolkit for Simulating
Future Generation Computer Systems, vol. 29, no. 8, pp. 2284–2294, Scientific Workflows in Distributed Environments,” in Proc. of the 8th
2013. IEEE Intl. Conf. on E-Science, 2012, pp. 1–8.
[15] W. Chen, R. Ferreira da Silva, E. Deelman, and T. Fahringer, “Dynamic [36] A. Hirales-Carbajal, A. Tchernykh, T. Rblitz, and R. Yahyapour, “A
and fault-tolerant clustering for scientific workflows,” IEEE Transactions Grid simulation framework to study advance scheduling strategies for
on Cloud Computing, vol. 4, no. 1, pp. 49–62, 2016. complex workflow applications,” in In Proc. of IEEE Intl. Symp. on
[16] H. M. Fard, R. Prodan, J. J. D. Barrionuevo, and T. Fahringer, “A Parallel Distributed Processing Workshops (IPDPSW), 2010.
multi-objective approach for workflow scheduling in heterogeneous [37] M.-H. Tsai, K.-C. Lai, H.-Y. Chang, K. Fu Chen, and K.-C. Huang,
environments,” in Proceedings of the 2012 12th IEEE/ACM International “Pewss: A platform of extensible workflow simulation service for work-
Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012). IEEE flow scheduling research,” Software: Practice and Experience, vol. 48,
Computer Society, 2012, pp. 300–309. no. 4, pp. 796–819, 2017.
[17] I. Pietri, M. Malawski, G. Juve, E. Deelman, J. Nabrzyski, and R. Sakel- [38] S. Ostermann, K. Plankensteiner, D. Bodner, G. Kraler, and R. Prodan,
lariou, “Energy-constrained provisioning for scientific workflow ensem- “Integration of an Event-Based Simulation Framework into a Scientific
bles,” in Cloud and Green Computing (CGC), 2013 Third International Workflow Execution Environment for Grids and Clouds,” in In proc. of
Conference on. IEEE, 2013, pp. 34–41. the 4th ServiceWave European Conference, 2011, pp. 1–13.
[18] M. Tikir, M. Laurenzano, L. Carrington, and A. Snavely, “PSINS: An [39] G. Kecskemeti, S. Ostermann, and R. Prodan, “Fostering Energy-
Open Source Event Tracer and Execution Simulator for MPI Applica- Awareness in Simulations Behind Scientific Workflow Management
tions,” in Proc. of the 15th Intl. Euro-Par Conf. on Parallel Processing, Systems,” in Proc. of the 7th IEEE/ACM Intl. Conf. on Utility and Cloud
ser. LNCS, no. 5704. Springer, Aug. 2009, pp. 135–148. Computing, 2014, pp. 29–38.
[40] J. Cao, S. Jarvis, S. Saini, and G. Nudd, “GridFlow: Workflow Manage-
ment for Grid Computing,” in Proc. of the 3rd IEEE/ACM Intl. Symp.
on Cluster Computing and the Grid (CCGrid), 2003, pp. 198–205.
[41] “The SimGrid Project,” Available at http://simgrid.org/, 2018.
[42] P. Bedaride, A. Degomme, S. Genaud, A. Legrand, G. Markomanolis,
M. Quinson, M. Stillwell, F. Suter, and B. Videau, “Toward Better
Simulation of MPI Applications on Ethernet/TCP Networks,” in Prod.
of the 4th Intl. Workshop on Performance Modeling, Benchmarking and
Simulation of High Performance Computer Systems, 2013.
[43] P. Velho, L. Mello Schnorr, H. Casanova, and A. Legrand, “On the
Validity of Flow-level TCP Network Models for Grid and Cloud
Simulations,” ACM Transactions on Modeling and Computer Simulation,
vol. 23, no. 4, 2013.
[44] P. Velho and A. Legrand, “Accuracy Study and Improvement of Network
Simulation in the SimGrid Framework,” in Proc. of the 2nd Intl. Conf.
on Simulation Tools and Techniques, 2009.
[45] K. Fujiwara and H. Casanova, “Speed and Accuracy of Network Sim-
ulation in the SimGrid Framework,” in Proc. of the 1st Intl. Workshop
on Network Simulation Tools, 2007.
[46] A. Lèbre, A. Legrand, F. Suter, and P. Veyre, “Adding Storage Simula-
tion Capacities to the SimGrid Toolkit: Concepts, Models, and API,” in
Proc. of the 8th IEEE Intl. Symp. on Cluster Computing and the Grid,
2015.
[47] “The WRENCH Project,” http://wrench-project.org, 2018.
[48] “The ns-3 Network Simulator,” Available at http://www.nsnam.org.
[49] E. León, R. Riesen, A. Maccabe, and P. Bridges, “Instruction-Level
Simulation of a Cluster at Scale,” in Proc. of the Intl. Conf. for High
Performance Computing and Communications (SC), Nov. 2009.
[50] R. Fujimoto, “Parallel Discrete Event Simulation,” Commun. ACM,
vol. 33, no. 10, pp. 30–53, 1990.
[51] R. Matha, S. Ristov, and R. Prodan, “Simulation of a workflow execution
as a real Cloud by adding noise,” Simulation Modelling Practice and
Theory, vol. 79, pp. 37–53, 2017.
[52] L. Bobelin, A. Legrand, D. A. G. Márquez, P. Navarro, M. Quinson,
F. Suter, and C. Thiery, “Scalable Multi-Purpose Network Representation
for Large Scale Distributed System Simulation,” in Proceedings of the
12th IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing (CCGrid), Ottawa, Canada, May 2012, pp. 220–227.
[53] M. Turilli, M. Santcroos, and S. Jha, “A Comprehensive Perspective on
Pilot-Job Systems,” ACM Comput. Surv., vol. 51, no. 2, pp. 43:1–43:32,
2018.
[54] “Pegasus’ DAX Workflow Description Format,” https://pegasus.isi.edu/
documentation/creating workflows.php, 2018.
[55] “The Standard Workload Format,” http://www.cs.huji.ac.il/labs/parallel/
workload/swf.html, 2018.
[56] D. Lifka, “The ANL/IBM SP Scheduling System,” in Proc. of the 1st
Workshop on Job Scheduling Strategies for Parallel Processing, LCNS,
vol. 949, 1995, pp. 295–303.
[57] F. Dabek, R. Cox, F. Kaashoek, and R. Morris, “Vivaldi: A Decentralized
Network Coordinate System,” in Proc. of SIGCOMM, 2004.
[58] J. Frey, “Condor dagman: Handling inter-job dependencies,” University
of Wisconsin, Dept. of Computer Science, Tech. Rep, 2002.
[59] D. Thain, T. Tannenbaum, and M. Livny, “Distributed computing in
practice: the condor experience,” Concurrency and computation: prac-
tice and experience, vol. 17, no. 2-4, pp. 323–356, 2005.
[60] “The WRENCH Pegasus Simulator,” https://github.com/wrench-project/
pegasus, 2018.
[61] R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M.
Overton, and M. Atkinson, “Using simple pid controllers to prevent and
mitigate faults in scientific workflows,” in 11th Workflows in Support of
Large-Scale Science, ser. WORKS’16, 2016, pp. 15–24.
[62] R. Ferreira da Silva, W. Chen, G. Juve, K. Vahi, and E. Deelman,
“Community resources for enabling and evaluating research on scientific
workflows,” in 10th IEEE International Conference on e-Science, ser.
eScience’14, 2014, pp. 177–184.

You might also like