RRL 10
RRL 10
Abstract—Scientific workflows are used routinely in numerous In spite of active WMS development and use in production,
scientific domains, and Workflow Management Systems (WMSs) which has entailed solving engineering challenges, fundamen-
have been developed to orchestrate and optimize workflow tal questions remain unanswered in terms of system designs
executions on distributed platforms. WMSs are complex software
systems that interact with complex software infrastructures. Most and algorithms. Although there are theoretical underpinnings
WMS research and development activities rely on empirical for most of these questions, theoretical results often make
experiments conducted with full-fledged software stacks on actual assumptions that do not hold with production hardware and
hardware platforms. Such experiments, however, are limited software infrastructures. Further, the specifics of the design
to hardware and software infrastructures at hand and can be of a WMS can impose particular constraints on what solu-
labor- and/or time-intensive. As a result, relying solely on real-
world experiments impedes WMS research and development. An tions can be implemented effectively, and these constraints
alternative is to conduct experiments in simulation. are typically not considered in available theoretical results.
In this work we present WRENCH, a WMS simulation Consequently, current research that aims at improving and
framework, whose objectives are (i) accurate and scalable simula- evolving the state of the art, although sometimes informed by
tions; and (ii) easy simulation software development. WRENCH theory, is mostly done via “real-world” experiments: designs
achieves its first objective by building on the SimGrid framework.
While SimGrid is recognized for the accuracy and scalability and algorithms are implemented, evaluated, and selected based
of its simulation models, it only provides low-level simulation on experiments conducted for a particular WMS implemen-
abstractions and thus large software development efforts are tation with particular workflow configurations on particular
required when implementing simulators of complex systems. platforms. As a corollary, from the WMS user’s perspective,
WRENCH thus achieves its second objective by providing high- quantifying accurately how a WMS would perform for a
level and directly re-usable simulation abstractions on top of
SimGrid. After describing and giving rationales for WRENCH’s particular workflow configuration on a particular platform
software architecture and APIs, we present a case study in entails actually executing that workflow on that platform.
which we apply WRENCH to simulate the Pegasus production Unfortunately, real-world experiments have limited scope,
WMS. We report on ease of implementation, simulation accuracy, which impedes WMS research and development. This is
and simulation scalability so as to determine to which extent because they are confined to application and platform con-
WRENCH achieves its two above objectives. We also draw
both qualitative and quantitative comparisons with a previously figurations available at hand, and thus cover only a small
proposed workflow simulator. subset of the relevant scenarios that may be encountered
Index Terms—Scientific Workflows, Workflow Management in practice. Furthermore, exclusively relying on real-world
Systems, Simulation, Distributed Computing experiments makes it difficult or even impossible to investigate
hypothetical scenarios (e.g., “What if the network had a
I. I NTRODUCTION different topology?”, “What if there were 10 times more
Scientific workflows have become mainstream in support compute nodes but they had half as many cores?”). Real-
of research and development activities in numerous scientific world experiments, especially when large-scale, are often not
domains [1]. Consequently, several Workflow Management fully reproducible due to shared networks and compute re-
Systems (WMSs) have been developed [2]–[7] that allow sources, and due to transient or idiosyncratic behaviors (main-
scientists to execute workflows on distributed platforms that tenance schedules, software upgrades, and particular software
can accommodate executions at various scales. WMSs handle (mis)configurations). Running real-world experiments is also
the logistics of workflow executions and make decisions re- time-consuming, thus possibly making it difficult to obtain
garding resource selection, data management, and computation statistically significant numbers of experimental results. Real-
scheduling, the goal being to optimize some performance world experiments are driven by WMS implementations that
metric (e.g., latency [8], [9], throughput [10], [11], jitter [12], often impose constraints on workflow executions. Further-
reliability [13]–[15], power consumption [16], [17]). WMSs more, WMSs are typically not monolithic but instead reuse
are complex software systems that interact with complex CyberInfrastructure (CI) components that impose their own
software infrastructures and can thus employ a wide range overheads and constraints on workflow execution. Exploring
of designs and algorithms. what lies beyond these constraints via real-world executions,
e.g., for research and development purposes, typically entails capture the behavior of a real-world system with as little
unacceptable software (re-)engineering costs. Finally, running bias as possible) and scalability (the ability to simulate large
real-world experiments can also be labor-intensive. This is due systems with as few CPU cycles and bytes of RAM as
to the need to install and execute many full-featured software possible). The aforementioned simulation frameworks achieve
stacks, including actual scientific workflow implementations, different compromises between these two concerns by using
which is often not deemed worthwhile for “just testing out” various simulation models. At one extreme are discrete event
ideas. models that simulate the “microscopic” behavior of hardware/-
An alternative to conducting WMS research via real-world software systems (e.g., by relying on packet-level network
experiments is to use simulation, i.e., implement a software simulation for communication [48], on cycle-accurate CPU
artifact that models the functional and performance behaviors simulation [49] or emulation for computation). In this case,
of software and hardware stacks of interest. Simulation is the scalability challenge can be handled by using Parallel
used in many computer science domains and can address the Discrete Event Simulation [50], i.e., the simulation itself is
limitations of real-world experiments outlined above. Several a parallel application that requires a parallel platform whose
simulation frameworks have been developed that target the scale is at least commensurate to that of the simulated plat-
parallel and distributed computing domain [18]–[34]. Some form. At the other extreme are analytical models that capture
simulation frameworks have also been developed specifically “macroscopic” behaviors (e.g., transfer times as data sizes
for the scientific workflow domain [11], [35]–[40]. divided by bottleneck bandwidths, compute times as numbers
We claim that advances in simulation capabilities in the field of operations divided by compute speeds). While these models
have made it possible to simulate WMSs that execute large are typically more scalable, they must be developed with care
workflows on large-scale platforms accessible via diverse CI so that they are accurate. In previous work, it has been shown
services in a way that is accurate (via validated simulation that several available simulation frameworks use macroscopic
models), scalable (fast execution and low memory footprint), models that can exhibit high inaccuracy [43].
and expressive (ability to describe arbitrary platforms, complex A number of simulators have been developed that target
WMSs, and complex software infrastructure). In this work, scientific workflows. Some of them are stand-alone simula-
we build on the existing open-source SimGrid simulation tors [11], [35]–[37]. Others are integrated with a particular
framework [33], [41], which has been one of the drivers of WMS to promote more faithful simulation and code re-
the above advances and whose simulation models have been use [38], [39] or to execute simulations at runtime to guide
extensively validated [42]–[46], to develop a WMS simulation on-line scheduling decisions made by the WMS [40].
framework called WRENCH [47]. More specifically, this work The authors in [39] conduct a critical analysis of the state-
makes the following contributions: of-the-art of workflow simulators. They observe that many
1) We justify the need for WRENCH and explain how it of these simulators do not capture the details of underlying
improves on the state of the art. infrastructures and/or use naive simulation models. This is the
2) We describe the high-level simulation abstractions pro- case with custom simulators such as that in [36], [37], [40].
vided by WRENCH that (i) make it straightforward to But it is also the case with workflow simulators built on top of
implement full-fledged simulated versions of complex generic simulation frameworks that provide convenient user-
WMS systems; and (ii) make it possible to instantiate level abstractions but fail to model the details of the underlying
simulation scenarios with only few lines of code. infrastructure, e.g., the simulators in [11], [35], [38], which
3) Via a case study with the Pegasus [2] production WMS, build on the CloudSim [25] or GroudSim [24] frameworks.
we evaluate the ease-of-use, accuracy, and scalability of These frameworks have been shown to lack in their network
WRENCH, and compare it with a previously proposed modeling capabilities [43]. As a result, some authors readily
simulator, WorkflowSim [35]. recognize that their simulators are likely only valid when
This paper is organized as follows. Section II discusses network effects play a small role in workflow executions (i.e.,
related work. Section III outlines the design of WRENCH and when workflows are not data-intensive).
describes how its APIs are used to implement simulators. Sec- To overcome the above limitations, in [39] the authors have
tion IV presents our case study. Finally, Section V concludes improved the network model in GroudSim and also use a
with a brief summary of results and a discussion of future separate simulator, DISSECT-CF [27], for simulating cloud
research directions. infrastructures accurately. Both [39] and [27] acknowledge
that the popular SimGrid [33], [41] simulation framework
II. R ELATED W ORK offers compelling capabilities, both in terms of scalability
Many simulation frameworks have been developed for par- and simulation accuracy. But one of their reasons for not
allel and distributed computing research and development. considering SimGrid is that, because it is low-level, using
They span domains such as HPC [18]–[21], Grid [22]– it to implement a simulator of a complex system, such as
[24], Cloud [25]–[27], Peer-to-peer [28], [29], or Volunteer a WMS and the CI services it uses, would be too labor-
Computing [30]–[32]. Some frameworks have striven to be intensive. In this work, we address this issue by developing
applicable across some or all or the above domains [33], a simulation framework that provides convenient, reusable,
[34]. Two conflicting concerns are accuracy (the ability to high-level abstractions but that builds on SimGrid so as to
benefit from its scalable and accurate simulation models.
Furthermore, unlike [38], [39], we do not focus on integration
with any specific WMS. The argument in [39] is that stand-
alone simulators, such as that in [35], are disconnected from
real-world WMSs because they abstract away much of the
complexity of these systems. Instead, our proposed framework
does capture low-level system details (and simulates them well
thanks to SimGrid), but provides high-level enough abstrac-
tions to implement faithful simulations of complex WMSs
with minimum effort, which we demonstrate via a case study
with the Pegasus WMS [2].
Also related to this work is previous research that has not
focused on providing simulators or simulation frameworks per
se, but instead on WMS simulation methodology. In particular,
several authors have investigated methods for injecting realistic
stochastic noise in simulated WMS executions [35], [51].
These techniques can be adopted by most of the aforemen-
tioned frameworks, including the one proposed in this work.
III. WRENCH
A. Objective and Intended Users
WRENCH’s objective is to make it possible to study WMSs
in simulation in a way that is accurate (faithful modeling of
real-world executions), scalable (low computation and memory Fig. 1: The four layers in the WRENCH architecture from bot-
footprints on a single computer), and expressive (ability to tom to top: simulation core, simulated core services, simulated
simulate arbitrary WMS, workflow, and platform scenarios WMS implementations, and simulators.
with minimal software engineering effort). WRENCH is not workflow issues, etc. These users can develop simulators
a simulator but a simulation framework that is distributed as via the WRENCH User API (described in Section III-E),
a C++ library. It provides high-level reusable abstractions for which makes it possible to build a full-fledged simulator
developing simulated WMS implementations and simulators with only a few lines of code.
for the execution of these implementations. There are two
Users in the first category above often also belong to
categories of WRENCH users:
the second category. That is, after implementing a simulated
1. Users who implement simulated WMSs – These users
WMS these users typically instantiate simulators for several
are engaged in WMS research and development activities
experimental scenarios to evaluate their WMS.
and need an “in simulation” version of their current or
intended WMS. Their goals typically include evaluating
B. Software Architecture Overview
how their WMS behaves over hypothetical experimental
scenarios and comparing competing algorithm and system Figure 1 depicts WRENCH’s software architecture. At the
design options. For these users, WRENCH provides the bottom layer is the Simulation Core, which simulates low-level
WRENCH Developer API (described in Section III-D) software and hardware stacks using the simulation abstractions
that eases WMS development by removing the typical and models provided by SimGrid (see Section III-C). The next
difficulties involved when developing, either in real-world layer implements simulated CI services that are commonly
or in simulation mode, a system comprised of distributed found in current distributed platforms and used by production
components that interact both synchronously and asyn- WMSs. At the time of this writing, WRENCH provides ser-
chronously. To this end, WRENCH makes it possible vices in 4 categories: compute services that provide access to
to implement a WMS as a single thread of control that compute resources to execute workflow tasks; storage services
interacts with simulated CI services via high-level APIs that provide access to storage resources for storing workflow
and must react to a small set of asynchronous events. data; network monitoring services that can be queried to
2. Users who execute simulated WMSs – These users sim- determine network distances; and data registry services that
ulate how given WMSs behave for particular workflows can be used to track the location of (replicas of) workflow data.
on particular platforms. Their goals include comparing Each category includes multiple service implementations, so
different WMSs, determining how a given WMS would as to capture specifics of currently available CI services used in
behave for various workflow configurations, comparing production. For instance, in its current version WRENCH pro-
different platform and resource provisioning options, de- vides a “batch-scheduled cluster” compute service, a a “cloud”
termining performance bottlenecks, engaging in peda- compute service, and a “bare-metal” compute service. The
gogic activities centered on distributed computing and above layer in the software architecture consists of simulated
WMS, that interact with CI services using the WRENCH De- Algorithm 1 Blueprint for a WMS execution
veloper API (see Section III-D). These WMS implementations, 1: procedure M AIN(workf low)
which can simulate production WMSs or WMS research pro- 2: Obtain list of available services
totypes, are not included as part of the WRENCH distribution, 3: Gather static information about the services
4: while workf low execution has not completed/failed do
but implemented as stand-alone projects. One such project is 5: Gather dynamic service/resource information
the simulated Pegasus implementation for our case study in 6: Submit pilot jobs if needed
Section IV. Finally, the top layer consists of simulators that 7: Make data/computation scheduling decisions
configure and instantiate particular CI services and particular 8: Interact with services to enact decisions
WMSs on a given simulated hardware platform, that launch 9: Wait for and react to the next event
10: end while
the simulation, and that analyze the simulation outcome. These 11: return
simulators use the WRENCH User API (see Section III-E). 12: end procedure
Here again, these simulators are not part of WRENCH, but
implemented as stand-alone projects.
To the best of our knowledge, among comparable available
C. Simulation Core
simulation frameworks (as reviewed in Section II), SimGrid
WRENCH’s simulation core is implemented using Sim- is the only one to offer all the above desirable characteristics.
Grid’s S4U API, which provides all necessary abstractions
and models to simulate computation, I/O, and communication D. WRENCH Developer API
activities on arbitrary hardware platform configurations. These With the Developer API, a WMS is implemented as a single
platform configurations are defined by XML files that specify thread of control that executes according to the pseudo-code
network topologies and endpoints, compute resources, and blueprint shown in Algorithm 1. Given a workflow to execute,
storage resources [52]. a WMS first gathers information about all the CI services
At its most fundamental level, SimGrid provides a Con- it can use to execute the workflow (lines 2-3). Examples
current Sequential Processes (CSP) model: a simulation con- of such information include the number of compute nodes
sists of sequential threads of control that consume hardware provided by a compute service, the number of cores per node
resources. These threads of control can implement arbitrary and the speed of these cores, the amount of storage space
code, exchange messages via a simulated network, can perform available in a storage service, the list of hosts monitored by a
computation on simulated (multicore) hosts, and can perform network monitoring service, etc. Then, the WMS iterates until
I/O on simulated storage devices. In addition, SimGrid pro- the workflow execution is complete or has failed (line 4). At
vides a virtual machine abstraction that includes a migration each iteration it gathers dynamic information about available
feature. Therefore, SimGrid provides all the base abstractions services and resources if needed (line 5). Example of such
necessary to implement the classes of distributed systems that information include currently available capacities at compute
are relevant to scientific workflow executions. However, these or storage services, current network distances between pairs
abstractions are low-level and a common criticism of SimGrid of hosts, etc. Then, if desired, the WMS can submit pilot
is that implementing a simulation of a complex system requires jobs [53] to compute services that support them, if any (line
a large software engineering effort. A WMS executing a 6). Based on resource information and on the current state of
workflow using several CI services is a complex system, and the workflow, the WMS can then make whatever scheduling
WRENCH builds on top of SimGrid to provide high-level decisions it sees fit (line 7). It then enacts these decisions
abstractions so that implementing this complex system is not by interacting with appropriate services. For instance, it could
labor-intensive. decide to submit a “job” to a compute service to execute a
We have selected SimGrid for WRENCH for the following ready task on some number of cores at some compute service
reasons. SimGrid has been used successfully in many dis- and copy all produced files to some storage service, or it could
tributed computing domains (cluster, peer-to-peer, grid, cloud, decide to just copy a file between storage services and then
volunteer computing, etc.), and thus can be used to simulate update a data location service to keep track of the location of
WMSs that execute over a wide range of platforms. SimGrid this new file replica. It is the responsibility of the developer
is open source and freely available, has been stable for many to implement all decision-making algorithms employed by the
years, is actively developed, has a sizable user community, WMS. At the end of the iteration, the WMS simply waits for
and has provided simulation results for over 350 research a workflow execution event to which it can react if need be.
publications since its inception. SimGrid has also been the Most common events are job completions/failures and data
object of many invalidation and validation studies [42]–[46], transfer completions/failures.
and its simulation models have been shown to provide com- The WRENCH Developer API provides a rich set of meth-
pelling advantages over other simulation frameworks in terms ods to process analyze the workflow and to interact with CI
of both accuracy and scalability [33]. Finally, most SimGrid services to execute the workflow. These methods were de-
simulations can be executed in minutes on a standard laptop signed based on current and envisioned capabilities of current
computer, making it possible to perform large numbers of state-of-the-art WMSs. We refer the reader to the WRENCH
simulations quickly with minimal compute resource expenses. Web site [47] for more information on how to use this API
and for the full API documentation. The key objective of this capability to simulate arbitrary failures via availability traces.
API is to make it straightforward to implement a complex Furthermore, failures can occur due to the WMS implemen-
system, namely a full-fledged WMS that interact with diverse tation itself, e.g., if it fails to check that the operations it
CI services. We achieve this objective by providing simple attempts are actually valid, if concurrent operations initiated
solutions and abstractions to handle well-known challenges by the WMS work at cross purposes. WRENCH abstracts
when implementing a complex distributed system (whether in away all these failures as C++ exceptions that can be caught
the real world or in simulation), as explained hereafter. by the WMS implementation, or caught by a manager and
SimGrid provides simple point-to-point communication be- passed to the WMS as workflow execution events. Regardless,
tween threads of control via a mailbox abstraction. One of each failure exposes a failure cause, which encodes a detailed
the recognized strengths of SimGrid is that it employs highly description of the failure. For instance, after initiating a file
accurate and yet scalable network simulation models. How- copy from a storage service to another storage service, a
ever, unlike some of its competitors, it does not provide any “file copy failed” event sent to the WMS would include a
higher-level simulation abstractions meaning that distributed failure cause that could specify that when trying to copy file
systems must be implemented essentially from scratch, with x from storage service y to storage service z, storage service
many message-based interactions. All message-based commu- z did not have sufficient storage space. Other example failure
nication is abstracted away by WRENCH, and although the causes could be that a network error occurred when storage
simulated CI services exchange many messages with the WMS service y attempted to receive a message from storage service
and among themselves, the WRENCH Developer API only z, or that service z was down. All CI services implemented
exposes higher-level interaction with services (“run this job”, in WRENCH simulate well-defined failure behaviors, and
“move this data”) and only requires that the WMS handle a few failure handling capabilities afforded to simulated WMSs can
events. The WMS developer thus completely avoids the need actually allow more sophisticated failure tolerance strategies
to send and receive (and thus orchestrate) network messages. than currently done or possible in real-world implementations.
Another challenge when developing a system like a WMS But more importantly, the amount of code that needs to be
is the need to handle asynchronous interactions. While some written for failure handling in a simulated WMS is minimal.
service interactions can be synchronous (e.g., “are you up?”, Given the above, WRENCH makes it possible to implement
“tell me your current load”), most need to be asynchronous a simulated WMS with very little code and effort. The example
so that the WMS retains control. The typical solution is to WMS implementation provided with the WRENCH distribu-
maintain sets of request handles and/or to use multiple threads tion, which is simple but functional, is under 200 lines of C++
of control. To free the WMS developer from these responsi- (once comments have been removed). See more discussion of
bilities, WRENCH provides already implemented “managers” the effort needed to implement a WMS with WRENCH in the
that can be used out-of-the-box to take care of asynchronicity. context of our Pegasus case study (Section IV).
A WMS can instantiate such managers, which are independent
threads of control. Each manager transparently interacts with E. WRENCH User API
CI services, maintains a database of pending requests, pro- With the User API one can quickly build a simulator, which
vides a simple API to check on the status of these requests, typically follows these steps:
and automatically generates workflow execution events. For 1. Instantiate a platform based on a SimGrid XML platform
instance, a WMS can instantiate a “job manager” through description file;
which it will create and submit jobs to compute services. It 2. Create one or more workflows;
can at any time check on the status of a job, and the job 3. Instantiate services on the platform;
manager interacts directly (and asynchronously) with compute 4. Instantiate one or more WMSs telling each what services
services so as to generate “job done” or “job failed” events are at its disposal and what workflow it should execute
to which the WMS can react. In our experience developing starting at what time;
simulators from scratch using SimGrid, the implementation of 5. Launch the simulation; and
asynchronous interactions with simulated processes is a non- 6. Process the simulation outcome.
trivial development effort, both in terms of amount of code to The above steps can be implemented with only a few lines of
write and difficulty to write this code correctly. We posit that C++. An example WRENCH simulator is shown in Figure 2,
this is one of the reasons why some users have preferred using which uses a WMS implementation (called SomeWMS) that
simulation frameworks that provide higher-level abstractions has already been developed using the WRENCH Developer
than SimGrid but offer less attractive accuracy and/or scalabil- API (see previous section). After initializing the simulation
ity features. WRENCH provides such higher-level abstractions (lines 5-6), the simulator instantiates a platform (line 8) and
to the WMS developers, and as a result implementing a WMS a workflow (line 10-11). A workflow is defined as a set
with WRENCH can be straightforward. of computation tasks and data files, with control and data
Finally, one of the challenges when developing a WMS dependencies between tasks. Each task can also have a priority,
is failure handling. It is expected that compute, storage, and which can then be taken into account by a WMS for scheduling
network resources, as well as the CI services that use them, purposes. Although the workflow can be defined purely pro-
can fail through the execution of the WMS. SimGrid has the grammatically, in this example the workflow is imported from
1 #include <math.h>
2 #include <wrench.h>
3 int main(int argc, char **argv) {
4 // Declare and initialize a simulation
5 wrench::Simulation simulation;
6 simulation.init(&argc, argv);
7 // Instantiate a platform
8 simulation.instantiatePlatform("my_platform.xml");
9 // Instantiate a workflow
10 wrench::Workflow workflow;
11 workflow.loadFromDAX("my_workflow.dax", "1000Gf");
12 // Instantiate a storage service
13 auto storage_service = simulation.add(
14 new wrench::SimpleStorageService("storage_host", pow(2,50)));
15 // Instantiate a sompute service (a batch−scheduled 4−node cluster that uses the
16 // EASY backfilling algorithm and is subject to a background load)
17 auto batch_service = simulation.add(
18 new wrench::BatchService("batch_login", {"node1", "node2", "node3", "node4"}, pow(2,40),
19 {{wrench::BatchServiceProperty::SIMULATED_WORKLOAD_TRACE_FILE, "load.swf"},
20 {wrench::BatchServiceProperty::BATCH_SCHEDULING_ALGORITHM, "easy_bf"}}));
21 // Instantiate a compute service (a 4−host cloud platform that does not support pilot jobs)
22 auto cloud_service = simulation.add(
23 new wrench::CloudService("cloud_gateway", {"host1", "host2", "host3", "host4"}, pow(2,42),
24 {{wrench::CloudServiceProperty::SUPPORTS_PILOT_JOBS, "false"}}));
25 // Instantiate a data registry service
26 auto data_registry_service = simulation.add(new wrench::FileRegistryService("my_desktop"));
27 // Instantiate a network monitoring service
28 auto network_monitoring_service =
29 simulation.add(new wrench::NetworkProximityService(
30 "my_desktop", {"my_desktop", "batch_login", "cloud_gateway"},
31 {{wrench::NetworkProximityServiceProperty::NETWORK_PROXIMITY_SERVICE_TYPE,
32 "vivaldi"}});
33 // Stage a workflow input file at the storage service
34 simulation.stageFile(workflow.getFileByID("input_file"), storage_service);
35 // Instantiate a WMS...
36 auto wms = simulation.add(
37 new wrench::SomeWMS({batch_service, cloud_service}, {storage_service},
{network_monitoring_service}, {data_registry_service}, "my_desktop"));
38 // ... and assign the workflow to it, to be executed one hour in
39 wms->addWorkflow(&workflow, 3600);
40 // Launch the simulation
41 simulation.launch();
42 // Retrieve task completion events
43 auto trace = simulation.getOutput().getTrace<wrench::SimulationTimestampTaskCompletion>();
44 // Determine the completion time of the last task that completed
45 double completion_time = trace[trace.size()-1]->getContent()->getDate();
46 }
Fig. 2: Example fully functional WRENCH simulator. Try-catch clauses are omitted.
a workflow description file in the DAX format [54]. At line (line 26) and a network monitoring service that uses the
13 the simulator creates a storage service with 1PiB capacity Vivaldi algorithm [57] to measure network distances between
accessible on host storage_host. This and other hostnames the two hosts from which the compute services are accessed
are specified in the XML platform description file. At line 17 (batch_login and cloud_gateway) and the my_host host,
the simulator creates a compute service that corresponds to a which is the host that runs these helper services and the
4-node batch-scheduled cluster. The physical characteristics of WMS (line 28). At line 34, the simulator specifies that the
the compute nodes (node[1-4]) are specified in the platform workflow data file input_file is initially available at the
description file. This compute service has a 1TiB scratch storage service. It then instantiates the WMS and passes to it
storage space. Its behavior is customized by passing a couple all available services (line 36), and assigns the workflow to it
of property-value pairs to its constructor. It will be subject to (line 39). The crucial call is at line 41, where the simulation
a background load as defined by a trace in the standard SWF is launched and the simulator hands off control to WRENCH.
format [55], and its batch queue will be managed using the When this call returns the workflow has either completed
EASY Backfilling scheduling algorithm [56]. The simulator or failed. Assuming it has completed, the simulator then
then creates a second compute service (line 22), which is a retrieves the ordered set of task completion events (line 43)
4-host cloud service, customized so that it does not support and performs some (in this example, trivial) mining of these
pilot jobs. Two helper services are instantiated, a data registry events (line 45).
service so that the WMS can keep track of file locations For brevity, the example in Figure 2 omits try/catch
clauses. Also, note that although the simulator uses the new WRENCH Pegasus Simulator
Pegasus
operator to instantiate WRENCH objects, the simulation object pegasus-run configuration
takes ownership of these objects (using unique or shared point-
ers), so that there is no memory deallocation onus placed on
DAGMan
scheduler
monitor
the user. This example showcases only the most fundamental DAGMan
HTCondor
to the WRENCH Web site [47] for more detailed information Job Submission Service
on how to use this API and for the full API documentation. master schedd shadow
In the future this API will come with Python binding so that
Central Manager Service
users can implement simulators in Python.
master negotiator collector
TABLE I: Average simulated makespan error (%), and p-values and Kolmogorov-Smirnov (KS) distances for task submission
and completion dates, computed for 5 runs of each of our 3 experimental scenarios.
values for all these parameters, but it is possible to pick m5.xlarge. The bandwidth between the data node and
custom values upon service instantiation. The process of the submit node was ∼0.44 Gbps, and the bandwidth
picking parameter values so as to match a specific real-world between the submit and worker nodes on these instances
system is referred to as simulation calibration. We calibrated were ∼0.74 Gbps and ∼1.24 Gbps, respectively.
our simulator by measuring delays observed in event traces
of real-world executions for workflows on hardware/software C. Simulation Accuracy
infrastructures (see Section IV-B). To evaluate the accuracy of our simulator, we consider 3
The simulator code, details on the simulation calibration particular experimental scenarios: 1000Genome on ExoGENI,
procedure, and experimental scenarios used in the rest of this Montage-1.5 on AWS-t2.xlarge, and Montage-2.0 on AWS-
section are all publicly available online [60]. m5.xlarge. Each execution is repeated 5 times and the overall
B. Experimental Scenarios workflow execution times, or makespans, are recorded.
The third column in Table I shows average relative differ-
We consider experimental scenarios defined by particular
ences between actual and simulated makespans. We see that
workflow instances to be executed on particular platforms. Due
simulated makespans are close to actual makespans across the
to the lack of publicly available detailed workflow execution
board (average relative error is below 5%). One of the key
traces (i.e., execution logs that include data sizes for all files,
advantages of building WRENCH on top of SimGrid is that
all execution delays, etc.), we have performed real workflow
WRENCH simulators benefit from the high-accuracy network
executions with Pegasus and collected raw, time-stamped event
models in SimGrid, e.g., these models capture many features
traces from these executions. These traces form the ground
of the TCP protocol. And indeed, when comparing real-world
truth to which we can compare simulated executions. We
and simulated executions we observe average relative error
consider these workflow applications:
below 3% for data movement operations. The many processes
• 1000Genome [61]: A data-intensive workflow that iden- involved in a workflow execution with Pegasus interact by
tifies mutational overlaps using data from the 1000 exchanging (typically small) control messages. Our simulator
genomes project in order to provide a null distribution simulates these interactions. For instance, each time an output
for rigorous statistical evaluation of potential disease- file is produced by a task a data registry service is contacted so
related mutations. We consider a 1000Genome instance that a new entry can be added to its database of file replicas,
that comprises 71 tasks. which incurs some overhead due to a message exchange. When
• Montage [2]: A compute-intensive astronomy workflow comparing real-world to simulated executions we observe
for generating custom mosaics of the sky. For this ex- average relative simulation error below 1% for these data
periment, we ran Montage for processing 1.5 and 2.0 registration overheads.
square degrees mosaic 2MASS. We thus refer to each To draw comparisons with a state-of-the-art simulator,
configuration as Montage-1.5 and Montage-2.0, respec- we repeated the above simulations using WorkflowSim [35].
tively. Montage-1.5, resp. Montage-2.0, comprises 573, WorkflowSim does not provide a detailed simulated HTCondor
resp. 1,240, tasks. implementation, does not offer the same simulation calibra-
We use these platforms, deploying on each a submit node tion capabilities as WRENCH, and is built on top of the
(which runs Pegasus, DAGMan, and HTCondor’s job submis- CloudSim simulation framework [25]. Nevertheless, we have
sion and central manager services), four worker nodes (4 cores painstakingly calibrated our WorkflowSim simulator so that
per node / shared file system), and a data node in the WAN: it models the hardware and software infrastructures of our
• ExoGENI: A widely distributed networked infrastructure- experimental scenarios as closely as possible. For each of
as-a-service testbed representative of a “bare metal” plat- the 3 experimental scenarios, we find that the relative average
form. Each worker node is a 4-core 2.0GHz processor makespan percentage error is 12.09 ±2.84, 26.87 ±6.26, and
with 12GiB of RAM. The bandwidth between the data 13.32 ±1.12, respectively, i.e., from 4x up to 11x larger than
node and the submit node was ∼0.40 Gbps, and the the error values obtained with our WRENCH-based simulator.
bandwidth between the submit and worker nodes was The reasons for the discrepancies between WorkflowSim and
∼1.00 Gbps. real-world results are twofold. First, WorkflowSim uses the
• AWS: Amazon’s cloud platform, on which we use two simplistic network models in CloudSim (see discussion in
types of virtual machine instances: t2.xlarge and Section II) and thus suffers from simulation bias w.r.t. data
transfer times. Second, WorkflowSim does not capture all the A 1.00 B 1.00
relevant details of the system and its execution. By contrast,
F(Completed Tasks)
implementing a fully detailed simulator with WRENCH can
F(Submitted Tasks)
0.75 0.75
be done in a few hundred lines of code.
In our experiments we also record the submission and com-
0.50 0.50
pletion dates of each task, thus obtaining empirical cumulative
density functions (ECDFs) of these times, for both real-world
0.25 0.25
executions and simulated executions. To further validate the
accuracy of our simulation results we apply Kolmogorov-
0.00 0.00
Smirnov goodness of fit tests (KS tests) with null hypotheses
0 1000 2000 3000 0 1000 2000 3000
(H0 ) that the real-world and simulation samples are drawn Workflow Makespan (s) Workflow Makespan (s)
from the same distributions. The two-sample KS test results in
pegasus wrench workflowsim
a miss if the null hypothesis (two-sided alternative hypothesis)
is rejected at 5% significance level (p-value ≤ 0.05). Each Fig. 4: Empirical cumulative distribution function of task sub-
test for which the null hypothesis is not rejected (p-value mit times (left) and task completion times (right) for sample
> 0.05), indicates that the simulated execution statistically real-world (“pegasus”) and simulated (“wrench” and “work-
matches the real-world execution. Table I shows p-value and flowsim”) executions of Montage-2.0 on AWS-m5.xlarge.
KS test distance for both task submission times and task
completion times. The null hypothesis is not rejected, and we A pegasus B wrench
thus conclude that simulated workflow task executions statis-
tically match real-world executions well. These conclusions
are confirmed by visually comparing ECFDs. For instance,
Figure 4 shows real-world and simulated ECDFs for sample
runs of Montage-2.0 on AWS-m5.xlarge, with task submission,
resp. completion, date ECDFs on the left-hand, resp. right-
Tasks
Tasks
simulation time
be implemented with minimal software development effort.
2000 2000
Via a case study for the Pegasus production WMS we have
workflowsim
demonstrated that WRENCH achieves these objectives, and
Memory [MB]
1500 wrench 1500
Time (s)
00
00
00
00
00
00
00
00
0
00
10
20
30
40
50
60
70
80
90
10