Serverless Workflows With Durable Functions and Ne
Serverless Workflows With Durable Functions and Ne
Netherite
Sebastian Burckhardt Chris Gillum David Justo
Microsoft Research Microsoft Azure Microsoft Azure
[email protected] [email protected] [email protected]
declarative approaches is that DF workflows can take ad- Elastic Partition Balancing. Netherite uses a fixed num-
vantage of all the familiar control flow abstractions and the ber of partitions (32) that communicate via a reliable ordered
ecosystem of libraries and tools of a mature host language. queue service. It can move individual partitions between
DF persists the intermediate state of a workflow using record nodes, by persisting and then recovering their state on a
and replay. different node. In particular, it can re-balance the partitions
as needed. For example, on a one-node cluster, all 32 parti-
Serverless Computation Model. In order to keep the en- tions are loaded on a single node. On a four-node cluster,
gine development separate from the programming model each node has eight partitions, and so on, up to 32 nodes
we propose a computation model that contains two simple with one partition each. Netherite can also scale to zero if
"serverless primitives": stateless tasks and stateful instances. the application is idle: on a zero-node cluster, all partitions
This acts as an interface between the programming model reside in cloud storage.
and the execution engine: DF is translated into the computa-
tion model by encoding workflows as stateful instances, and Evaluation. Our evaluation on five workflows, two of
Netherite implements it. This separation allows independent which are taken from real applications, indicate that the DF
experimentation on the programming or the engine part—in programming model offers significant benefits regarding de-
fact, we benefited from this separation since Netherite was velopment effort. In particular, the availability of general
built as a replacement for the existing Durable Functions loops, exception handling, and functional abstraction (pro-
implementation. The model is also designed to facilitate elas- vided by the host language) greatly improve the experience
ticity: tasks and instances are both fine-grained and commu- when dealing with complex workflows.
nicate via messages, which makes it possible to dynamically Yet, the benefits are not limited to the developer experi-
load-balance them over an elastic cluster. ence: the execution performance with Netherite is better
than with common serverless alternatives, across the board.
Causally Consistent Commit. A common challenge for For instance, Netherite orchestrations outperform trigger-
workflow systems is to articulate a reliability guarantee that based composition by orders of magnitude, both on AWS
is strong, easy to understand for programmers, and efficiently and Azure. They also exhibit better throughput and latency
implementable. To this end, we define a guarantee called than the current Durable Functions production implementa-
causally consistent commit (CCC) using execution graphs. tion, by an order of magnitude in some situations. Finally,
It is stronger than "at-least-once" or "effectively-once", and a workflow composing AWS lambdas completes faster in
more realistic than "exactly-once". In essence, it guarantees Netherite (deployed in Azure and invoking lambdas through
atomicity: a step that fails is aborted, along with all steps HTTP) rather than in Step Functions (deployed in AWS and
that causally depend on it. invoking lambdas directly).
Batch Commit. In order to guarantee reliability, work-
flow solutions need to persist workflow steps in storage. This 1.1 Contributions
is commonly achieved by persisting the state and steps of We make the following contributions:
each workflow individually1 , creating a throughput bottle-
• We introduce the Durable Functions Programming
neck due to the limited number of I/O operations storage
Model, which allows code-based structured expression
can handle per second. To avoid this problem, we designed
of workflows in multiple languages (§2).
Netherite so it can persist many steps, by different workflow
• We demonstrate how to break down complex work-
instances, using a single storage update. This is achieved by
flows into just two serverless primitives, and define
grouping the fine-grained instances and tasks into partitions.
the causally-consistent-commit guarantee (§3).
Each partition can then persist a batch of steps efficiently by
• We provide an architecture and implementation that
appending it to its commit log in cloud SSD storage.
realize these concepts (§4) and demonstrates the power
Speculation Optimizations. A conservative workflow of speculation optimizations (§5).
execution engine would wait until a step is persisted before • We evaluate the Durable Functions programming model
proceeding with the next step. This introduces a significant and Netherite implementation on several benchmarks
latency overhead since storage accesses are on the critical and case studies, comparing it to commonly used server-
execution path. We show that with careful local and global less composition techniques (§6).
speculation, Netherite moves these storage accesses off the Overall, our contributions bring the development of complex
critical path, significantly reducing latency, while still pro- full-fledged serverless applications within reach: providing
viding the CCC guarantee. cloud developers with (i) Durable Functions, a mature pro-
gramming environment that allows them to have their appli-
1 This
is the case with unstructured composition, as well as the existing DF cation in one place; and (ii) Netherite, an efficient execution
implementation. engine that provides strong reliability guarantees.
Serverless Workflows with Durable Functions and Netherite
1 [FunctionName("Transfer")]
2.1 Orchestration Persistence
2 public static async Task<bool> Transfer( In contrast to stateless functions, orchestrations do not have
3 [OrchestrationTrigger] IDurableOrchestrationContext ctx) to remain in memory, accumulating billing charges, while
4 {
they wait for a step to complete. Instead their progress can
5 (string source, string dest, int amount) =
6 ctx.GetInput<string,string,int>(); be stored in durable storage and retrieved when the step has
7 EntityId sourceId = new EntityId("Account", source); completed. This is particularly important for long-running
8 EntityId destId = new EntityId("Account", dest); workflows.
9 Rather than persisting the program location, variables,
10 using (await ctx.LockAsync(sourceId, destId))
and heap, DF records a history of events. For example, the
11 {
12 int bal = await ctx.CallEntityAsync<int>(sourceId, "Get"); orchestration from Fig. 1 executes in three steps, with par-
13 if (bal < amount) tial histories as shown in Fig. 5. It is possible to re-hydrate
14 { the intermediate state of an orchestration from storage by
15 return false; replaying the persisted partial history. Completed tasks are
16 }
not re-executed during replay; rather, the recorded results
17 else
18 { are reused.
19 await Task.WhenAll( Replay can cause problems if the orchestration contains
20 ctx.CallEntityAsync(sourceId, "Modify", -amount), nondeterminism or if histories are excessively long. De-
21 ctx.CallEntityAsync(destId, "Modify", +amount)); velopers are expected to avoid these issues by (1) encap-
22 return true;
sulating nondeterminism in activities, and (2) using sub-
23 } } }
orchestrations, or restarting orchestrations, to limit history
Figure 4. Example of an orchestration with a critical section
size. DF also includes a static analysis tool that can detect
that reliably transfers money between account entities.
common mistakes of this kind for its C# front end.
step 1
step 2
step 3
Figure 6. Execution graph for a simple sequence of two tasks For an example, see Fig. 6. This execution graph corresponds
as in Fig. 1. Vertices are labeled to indicate the vertex type, to the simple sequence from Fig. 1. The input is the message
and message edges are labeled with the value propagated. that starts the orchestration; the orchestration then proceeds
in three steps: (1) receive input and issue first task, (2) receive
first task result and issue second task, and (3) receive second
Conceptually, an AWS Step Function is comparable to a DF task result and finish.
orchestrator. But the programming model is different: DF ex- We call an execution graph consistent if it is consistent
presses orchestrations in a mainstream language, while Step with a sequential execution of atomic processing steps as
Functions use a JSON-based schema with limited options for described in §3.1. We call an execution graph complete if all
abstraction and control flow. For example, sequential loops messages produced are also consumed.
with dependent iterations cannot be expressed (§6.3).
3.3 Faults and Recovery
3 Computation Model
A critical reality of service-oriented environments is the
In this section we describe the core of the serverless com- prevalence of faults: tasks may time out, nodes may crash
putation model that underlies Durable Functions and is im- (e.g., run out of memory) and reboot, and service connections
plemented by Netherite. By describing this model abstractly may be temporarily interrupted. For example, attempts to
using execution graphs, we provide a solid foundation that al- persist instances to storage, to send a message, or to acknowl-
lows us to state and explain the "causally-consistent-commit" edge the receipt of a queue message, may fail intermittently.
execution guarantee of Netherite. What does it mean for a workflow execution to be correct
in the presence of faults and recovery? Ideally, faults would
3.1 Tasks and Instances
be invisible. This is sometimes called an "exactly-once" guar-
Computations in our model are built from tasks and instances antee, since it means that each message is processed exactly
that communicate via messages. We distinguish two types once. In general it is unfortunately not possible to implement
of messages: this guarantee. The reason is that, when recovering from a
• Task messages are used to start a task. Tasks are state- crash, some progress may be lost, and some code must there-
less and can be executed anywhere. When a task fin- fore be re-executed, possibly re-performing an irrevocable
ishes executing, it sends a single result message. effect on an external service.
• Instance messages target a specific stateful instance, Because of that, many workflow systems settle for an "at-
identified by an instance ID. When processed, an in- least-once" guarantee, where a message may be processed
stance message may read and update the state of that more than once, and thus its effects may also be duplicated.
instance, and may also produce additional messages. To handle duplicates correctly, developers usually employ a
The fine granularity of tasks and instances, and the state technique called "effectively-once" [15, 16]: it combines the
encapsulation afforded by the message passing paradigm, at-least-once guarantee with additional mechanisms that en-
facilitate elasticity as they allow us to balance task and in- sure that all effects of processing a message are idempotent. It
stance execution across an elastic cluster. For stateless tasks, may at first appear that the combination of at-least-once and
load balancing is straightforward. For stateful instances, it idempotence is sufficient to hide faults. However, that is not
requires a bit more work. We describe our solution in §4. true in the presence of nondeterminism. The reason is that if
re-processing a message produces different effects (e.g. sends
3.2 Execution graphs a message to a different queue, or updates a different storage
To visualize execution states and execution histories, we use location), the effects of both executions remain, instead of
execution graphs. There are three types of vertices: being deduplicated.
• An input vertex represents an external input message. Causally consistent commit. To address the shortcom-
• A task vertex represents a stateless task. ings of the aforementioned guarantees we propose a guaran-
• A step vertex represents the processing of a batch of tee called "causally-consistent-commit". The intuition behind
one or more messages by a stateful instance. it is that if we re-execute a work item, we have to ensure
We call the task and step vertices work items, since both that all internal effects that causally depend on the previous
represent the processing of messages. Edges in the graph execution are aborted; in particular, any produced messages
represent direct causal dependencies: are discarded and updated instance states are rolled back.
Sebastian Burckhardt, Chris Gillum, David Justo, Konstantinos Kallas, Connor McMahon, and Christopher S. Meiklejohn
MessagesReceived
P S
10729 id1 id7
StepCompleted
O I id S1 T
1
id2 S2
id7 S7
id9 S9
Figure 9. Illustration of the Netherite architecture.
MessagesSent TaskCompleted
Introducing buffers decouples the work for sending and re- during their execution. That includes the common case of
ceiving of messages, processing steps, and processing tasks, orchestration workflows that only use a single instance and
which in turn increases pipeline parallelism and enables compose multiple tasks, such as the examples in Fig. 1 and
batching. As required by the event sourcing paradigm, exe- Fig. 2.
cution progress is recorded as a sequence of atomic events
that update the partition state deterministically. There are 4 Global Speculation. With global speculation enabled,
event types: messages destined to remote partitions are also sent imme-
diately. Global speculation essentially moves all commit log
• MessagesReceived. Updates P (advances position and updates out of critical path. It is particularly beneficial for
deduplication vector) and S (enqueues messages). workflows involving many hops between partitions. How-
• MessagesSent. This updates O (removes messages). ever, it requires a more involved protocol to ensure aborts
• TaskCompleted. This updates S (enqueues response) are propagated correctly.
and T (removes completed task). The sending partition keeps a record of the completed
• StepCompleted. This updates I (updates instance state), work items and the messages they have sent. When a work
S (removes consumed messages), O (adds produced item is persisted, for each message sent before, it sends a
messages), and T (adds produced tasks). confirmation message. The receiving partition knows that
Instance State Caching. Keeping the state of all instances a message it receives is speculative until it receives that
of a partition in memory is expensive and not always possible. confirmation message; and the partition avoids persisting
Also, loading that state into memory on partition recovery any work items that depend on such a speculative message
is slow. Thus, it is important to have a caching mechanism until a confirmation is received.
that keeps only the most recently used instances in memory, But how are crashes handled? Note that when a parti-
while the rest remains in storage. Netherite achieves that tion crashes and recovers, it may no longer remember the
by leveraging FASTER [29], a hybrid key-value store that work items it completed before the crash, so it cannot simply
coexists in memory and storage. FASTER exploits temporal send abort messages for individual work items. Our current
access patterns to keep "hot" keys in memory while evicting solution thus relies on using the commit log positions of par-
the rest in storage. It is implemented on top of a hybrid log, titions. Each speculative message is tagged with the commit
which allows it to perform fewer batched storage accesses. log position of the work item that produces it. When a parti-
tion crashes and recovers, it broadcasts a recovery message
5 Optimizations to all partitions, which contains the recovered commit log po-
sition. When a partition receives a recovery message, it then
The baseline Netherite implementation is conservative: the "rewinds" its own commit log, by recovering from the closest
messages produced by a work item execution are first per- preceding checkpoint, to a position that does not causally
sisted to storage before being propagated. As explained in depend on aborted work items. It then broadcasts recovery
§3.6, speculation can improve performance by moving this messages of its own, to propagate aborts recursively.
storage access off the critical path. This does not compromise
the CCC guarantee, because we take care to properly prop-
6 Evaluation
agate aborts along causal dependencies. We now describe
two levels of speculation that are supported as optional opti- The goal of our evaluation is to study several aspects of DF
mizations in Netherite. and Netherite. We start by describing the workflow applica-
tions (§6.1). We then formulate the research questions (§6.2),
Local Speculation. With local speculation, we allow mes- and present the results (§6.3–6.6).
sages to be processed immediately (before the work item
is persisted) as long as the message stays within the same 6.1 Workflows
partition. Messages headed for different partitions are held We use five representative workflows that vary in complex-
up in the outbox 𝑂 until after their work item is persisted. ity and execution characteristics. The first two workflows
Thus, we never need to propagate aborts to other par- correspond to sequences of tasks, the third is a workflow
titions. Locally, within a single partition, aborts "automat- that performs a transaction between two bank accounts and
ically" respect causality because we use a single, causally thus requires atomicity guarantees, and the other two work-
consistent commit log to persist the partition state. After a flows are taken from real applications, an image processing
crash, the partition state reverts to the persisted prefix of the application, and a database snapshot obfuscation.
commit log, which implicitly aborts all non-persisted work
items. Hello Sequence. A very simple "hello world" workflow
Local speculation provides significant benefits for inde- that calls three functions in sequence. Each function returns
pendent workflows that do not communicate with other a hello message, and the workflow then returns the concate-
stateful instances, therefore staying within a single partition nation of those messages.
Serverless Workflows with Durable Functions and Netherite
Task Sequence. A sequential workflow that initializes of type Standard_DS2_v2 [22]. The number of nodes was
an object and then passes it through a sequence of processing 4 (8 for the scale out experiment). Each node had 2 vCPU
steps. It is similar to the hello sequence, but the length of the and a memory limit of 5GB. The queueing service was Azure
sequence is not fixed, but given an an input parameter. EventHubs, which is roughly equivalent to Apache Kafka
[2], with 32 partitions. The cloud storage was Azure storage
Bank Application. The workflow from Fig. 4 that imple-
GPv2, using premium tier for the FASTER Log Devices. The
ments a reliable transfer of currency between accounts. This
load was generated by a separate deployment of 20 load
workflow showcases the capabilities of the Durable Func-
generator machines.
tions programming model since it cannot be implemented
with existing solutions. 6.3 Programmability Results (Q1)
Image Recognition. A workflow that recognizes objects To evaluate and compare the development experience when
in a given picture and creates a thumbnail for it. It is part using DF, unstructured composition, or Step Functions, we
of a bigger image processing application3 . The workflow tried to implement all the workflows from §6.1.
performs the following steps, each of which is implemented Task Sequence. With DF, the task sequence can be im-
as a separate AWS lambda. It first reads the image metadata plemented using a straightforward for-loop that iteratively
from the S3 bucket where it is stored. If the image extension updates the target object by invoking the task with it. With
is supported, it filters out the unnecessary metadata and unstructured composition, the sequence is also relatively sim-
then runs two steps in parallel: one that performs object ple, but requires that the user also manages and configures
detection using Amazon Rekognition, and one that generates a storage or queue service. To our surprise, with Step Func-
a thumbnail of the image. When the processes complete, it tions, it is not possible to express this workflow: the JSON
persists the filtered metadata in a DynamoDB table. The schema for state machines does not support folds, i.e. loops
workflow repeatedly retries all steps until they succeed. with iteration dependencies. Encoding a loop by restarting
Database Snapshot Obfuscation. This workflow is the state machine does not work since the invocation API
taken from a real application used for database snapshot would return after the first iteration terminates.
obfuscation4 . The workflow state machine contains 27 states Image Recognition. In order to be as faithful as possi-
that interact with a variety of AWS services. Some of the ble to the original Step Functions implementation, we im-
tasks that it calls include user authorization, creation of data- plemented this workflow in DF by invoking the original
base snapshots, validation of the snapshots, obfuscation of Lambdas through their HTTP interface, only porting the
the snapshots, and publishing the snapshots in a production workflow logic. The code in DF is 70 lines of standard C#,
environment. while the state machine definition in Step Functions is 150
lines of JSON. An interesting difference is the implementa-
6.2 Research Questions
tion of a check whether the format of an image is supported.
We organize the evaluation and results according to the In Step Functions, this requires 24 lines of JSON compared
following questions: to a 5 line if statement in DF (Fig. 12).
Q1 Does the DF programming model facilitate application
Database Snapshot Obfuscation. The workflow in this
development and maintenance?
application is by far the most complex. The state machine
Q2 How does Netherite compare with existing solutions
definition in Step Functions contains 27 states and is writ-
with respect to latency, i.e. the completion of a work-
ten using 700 lines of JSON; the DF version is more concise
flow?
and easier to read, with 200 lines of C# code. An important
Q3 How does Netherite compare with existing solutions
observation is that there is a lot of copied code in the Step
with respect to throughput, i.e. the number of work-
Functions definition since it doesn’t support function abstrac-
flows that it can execute in a period of time?
tion. Specifically, the error handling logic, written as 9 lines
Q4 How does speculation improve latency and how does it
of JSON, is copied 12 times in the definition, while in DF we
impact throughput?
just wrap the orchestration with a single try-catch(Fig. 13).
Q5 Does Netherite scale with the addition of available nodes
in cases of high-load? Bank Application. The bank application simulates bank
accounts and reliable money transfers between them. In DF,
System infrastructure. In all experiments other than the this is straightforward to implement using entities (Fig. 3)
ones targeting AWS Step Functions, the system under test and critical sections (Fig. 4). We have not yet figured out a
was run on a pool of Linux VMs on Azure Kubernetes Service, satisfactory way of implementing this workflow using un-
3 Source at: https://github.com/aws-samples/lambda-refarch- structured composition or Step Functions, as they do not
imagerecognition provide the synchronization primitives needed for concur-
4 Source at: https://github.com/FINRAOS/maskopy rency control.
Sebastian Burckhardt, Chris Gillum, David Justo, Konstantinos Kallas, Connor McMahon, and Christopher S. Meiklejohn
50 50
25 25
0 0
101 102 103 101 102 103 104 105
Bank Application Image Recognition
100 100
Neth. (Global Spec)
75 75 Neth. (Local Spec)
Neth. (No Spec)
Existing DF
CDF
50 50 Step Functions
Trig. Seq. Azure
25 25 Queue Seq. Azure
Trig. Seq. AWS
0 0
102 103 103 104
Latency (ms) Latency (ms)
Figure 12. DF code that checks whether the input image is Methodology. For all workflows except the snapshot
in a supported format. obfuscation, requests are issued at a fixed, low rate (4–25
requests per second) for 3–5 minutes. We then compute the
try {
empirical cumulative distribution function (eCDF) of the
... system-internal orchestration latency, i.e. the time it takes
} catch { for an orchestration to complete, using the timestamps re-
// Catch errors by calling ErrorHandling ported by the system. We chose to use the system-reported
string ErrorHandlingAndCleanupInput = latency of workflows, as opposed to the client-observed la-
JsonConvert.SerializeObject(inputJson);
await MakeHttpRequestSync(inputJson.ErrorHandlingURI,
tency, because not all clients provide a way to wait for the
ErrorHandlingInput, context); completion of a workflow.
return "Orchestration Failed!"; Latency results for four of the five workflows are shown
} in Fig. 11. For the snapshot obfuscation workflow, there is
no appreciable performance difference between the imple-
Figure 13. DF code that does error handling for the snapshot mentations; the total latency (20-25 minutes) is dominated
obfuscation application. by executing the time-consuming tasks (taking a snapshot,
obfuscating it, restoring the database from a snapshot, etc).
Take away: DF supports a wider range of applications Unstructured Composition. Unstructured composition
than Step Functions and unstructured composition, due to its (using triggers and queues) can only be used to implement
support for workflows with rich control structure, entities, the Task Sequence workflow. As can be seen in Fig. 11, trig-
and critical sections. Defining workflows implicitly using a gers5 suffer significantly higher latencies (x1000-x10000)
high-level language has several benefits compared to declar- than Netherite. Using queues for constructing sequential
ative definitions using JSON since it allows for features such workflows performs better than triggers but Netherite still
as function abstraction and error handling. Maintenance is achieves an order of magnitude lower latencies (median x61,
also improved since there is less copied code and less lines 95th x91).
of code in general. Finally, using a high-level language the
user gets to enjoy all of its benefits (libraries, type system,
IDE support). 5 Blob in Azure and S3 in AWS.
Serverless Workflows with Durable Functions and Netherite
Step Functions. Step Functions does not support the Hello Sequence Bank Application
Neth. (Global Spec) 1584 288
bank application and the task sequence so they are not in-
cluded in that experiment. For the other two workflows Neth. (Local Spec) 1612 270
Netherite achieves better latencies (hello sequence: median Neth. (No Spec) 1467 254
x104, 95th x75). An important take-away is that Netherite Neth. (No Spec) HTTP 843 229
achieves lower latency in the image recognition experiment Existing DF HTTP 113 111
even though Netherite is deployed on Azure and invokes 0 250 500 750 1000 1250 1500 0 50 100 150 200 250 300
Throughput Throughput
AWS lambdas as its tasks using their HTTP interfaces, while
AWS Step Functions invoke the lambdas directly (avoiding
Figure 14. Throughput Measurements.
both the network back and forth and the HTTP overhead).
Durable Functions. Compared to the existing imple-
We only compare against the existing DF implementation
mentation of Durable Functions, Netherite achieves better
because it is available on Github6 and thus we could deploy
latency in all experiments, even without speculation. The
it with the exact same resources as Netherite.
optimized Netherite implementation achieves x38, x4.3, 17%,
We did not include image recognition and snapshot ob-
improvements in median and x43, x4.7, 29% improvements
fuscation since their throughput limits are bounded by the
in 95th percentile latency than the existing implementation
throughput limits of external services that they use7 . We
in the task sequence, bank, and image recognition workflows
did not include throughput measurements for task sequence
respectively.
because its results are very similar to the Hello Sequence.
Speculation Benefits (Q4). The benefits of speculation Throughput results are shown in Fig. 14.
are apparent in all plots of Fig. 11. In general, the improve-
Durable Functions. The HTTP plots correspond to ex-
ment is cumulative, with two exceptions: local speculation
ecutions where the invocations where done through HTTP,
does not improve latency for the Bank Application since
consuming some resources. Netherite without speculation
there is a lot of communication among workflows and en-
improves the throughput over the existing DF implementa-
tities, and global speculation does not improve latency for
tion by x7.5 for hello sequence and by x2 for the bank appli-
task sequence and image recognition since their workflows
cation. Throughput improvement for the bank application is
stay within partitions. In image recognition, the speculation
smaller, presumably because there is much inter-partition
benefits are small because the biggest factor of the workflow
traffic and less batching per node.
latency is the execution time of the image recognition. In to-
tal, median latency for the sequence experiment is improved Speculation (Q4). To measure speculation improvement
by x21 (95th x17) with speculation, the median latency for on throughput more accurately, we invoke the workflows
the image recognition experiment is improved by 6% (95th without HTTP. Speculation slightly improves throughput in
5%) due to speculation, and finally the median latency for both experiments: for Hello Sequence (10% with local, 8%
the bank experiment is improved by x3 (95th x2) using global with global), for Bank Application (6% with local, 13% with
speculation. global). It is not immediately clear why global speculation
improves throughput of the bank application, as it performs
Take away: Netherite achieves better latencies than all
strictly more work per orchestration. We believe the rea-
other solutions in all of our experiments. Speculation signifi-
son is that the much lower latency (almost x5) means each
cantly improves Netherite’s latency. For a workflow taken
workflow spends less time in the system, leading to emptier
from an AWS application, Netherite achieves better latency
queues, less memory consumption, and less GC overhead.
than Step Functions even though it pays communication
and HTTP costs due to being deployed in Azure and calling Take away: Netherite achieves close to x8 the through-
stateless functions deployed in AWS. put of the existing DF implementation. Speculation does not
negatively impact throughput, but slightly improves it.
6.5 Throughput Results (Q3, Q4)
In this section we conduct experiments to evaluate Netherite’s 6.6 Scale-out Results (Q5)
throughput and how is it impacted by speculation. In this section we conduct an experiment to evaluate if
Netherite can scale out with the addition of nodes.
Methodology. For the throughput experiments, we con-
trol the load by controlling the number of request loops that Methodology. For this experiment, as before in Sec-
are running on the load generators. We determine a suitable tion 6.5, the load generators emit a fixed load that can satu-
load level by ramping up the load until we can visually dis- rate the throughput of the full configuration (4 or 8 compute
cern saturation, indicating that a further load increase will 6 https://github.com/Azure/azure-functions-durable-extension
not improve throughput. We then keep that load steady for 7 AWS Rekognition has a limit of 50 invocations per second and the snapshot
a minute and compute the average throughput. obfuscation workflow takes 20-25 minutes to complete.
Sebastian Burckhardt, Chris Gillum, David Justo, Konstantinos Kallas, Connor McMahon, and Christopher S. Meiklejohn
developer experience and broadens the scope of supported Proceedings of the 2nd ACM Symposium on Cloud Computing, page 16.
applications, and (2) the Netherite architecture improves ACM, 2011.
the performance across the board, by orders of magnitude [29] Badrish Chandramouli, Guna Prasaad, Donald Kossmann, Justin Levan-
doski, James Hunter, and Mike Barnett. Faster: A concurrent key-value
in some cases. Our work enables the development of full- store with in-place updates. In Proceedings of the 2018 International
featured, stateful, serverless applications that extend far be- Conference on Management of Data, pages 275–290, 2018.
yond the scope of the original FaaS concept. [30] Simon Eismann, Johannes Grohmann, Erwin Van Eyk, Nikolas Herbst,
In future work, we would like to explore how to extend and Samuel Kounev. Predicting the costs of serverless workflows. In
CCC to external services, and how to further improve Netherite Proceedings of the ACM/SPEC International Conference on Performance
Engineering, pages 265–276, 2020.
by smarter scheduling of tasks and steps. [31] F. Fakhfakh, H. H. Kacem, and A. H. Kacem. Workflow scheduling in
cloud computing: A survey. In 2014 IEEE 18th International Enterprise
References Distributed Object Computing Conference Workshops and Demonstra-
tions, pages 372–378, 2014.
[1] Apache Airflow. https://airflow.apache.org/. [32] Sadjad Fouladi, Francisco Romero, Dan Iter, Qian Li, Shuvo Chatterjee,
[2] Apache Kafka. http://kafka.apache.org. Christos Kozyrakis, Matei Zaharia, and Keith Winstein. From lap-
[3] Apache OpenWhisk. https://openwhisk.apache.org/. top to lambda: Outsourcing everyday jobs to thousands of transient
[4] AWS Lambda – Serverless Compute – Amazon Web Services. https:
functional containers. In 2019 {USENIX} Annual Technical Conference
//aws.amazon.com/lambda/. ({USENIX} {ATC} 19), pages 475–488, 2019.
[5] AWS Step Functions. https://docs.aws.amazon.com/step-functions/ [33] Sadjad Fouladi, Riad S Wahby, Brennan Shacklett, Karthikeyan Vasuki
latest/dg/welcome.html. Balasubramaniam, William Zeng, Rahul Bhalerao, Anirudh Sivara-
[6] AWS Step Functions. https://aws.amazon.com/step-functions/. man, George Porter, and Keith Winstein. Encoding, fast and slow:
[7] Azure Functions. https://azure.microsoft.com/en-us/services/ Low-latency video processing using thousands of tiny threads. In 14th
functions/. {USENIX} Symposium on Networked Systems Design and Implementa-
[8] Azure Functions Host repository. https://github.com/Azure/azure- tion ({NSDI} 17), pages 363–376, 2017.
functions-host. [34] Jonathan Goldstein, Ahmed S. Abdelhamid, Mike Barnett, Sebastian
[9] Azure Functions PowerShell repository. https://github.com/Azure/ Burckhardt, Badrish Chandramouli, Darren Gehring, Niel Lebeck,
azure-functions-powershell-worker. Christopher Meiklejohn, Umar Farooq Minhas, Ryan Newton, Rahee
[10] Azure Logic Apps Service. https://azure.microsoft.com/en-us/services/ Peshawaria, Tal Zaccai, and Irene Zhang. A.M.B.R.O.S.I.A: providing
logic-apps/. performant virtual resiliency for distributed applications. Proc. VLDB
[11] CloudState—Towards Stateful Serverless by Jonas Bonér. https://www. Endow., 13(5):588–601, 2020.
youtube.com/watch?v=DVTf5WQlgB8. [35] Rachid Guerraoui and Michal Kapalka. Opacity: A correctness condi-
[12] Durable Functions Extension (C-sharp) repository. https://github.com/ tion for transactional memory. 2007.
Azure/azure-functions-durable-extension. [36] Philipp Haller. On the integration of the actor model in mainstream
[13] Durable Functions JavaScript repository. https://github.com/Azure/ technologies: the Scala perspective. In Proceedings of the 2nd Edition
azure-functions-durable-js. on Programming Systems, Languages and Applications based on Actors,
[14] Durable Functions Python repository. https://github.com/Azure/azure- Agents, and Decentralized Control Abstractions, pages 1–6. ACM, 2012.
functions-durable-python. [37] Joseph M. Hellerstein, Jose M. Faleiro, Joseph Gonzalez, Johann
[15] effectively-once = message processing with at-least-once + Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Chenggang
idempotent operations. https://twitter.com/viktorklang/status/ Wu. Serverless computing: One step forward, two steps back. In CIDR
789036133434978304. 2019, 9th Biennial Conference on Innovative Data Systems Research,
[16] Effectively-Once Semantics in Apache Pulsar. https: Asilomar, CA, USA, January 13-16, 2019, Online Proceedings, 2019.
//www.splunk.com/en_us/blog/it/effectively-once-semantics-
[38] Abhinav Jangda, Donald Pinckney, Yuriy Brun, and Arjun Guha. For-
in-apache-pulsar.html. mal foundations of serverless computing. Proceedings of the ACM on
[17] Fission: Open source, Kubernetes-native Serverless Framework. https: Programming Languages (PACMPL), 3(OOPSLA), 2019.
//fission.io/. [39] Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Ben-
[18] Fn Flow. https://github.com/fnproject/flow/. jamin Recht. Occupy the cloud: Distributed computing for the 99%. In
[19] Google Cloud Composers. https://cloud.google.com/composer/. Proceedings of the 2017 Symposium on Cloud Computing, pages 445–451,
[20] Google Cloud Functions. https://cloud.google.com/functions/docs/. 2017.
[21] Netflix Conductor. https://netflix.github.io/conductor/. [40] Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-che Tsai,
[22] Sizes for Virtual Machines in Azure. https://docs.microsoft.com/en- Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl
us/azure/virtual-machines/sizes. Krauth, Neeraja Jayant Yadwadkar, Joseph E. Gonzalez, Raluca Ada
[23] Using Durable Objects, Cloudflare Docs. https://developers.cloudflare. Popa, Ion Stoica, and David A. Patterson. Cloud Programming Simpli-
com/workers/learning/using-durable-objects. fied: A Berkeley View on Serverless Computing. CoRR, abs/1902.03383,
[24] What are Durable Functions? - Microsoft Azure. https: 2019.
//docs.microsoft.com/en-us/azure/azure-functions/durable/durable- [41] R. A. P. Rajan. Serverless architecture - a revolution in cloud computing.
functions-overview. In 2018 Tenth International Conference on Advanced Computing (ICoAC),
[25] Workers Durable Objects Beta: A New Approach to Stateful Serverless. pages 88–93, 2018.
https://blog.cloudflare.com/introducing-workers-durable-objects/. [42] Johann Schleier-Smith. Serverless foundations for elastic database sys-
[26] Zeebe: A Workflow Engine for Microservices Orchestration. https: tems. In CIDR 2019, 9th Biennial Conference on Innovative Data Systems
//zeebe.io/. Research, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings,
[27] Joe Armstrong. The development of Erlang. In ICFP, volume 97, pages 2019.
196–203, 1997. [43] German Shegalov, Michael Gillmann, and Gerhard Weikum. Xml-
[28] Sergey Bykov, Alan Geller, Gabriel Kliot, James R Larus, Ravi Pandya, enabled workflow management for e-services across heterogeneous
and Jorgen Thelin. Orleans: Cloud Computing for Everyone. In
Sebastian Burckhardt, Chris Gillum, David Justo, Konstantinos Kallas, Connor McMahon, and Christopher S. Meiklejohn