0% found this document useful (0 votes)
28 views15 pages

1 s2.0 S0167739X19312130 Main

This document presents a model for distributed computing across heterogeneous hardware located at the network edge and within the network itself. The model aims to determine the optimal placement of computational capabilities given metrics like latency, throughput, energy usage, and cost. It considers real-world applications that involve processing data streams from distributed sources in real-time. Unlike previous models, it accounts for heterogeneous hardware platforms and network structures, models computation costs, and evaluates key factors that influence where processing should occur. The proposed flexible and extensible model could be used to explore distributed computing approaches for a variety of applications and scales.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views15 pages

1 s2.0 S0167739X19312130 Main

This document presents a model for distributed computing across heterogeneous hardware located at the network edge and within the network itself. The model aims to determine the optimal placement of computational capabilities given metrics like latency, throughput, energy usage, and cost. It considers real-world applications that involve processing data streams from distributed sources in real-time. Unlike previous models, it accounts for heterogeneous hardware platforms and network structures, models computation costs, and evaluates key factors that influence where processing should occur. The proposed flexible and extensible model could be used to explore distributed computing approaches for a variety of applications and scales.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Future Generation Computer Systems 105 (2020) 395–409

Contents lists available at ScienceDirect

Future Generation Computer Systems


journal homepage: www.elsevier.com/locate/fgcs

A model for distributed in-network and near-edge computing with


heterogeneous hardware✩

Ryan A. Cooke , Suhaib A. Fahmy
School of Engineering, University of Warwick, Coventry, UK

article info a b s t r a c t

Article history: Applications that involve analysis of data from distributed networked data sources typically involve
Received 3 May 2019 computation performed centrally in a datacenter or cloud environment, with some minor pre-
Received in revised form 25 November 2019 processing potentially performed at the data sources. As these applications grow in scale, this
Accepted 30 November 2019
centralized approach leads to potentially impractical bandwidth requirements and computational
Available online 13 December 2019
latencies. This has led to interest in edge computing, where processing is moved nearer to the
Keywords: data sources, and recently, in-network computing, where processing is done as data progresses
Distributed computing through the network. This paper presents a model for reasoning about distributed computing at the
Edge computing edge and in the network, with support for heterogeneous hardware and alternative software and
In-network computing hardware accelerator implementations. Unlike previous distributed computing models, it considers
Hardware acceleration
the cost of computation for compute-intensive applications, supports a variety of hardware platforms,
and considers a heterogeneous network. The model is flexible and easily extensible for a range of
applications and scales, and considers a variety of metrics. We use the model to explore the key
factors that influence where computational capability should be placed and what platforms should be
considered for distributed applications.
© 2019 Elsevier B.V. All rights reserved.

1. Introduction these streams can be processed, are key performance metrics for
such applications.
Distributed data processing applications involve the process- Centralized cloud computing is often utilized in these sce-
ing and combination of data from distributed sources to extract narios, since the data sources do not typically have adequate
value, and are increasing in importance. Emerging applications computing resources to perform complex computations. Appli-
such as connected autonomous vehicles rely on complex machine cations also rely on the fusion of data from multiple sources, so
learning models being applied to data captured at the edge, while centralized processing is useful. The cloud also offers benefits in
also involving collaboration with other vehicles. Further example scalability and cost, and has been shown to provide benefits in
applications include factory automation [1], smart grid monitor- applications such as smart grid processing [2,4] and urban traffic
ing [2], and video surveillance and tracking [3]. Such applications management [5].
present a challenge to existing computational approaches that However, many emerging streaming applications have strict
consider only the cloud and the very edge of the network. Compu- latency constraints, and moving data to the cloud incurs substan-
tationally intensive algorithms must now be applied to intensive tial delay. Furthermore, while the data generated by sources can
streams of data, and latency must be minimized. In these appli- be small, a high number of sources means that, in aggregate, the
cations, data sources transmit streams of data through a network volume of data to be transmitted is high. For example, in 2011,
to be processed remotely, with a focus on continuous processing, the Los Angeles smart grid required 2TB of streamed data from 1.4
and potentially involvement in a feedback loop, as opposed to million consumers to be processed per day [2]. Some applications,
other applications that involve large scale storage and delayed such as those dealing with video data, must also contend with
processing. Latency, the time taken to extract relevant informa- high bandwidth data requirements.
tion from the data streams, and throughput, the rate at which These limitations have led to an increased interest in ‘edge’ or
‘fog’ computing, a loosely defined paradigm where processing is
✩ This work was supported in part by The Alan Turing Institute, UK under done either at or close to the data sources. This could mean at
the UK EPSRC grant EP/N510129/1.
the source, such as on a sensor node with additional processing
∗ Corresponding author. resources [6]. It can also encompass performing processing within
E-mail addresses: [email protected] (R.A. Cooke), the network infrastructure, such as in smart gateways [7], or in
[email protected] (S.A. Fahmy). network switches or routers. Cisco offer a framework that allows

https://doi.org/10.1016/j.future.2019.11.040
0167-739X/© 2019 Elsevier B.V. All rights reserved.
396 R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409

Fig. 1 summarizes the application scenario of interest, giving


an example of the type of networked system that the proposed
model targets. Edge nodes such as sensors and microcontrollers
transmit data through a network towards centralized computing
resources. In a traditional cloud computing setup, only the cen-
tral resources perform computation (shaded). In edge computing,
the edge nodes are capable of performing some computation
(shaded). In-network computing allows some tasks to be per-
formed in the network as data traverses it, using smart switches
(shaded).
The key contributions of this paper are:

• A model for evaluating different in-network computing ap-


proaches is developed, encompassing:

– Multiple levels of network structure, unlike existing


Fig. 1. An example of the type of networked system that the proposed model
models that focus on clusters of machines.
targets. Shaded nodes can perform computation.
– Hardware heterogeneity including accelerator
platforms, and the resulting differences in computing
and networking.
application code to be run on spare computing resources in some
– Realistic representation of performance metrics, along-
network elements, and Ethernet switches from Juniper allow
side energy and financial cost.
application compute to be closely coupled with the switching
fabric. • The model is used to examine a case-study scenario and
Edge computing can also include the concept of ‘cloudlets’, draw general lessons about in-network computing on dif-
which are dedicated computing server resources placed a few ferent platforms using a set of synthetic applications.
hops away from the data sources. These can vary in scale, from
a single box placed on a factory floor to a small scale data- 2. Related work
center comprising multiple networked machines. While the data
sources themselves may not have the required computing capa- Distributed Stream Processing Models: The allocation of
bilities, these resources can support complex applications and are streaming tasks to networked processing nodes has been ex-
accessible at shorter latencies than a remote cloud [8]. plored in a variety of existing work. Applications are represented
In complex applications, it is likely that some processing, such as a graph of tasks with edges representing dependencies, while
as filtering and pre-processing can be performed at the edge, networks are represented as a graph of compute nodes with
greatly reducing the volume of transmitted data, and additional edges representing links.
processing and fusion of data can be carried out in the cloud. Earlier models such as Aurora/Medusa [14] focused on load
The benefits of this approach are that latency sensitive parts of balancing in task placement, primarily for the allocation of tasks
the application can be done locally, while more computationally to multiple servers in a datacenter environment. However, net-
intensive operations that may require more processing power work costs are not modelled, making them unsuitable for sce-
or additional data can be done centrally. Stream processing ap- narios that consider larger scale networks where communication
plications are well suited to being partitioned and distributed and network costs are more significant. Work on more network-
across multiple machines, as is common in stream processing aware placement [15–19] was tailored towards networks of ma-
frameworks such as Apache Storm and IBM Infosphere Streams. chines that are more widely distributed, and include network
Additionally, cloud service providers such as Microsoft Azure utilisation and latency in their formulation. These models are all
have edge analytics platforms that allow processing to be split focused on placing operators to optimize specific objectives, for
between the cloud and the edge. example bandwidth utilisation, meaning that they are not gener-
Edge and in-network computing is an emerging area. Cloudlets alisable when wanting to model a range of different performance
have been utilized for image processing applications [9,10] and metrics. Since these online optimisations are run dynamically, the
augmented reality [11]. Platforms such as Google’s Edge Tensor models are significantly simplified to minimize their impact on
Processing Unit demonstrate that there is a trend towards mov- the application. These models consider homogeneous processor
ing complex computation closer to the data source. In-network platforms and do not support alternative hardware platforms
computing has seen application for network functions, machine with different computational models and metrics.
learning [12], and high data rate processing [13]. Recently, more generalized placement models have emerged
In order to explore the implications of distributing applica- [20–22]. These focus on creating a general representation of the
tion computation across a network of heterogeneous compute operator placement problem, developing formulations based on
platforms, a suitable model is needed. This would allow for the integer linear programming instead of focused heuristics. They
evaluation of different deployment strategies using metrics such are still limited as they assume a fully connected cluster of
as throughput and end-to-end latency. Existing models that deal machines, and their models of computing resources and tasks are
with placement of processing on distributed nodes do not con- coarse grained. We are interested in a scenario where hardware
sider hardware resources, varied connectivity, and application acceleration may be utilized at certain computing nodes, using
features together. a different computational model to that of a processor, which is
To this end we have developed a generalized formulation considered solely in these models.
that can represent applications and target networks with het- In contrast to these works, our proposed model is not focused
erogeneous computing resources. It supports reasoning about on finding an optimal allocation of tasks to a given set of re-
in-network and near-edge processing scenarios that are emerging sources at runtime. Instead, we wish to use it to investigate the
including both general processor based machines and hardware implications of placing computing resources at different locations
accelerator systems. in a network and to understand the benefits and costs of doing
R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409 397

so. Since we are not concerned with dynamic optimisation of hardware platforms to support sharing and virtualisation of mul-
operator placement within a time constraint, the model can in- tiple accelerators that can be changed at runtime [37], hence
clude more fine grained detail for tasks and hardware, accounting offering some of the flexibility of software platforms with the
for hardware acceleration, heterogeneous resources required by connectivity and performance of hardware. This trend is expected
tasks, the financial cost of adding additional compute capability to continue with the deployment of FPGAs in commercial cloud
to network nodes, and energy consumption. We also consider computing datacenters [38,39]. Tightly coupling accelerators with
the networked system as a whole, from the sensor nodes to the network interface has also been demonstrated to be effec-
the datacenter, instead of focusing on a cluster of computational tive in embedded networks [40] and the datacenter [41], and
servers. The focus of this paper is not optimisation, but rather to have significant impact on streaming application latency [42].
an analysis of different distributed computing paradigms in the To reflect the trend towards heterogeneity, our proposed model
context of streaming applications. encompasses the idea of distinct hardware platforms with differ-
Edge/Fog Computing: In response to increasing demand for ent computational characteristics. This further differentiates our
low latency in distributed streaming applications, efforts have work from others that consider only traditional processor based
been made to move computation closer to the data source, or compute architectures.
the ‘edge’ of the network. Where processing occurs varies, and
it is rare that the application is entirely pushed to the edge. 3. Scenario and metrics
Typically operations such pre-processing and filtering take place
at the edge, with aggregation and decision making centralized. The scenario of interest comprises a set of distributed data
This approach has been applied to domains such as smart grid, sources producing continuous streams of data, connected through
radio access networks, and urban traffic processing [23–25]. The a network comprised of intermediate nodes (for example gate-
model we have developed is capable of representing this scenario. ways, routers, or cluster heads) to a central data sink, such as
In some cases a majority of the processing is performed at a datacenter. These data sources could be cameras, streams of
the data source. This is common in sensor networks, where documents, environmental/industrial sensors, or similar. An ap-
communication costs are higher than computation costs. Some plication consisting of a set of tasks and their dependencies
examples of this are TAG [26], directed diffusion [27], EADAT [28], processes these streams to make a decision or extract value.
and MERIG [29]. These models consider computing at the very These tasks operate on the different streams of data, and some
edge of the network, unlike those discussed previously. Our pro- combine information from multiple (possibly processed) streams.
posed model can account for the energy costs of communication Individual tasks affect the data volume through a reduction factor
and computation, as well as representing heterogeneous network that determines the ratio of input data to output data, which re-
links, unlike these models. flects the properties of many stream processing tasks. An example
Processing may also be offloaded to local ‘cloudlets’, servers of such an application is a smart surveillance system that moni-
dedicated to computation, a few hops away in the network from tors video streams from many cameras to detect specific events.
the data source. This approach can be seen in mobile edge com- Video streams can come from a mix of fixed cameras and mobile
puting, where processing data on a mobile device would consume platforms, with different resolutions, frame-rates, and interfaces,
too much power, and doing so in the cloud would lead to high requiring different amounts of processing. The application uses
latency [30]. Cloudlets have also been demonstrated in video pro- processed information to adapt how the cameras are deployed
cessing and augmented reality applications [9–11] where latency and positioned.
is an important consideration. In order to evaluate alternative allocations of resources and
In-network computing is another emerging paradigm in which tasks, we consider the following key metrics of interest, with
traditionally centralized computation is distributed throughout some explanation of how they are impacted below. We provide
the networking infrastructure. Devices such as network switches the comprehensive formulation of these metrics in Section 5.
and gateways are extended to perform additional data process-
ing as well as their network functions. This technique has been 3.1. Latency
demonstrated to result in a reduction in data and execution
latency in map reduce applications [12]. A key value store imple- Latency is important when data is time-sensitive. Fast detec-
mented on an FPGA based NIC and network switch outperformed tion of an event may have safety or security implications, or in
a server based implementation [13]. In-network computation some applications, there could be real-time constraints. In this
using programmable network switches for a consensus protocol case-study, transmitting all video streams to the cloud introduces
was demonstrated in [31]. As the capability of this hardware large communication delays and competition for resources in the
improves, this method in which networking elements are used cloud can add further latency. Performing computation closer to
for both moving data as well computing, is becoming more viable. the cameras, whether at the cameras or in network switches
Extending such capabilities to broader applications requires the can reduce these communication delays, and distributing the
ability to analyse applications composed of multiple dependent tasks to different network nodes reduces the delays from sharing
tasks and determining how to allocate these to capable nodes. centralized resources. Even with less powerful hardware, latency
Our proposed model allows this to be explored in a manner not can improve as a result of this stream processing parallelisation.
possible using existing distributed computing models.
Hardware acceleration: A primary motivation for this work 3.2. Bandwidth
is the increasing complexity of applications, growing volumes of
data, and more widespread availability of alternative hardware Processing sensor data often reduces the size of data, out-
such as GPUs and FPGAs that can boost the performance of these putting filtered or aggregated data, or simple class labels. Hence,
applications. Recent work has explored accelerators for a variety if this processing is performed nearer to the data source, band-
of algorithms relevant to networked systems [32–34]. Within the width consumption further up the network can be reduced sig-
datacenter, heterogeneity has emerged as an important way to nificantly. There may also be scenarios where early processing
address stalled performance scaling and rising energy constraints. can determine that a particular stream of data is useless, and
FPGAs can be integrated into datacenter servers for application hence further transmission can be avoided. In our example, some
acceleration [35]. FPGA partial reconfiguration [36] allows these cameras may use low resolutions or frame rates, and hence be
398 R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409

less costly in terms of bandwidth, while others might require


significantly higher bandwidth, which would be more efficiently
processed nearer to the cameras. It is clear once again that this
decision depends on the specific application and tasks.

3.3. Energy

Energy remains a key concern as cloud computing continues


to grow; the power consumption of datacenter servers and the
network infrastructure required to support them is significant.
One approach vendors have taken to try and address this is to
introduce heterogeneous computing resources, such as FPGAs, to
help accelerate more complex applications while consuming less
energy. However, these resources add some energy cost to the
datacenter, in the hope that this will be offset by significantly
increased computational capacity. There is similarly an energy
cost for adding accelerators in the network infrastructure but Fig. 2. Nodes in the network graph can represent a single device or a cluster
of networked devices.
this is likely less than the cost of full server nodes, and leads to
a reduced load on the datacenters as they then only deal with
processed data. However, it is clear that energy consumption
is heavily dependent on where such resources are placed. It is Each task t ∈ T can be assigned to a node through an imple-
also possible that energy constraints at source nodes can impact mentation. Implementations are pieces of software or hardware
what can be done there. In this example, battery-powered drones logic that can perform the required task. This allows for selection
carrying cameras may have constrained power, so performing between software implementations and hardware architectures
more computing there may not be viable. that may have different benefits and drawbacks. This is in con-
trast to previous work which typically considers generic query
3.4. Financial cost operators and does not allow for the possibility of alternative
implementations of a task. Implementations and tasks are treated
Adding computing capabilities to all data sources is expensive, as black boxes that take inputs and produce outputs, and have
especially where the tasks to be performed are computationally already been benchmarked to determine an estimate of process-
expensive, possibly requiring dedicated hardware. In this exam- ing time and energy consumption on a reference platform with
ple, the cameras would have to be smart cameras with dedicated no other tasks running. A set of platforms that can be assigned
processing resources attached, and this is likely to increase cost to nodes, P, to execute the tasks can have varying computational
significantly. While centralising all computation is likely to be the models and available resources.
cheapest solution in terms of hardware, placing some computa-
tion in the network can come close to that cost, while offering 4.1. Tasks
significant benefits in the other metrics.
T = {t1 , t2 , t3 , . . . , tT } is the set of application tasks to be
4. Proposed model allocated to nodes in the network. Individual tasks represent
functions to be carried out on a data stream. Together tasks
The proposed model defines a network topology, task/operator represent the operations performed on each data stream, and
graph, and hardware platforms. Tasks and hardware platforms specify how they are combined and manipulated to extract value.
can be allocated to network nodes, and values for the previ- In this model, data is consumed by a task and transformed, with
ously mentioned performance metrics can be calculated. The net- the result passed to the parent task. Task dependency is captured
work communication topology is assumed to be pre-determined, in the DAG, with each task unable to begin until all of its child
though not the hardware at the nodes or the task allocation. The tasks have been completed on a given instance of data—tasks with
model is flexible enough to be used in a range of situations. multiple children are typically aggregation operations. Each task
The logical topology of the network is represented as a graph, t ∈ T is defined by t = (ft , Mt , Ct , at ).
GN = (N , EN ), where N is the set of network nodes, with bidirec- • The set Ct ⊂ T contains the prerequisite tasks for t that must
tional communication across the set of edges between them, EN . be completed before task t can begin—its child tasks;
Application data travels through these nodes and edges towards • at ∈ T is the parent task of t, which cannot begin until t has
a central sink. A node can represent either a single machine in finished.
the network, such as a gateway, switch or, server, or a ‘tier’ or • ft is the reduction factor, where 0 < ft ≤ 1. This parameter
‘level’ of the network infrastructure. In this case a node represents represents the amount that a task will reduce the volume of
multiple machines but the connectivity between them is not data it operates on;
modelled at the higher level in the graph (see Fig. 2). Using this • Mt ⊂ M is the set of implementations that can implement
representation allows the network topology to be represented in the functionality of t;
a tree structure as suits the application models considered.
To represent the application, a directed acyclic graph (DAG)
• The data into (operated on by) a task t, denoted∑|C | δt , is the
sum of the data out from all sub tasks, δt = ( i=t0 di );
is used to define the relationships between tasks, GT = (T , ET ),
• The data output from a task, dt , to be processed by the task’s
where T is the set of tasks and ET defines the dependencies
parent task, is given by dt = ft δt .
between them. GT is a tree structure with a global task at the
root, with other nodes representing sub-tasks, such as aggre- This representation of tasks supports different types of opera-
gations and pre-processing. This task model is based on the tions, for example, a filtering tasks that reduces a data stream,
stream processing model, where data is processed per sample as or aggregation tasks that merge multiple streams. Traditionally,
it arrives. aggregation tasks that process several data streams would have
R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409 399

to be centralized but in this model they can be placed at inter- Table 1


mediate nodes that has access to the requisite streams. Summary of symbols used in formulation.

Symbol Meaning
4.2. Implementations
xnm Allocation of implementation m to node n
M = {m1 , m2 , m3 , . . . , mM } is the set of all implementations, ynp Allocation of platform p to node n
which are the pieces of software or hardware that implement the znmp Allocation of m and p to n
functionality of a task. Implementations can represent different unm1 m2 p Allocation of m1 , m2 , and p to n
software algorithms or hardware accelerator architectures that τmax Maximum path delay
give the same functionality but have different computational
g Throughput
delays or hardware requirements. Each task t ∈ T has a set
of implementations Mt , and each m ∈ M is defined by m = Kt ⊂ T Set of tasks lower than t in task sub-tree with t at the root
(tm , τm , Rm , hm ) Kn ⊂ N Set of nodes lower than n in network sub-tree with n at the root

• tm ∈ T is the {task that is implemented Ds ⊂ N Set of nodes on path from s to root node
} by m;
• the set Rm = rm1 , rm2 , rm3 , . . . , rRm contains the amount of vnp 1 if p ∈ Pn , 0 otherwise
each resource needed to be able to host the implementation, Ph ⊂ P Set of all platforms that run hardware implementations
such as memory, FPGA accelerator slots, etc; Ps ⊂ P Set of all platforms that run software implementations
• τm is the time taken for this implementation to complete the H Set of all paths from leaves to root in task graph
task it implements per unit of data, compared to a reference Ht ⊂ H Set of tasks on path from leaf task t to root
processor;
OHt Set of all other tasks not on path Ht
• hm = {0, 1} signals whether the implementation is software
I⊂M Set of all software implementations
or hardware. A value of 0 is software, 1 is hardware.
φmpt Time to complete task implementation on node
4.3. Platforms q Bandwidth of streams/tasks
L⊂T Set of tasks with no child tasks
Platforms represent the systems in a network node that can SKn Set of all sources that lie beneath node n
carry out tasks. We define P = {p1 , p2 , p3 , . . . , pP } as the set of
platforms that could be assigned to node n ∈ N. Each platform
p ∈ P is defined by p = (ep , cp , wp , Rp , hp ), where:
4.5. Sources and data
• ep is the execution speed of the platform relative to a refer-
ence processor—this represents different processors having
different computing capabilities; S = {s1 , s2 , s3 , . . . , sS } is the set of data sources. We model
• cp is the monetary cost of the platform; data as continuous streams, as we are interested in applications
wp is the that process and merge continuous streams of data. A data source
• { power consumption } of the platform; could represent a sensor, database, video, or other source that
• Rp = rp1 , rp2 , rp3 , . . . , rpR is the set of resources available
on the platform, such as memory, FPGA accelerator slots, etc. injects a stream of data into the network. Each s ∈ S is defined
Resources are required by implementations; by s = (ns , ts , ds , es ).
• hp = {0, 1} indicates whether the platform runs software or • ns ∈ N is the parent node of the source, the node where the
hardware versions of tasks. A value of 0 means the platform data stream enters the network;
is a processor that executes software, and a value of 1
• ts ∈ T is the task to be performed on data being produced
means the platform is a hardware accelerator that executes
by the source;
application-specific logic. This is used to ensure correct al-
location of software and hardware implementations. • ds is the amount of data in one instance from this source per
period es ;
Unlike existing work, this model makes the distinction between • es is the period between subsequent units of data of size ds
platforms that execute software code and hardware acceleration entering the network.
platforms such as FPGAs as they have different computational de-
lay models, discussed in Section 5.1. Hardware acceleration plat- The model assumes a constant periodic stream of data from the
forms incur no latency penalty when multiple tasks are present source, such as a periodic sensor reading, frame of a video, or set
on the same node, whereas software platforms do, as a result of of captured tweets for example. There are some systems that do
contention for computing resources. not fit this model—for example where sensors may only send out
data if there is some change detected. This case can still be rep-
4.4. Network
resented in the proposed model, as the sensor is still continually
capturing data as a source and the detection component can be
N = {n1 , n2 , n3 , . . . , nN } is a set of the network nodes, for
example sensors, gateways, and routers, or servers. Each n ∈ N is modelled as a filtering task that reduces it.
defined by n = (an , Cn , Pn , bn ), where:
4.6. Allocation variables
• an ∈ N is the parent node of n linking it to towards the
central data sink;
• Cn ∈ N is a set of child nodes of n linking it to towards the Boolean variables represent the allocations of tasks and hard-
source(s); ware to network nodes. xnm = {0, 1} represents the allocation of
• Pn ⊂ P is the set of platforms that can be assigned to node an implementation m ∈ M to node n ∈ N. Similarly, ynp = {0, 1}
n. For example, a large datacenter class processor that must represents the allocation of platform p ∈ P to node n ∈ N.
be housed in a server rack cannot be placed on a drone; znmp = {0, 1} represents the allocation of platform p ∈ P, and
• bn is the outgoing interface between the node n and its task m ∈ M to a node n ∈ N, using a set of constraints.
parent node, and represents the bandwidth in terms of data A summary of the symbols used in the model is presented in
per unit time. Table 1.
400 R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409

4.7. Constraints

Constraints are used to ensure correct allocation of tasks,


platforms, and nodes.

4.7.1. Allocate tasks only once

| N | | Mt |
∑ ∑
∀t ∈ T , xij == 1 (1)
i=0 j=0

4.7.2. One platform per node Fig. 3. The difference in how a set of tasks allocated to a single node are
scheduled on software and hardware accelerator nodes.
|P |

∀n ∈ N , yni == 1 (2)
i=0 • Communication and computation happen independently
and can be parallel to each other;
4.7.3. Resource availability • There is no communication time between tasks on the same
Allocations cannot exceed the available resources for the plat- node.
form assigned to a node:
As tasks can only begin once their child tasks are complete, we
| T | | Mi | |P |
∑ ∑ ∑ can say the root task of the graph G(T , ET ) can only start once all
∀n ∈ N , ∀e ∈ R , xnj rje ≤ ynk rke (3) paths to it are complete. The end-to-end latency is therefore equal
i=0 j=0 k=0 to the longest path delay of the task graph, including network and
computation delay.
4.7.4. Additional constraints
The model allows for additional constraints to be added in 5.1.1. Computation delay
order to better model a specific system or set of requirements. The time to complete one task on the node it is allocated to
Constraints can be added to give certain tasks deadlines, con- can be represented:
strain bandwidths, restrict specific nodes to certain platforms,
and more. φmpt = τm ep δt (4)
For a task t, implemented with m on platform p. To find the end-
5. Performance metrics
to-end latency, the values of φmpt for each path in the task tree
are summed, and the maximum value determined.
As previously mentioned, there are five main metrics of inter-
In the case of software implementations, nodes are assumed
est in this analysis. Latency, throughput, bandwidth and energy
to carry out one task at a time. So in the cases of multiple tasks
consumption, and financial cost. In this section we formulate
being assigned to the same node, in the worst case scenario, a
these metrics, and discuss how the formulation allows each to
data instance must wait for all other tasks not in the path to finish
be evaluated.
before beginning the next task. Note that this applies even if a
node supports concurrent software tasks, since we assume that
5.1. End-to-end latency
multiple software tasks suffer degraded performance in propor-
tion to the parallelism applied. Unlike some other works, which
The end-to-end latency is the total time between an instance
are only concerned with preventing the allocated tasks exceeding
of data entering the network and its root task being completed.
a measure of available resources on a platform, running multiple
For example, this could be the time between a sensor reading or
software tasks at once on the same node in our model affects
image being taken and a fault or anomaly being detected. This
computational delay. For hardware implementations we make
value is of interest in time-sensitive applications such as those
no such assumption as they can operate in parallel as separate
concerned with safety or closed-loop control, such as for indus-
streams since they are spatially partitioned, and so it is sufficient
trial equipment, or coordinated control. The model incorporates
several assumptions and behaviours that are relevant for this to only sum the path of interest, though we do factor in available
metric: hardware resources as discussed later. This distinction between
software and hardware implementations of tasks better repre-
• Sources s ∈ S produce continuous streams of data of an sents the growing trend of using alternative computing platforms
amount ds , every period of time es . We take a ‘snapshot’ of to accelerate computation, compared with previous work that
the network at any instance of time, and say that data is only accounts for software running on processors. Fig. 3 shows
entering the system at this instant from all sources, of an this difference in scheduling for software and hardware nodes. On
amount ds . The equation we form gives the latency of the software nodes, tasks are performed in series in the worst case,
data instances entered at the beginning of this ‘snapshot’; and on hardware nodes, tasks can be performed concurrently. In
• Only one software implementation can run at a time on a this example, this means that tasks C and D can be performed
node. Software runs on a first in first out basis; in parallel to tasks A and B. Task E is dependent on tasks B,
• Hardware implementations of tasks operate independently C and D, so must happen once they are completed. The added
from one another so can operate in parallel; concurrency of hardware accelerator nodes helps reduce task
• A task cannot begin until all of its child tasks have been execution latency when multiple tasks are assigned to a node.
completed; In order to represent this behaviour, a set of new allocation
• Tasks start as soon as all of the data required is available, variables is introduced: u. Each one of these unm1 m2 p = {0, 1}
and once completed send the result to the next task as soon represents the allocation of two implementations m1 and m2 to
as possible; node n, assigned platform p.
R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409 401

The set of tasks on the path from a leaf node on the task graph 5.2. Throughput
t to the root of the task graph is Ht ⊂ T . Let the set H contain all
of the task path sets (Ht ∈ H). The set OHt is declared, containing
The throughput of the system is the rate at which results are
all other tasks not on the path Ht . The set I ⊂ M is defined as the
output, and dependent on the node with the longest processing
set of all software implementations. The computation time for a
time in the network. A continuous variable g can be introduced
path Ht , τHt c in the task tree is given by:
to represent the maximum delay processing stage. For software
|N | |Ht | |Mj | |P | |Oi | |Im |
(
∑ ∑∑∑ ∑ ∑ implementations, where only one task can run on a node at any
τHt c = zikl φklj + uikql φqlm time, this can be expressed:
i=0 j=0 k=0 l=0 m=0 q=0
(5) |T | |Ps | |Mi |
|Ia |
∑ ∑∑
∀n ∈ N , g ≥ znkj φkji
j
)
∑ (10)
− uikrl φqlm i=0 j=0 k=0
r =0
where Ps is the set of all platforms that run software implemen-
The znmp term in this equation is the sum of delays on all paths
tations. For platforms that run hardware implementations, Ph :
of the task tree. The unm1 m2 p terms represent the extra delays on
the path caused by having multiple tasks not on the same path
allocated to the same node in software. The computation times |Ph | |Mt |
∑ ∑
of any other tasks allocated to the same node as any task in the ∀t ∈ T , g ≥ znji φkji (11)
path are added. The subtraction is present to ensure that this i=0 j=0
computation time is only added once for each set of tasks in a
The throughput, v , can then be expressed:
path allocated to the same node.
v = 1/max(g) (12)
5.1.2. Communication delay
A simple communication model is used where tasks send data
to the parent task as soon as it is ready. There is no communi- 5.3. Bandwidth
cation cost between tasks, only between nodes. Communication
and computation can occur simultaneously and independently. If Bandwidth utilisation can be very significant in scenarios in-
a node receives data for a task not assigned to it, it forwards this
volving information sources with dense data and for large net-
data immediately to the next node.
works and applications. Poor utilisation can also lead to additional
Data is transferred from one node to another when a task’s
communication delays.
parent task is allocated to another node. Similarly to the compu-
tational delay, we can find the total communication delay τHt m The bandwidth of a data stream at a source s, qs is given by:
between tasks in each path in the task tree Ht ∈ H: ds
qs = (13)
|Ht | |N | |Kj |
( |M |
i
| Ma i | ) es
∑∑∑ ∑ xkl di ∑ xkm di
τHt m = − (6) The bandwidth of a task t, denoted qt , is given by:
bk bk
i=0 j=0 k=0 l=0 m=0
|Ct |

The delay from the data source s on path Ht to the node that qt = ft qi (14)
performs the first task on it, τHt s , is given by: i=0
|Ds | |Ki ||M |
tl
∑ ds ∑ ∑ xit ds For leaf tasks tl where |Ct | = 0, it is given by:
τHt s = − l l
(7)
bi bi qtl = ft qs (15)
i=0 j=0 k=0

where tl is the leaf in the task path. The total communication The total bandwidth consumption at the output of a network
delay in a path τHt k is thus: node is the sum of the bandwidths of all streams passing through
it.
τHt k = τHt s + τHt m (8)
|Kn | |T |
( | Mj | |Cj | |Ml | )
The proposed model can be extended to incorporate differ- ∑ ∑ ∑ ∑ ∑
qnc = xik qj − xim ql (16)
ent communication delays for software and hardware tasks as
i=0 j=0 k=0 l=0 m=0
would be the case for network-attached hardware accelerators
that can process packets with lower latency. The computation and The data not yet processed by any tasks must also be taken
communication latencies are likely to vary in reality. This model into account. If SKn ⊂ Kn is the set of all sources that lie beneath
considers the worst case latency where a node processes all other n in the network graph, and L ⊂ T is the set of all tasks where
tasks first and transmits the results last. |Ct | = 0:
5.1.3. Total delay |Kn |
( |L| (|SK |
n | Mj | ))
∑ ∑ ∑ ∑
The total latency for a path, τHt , is equal to: qnl = qk − xil qsj (17)
i=0 j=0 k=0 l=0
τHt = τHt k + τHt c (9)
where qst is the bandwidth of the source that leaf task t operates
The largest of these values is the total latency. τmax .
on.
Although we have discussed a scenario where only a single
The total bandwidth at a node n ∈ N is given by:
task graph is present, the model allows the possibility of multi-
ple independent task graphs representing separate applications. qn = qnc + qnl (18)
Using the same method and equations, a τmax can be formulated
for other task graphs. This gives the bandwidth at each link between nodes.
402 R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409

5.4. Energy consumption a more uniform approach to deploying computing resources is


generally required, and the variability of applications might make
The energy consumption of the network can be relevant for a static allocation less ideal. Hence, we evaluate strategies for
a variety of applications. In an application that deploys remote a representative application to learn general lessons about the
nodes with limited power sources for example, such as a wireless placement of computing resources in such networks. We consider
sensor network, energy usage can be a significant constraint. Most a network of cameras, some fixed and some mobile, such as
related works do not consider computational energy costs. The drones, tasked with surveying an area to detect human presence.
energy used at a node n ∈ N depends on the power consumption The images collected by each camera are processed through a
wp of the platform p ∈ P at that node, and the times taken τm to sequence of tasks including the histogram of oriented gradients
complete the implementations mt ∈ M of tasks t ∈ T allocated
(HOG) and an SVM classifier to detect objects of interest, and a
to the node. Just as when formulating an equation for the end-
tracking algorithm is applied that relies on the fusion of data from
to-end latency, taking a ‘snapshot’ of the network, the energy
multiple cameras.
consumed by the network per data instance is given by:
| N | | T | | Mj | | P |
6.1. Network
∑ ∑∑∑
zikl φklj wl (19)
i=0 j=0 k=0 l=0
We choose a network structure that is generally representative
5.5. Financial cost of that seen in an application such as this. The outermost layer
represents the very edge of the network, comprising the cameras
The simplest metric is the financial cost of the solution. An themselves (layer A). The next layer represents an access or
equation can be formed that represents the total cost of the gateway layer, that connects the cameras to the larger network
system, based on the platforms selected at all of the nodes. The (layer B). Each gateway and the connected sensors represent
total cost of the solution is given by: different areas that are to be monitored—for example rooms or
|N | |P |
∑ ∑ neighbourhoods. Cameras connect to this layer through interfaces
cmax = yij cj (20) such as 100 Mb Ethernet or 802.11 wireless LAN. We model a
i=0 j=0 transfer time of 10 ns per bit of data for this layer. The next layer
is a routing layer that connects the local network to the wider
Financial cost is a concern as it is ultimately one of the key
network, with higher speed and bandwidth interfaces such as
drivers in the decision of where to place computing capability,
10G Ethernet (layer C). Here we model a latency of 0.1 ns per
and will always be one of the largest barriers to achieving the best
possible placement. We consider the added cost of the computing bit of data. Finally, there is the cloud layer, which houses the
platforms required to implement the in-network computation. remote computing resources. To reach this layer data must travel
through the internet, for which we assume a communication time
5.6. Combined evaluation metrics of 1 ns per bit, based on round trip times to AWS EC2 instances
measured in [43]. These communication times are estimates and
We have presented formulations for the 5 important perfor- ignore frame/packet overheads, and many other delays, but are
mance metrics relevant for evaluating heterogeneous distributed there to model variation in transfer time between different layers.
systems. We have kept these distinct as our proposed model is The topology we use in this case study is shown in Fig. 4. It
designed to be flexible enough to use for different scenarios and includes a mix of nodes with high and low fanout, and nodes
purposes, where the relative importance of these five metrics will at all of the layers discussed above. Links appear unidirectional
vary depending on the application. Users of the model are able to as we assume data must flow through these layers in order
build more complex metrics based on the requirements of their to reach the cloud/datacenter. It is important to note that the
analysis, combining whichever of these five is relevant to their layer B/layer C nodes do not represent individual machines, but
evaluation, and suitably weighting the different components. rather layers of the network hierarchy, comprising multiple ma-
We expect this model to be used in the design and evaluation chines. Communication within these nodes is neglected in this
of alternative structures for deploying heterogeneous applica-
case study.
tions. In such scenarios, a constraint-driven approach is more
sensible than a combined metric, and our model supports such
evaluations. For example, a required financial budget or latency 6.2. Tasks
target can be set and other metrics evaluated for different de-
signs. If used to compare designs, the primary metric of impor- The HOG algorithm used in this case study has been previously
tance can be evaluated, with constraints placed on the other implemented on a variety of computing platforms [44–46]. For
metrics, such as the best latency for a fixed financial cost and the sake of this case study we break down the algorithm into
energy budget.
3 tasks: gradient computation, normalisation, and classification.
We demonstrate the flexibility in this model in determining
While there are more tasks that form this algorithm, these 3 take
general lessons around the placement of tasks and hardware
a majority of the computation time and have a significant effect
resources in our evaluation in Section 7.
on data size. From [44] we obtain estimates for the reduction fac-
6. Case study tor of each task. The tracking algorithm uses these HOG features
and a KLT tracker [47], relying on fusion of data from multiple
In this section we investigate the implications of different cameras. Therefore this task must be placed elsewhere in the
placement strategies in a distributed object detection and track- network, at a location that can access all necessary cameras.
ing system. While the formulation presented in Section 4 can be In our case study, each camera has a set of tasks, gradient comp
used to create an optimal placement of computing resources and → histogram → classification, associated with it, and then each
tasks for a given application and network, it might be argued area of cameras has a tracking task that processes the result from
that such a bespoke design would not be highly practical, since multiple camera chains.
R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409 403

Fig. 4. Network structure used in this case study.

Table 2 Table 3
Computation times in milliseconds for each task on different platforms. Financial cost and power estimates for each platform.
Platform Grad, Hist Normalisaton Classification Tracking Platform Cost Power Consumption
Cortex A9 2000 3200 1900 2000 Arm Cortex A9 10 1
Intel i7 40 60 35 40 Intel Core i7 300 5
Intel Xeon 2.6 4.0 2.3 2.6 Intel Xeon 2000 100
Xilinx Zynq 260 400 240 260 Xilinx Zynq 250 5
Xilinx Virtex 6 1.3 2.1 1.2 1.3 Xilinx Virtex-6 1000 10
Reduction factor 0.77 0.004 0.16 0.16
Table 4
Performance metrics for different placement strategies using
software platforms.
6.3. Platforms Placement Latency Throughput Cost Energy
(seconds) (frames/s)
From previous work, we estimate the computation times for Centralized 1.95 3.43 2000 30.03
each task on the different platforms. Though these are estimates, Layer C 1.97 0.88 3200 23.00
and different implementations may have varying optimisations, Layer B 1.93 0.94 4100 23.00
the relative computation times are the important factor for this Layer A 7.16 0.14 2300 241.2
case study. If computation is placed at a camera node, we as-
sume an embedded platform. An embedded Arm Cortex A9 is
used in [44] to implement the HOG algorithm, so we use the
pipelines of the algorithm pipeline each, so 12 tasks. We as-
computation times presented there.
sume CPU based platforms have no limit to the number of tasks
If computing is placed at the access or routing layers, we
can assume a more powerful CPU is available. The work in [46] that can be running, though, as discussed in the formulation,
implements the algorithm on an Intel Core i7 processor. Finally, there is a latency penalty for sharing resources. We focus on
the cloud layer would use server class processors, such as the latency, throughput, energy consumption, and financial cost as
Intel Xeon platform used to implement the algorithm in [45]. the metrics of interest.
We also discuss the implications of using an FPGA to accelerate We use our model to build the above scenario and evaluate
tasks. The work in [44] presents an FPGA design that gives a different computation placement strategies. We implement the
speed up of around 7× on a Xilinx Zynq platform that could model in Python, using classes to represent the nodes, platforms,
be embedded at the camera. An FPGA accelerator implemented tasks and implementations, all containing members representing
on a larger Xilinx Virtex-6 FPGA was reported in [45], and we the various parameters discussed previously. Results are pre-
assume this is the FPGA platform available at other layers. We sented in Table 4 for software platforms and Table 5 for hardware
use the relative performance on these platforms to estimate the platforms. We show the latency, throughput, financial cost, and
computation time of the tracker task. Table 2 summarizes time total energy consumption of the entire system. Bandwidth results
taken for each task on each platform per frame. are not shown in this table as they are calculated per node in our
The costs of each platform are also relevant. In this case study
model.
we consider the extra costs associated with adding computing
Centralized Software: A typical approach to such an applica-
resources to different layers of the network. Cloud/datacenter
tion would be to centralize all tasks, performing them in software,
costs are difficult to estimate, so we assume that this central node
is present regardless of how we place other computing resources. transmitting all data to the cloud or datacenter. In this case study,
Table 3 summarizes approximate costs and power consumption this gives a latency of around 1.95 s, and a throughput of 3.4
for each of the platforms in arbitrary currency and energy units frames per second for each camera. Note that this is in the worst
based on costs we have determined from OEM suppliers, and case, where all camera streams compete for CPU resources. The
manufacturer power estimation utilities. large communication latency coupled with the large amount of
The FPGA resource utilisation estimates in the previously cited data being transmitted undermines the extra computing power
works suggest that both FPGA platforms can implement 3 full provided by the cloud. Energy consumption was also joint highest
404 R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409

Table 5 Table 6
Performance metrics for different placement strategies using Performance metrics for MILP optimisation of model.
hardware platforms. Placement Latency Throughput Cost Energy
Placement Latency Throughput Cost Energy (seconds) (frames/s)
(seconds) (frames/s) Optimized 0.87 133 9000 1.56
Centralized 1.68 133 13000 1.56
Layer C 0.844 133 14000 1.56
Layer B 0.8 133 16000 1.56
Layer A 0.94 1.10 11600 30.6 overall latency, and worse throughput. While the total energy
consumption for the layer A approach looks high, it is spread
across a greater number of nodes. Each layer A node actually has a
with this approach, as the Centralized hardware has the highest power consumption of approximately 0.956. The same processing
power consumption. hardware is implemented on the FPGAs in layers B and C, as well
In-network software: An alternative approach is to push pro- as when centralized. This results in the throughput being equal
cessing tasks down into the network. One possibility is placing in all circumstances, despite the higher communication latency.
the gradient computation, normalisation, and classification tasks Optimal Placement Our model can be used with a Mixed
on the camera nodes (layer A), and placing the tracking tasks Integer Linear Program (MILP) solver to generate a specific task
at the appropriate layer B nodes as they require information and hardware placement strategy to optimize any of the per-
from a set of cameras. This results in a latency of around 7.16 s formance metrics detailed in Section 5. To do this, the Python
and a poor throughput of 0.14 frames per second, unsuitable PULP front end was used to interface to the CBC solver. In this
for real time applications. The energy consumption seems high, case, we optimize for latency, as in this example, energy and
but this value is the energy consumption of the entire system— throughput are directly related to latency. We first generated the
the consumption at each individual node is much lower. While optimal latency placement, then ran the optimisation again with
there is communication latency, and fewer tasks competing for a latency constraint 5% higher than this value, but optimising for
the same resources, the computing capability of these edge nodes cost. This forces the solver to generate the cheapest placement
is so low that the latency and throughput are much worse than that achieves a latency within 5% of the optimal value. As a result,
the centralized placements. our model generated a placement with the metrics shown in
Distributing tasks within the intermediate network infrastruc- Table 6. This is presented for completeness; it may be argued
ture offers improved latency relative to placing tasks in layer A, that customising a network for a specific application is unlikely
but has minimal impact when compared to centralized place- to be a common requirement. Hence, we have focused primarily
ment. In this scenario, the reduced communication latency is on the general lessons learnt in terms of placement strategies for
offset by the increased computation latency. Layer B and layer C hardware in the network.
approaches introduce additional costs of 2100 and 1200 currency Summary: Improvements can be made to streaming applica-
units respectively. The centralized solution also has 3.65× higher tion latency by pushing tasks into the network in either software
throughput than these approaches. This is because of its increased or hardware. This also offers improvement in energy consump-
computing capability relative to these other nodes, meaning that tion at each individual node, important when there may be a
there is less computation latency. Energy consumption is less limited power budget. There is a balance between the com-
than centralized software, due to the lower power consumption munication latency to reach higher capability nodes, and the
of the hardware. This energy consumption is also spread across benefits to computation latency that they provide. Placing tasks
a greater number of nodes, meaning each node consumes less at the very edge of the network minimizes communication la-
energy. tency but is limited by poor computational capability. The cloud
Centralized Hardware: Utilising FPGA acceleration at the offers the highest computing capability but there is a communi-
server node reduces the latency to 1.68 s, and increases through-
cation latency bottleneck. The downside of using in-network task
put to 133 frames per second, as a result of reductions in compu-
placement is the additional financial cost of the extra hardware.
tation latency. While the FPGA should in theory provide a greater
However, with the price/performance ratio for embedded devices
performance boost than this, the time taken for data to travel to
scaling significantly faster than for server class CPUs, we expect
the cloud limits the improvement that can be achieved for the
this to improve over time.
application. The energy cost of running these tasks in hardware
is also much lower than in software. The FPGA accelerator has a
lower power consumption, as well as lower computation time. 6.4. Event driven simulation
In-network hardware: Adding FPGA accelerators to layer C
reduces latency to 0.84 s, and increases throughput to 133 frames We further developed a discrete event simulator written in
per second due to the performance of the FPGA accelerators dra- Python using the SimPy library, to test the validity of results
matically reducing computation latency. Placing FPGAs in layer produced by our model. Data sources emit periodic packets of
B further improves latency to 0.83 s. These placements give im- data into the network with the same topology and task structure.
provements over the centralized FPGA approach due to the reduc- The tasks are allocated to the relevant nodes, and are executed at
tion in communication latency. There is little difference in latency the nodes in a first-in first-out fashion, with priority given to the
between placing tasks predominately in layers B or C, as the fast oldest data packets.
link between these layers means that there is minimal communi- We expect differences in the reported latencies from the
cation delay. The disadvantage of the in-network FPGA approach model and simulator primarily due to the more detailed task
is the additional cost, with the layer B and C methods costing and communication scheduling in the simulator. The simulation
16,000 and 14,000 currency units respectively. Moving all tasks processes individual packets as opposed to the considering ab-
in hardware to the layer A camera nodes offers improvements stract streams in the model. The data sources in the simulator
over the software equivalent due to the increased computing emit packets with fixed periods, sources are unsynchronized,
capability. It also improves over centralized approaches due to the whereas the model implicitly assumes synchronisation. The sim-
reduced communication latency. However the higher computa- ulation also takes into account a small switching delay at nodes,
tion latency relative to layers B and C means that there is a higher representing the transfer of data form received packets to the
R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409 405

Table 7
Different placement policies used in our simulations.
Strategy Explanation
Centralized All tasks allocated to root node
Pushed All tasks pushed down toward the leaf nodes as much
as possible
Intermediate All tasks pushed down as far as possible, but not to leaf
nodes
Edge/Central Leaf tasks placed at leaf nodes, others placed centrally
Edge/Network Leaf tasks placed at leaf nodes, others pushed down as
far as possible but not to leaf nodes

7. Further analysis

While determining a fixed optimal solution for a given applica-


tion and network topology is possible by using an MILP solver as
we discussed, in this section, we consider synthetic scenarios in
an attempt to draw general lessons about distributed, accelerated
in-network computing. We explore how application and network
properties influence the decision on where to place computing
resources for this range of scenarios. Since latency is the primary
metric of interest, we focus on that in this section.
For this analysis, we use a Python script to pseudo-randomly
generate application task graphs. These are in a tree structure
with a maximum depth of 4 tasks, reflect a realistic partitioning
granularity rather than a very fine grained structure that would
skew further towards in-network computing. We use a fixed net-
work structure, with a similar layer A/B/C hierarchy and interface
specification as used in the case study in Section 6, however with
8 layer C nodes, each serving 2 layer B nodes, each of which serves
5 layer A nodes.
Several constraints are placed on the task generation. The
tree is built up leaf tasks, with a random variable determining
whether each task is connected to a new task or joins one already
existing in the tree. Tasks can only join other tasks whose leaf
tasks originate from nodes that share the same layer B parent. We
generate 100 random task trees in this manner, and the same 100
trees are used to evaluate each placement strategy, summarized
Fig. 5. Difference between values calculated through the formulated model and
in Table 7. For the purposes of this analysis, we assume that there
a discrete event simulator for the same configurations and parameter values. are no restrictions on the number of tasks that can be allocated to
a node, and all tasks are to be executed in ‘software’ - meaning
that in our model there is a latency penalty dependent on the
number of tasks allocated to the same node.
computing platform. We have not included various network re-
In this section, we focus on the latency metric, as latency
lated parameters in the simulation, as these are not influenced by
reduction is one of the main motivations behind in-network and
the allocation of tasks and platforms.
edge computing. The case study in Section 6 showed that latency
Simulations of the above scenario were run for 20,000 packets and throughput are closely related for these types of streaming
entering the network from each source. The sources were fixed to applications.
the same period, but set out of sync with each other, to a degree
determined from a uniformly distributed random variable. Fig. 5 7.1. Relative computing capability
shows the deviation between the metrics predicted by the model
and those measured in the simulation. We do not show financial A key factor that determines where to place tasks is the
cost, as there will be no difference between the simulation and relative computing capability that can be accessed at different
model, and we do not show bandwidth as it is calculated for each layers of the network. In general, the closer to the centre a
individual node, not the system as a whole. node is, the greater the computing capability, since the cost is
We see that if considering only software platforms, the dif- amortized across more streams. The resources at the edge of the
network are more likely to be limited due to space, energy, or
ference between the model and simulator is close to 6%, and in
cost constraints, while nodes further up in the hierarchy will
hardware 7%. These differences stem from the data sources being
have access to better hardware. However using better resources
out of sync, and the switching delays introduced at each node, further up the network entails a communication latency penalty,
not represented in the model. The ratio between computation which must be overcome by improved computation latency. For
time and network switching delay impacts this error, and hence this comparison, we set tasks to have a reduction factor of 50%
in the case of hardware, where computation time is reduced, the and equal latency on the same platform. Fig. 6 shows how dif-
overhead is more significant. However, these deviations are still ferent placement strategies impact latency, for different relative
well within tolerable levels. computing capabilities.
406 R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409

Fig. 6. Latency comparison for different Layer A:B:C computing capability ratios.

In the unlikely case where computing capability is equal across The more likely case is that resources at the central node are
all layers (i.e. a Centralized:B/C:A computing capability ratio of more capable than intermediate nodes, which offer greater capa-
1:1:1), pushing all tasks as close to the data sources as possible bility than the edge. In the case of the central node being 5× more
yields the lowest latency as there is minimal communication capable than the intermediate nodes (a computing capability ratio
delay, and no benefit to placing tasks higher up. This may be of 1:50:250), pushing tasks as low as possible into the intermedi-
the case if the network is a commodity cluster of homogeneous ate nodes still outperforms the centralized solution, as tasks are
machines. Computation time is also improved since the tasks are distributed to a larger number of nodes, reducing computation
latency. Increasing the difference in computing power to 10×
distributed across many nodes resulting in less contention than
(1:50:500) causes the central solution to become dominant.
for a centralized placement.
Hence, we see that a key requirement for in-network computing
If the compute capability at the data sources is significantly
to be feasible is that suitably capable computing resources be em-
smaller (50× in this case), while the rest of the network offers ployed for executing tasks in the network. The more capable the edge
equivalent computing capability (a ratio of 1:50:50), pushing nodes are in comparison to the root node, the greater the benefits of
tasks down to intermediate nodes offers the best latency. In placing tasks further towards the edge.
this case, the slight reduction in communication latency gained
through placing tasks at the data sources is outweighed by the 7.2. Task data reduction
computation latency penalty. Placing them any closer to the cen-
tral node adds further communication latency with no additional The time taken to transmit data further up the network is
benefit, and causes contention due to more tasks being allocated tied to the amount of data being transmitted. Tasks can reduce
to fewer nodes. data by varying degrees, and this impacts the balance between
R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409 407

Fig. 7. Latency comparison for different edge task data reduction factors.

computation and communication latency. For this experiment, we compete for resources. Related to this is the structure of the ap-
modify the reduction factors of tasks to observe the impact on plication task graph; having tasks that require data from multiple
latency. We use the same network topology as in the previous sources closer to the root of the tree means that tasks cannot be
experiment, and the same method of generating task trees. To pushed down into the network to a layer with more computing
mimic a realistic scenario, we use a 1:50:500 relative computing nodes. To investigate this factor, we consider different network
capability configuration, as discussed in Section 7.1. structures and their impact on latency, as shown in Fig. 8. The
Fig. 7 shows how different placement strategies impact la- tasks were generated with the same method as before, and net-
tency, for different task reduction factors. If data is dramatically work nodes had the same computing capability as in Section 7.2.
reduced by tasks close to the edge of the task tree, placing tasks All tasks were set to a fixed reduction factor of 0.5.
as close as possible to the data source is more likely to provide a Firstly, we examine a network with low fanout, where layer
latency improvement as the communication cost for every other B nodes each have 2 layer A nodes attached. While this means
transfer between nodes is reduced. We see that intermediate that there was more available resources towards the edge of the
placement reduces latency by 5× compared to a centralized allo- network, in many cases pushing tasks into the network results in
cation in such a scenario. Placing all tasks at the edge results in almost 2× the latency of a centralized solution. Tasks that require
30% worse latency, despite the reduced communication latency, data from more than one source must be pushed further up the
due to the low computing capability of these nodes. Placing only network, adding additional communication latency. Additionally,
the leaf task at the edge and the rest either in the network or at as there are few layer C nodes, these nodes are over-utilized. In-
the central node also provides a significant reduction in latency creasing the number of layer C nodes, or the computing capability
in this scenario. of these nodes would offer performance benefits in this scenario.
Raising the fanout of the layer B nodes to 5 instead of 2
If data is not significantly reduced in the task tree, or only
increases the benefits of pushing tasks into the network. As more
at tasks higher up in the tree, placing tasks towards the central
sources share the same paths towards the central node, there is
sink is preferred, especially if those resources are more capable. A
a higher chance that a task that works with data from multiple
centralized placement provides the best latency in a majority of
sources can be placed closer to the edge. Increasing the number
cases, although only by a slight margin. For some task trees, the
of nodes at layer C in this case again slightly decreases latency,
in-network approach is superior. This result is impacted by the
as tasks that do have to be placed there have access to more
relative computing capability of layers. For scenarios where the
resources.
central node is much more capable than the rest of the network,
Further increasing the fanout of the layer B nodes to 20 starts
the instinct is to place tasks there. However, if data is reduced
to increase latency again, up to around 0.45× the centralized
significantly at the leaf tasks then placing tasks in the network placement. Increasing it to 40 increases the latency to around
can reduce communication latency significantly. 0.7× the central placement.
It can be seen that, generally, the closer to the edge tasks that A larger fanout at layer A (the edge layer), up to a point means
data is reduced, the greater the benefits of placing tasks closer to the that there is a greater benefit of pushing tasks down towards the
edge of the network. network edge, as there are more opportunities to place tasks that
require data from multiple child tasks closer to the edge. However
7.3. Network structure if the fanout is too great, resource competition starts to reduce
the benefits of this approach.
The structure of the network determines to what extent tasks It can be seen that there exists a trade-off between having mul-
can be distributed and parallelized and how much they must tiple sources connected to the same path of nodes, and creating too
408 R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409

Fig. 8. Latency comparison for different network fanout factors.

much resource contention by having too many tasks assigned to the case-study of an object detection and tracking system. We also
same intermediate nodes. used synthetically generated applications to explore the key ap-
plication factors that impact the effectiveness of the in-network
7.4. Hardware acceleration computing approach.

There are several key points we can take away from this Declaration of competing interest
investigation. In-network computing is more effective in situa-
tions where the edge and intermediate nodes are comparable The authors declare that they have no known competing finan-
in capability to the central node. While this is unlikely with cial interests or personal relationships that could have appeared
traditional software processing platforms, it makes the case for to influence the work reported in this paper.
trying to integrate hardware accelerators such as FPGAs into the
network, as they can provide processing times closer to the more References
powerful processors found in a datacenter environment. We can
also see that in-network computing provides more benefits in [1] N. Tapoglou, J. Mehnen, A. Vlachou, M. Doukas, N. Milas, D. Mourtzis,
Cloud-based platform for optimal machining parameter selection based
applications where data is more greatly reduced in tasks closer to on function blocks and real-time monitoring, J. Manuf. Sci. Eng. 137 (1)
the edge of the task tree. These tasks can often be large filtering or (2015).
pre processing tasks, and in order to place them close the edge [2] B. Lohrmann, O. Kao, Processing smart meter data streams in the cloud,
of the network, more capable hardware is required. This again in: PES Innovative Smart Grid Technologies Conference Europe, 2011.
[3] W. Zhang, P. Duan, Q. Lu, X. Liu, A realtime framework for video object de-
makes the case for hardware acceleration. Finally, high fanout
tection with storm, in: International Conference on Ubiquitous Intelligence
network topologies benefit more from in-network computing as and Computing, January 2017, 2014, pp. 732–737.
there are more opportunities for data fusion between tasks. The [4] Y. Simmhan, B. Cao, M. Giakkoupis, Adaptive rate stream processing for
ability of hardware acceleration architectures to process streams smart grid applications on clouds, in: International Workshop on Scientific
of data in parallel is well suited to these scenarios, suffering less Cloud Computing, 2011, pp. 33–37.
[5] Z. Li, C. Chen, K. Wang, Cloud computing for agent-based urban
of a latency penalty due to resource contention. transportation systems, IEEE Intell. Syst. 26 (1) (2011) 73–79.
[6] E. Jean, R.T. Collins, A.R. Hurson, S. Sedigh, Y. Jiao, Pushing sensor net-
8. Conclusion work computation to the edge, in: International Conference on Wireless
Communications, Networking and Mobile Computing, 2009.
[7] L. Hong, C. Cheng, S. Yan, Advanced sensor gateway based on FPGA
The placement of computing resources and allocation of tasks for wireless multimedia sensor networks, in: International Conference on
in distributed streaming applications has a significant impact on Electric Information and Control Engineering, 2011, pp. 1141–1146.
application metrics. We have presented a model that can be used [8] M. Satyanarayanan, P. Bahl, R. Cáceres, N. Davies, The case for VM-based
to reason about such applications. It models data sources that cloudlets in mobile computing, Perv. Comput. 8 (4) (2009) 14–23.
[9] T. Soyata, R. Muraleedharan, C. Funai, M. Kwon, W. Heinzelman, Cloud-
inject data into this network, applications composed of depen-
Vision: Real-time face recognition using a mobile-cloudlet-cloud accelera-
dent tasks, and hardware platforms that can be allocated to nodes tion architecture, in: IEEE Symposium on Computers and Communications,
in the network. The model can be used to evaluate alternative 2012, pp. 59–66.
strategies for allocating computing resources and task execution, [10] S. Yi, Z. Hao, Z. Qin, Q. Li, Fog computing: Platform and applications, in:
offering information on latency, throughput, bandwidth, energy, Workshop on Hot Topics in Web Systems and Technologies, 2016, pp.
73–78.
and cost. We have used this model to demonstrate that com- [11] K. Ha, Z. Chen, W. Hu, W. Richter, P. Pillai, M. Satyanarayanan, Towards
puting in the network offers significant advantages over fully wearable cognitive assistance, in: International Conference on Mobile
centralized and fully decentralized approaches, using an example Systems, Applications, and Services, 2014, pp. 68–81.
R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409 409

[12] A. Sapio, I. Abdelaziz, A. Aldilaijan, M. Canini, P. Kalnis, In-network [36] K. Vipin, S.A. Fahmy, FPGA dynamic and partial reconfiguration: A survey of
computing is a dumb idea who’s time has come, in: Proceedings of architectures, methods, and applications, ACM Comput. Surv. 51 (4) (2018)
HotNets, 2017. 72:1–72:39.
[13] Y. Tokusashi, H. Matsutani, N. Zilberman, LaKe: the power of in- [37] M. Asiatici, N. George, K. Vipin, S.A. Fahmy, P. Ienne, Virtualized execu-
network computing, in: Proceedings of the International Conference on tion runtime for FPGA accelerators in the cloud, IEEE Access 5 (2017)
Reconfigurable Computing and FPGAs, 2018. 1900–1910.
[14] D.J. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. [38] A.M. Caulfield, E.S. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman,
Stonebraker, N. Tatbul, S. Zdonik, Aurora: a new model and architecture S. Heil, M. Humphrey, P. Kaur, J.-Y. Kim, et al., A cloud-scale acceleration
for data stream management, Int. J. Very Large Data Bases 12 (2) (2003) architecture, in: Proceedings of the IEEE/ACM International Symposium on
120–139. Microarchitecture, 2016.
[15] Y. Ahmad, U. Cetintemel, Network-aware query processing for stream- [39] D. Firestone, A. Putnam, S. Mundkur, D. Chiou, A. Dabagh, M. Andrewartha,
based applications, in: Proceedings of the International Conference on Very H. Angepat, V. Bhanu, A. Caulfield, E. Chung, et al., Azure accelerated
Large Data Bases, 2004, pp. 456–467. networking: SmartNICs in the public cloud, in: USENIX Symposium on
[16] D.J. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, M. Cherniack, J.-H. Networked Systems Design and Implementation, 2018, pp. 51–66.
Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, [40] S. Shreejith, S.A. Fahmy, Smart network interfaces for advanced automotive
S.B. Zdonik, The design of the Borealis stream processing engine, in: applications, IEEE Micro 38 (2) (2018) 72–80.
Proceedings of Conference on Innovative Data Systems Research, 2005, pp. [41] J. Weerasinghe, R. Polig, F. Abel, C. Hagleitner, Network-attached FPGAs for
277–289. data center applications, in: Proceedings of the International Conference
[17] P. Pietzuch, J. Ledlie, J. Shneidman, M. Roussopoulos, M. Welsh, M. Seltzer, on Field-Programmable Technology, 2016, pp. 36–43.
Network-aware operator placement for stream-processing systems, in: [42] S. Shreejith, R.A. Cooke, S.A. Fahmy, A smart network interface approach for
Proceedings of the International Conference on Data Engineering, 2006. distributed applications on Xilinx Zynq SoCs, in: International Conference
[18] L. Ying, Z. Liu, D. Towsley, C.H. Xia, Distributed operator placement on Field Programmable Logic and Applications, 2018, pp. 189–190.
and data caching in large-scale sensor networks, in: Proceedings of IEEE [43] O. Tomanek, P. Mulinka, L. Kencl, Multidimensional cloud latency
INFOCOM, 2008, pp. 1651–1659. monitoring and evaluation, Comput. Netw. 107 (Part 1) (2016) 104–120.
[19] S. Rizou, F. Dürr, K. Rothermel, Solving the multi-operator placement prob- [44] M. Hatto, T. Miyajima, H. Amano, Data reduction and parallelization for
lem in large-scale operator networks, in: Proceedings of the International human detection system, in: Proceedings of the Workshop on Synthesis
Conference on Computer Communications and Networks, 2010. and System Integration of Mixed Information Technologies, 2015, pp.
[20] A. Benoit, H. Casanova, V. Rehn-Sonigo, Y. Robert, Resource allocation for 134–139.
multiple concurrent in-network stream-processing applications, Parallel [45] C. Blair, N.M. Robertson, D. Hume, Characterising a heterogeneous system
Comput. 37 (8) (2011) 331–348. for person detection in video using histograms of oriented gradients:
[21] A. Benoit, H. Casanova, V. Rehn-Sonigo, Y. Robert, Resource allocation Power vs. speed vs. accuracy, J. Emerg. Sel. Top. Circuits Syst. 3 (2) (2013)
strategies for constructive in-network stream processing, Internat. J. Found 236–247.
Comput. Sci. 22 (03) (2011) 621–638. [46] M. Kachouane, S. Sahki, M. Lakrouf, N. Ouadah, HOG based fast human
[22] V. Cardellini, V. Grassi, F. Lo Presti, M. Nardelli, Optimal operator placement detection, in: International Conference on Microelectronics, No. 24, ICM,
for distributed stream processing applications, in: Proceedings of the 2012.
International Conference on Distributed and Event-Based Systems, 2016, [47] B. Benfold, I. Reid, Stable multi-target tracking in real-time surveillance
pp. 69–80. video, in: Proceedings of CVPR, 2011, pp. 3457–3464.
[23] F. Bonomi, R. Milito, J. Zhu, S. Addepalli, Fog computing and its role in the
internet of things, in: Proceedings of the MCC Workshop on Mobile Cloud
Computing, 2012.
[24] S.H. Park, O. Simeone, S.S. Shitz, Joint optimization of cloud and edge Ryan A. Cooke is a PhD student in the School of
processing for fog radio access networks, IEEE Trans. Wirel. Commun. 15 Engineering at the University of Warwick, UK, where
(11) (2016) 7621–7632. he also received his M.Eng. degree in Electronic Engi-
[25] A. Botta, W. De Donato, V. Persico, A. Pescapé, Integration of Cloud neering in 2015.
computing and Internet of Things: A survey, Future Gener. Comput. Syst. His research interests include reconfigurable com-
56 (2016) 684–700. puting, and in-network analytics acceleration.
[26] S. Madden, M.J. Franklin, J.M. Hellerstein, W. Hong, TAG: A tiny aggregation
service for Ad-hoc sensor networks, in: Proceedings of the Symposium on
Operating Systems Design and Implementation, Vol. 36, 2002, pp. 131–146.
[27] C. Intanagonwiwat, R. Govindan, D. Estrin, Directed diffusion, in: Proceed-
ings of the International Conference on Mobile Computing and Networking,
2000, pp. 56–67.
[28] M. Ding, X. Cheng, G. Xue, Aggregation tree construction in sensor
networks, in: Proceedings of the Vehicular Technology Conference, Vol. Suhaib A. Fahmy is Reader in Computer Engineering
4, 2003, pp. 2168–2172. at the University of Warwick, where his research en-
[29] H. Luo, H. Tao, H. Ma, S.K. Das, Data fusion with desired reliability in compasses reconfigurable computing, high-level system
wireless sensor networks, IEEE Trans. Parallel Distrib. Syst. 23 (3) (2012) design, and computational acceleration of complex al-
501–513. gorithms.
[30] M. Satyanarayanan, Z. Chen, K. Ha, W. Hu, W. Richter, P. Pillai, Cloudlets: at He received the M.Eng. degree in information sys-
the leading edge of mobile-cloud convergence, in: International Conference tems engineering and the Ph.D. degree in electrical and
on Mobile Computing, Applications and Services, 2014, pp. 1–9. electronic engineering from Imperial College London,
[31] H.T. Dang, M. Canini, F. Pedone, R. Soule, Paxos made switch-y, in: UK, in 2003 and 2007, respectively. From 2007 to 2009,
Proceedings of SIGCOMM, 2016, pp. 18–24. he was a Research Fellow with Trinity College Dublin
[32] M. Papadonikolakis, C.-S. Bouganis, A novel FPGA-based SVM classifier, and a Visiting Research Engineer with Xilinx Research
in: Proceedings of the International Conference on Field-Programmable Labs, Dublin. From 2009 to 2015, he was an Assistant Professor with the School
Technology, 2010, pp. 2–5. of Computer Engineering, Nanyang Technological University, Singapore.
[33] H.M. Hussain, K. Benkrid, A.T. Erdogan, H. Seker, Highly parameterized K- Dr. Fahmy was a recipient of the Best Paper Award at the IEEE Conference
means clustering on FPGAs: Comparative results with GPPs and GPUs, in: on Field Programmable Technology in 2012, the IBM Faculty Award in 2013
Proceedings of the International Conference on Reconfigurable Computing and 2017, the Community Award at the International Conference on Field
and FPGAs, Vol. 1, 2011, pp. 475–480. Programmable Logic and Applications, the ACM TODAES Best Paper Award in
[34] N. Pittman, A. Forin, A. Criminisi, J. Shotton, A. Mahram, Image segmenta- 2019 and is a senior member of the IEEE and ACM.
tion using hardware forest classifiers, in: Proceedings of the International
Symposium on Field-Programmable Custom Computing Machines, 2013,
pp. 73–80.
[35] S.A. Fahmy, K. Vipin, S. Shreejith, Virtualized FPGA accelerators for effi-
cient cloud computing, in: International Conference on Cloud Computing
Technology and Science, 2015, pp. 430–435.

You might also like