1 s2.0 S0167739X19312130 Main
1 s2.0 S0167739X19312130 Main
article info a b s t r a c t
Article history: Applications that involve analysis of data from distributed networked data sources typically involve
Received 3 May 2019 computation performed centrally in a datacenter or cloud environment, with some minor pre-
Received in revised form 25 November 2019 processing potentially performed at the data sources. As these applications grow in scale, this
Accepted 30 November 2019
centralized approach leads to potentially impractical bandwidth requirements and computational
Available online 13 December 2019
latencies. This has led to interest in edge computing, where processing is moved nearer to the
Keywords: data sources, and recently, in-network computing, where processing is done as data progresses
Distributed computing through the network. This paper presents a model for reasoning about distributed computing at the
Edge computing edge and in the network, with support for heterogeneous hardware and alternative software and
In-network computing hardware accelerator implementations. Unlike previous distributed computing models, it considers
Hardware acceleration
the cost of computation for compute-intensive applications, supports a variety of hardware platforms,
and considers a heterogeneous network. The model is flexible and easily extensible for a range of
applications and scales, and considers a variety of metrics. We use the model to explore the key
factors that influence where computational capability should be placed and what platforms should be
considered for distributed applications.
© 2019 Elsevier B.V. All rights reserved.
1. Introduction these streams can be processed, are key performance metrics for
such applications.
Distributed data processing applications involve the process- Centralized cloud computing is often utilized in these sce-
ing and combination of data from distributed sources to extract narios, since the data sources do not typically have adequate
value, and are increasing in importance. Emerging applications computing resources to perform complex computations. Appli-
such as connected autonomous vehicles rely on complex machine cations also rely on the fusion of data from multiple sources, so
learning models being applied to data captured at the edge, while centralized processing is useful. The cloud also offers benefits in
also involving collaboration with other vehicles. Further example scalability and cost, and has been shown to provide benefits in
applications include factory automation [1], smart grid monitor- applications such as smart grid processing [2,4] and urban traffic
ing [2], and video surveillance and tracking [3]. Such applications management [5].
present a challenge to existing computational approaches that However, many emerging streaming applications have strict
consider only the cloud and the very edge of the network. Compu- latency constraints, and moving data to the cloud incurs substan-
tationally intensive algorithms must now be applied to intensive tial delay. Furthermore, while the data generated by sources can
streams of data, and latency must be minimized. In these appli- be small, a high number of sources means that, in aggregate, the
cations, data sources transmit streams of data through a network volume of data to be transmitted is high. For example, in 2011,
to be processed remotely, with a focus on continuous processing, the Los Angeles smart grid required 2TB of streamed data from 1.4
and potentially involvement in a feedback loop, as opposed to million consumers to be processed per day [2]. Some applications,
other applications that involve large scale storage and delayed such as those dealing with video data, must also contend with
processing. Latency, the time taken to extract relevant informa- high bandwidth data requirements.
tion from the data streams, and throughput, the rate at which These limitations have led to an increased interest in ‘edge’ or
‘fog’ computing, a loosely defined paradigm where processing is
✩ This work was supported in part by The Alan Turing Institute, UK under done either at or close to the data sources. This could mean at
the UK EPSRC grant EP/N510129/1.
the source, such as on a sensor node with additional processing
∗ Corresponding author. resources [6]. It can also encompass performing processing within
E-mail addresses: [email protected] (R.A. Cooke), the network infrastructure, such as in smart gateways [7], or in
[email protected] (S.A. Fahmy). network switches or routers. Cisco offer a framework that allows
https://doi.org/10.1016/j.future.2019.11.040
0167-739X/© 2019 Elsevier B.V. All rights reserved.
396 R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409
so. Since we are not concerned with dynamic optimisation of hardware platforms to support sharing and virtualisation of mul-
operator placement within a time constraint, the model can in- tiple accelerators that can be changed at runtime [37], hence
clude more fine grained detail for tasks and hardware, accounting offering some of the flexibility of software platforms with the
for hardware acceleration, heterogeneous resources required by connectivity and performance of hardware. This trend is expected
tasks, the financial cost of adding additional compute capability to continue with the deployment of FPGAs in commercial cloud
to network nodes, and energy consumption. We also consider computing datacenters [38,39]. Tightly coupling accelerators with
the networked system as a whole, from the sensor nodes to the network interface has also been demonstrated to be effec-
the datacenter, instead of focusing on a cluster of computational tive in embedded networks [40] and the datacenter [41], and
servers. The focus of this paper is not optimisation, but rather to have significant impact on streaming application latency [42].
an analysis of different distributed computing paradigms in the To reflect the trend towards heterogeneity, our proposed model
context of streaming applications. encompasses the idea of distinct hardware platforms with differ-
Edge/Fog Computing: In response to increasing demand for ent computational characteristics. This further differentiates our
low latency in distributed streaming applications, efforts have work from others that consider only traditional processor based
been made to move computation closer to the data source, or compute architectures.
the ‘edge’ of the network. Where processing occurs varies, and
it is rare that the application is entirely pushed to the edge. 3. Scenario and metrics
Typically operations such pre-processing and filtering take place
at the edge, with aggregation and decision making centralized. The scenario of interest comprises a set of distributed data
This approach has been applied to domains such as smart grid, sources producing continuous streams of data, connected through
radio access networks, and urban traffic processing [23–25]. The a network comprised of intermediate nodes (for example gate-
model we have developed is capable of representing this scenario. ways, routers, or cluster heads) to a central data sink, such as
In some cases a majority of the processing is performed at a datacenter. These data sources could be cameras, streams of
the data source. This is common in sensor networks, where documents, environmental/industrial sensors, or similar. An ap-
communication costs are higher than computation costs. Some plication consisting of a set of tasks and their dependencies
examples of this are TAG [26], directed diffusion [27], EADAT [28], processes these streams to make a decision or extract value.
and MERIG [29]. These models consider computing at the very These tasks operate on the different streams of data, and some
edge of the network, unlike those discussed previously. Our pro- combine information from multiple (possibly processed) streams.
posed model can account for the energy costs of communication Individual tasks affect the data volume through a reduction factor
and computation, as well as representing heterogeneous network that determines the ratio of input data to output data, which re-
links, unlike these models. flects the properties of many stream processing tasks. An example
Processing may also be offloaded to local ‘cloudlets’, servers of such an application is a smart surveillance system that moni-
dedicated to computation, a few hops away in the network from tors video streams from many cameras to detect specific events.
the data source. This approach can be seen in mobile edge com- Video streams can come from a mix of fixed cameras and mobile
puting, where processing data on a mobile device would consume platforms, with different resolutions, frame-rates, and interfaces,
too much power, and doing so in the cloud would lead to high requiring different amounts of processing. The application uses
latency [30]. Cloudlets have also been demonstrated in video pro- processed information to adapt how the cameras are deployed
cessing and augmented reality applications [9–11] where latency and positioned.
is an important consideration. In order to evaluate alternative allocations of resources and
In-network computing is another emerging paradigm in which tasks, we consider the following key metrics of interest, with
traditionally centralized computation is distributed throughout some explanation of how they are impacted below. We provide
the networking infrastructure. Devices such as network switches the comprehensive formulation of these metrics in Section 5.
and gateways are extended to perform additional data process-
ing as well as their network functions. This technique has been 3.1. Latency
demonstrated to result in a reduction in data and execution
latency in map reduce applications [12]. A key value store imple- Latency is important when data is time-sensitive. Fast detec-
mented on an FPGA based NIC and network switch outperformed tion of an event may have safety or security implications, or in
a server based implementation [13]. In-network computation some applications, there could be real-time constraints. In this
using programmable network switches for a consensus protocol case-study, transmitting all video streams to the cloud introduces
was demonstrated in [31]. As the capability of this hardware large communication delays and competition for resources in the
improves, this method in which networking elements are used cloud can add further latency. Performing computation closer to
for both moving data as well computing, is becoming more viable. the cameras, whether at the cameras or in network switches
Extending such capabilities to broader applications requires the can reduce these communication delays, and distributing the
ability to analyse applications composed of multiple dependent tasks to different network nodes reduces the delays from sharing
tasks and determining how to allocate these to capable nodes. centralized resources. Even with less powerful hardware, latency
Our proposed model allows this to be explored in a manner not can improve as a result of this stream processing parallelisation.
possible using existing distributed computing models.
Hardware acceleration: A primary motivation for this work 3.2. Bandwidth
is the increasing complexity of applications, growing volumes of
data, and more widespread availability of alternative hardware Processing sensor data often reduces the size of data, out-
such as GPUs and FPGAs that can boost the performance of these putting filtered or aggregated data, or simple class labels. Hence,
applications. Recent work has explored accelerators for a variety if this processing is performed nearer to the data source, band-
of algorithms relevant to networked systems [32–34]. Within the width consumption further up the network can be reduced sig-
datacenter, heterogeneity has emerged as an important way to nificantly. There may also be scenarios where early processing
address stalled performance scaling and rising energy constraints. can determine that a particular stream of data is useless, and
FPGAs can be integrated into datacenter servers for application hence further transmission can be avoided. In our example, some
acceleration [35]. FPGA partial reconfiguration [36] allows these cameras may use low resolutions or frame rates, and hence be
398 R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409
3.3. Energy
Symbol Meaning
4.2. Implementations
xnm Allocation of implementation m to node n
M = {m1 , m2 , m3 , . . . , mM } is the set of all implementations, ynp Allocation of platform p to node n
which are the pieces of software or hardware that implement the znmp Allocation of m and p to n
functionality of a task. Implementations can represent different unm1 m2 p Allocation of m1 , m2 , and p to n
software algorithms or hardware accelerator architectures that τmax Maximum path delay
give the same functionality but have different computational
g Throughput
delays or hardware requirements. Each task t ∈ T has a set
of implementations Mt , and each m ∈ M is defined by m = Kt ⊂ T Set of tasks lower than t in task sub-tree with t at the root
(tm , τm , Rm , hm ) Kn ⊂ N Set of nodes lower than n in network sub-tree with n at the root
• tm ∈ T is the {task that is implemented Ds ⊂ N Set of nodes on path from s to root node
} by m;
• the set Rm = rm1 , rm2 , rm3 , . . . , rRm contains the amount of vnp 1 if p ∈ Pn , 0 otherwise
each resource needed to be able to host the implementation, Ph ⊂ P Set of all platforms that run hardware implementations
such as memory, FPGA accelerator slots, etc; Ps ⊂ P Set of all platforms that run software implementations
• τm is the time taken for this implementation to complete the H Set of all paths from leaves to root in task graph
task it implements per unit of data, compared to a reference Ht ⊂ H Set of tasks on path from leaf task t to root
processor;
OHt Set of all other tasks not on path Ht
• hm = {0, 1} signals whether the implementation is software
I⊂M Set of all software implementations
or hardware. A value of 0 is software, 1 is hardware.
φmpt Time to complete task implementation on node
4.3. Platforms q Bandwidth of streams/tasks
L⊂T Set of tasks with no child tasks
Platforms represent the systems in a network node that can SKn Set of all sources that lie beneath node n
carry out tasks. We define P = {p1 , p2 , p3 , . . . , pP } as the set of
platforms that could be assigned to node n ∈ N. Each platform
p ∈ P is defined by p = (ep , cp , wp , Rp , hp ), where:
4.5. Sources and data
• ep is the execution speed of the platform relative to a refer-
ence processor—this represents different processors having
different computing capabilities; S = {s1 , s2 , s3 , . . . , sS } is the set of data sources. We model
• cp is the monetary cost of the platform; data as continuous streams, as we are interested in applications
wp is the that process and merge continuous streams of data. A data source
• { power consumption } of the platform; could represent a sensor, database, video, or other source that
• Rp = rp1 , rp2 , rp3 , . . . , rpR is the set of resources available
on the platform, such as memory, FPGA accelerator slots, etc. injects a stream of data into the network. Each s ∈ S is defined
Resources are required by implementations; by s = (ns , ts , ds , es ).
• hp = {0, 1} indicates whether the platform runs software or • ns ∈ N is the parent node of the source, the node where the
hardware versions of tasks. A value of 0 means the platform data stream enters the network;
is a processor that executes software, and a value of 1
• ts ∈ T is the task to be performed on data being produced
means the platform is a hardware accelerator that executes
by the source;
application-specific logic. This is used to ensure correct al-
location of software and hardware implementations. • ds is the amount of data in one instance from this source per
period es ;
Unlike existing work, this model makes the distinction between • es is the period between subsequent units of data of size ds
platforms that execute software code and hardware acceleration entering the network.
platforms such as FPGAs as they have different computational de-
lay models, discussed in Section 5.1. Hardware acceleration plat- The model assumes a constant periodic stream of data from the
forms incur no latency penalty when multiple tasks are present source, such as a periodic sensor reading, frame of a video, or set
on the same node, whereas software platforms do, as a result of of captured tweets for example. There are some systems that do
contention for computing resources. not fit this model—for example where sensors may only send out
data if there is some change detected. This case can still be rep-
4.4. Network
resented in the proposed model, as the sensor is still continually
capturing data as a source and the detection component can be
N = {n1 , n2 , n3 , . . . , nN } is a set of the network nodes, for
example sensors, gateways, and routers, or servers. Each n ∈ N is modelled as a filtering task that reduces it.
defined by n = (an , Cn , Pn , bn ), where:
4.6. Allocation variables
• an ∈ N is the parent node of n linking it to towards the
central data sink;
• Cn ∈ N is a set of child nodes of n linking it to towards the Boolean variables represent the allocations of tasks and hard-
source(s); ware to network nodes. xnm = {0, 1} represents the allocation of
• Pn ⊂ P is the set of platforms that can be assigned to node an implementation m ∈ M to node n ∈ N. Similarly, ynp = {0, 1}
n. For example, a large datacenter class processor that must represents the allocation of platform p ∈ P to node n ∈ N.
be housed in a server rack cannot be placed on a drone; znmp = {0, 1} represents the allocation of platform p ∈ P, and
• bn is the outgoing interface between the node n and its task m ∈ M to a node n ∈ N, using a set of constraints.
parent node, and represents the bandwidth in terms of data A summary of the symbols used in the model is presented in
per unit time. Table 1.
400 R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409
4.7. Constraints
| N | | Mt |
∑ ∑
∀t ∈ T , xij == 1 (1)
i=0 j=0
4.7.2. One platform per node Fig. 3. The difference in how a set of tasks allocated to a single node are
scheduled on software and hardware accelerator nodes.
|P |
∑
∀n ∈ N , yni == 1 (2)
i=0 • Communication and computation happen independently
and can be parallel to each other;
4.7.3. Resource availability • There is no communication time between tasks on the same
Allocations cannot exceed the available resources for the plat- node.
form assigned to a node:
As tasks can only begin once their child tasks are complete, we
| T | | Mi | |P |
∑ ∑ ∑ can say the root task of the graph G(T , ET ) can only start once all
∀n ∈ N , ∀e ∈ R , xnj rje ≤ ynk rke (3) paths to it are complete. The end-to-end latency is therefore equal
i=0 j=0 k=0 to the longest path delay of the task graph, including network and
computation delay.
4.7.4. Additional constraints
The model allows for additional constraints to be added in 5.1.1. Computation delay
order to better model a specific system or set of requirements. The time to complete one task on the node it is allocated to
Constraints can be added to give certain tasks deadlines, con- can be represented:
strain bandwidths, restrict specific nodes to certain platforms,
and more. φmpt = τm ep δt (4)
For a task t, implemented with m on platform p. To find the end-
5. Performance metrics
to-end latency, the values of φmpt for each path in the task tree
are summed, and the maximum value determined.
As previously mentioned, there are five main metrics of inter-
In the case of software implementations, nodes are assumed
est in this analysis. Latency, throughput, bandwidth and energy
to carry out one task at a time. So in the cases of multiple tasks
consumption, and financial cost. In this section we formulate
being assigned to the same node, in the worst case scenario, a
these metrics, and discuss how the formulation allows each to
data instance must wait for all other tasks not in the path to finish
be evaluated.
before beginning the next task. Note that this applies even if a
node supports concurrent software tasks, since we assume that
5.1. End-to-end latency
multiple software tasks suffer degraded performance in propor-
tion to the parallelism applied. Unlike some other works, which
The end-to-end latency is the total time between an instance
are only concerned with preventing the allocated tasks exceeding
of data entering the network and its root task being completed.
a measure of available resources on a platform, running multiple
For example, this could be the time between a sensor reading or
software tasks at once on the same node in our model affects
image being taken and a fault or anomaly being detected. This
computational delay. For hardware implementations we make
value is of interest in time-sensitive applications such as those
no such assumption as they can operate in parallel as separate
concerned with safety or closed-loop control, such as for indus-
streams since they are spatially partitioned, and so it is sufficient
trial equipment, or coordinated control. The model incorporates
several assumptions and behaviours that are relevant for this to only sum the path of interest, though we do factor in available
metric: hardware resources as discussed later. This distinction between
software and hardware implementations of tasks better repre-
• Sources s ∈ S produce continuous streams of data of an sents the growing trend of using alternative computing platforms
amount ds , every period of time es . We take a ‘snapshot’ of to accelerate computation, compared with previous work that
the network at any instance of time, and say that data is only accounts for software running on processors. Fig. 3 shows
entering the system at this instant from all sources, of an this difference in scheduling for software and hardware nodes. On
amount ds . The equation we form gives the latency of the software nodes, tasks are performed in series in the worst case,
data instances entered at the beginning of this ‘snapshot’; and on hardware nodes, tasks can be performed concurrently. In
• Only one software implementation can run at a time on a this example, this means that tasks C and D can be performed
node. Software runs on a first in first out basis; in parallel to tasks A and B. Task E is dependent on tasks B,
• Hardware implementations of tasks operate independently C and D, so must happen once they are completed. The added
from one another so can operate in parallel; concurrency of hardware accelerator nodes helps reduce task
• A task cannot begin until all of its child tasks have been execution latency when multiple tasks are assigned to a node.
completed; In order to represent this behaviour, a set of new allocation
• Tasks start as soon as all of the data required is available, variables is introduced: u. Each one of these unm1 m2 p = {0, 1}
and once completed send the result to the next task as soon represents the allocation of two implementations m1 and m2 to
as possible; node n, assigned platform p.
R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409 401
The set of tasks on the path from a leaf node on the task graph 5.2. Throughput
t to the root of the task graph is Ht ⊂ T . Let the set H contain all
of the task path sets (Ht ∈ H). The set OHt is declared, containing
The throughput of the system is the rate at which results are
all other tasks not on the path Ht . The set I ⊂ M is defined as the
output, and dependent on the node with the longest processing
set of all software implementations. The computation time for a
time in the network. A continuous variable g can be introduced
path Ht , τHt c in the task tree is given by:
to represent the maximum delay processing stage. For software
|N | |Ht | |Mj | |P | |Oi | |Im |
(
∑ ∑∑∑ ∑ ∑ implementations, where only one task can run on a node at any
τHt c = zikl φklj + uikql φqlm time, this can be expressed:
i=0 j=0 k=0 l=0 m=0 q=0
(5) |T | |Ps | |Mi |
|Ia |
∑ ∑∑
∀n ∈ N , g ≥ znkj φkji
j
)
∑ (10)
− uikrl φqlm i=0 j=0 k=0
r =0
where Ps is the set of all platforms that run software implemen-
The znmp term in this equation is the sum of delays on all paths
tations. For platforms that run hardware implementations, Ph :
of the task tree. The unm1 m2 p terms represent the extra delays on
the path caused by having multiple tasks not on the same path
allocated to the same node in software. The computation times |Ph | |Mt |
∑ ∑
of any other tasks allocated to the same node as any task in the ∀t ∈ T , g ≥ znji φkji (11)
path are added. The subtraction is present to ensure that this i=0 j=0
computation time is only added once for each set of tasks in a
The throughput, v , can then be expressed:
path allocated to the same node.
v = 1/max(g) (12)
5.1.2. Communication delay
A simple communication model is used where tasks send data
to the parent task as soon as it is ready. There is no communi- 5.3. Bandwidth
cation cost between tasks, only between nodes. Communication
and computation can occur simultaneously and independently. If Bandwidth utilisation can be very significant in scenarios in-
a node receives data for a task not assigned to it, it forwards this
volving information sources with dense data and for large net-
data immediately to the next node.
works and applications. Poor utilisation can also lead to additional
Data is transferred from one node to another when a task’s
communication delays.
parent task is allocated to another node. Similarly to the compu-
tational delay, we can find the total communication delay τHt m The bandwidth of a data stream at a source s, qs is given by:
between tasks in each path in the task tree Ht ∈ H: ds
qs = (13)
|Ht | |N | |Kj |
( |M |
i
| Ma i | ) es
∑∑∑ ∑ xkl di ∑ xkm di
τHt m = − (6) The bandwidth of a task t, denoted qt , is given by:
bk bk
i=0 j=0 k=0 l=0 m=0
|Ct |
∑
The delay from the data source s on path Ht to the node that qt = ft qi (14)
performs the first task on it, τHt s , is given by: i=0
|Ds | |Ki ||M |
tl
∑ ds ∑ ∑ xit ds For leaf tasks tl where |Ct | = 0, it is given by:
τHt s = − l l
(7)
bi bi qtl = ft qs (15)
i=0 j=0 k=0
where tl is the leaf in the task path. The total communication The total bandwidth consumption at the output of a network
delay in a path τHt k is thus: node is the sum of the bandwidths of all streams passing through
it.
τHt k = τHt s + τHt m (8)
|Kn | |T |
( | Mj | |Cj | |Ml | )
The proposed model can be extended to incorporate differ- ∑ ∑ ∑ ∑ ∑
qnc = xik qj − xim ql (16)
ent communication delays for software and hardware tasks as
i=0 j=0 k=0 l=0 m=0
would be the case for network-attached hardware accelerators
that can process packets with lower latency. The computation and The data not yet processed by any tasks must also be taken
communication latencies are likely to vary in reality. This model into account. If SKn ⊂ Kn is the set of all sources that lie beneath
considers the worst case latency where a node processes all other n in the network graph, and L ⊂ T is the set of all tasks where
tasks first and transmits the results last. |Ct | = 0:
5.1.3. Total delay |Kn |
( |L| (|SK |
n | Mj | ))
∑ ∑ ∑ ∑
The total latency for a path, τHt , is equal to: qnl = qk − xil qsj (17)
i=0 j=0 k=0 l=0
τHt = τHt k + τHt c (9)
where qst is the bandwidth of the source that leaf task t operates
The largest of these values is the total latency. τmax .
on.
Although we have discussed a scenario where only a single
The total bandwidth at a node n ∈ N is given by:
task graph is present, the model allows the possibility of multi-
ple independent task graphs representing separate applications. qn = qnc + qnl (18)
Using the same method and equations, a τmax can be formulated
for other task graphs. This gives the bandwidth at each link between nodes.
402 R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409
Table 2 Table 3
Computation times in milliseconds for each task on different platforms. Financial cost and power estimates for each platform.
Platform Grad, Hist Normalisaton Classification Tracking Platform Cost Power Consumption
Cortex A9 2000 3200 1900 2000 Arm Cortex A9 10 1
Intel i7 40 60 35 40 Intel Core i7 300 5
Intel Xeon 2.6 4.0 2.3 2.6 Intel Xeon 2000 100
Xilinx Zynq 260 400 240 260 Xilinx Zynq 250 5
Xilinx Virtex 6 1.3 2.1 1.2 1.3 Xilinx Virtex-6 1000 10
Reduction factor 0.77 0.004 0.16 0.16
Table 4
Performance metrics for different placement strategies using
software platforms.
6.3. Platforms Placement Latency Throughput Cost Energy
(seconds) (frames/s)
From previous work, we estimate the computation times for Centralized 1.95 3.43 2000 30.03
each task on the different platforms. Though these are estimates, Layer C 1.97 0.88 3200 23.00
and different implementations may have varying optimisations, Layer B 1.93 0.94 4100 23.00
the relative computation times are the important factor for this Layer A 7.16 0.14 2300 241.2
case study. If computation is placed at a camera node, we as-
sume an embedded platform. An embedded Arm Cortex A9 is
used in [44] to implement the HOG algorithm, so we use the
pipelines of the algorithm pipeline each, so 12 tasks. We as-
computation times presented there.
sume CPU based platforms have no limit to the number of tasks
If computing is placed at the access or routing layers, we
can assume a more powerful CPU is available. The work in [46] that can be running, though, as discussed in the formulation,
implements the algorithm on an Intel Core i7 processor. Finally, there is a latency penalty for sharing resources. We focus on
the cloud layer would use server class processors, such as the latency, throughput, energy consumption, and financial cost as
Intel Xeon platform used to implement the algorithm in [45]. the metrics of interest.
We also discuss the implications of using an FPGA to accelerate We use our model to build the above scenario and evaluate
tasks. The work in [44] presents an FPGA design that gives a different computation placement strategies. We implement the
speed up of around 7× on a Xilinx Zynq platform that could model in Python, using classes to represent the nodes, platforms,
be embedded at the camera. An FPGA accelerator implemented tasks and implementations, all containing members representing
on a larger Xilinx Virtex-6 FPGA was reported in [45], and we the various parameters discussed previously. Results are pre-
assume this is the FPGA platform available at other layers. We sented in Table 4 for software platforms and Table 5 for hardware
use the relative performance on these platforms to estimate the platforms. We show the latency, throughput, financial cost, and
computation time of the tracker task. Table 2 summarizes time total energy consumption of the entire system. Bandwidth results
taken for each task on each platform per frame. are not shown in this table as they are calculated per node in our
The costs of each platform are also relevant. In this case study
model.
we consider the extra costs associated with adding computing
Centralized Software: A typical approach to such an applica-
resources to different layers of the network. Cloud/datacenter
tion would be to centralize all tasks, performing them in software,
costs are difficult to estimate, so we assume that this central node
is present regardless of how we place other computing resources. transmitting all data to the cloud or datacenter. In this case study,
Table 3 summarizes approximate costs and power consumption this gives a latency of around 1.95 s, and a throughput of 3.4
for each of the platforms in arbitrary currency and energy units frames per second for each camera. Note that this is in the worst
based on costs we have determined from OEM suppliers, and case, where all camera streams compete for CPU resources. The
manufacturer power estimation utilities. large communication latency coupled with the large amount of
The FPGA resource utilisation estimates in the previously cited data being transmitted undermines the extra computing power
works suggest that both FPGA platforms can implement 3 full provided by the cloud. Energy consumption was also joint highest
404 R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409
Table 5 Table 6
Performance metrics for different placement strategies using Performance metrics for MILP optimisation of model.
hardware platforms. Placement Latency Throughput Cost Energy
Placement Latency Throughput Cost Energy (seconds) (frames/s)
(seconds) (frames/s) Optimized 0.87 133 9000 1.56
Centralized 1.68 133 13000 1.56
Layer C 0.844 133 14000 1.56
Layer B 0.8 133 16000 1.56
Layer A 0.94 1.10 11600 30.6 overall latency, and worse throughput. While the total energy
consumption for the layer A approach looks high, it is spread
across a greater number of nodes. Each layer A node actually has a
with this approach, as the Centralized hardware has the highest power consumption of approximately 0.956. The same processing
power consumption. hardware is implemented on the FPGAs in layers B and C, as well
In-network software: An alternative approach is to push pro- as when centralized. This results in the throughput being equal
cessing tasks down into the network. One possibility is placing in all circumstances, despite the higher communication latency.
the gradient computation, normalisation, and classification tasks Optimal Placement Our model can be used with a Mixed
on the camera nodes (layer A), and placing the tracking tasks Integer Linear Program (MILP) solver to generate a specific task
at the appropriate layer B nodes as they require information and hardware placement strategy to optimize any of the per-
from a set of cameras. This results in a latency of around 7.16 s formance metrics detailed in Section 5. To do this, the Python
and a poor throughput of 0.14 frames per second, unsuitable PULP front end was used to interface to the CBC solver. In this
for real time applications. The energy consumption seems high, case, we optimize for latency, as in this example, energy and
but this value is the energy consumption of the entire system— throughput are directly related to latency. We first generated the
the consumption at each individual node is much lower. While optimal latency placement, then ran the optimisation again with
there is communication latency, and fewer tasks competing for a latency constraint 5% higher than this value, but optimising for
the same resources, the computing capability of these edge nodes cost. This forces the solver to generate the cheapest placement
is so low that the latency and throughput are much worse than that achieves a latency within 5% of the optimal value. As a result,
the centralized placements. our model generated a placement with the metrics shown in
Distributing tasks within the intermediate network infrastruc- Table 6. This is presented for completeness; it may be argued
ture offers improved latency relative to placing tasks in layer A, that customising a network for a specific application is unlikely
but has minimal impact when compared to centralized place- to be a common requirement. Hence, we have focused primarily
ment. In this scenario, the reduced communication latency is on the general lessons learnt in terms of placement strategies for
offset by the increased computation latency. Layer B and layer C hardware in the network.
approaches introduce additional costs of 2100 and 1200 currency Summary: Improvements can be made to streaming applica-
units respectively. The centralized solution also has 3.65× higher tion latency by pushing tasks into the network in either software
throughput than these approaches. This is because of its increased or hardware. This also offers improvement in energy consump-
computing capability relative to these other nodes, meaning that tion at each individual node, important when there may be a
there is less computation latency. Energy consumption is less limited power budget. There is a balance between the com-
than centralized software, due to the lower power consumption munication latency to reach higher capability nodes, and the
of the hardware. This energy consumption is also spread across benefits to computation latency that they provide. Placing tasks
a greater number of nodes, meaning each node consumes less at the very edge of the network minimizes communication la-
energy. tency but is limited by poor computational capability. The cloud
Centralized Hardware: Utilising FPGA acceleration at the offers the highest computing capability but there is a communi-
server node reduces the latency to 1.68 s, and increases through-
cation latency bottleneck. The downside of using in-network task
put to 133 frames per second, as a result of reductions in compu-
placement is the additional financial cost of the extra hardware.
tation latency. While the FPGA should in theory provide a greater
However, with the price/performance ratio for embedded devices
performance boost than this, the time taken for data to travel to
scaling significantly faster than for server class CPUs, we expect
the cloud limits the improvement that can be achieved for the
this to improve over time.
application. The energy cost of running these tasks in hardware
is also much lower than in software. The FPGA accelerator has a
lower power consumption, as well as lower computation time. 6.4. Event driven simulation
In-network hardware: Adding FPGA accelerators to layer C
reduces latency to 0.84 s, and increases throughput to 133 frames We further developed a discrete event simulator written in
per second due to the performance of the FPGA accelerators dra- Python using the SimPy library, to test the validity of results
matically reducing computation latency. Placing FPGAs in layer produced by our model. Data sources emit periodic packets of
B further improves latency to 0.83 s. These placements give im- data into the network with the same topology and task structure.
provements over the centralized FPGA approach due to the reduc- The tasks are allocated to the relevant nodes, and are executed at
tion in communication latency. There is little difference in latency the nodes in a first-in first-out fashion, with priority given to the
between placing tasks predominately in layers B or C, as the fast oldest data packets.
link between these layers means that there is minimal communi- We expect differences in the reported latencies from the
cation delay. The disadvantage of the in-network FPGA approach model and simulator primarily due to the more detailed task
is the additional cost, with the layer B and C methods costing and communication scheduling in the simulator. The simulation
16,000 and 14,000 currency units respectively. Moving all tasks processes individual packets as opposed to the considering ab-
in hardware to the layer A camera nodes offers improvements stract streams in the model. The data sources in the simulator
over the software equivalent due to the increased computing emit packets with fixed periods, sources are unsynchronized,
capability. It also improves over centralized approaches due to the whereas the model implicitly assumes synchronisation. The sim-
reduced communication latency. However the higher computa- ulation also takes into account a small switching delay at nodes,
tion latency relative to layers B and C means that there is a higher representing the transfer of data form received packets to the
R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409 405
Table 7
Different placement policies used in our simulations.
Strategy Explanation
Centralized All tasks allocated to root node
Pushed All tasks pushed down toward the leaf nodes as much
as possible
Intermediate All tasks pushed down as far as possible, but not to leaf
nodes
Edge/Central Leaf tasks placed at leaf nodes, others placed centrally
Edge/Network Leaf tasks placed at leaf nodes, others pushed down as
far as possible but not to leaf nodes
7. Further analysis
Fig. 6. Latency comparison for different Layer A:B:C computing capability ratios.
In the unlikely case where computing capability is equal across The more likely case is that resources at the central node are
all layers (i.e. a Centralized:B/C:A computing capability ratio of more capable than intermediate nodes, which offer greater capa-
1:1:1), pushing all tasks as close to the data sources as possible bility than the edge. In the case of the central node being 5× more
yields the lowest latency as there is minimal communication capable than the intermediate nodes (a computing capability ratio
delay, and no benefit to placing tasks higher up. This may be of 1:50:250), pushing tasks as low as possible into the intermedi-
the case if the network is a commodity cluster of homogeneous ate nodes still outperforms the centralized solution, as tasks are
machines. Computation time is also improved since the tasks are distributed to a larger number of nodes, reducing computation
latency. Increasing the difference in computing power to 10×
distributed across many nodes resulting in less contention than
(1:50:500) causes the central solution to become dominant.
for a centralized placement.
Hence, we see that a key requirement for in-network computing
If the compute capability at the data sources is significantly
to be feasible is that suitably capable computing resources be em-
smaller (50× in this case), while the rest of the network offers ployed for executing tasks in the network. The more capable the edge
equivalent computing capability (a ratio of 1:50:50), pushing nodes are in comparison to the root node, the greater the benefits of
tasks down to intermediate nodes offers the best latency. In placing tasks further towards the edge.
this case, the slight reduction in communication latency gained
through placing tasks at the data sources is outweighed by the 7.2. Task data reduction
computation latency penalty. Placing them any closer to the cen-
tral node adds further communication latency with no additional The time taken to transmit data further up the network is
benefit, and causes contention due to more tasks being allocated tied to the amount of data being transmitted. Tasks can reduce
to fewer nodes. data by varying degrees, and this impacts the balance between
R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409 407
Fig. 7. Latency comparison for different edge task data reduction factors.
computation and communication latency. For this experiment, we compete for resources. Related to this is the structure of the ap-
modify the reduction factors of tasks to observe the impact on plication task graph; having tasks that require data from multiple
latency. We use the same network topology as in the previous sources closer to the root of the tree means that tasks cannot be
experiment, and the same method of generating task trees. To pushed down into the network to a layer with more computing
mimic a realistic scenario, we use a 1:50:500 relative computing nodes. To investigate this factor, we consider different network
capability configuration, as discussed in Section 7.1. structures and their impact on latency, as shown in Fig. 8. The
Fig. 7 shows how different placement strategies impact la- tasks were generated with the same method as before, and net-
tency, for different task reduction factors. If data is dramatically work nodes had the same computing capability as in Section 7.2.
reduced by tasks close to the edge of the task tree, placing tasks All tasks were set to a fixed reduction factor of 0.5.
as close as possible to the data source is more likely to provide a Firstly, we examine a network with low fanout, where layer
latency improvement as the communication cost for every other B nodes each have 2 layer A nodes attached. While this means
transfer between nodes is reduced. We see that intermediate that there was more available resources towards the edge of the
placement reduces latency by 5× compared to a centralized allo- network, in many cases pushing tasks into the network results in
cation in such a scenario. Placing all tasks at the edge results in almost 2× the latency of a centralized solution. Tasks that require
30% worse latency, despite the reduced communication latency, data from more than one source must be pushed further up the
due to the low computing capability of these nodes. Placing only network, adding additional communication latency. Additionally,
the leaf task at the edge and the rest either in the network or at as there are few layer C nodes, these nodes are over-utilized. In-
the central node also provides a significant reduction in latency creasing the number of layer C nodes, or the computing capability
in this scenario. of these nodes would offer performance benefits in this scenario.
Raising the fanout of the layer B nodes to 5 instead of 2
If data is not significantly reduced in the task tree, or only
increases the benefits of pushing tasks into the network. As more
at tasks higher up in the tree, placing tasks towards the central
sources share the same paths towards the central node, there is
sink is preferred, especially if those resources are more capable. A
a higher chance that a task that works with data from multiple
centralized placement provides the best latency in a majority of
sources can be placed closer to the edge. Increasing the number
cases, although only by a slight margin. For some task trees, the
of nodes at layer C in this case again slightly decreases latency,
in-network approach is superior. This result is impacted by the
as tasks that do have to be placed there have access to more
relative computing capability of layers. For scenarios where the
resources.
central node is much more capable than the rest of the network,
Further increasing the fanout of the layer B nodes to 20 starts
the instinct is to place tasks there. However, if data is reduced
to increase latency again, up to around 0.45× the centralized
significantly at the leaf tasks then placing tasks in the network placement. Increasing it to 40 increases the latency to around
can reduce communication latency significantly. 0.7× the central placement.
It can be seen that, generally, the closer to the edge tasks that A larger fanout at layer A (the edge layer), up to a point means
data is reduced, the greater the benefits of placing tasks closer to the that there is a greater benefit of pushing tasks down towards the
edge of the network. network edge, as there are more opportunities to place tasks that
require data from multiple child tasks closer to the edge. However
7.3. Network structure if the fanout is too great, resource competition starts to reduce
the benefits of this approach.
The structure of the network determines to what extent tasks It can be seen that there exists a trade-off between having mul-
can be distributed and parallelized and how much they must tiple sources connected to the same path of nodes, and creating too
408 R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409
much resource contention by having too many tasks assigned to the case-study of an object detection and tracking system. We also
same intermediate nodes. used synthetically generated applications to explore the key ap-
plication factors that impact the effectiveness of the in-network
7.4. Hardware acceleration computing approach.
There are several key points we can take away from this Declaration of competing interest
investigation. In-network computing is more effective in situa-
tions where the edge and intermediate nodes are comparable The authors declare that they have no known competing finan-
in capability to the central node. While this is unlikely with cial interests or personal relationships that could have appeared
traditional software processing platforms, it makes the case for to influence the work reported in this paper.
trying to integrate hardware accelerators such as FPGAs into the
network, as they can provide processing times closer to the more References
powerful processors found in a datacenter environment. We can
also see that in-network computing provides more benefits in [1] N. Tapoglou, J. Mehnen, A. Vlachou, M. Doukas, N. Milas, D. Mourtzis,
Cloud-based platform for optimal machining parameter selection based
applications where data is more greatly reduced in tasks closer to on function blocks and real-time monitoring, J. Manuf. Sci. Eng. 137 (1)
the edge of the task tree. These tasks can often be large filtering or (2015).
pre processing tasks, and in order to place them close the edge [2] B. Lohrmann, O. Kao, Processing smart meter data streams in the cloud,
of the network, more capable hardware is required. This again in: PES Innovative Smart Grid Technologies Conference Europe, 2011.
[3] W. Zhang, P. Duan, Q. Lu, X. Liu, A realtime framework for video object de-
makes the case for hardware acceleration. Finally, high fanout
tection with storm, in: International Conference on Ubiquitous Intelligence
network topologies benefit more from in-network computing as and Computing, January 2017, 2014, pp. 732–737.
there are more opportunities for data fusion between tasks. The [4] Y. Simmhan, B. Cao, M. Giakkoupis, Adaptive rate stream processing for
ability of hardware acceleration architectures to process streams smart grid applications on clouds, in: International Workshop on Scientific
of data in parallel is well suited to these scenarios, suffering less Cloud Computing, 2011, pp. 33–37.
[5] Z. Li, C. Chen, K. Wang, Cloud computing for agent-based urban
of a latency penalty due to resource contention. transportation systems, IEEE Intell. Syst. 26 (1) (2011) 73–79.
[6] E. Jean, R.T. Collins, A.R. Hurson, S. Sedigh, Y. Jiao, Pushing sensor net-
8. Conclusion work computation to the edge, in: International Conference on Wireless
Communications, Networking and Mobile Computing, 2009.
[7] L. Hong, C. Cheng, S. Yan, Advanced sensor gateway based on FPGA
The placement of computing resources and allocation of tasks for wireless multimedia sensor networks, in: International Conference on
in distributed streaming applications has a significant impact on Electric Information and Control Engineering, 2011, pp. 1141–1146.
application metrics. We have presented a model that can be used [8] M. Satyanarayanan, P. Bahl, R. Cáceres, N. Davies, The case for VM-based
to reason about such applications. It models data sources that cloudlets in mobile computing, Perv. Comput. 8 (4) (2009) 14–23.
[9] T. Soyata, R. Muraleedharan, C. Funai, M. Kwon, W. Heinzelman, Cloud-
inject data into this network, applications composed of depen-
Vision: Real-time face recognition using a mobile-cloudlet-cloud accelera-
dent tasks, and hardware platforms that can be allocated to nodes tion architecture, in: IEEE Symposium on Computers and Communications,
in the network. The model can be used to evaluate alternative 2012, pp. 59–66.
strategies for allocating computing resources and task execution, [10] S. Yi, Z. Hao, Z. Qin, Q. Li, Fog computing: Platform and applications, in:
offering information on latency, throughput, bandwidth, energy, Workshop on Hot Topics in Web Systems and Technologies, 2016, pp.
73–78.
and cost. We have used this model to demonstrate that com- [11] K. Ha, Z. Chen, W. Hu, W. Richter, P. Pillai, M. Satyanarayanan, Towards
puting in the network offers significant advantages over fully wearable cognitive assistance, in: International Conference on Mobile
centralized and fully decentralized approaches, using an example Systems, Applications, and Services, 2014, pp. 68–81.
R.A. Cooke and S.A. Fahmy / Future Generation Computer Systems 105 (2020) 395–409 409
[12] A. Sapio, I. Abdelaziz, A. Aldilaijan, M. Canini, P. Kalnis, In-network [36] K. Vipin, S.A. Fahmy, FPGA dynamic and partial reconfiguration: A survey of
computing is a dumb idea who’s time has come, in: Proceedings of architectures, methods, and applications, ACM Comput. Surv. 51 (4) (2018)
HotNets, 2017. 72:1–72:39.
[13] Y. Tokusashi, H. Matsutani, N. Zilberman, LaKe: the power of in- [37] M. Asiatici, N. George, K. Vipin, S.A. Fahmy, P. Ienne, Virtualized execu-
network computing, in: Proceedings of the International Conference on tion runtime for FPGA accelerators in the cloud, IEEE Access 5 (2017)
Reconfigurable Computing and FPGAs, 2018. 1900–1910.
[14] D.J. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. [38] A.M. Caulfield, E.S. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman,
Stonebraker, N. Tatbul, S. Zdonik, Aurora: a new model and architecture S. Heil, M. Humphrey, P. Kaur, J.-Y. Kim, et al., A cloud-scale acceleration
for data stream management, Int. J. Very Large Data Bases 12 (2) (2003) architecture, in: Proceedings of the IEEE/ACM International Symposium on
120–139. Microarchitecture, 2016.
[15] Y. Ahmad, U. Cetintemel, Network-aware query processing for stream- [39] D. Firestone, A. Putnam, S. Mundkur, D. Chiou, A. Dabagh, M. Andrewartha,
based applications, in: Proceedings of the International Conference on Very H. Angepat, V. Bhanu, A. Caulfield, E. Chung, et al., Azure accelerated
Large Data Bases, 2004, pp. 456–467. networking: SmartNICs in the public cloud, in: USENIX Symposium on
[16] D.J. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, M. Cherniack, J.-H. Networked Systems Design and Implementation, 2018, pp. 51–66.
Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, [40] S. Shreejith, S.A. Fahmy, Smart network interfaces for advanced automotive
S.B. Zdonik, The design of the Borealis stream processing engine, in: applications, IEEE Micro 38 (2) (2018) 72–80.
Proceedings of Conference on Innovative Data Systems Research, 2005, pp. [41] J. Weerasinghe, R. Polig, F. Abel, C. Hagleitner, Network-attached FPGAs for
277–289. data center applications, in: Proceedings of the International Conference
[17] P. Pietzuch, J. Ledlie, J. Shneidman, M. Roussopoulos, M. Welsh, M. Seltzer, on Field-Programmable Technology, 2016, pp. 36–43.
Network-aware operator placement for stream-processing systems, in: [42] S. Shreejith, R.A. Cooke, S.A. Fahmy, A smart network interface approach for
Proceedings of the International Conference on Data Engineering, 2006. distributed applications on Xilinx Zynq SoCs, in: International Conference
[18] L. Ying, Z. Liu, D. Towsley, C.H. Xia, Distributed operator placement on Field Programmable Logic and Applications, 2018, pp. 189–190.
and data caching in large-scale sensor networks, in: Proceedings of IEEE [43] O. Tomanek, P. Mulinka, L. Kencl, Multidimensional cloud latency
INFOCOM, 2008, pp. 1651–1659. monitoring and evaluation, Comput. Netw. 107 (Part 1) (2016) 104–120.
[19] S. Rizou, F. Dürr, K. Rothermel, Solving the multi-operator placement prob- [44] M. Hatto, T. Miyajima, H. Amano, Data reduction and parallelization for
lem in large-scale operator networks, in: Proceedings of the International human detection system, in: Proceedings of the Workshop on Synthesis
Conference on Computer Communications and Networks, 2010. and System Integration of Mixed Information Technologies, 2015, pp.
[20] A. Benoit, H. Casanova, V. Rehn-Sonigo, Y. Robert, Resource allocation for 134–139.
multiple concurrent in-network stream-processing applications, Parallel [45] C. Blair, N.M. Robertson, D. Hume, Characterising a heterogeneous system
Comput. 37 (8) (2011) 331–348. for person detection in video using histograms of oriented gradients:
[21] A. Benoit, H. Casanova, V. Rehn-Sonigo, Y. Robert, Resource allocation Power vs. speed vs. accuracy, J. Emerg. Sel. Top. Circuits Syst. 3 (2) (2013)
strategies for constructive in-network stream processing, Internat. J. Found 236–247.
Comput. Sci. 22 (03) (2011) 621–638. [46] M. Kachouane, S. Sahki, M. Lakrouf, N. Ouadah, HOG based fast human
[22] V. Cardellini, V. Grassi, F. Lo Presti, M. Nardelli, Optimal operator placement detection, in: International Conference on Microelectronics, No. 24, ICM,
for distributed stream processing applications, in: Proceedings of the 2012.
International Conference on Distributed and Event-Based Systems, 2016, [47] B. Benfold, I. Reid, Stable multi-target tracking in real-time surveillance
pp. 69–80. video, in: Proceedings of CVPR, 2011, pp. 3457–3464.
[23] F. Bonomi, R. Milito, J. Zhu, S. Addepalli, Fog computing and its role in the
internet of things, in: Proceedings of the MCC Workshop on Mobile Cloud
Computing, 2012.
[24] S.H. Park, O. Simeone, S.S. Shitz, Joint optimization of cloud and edge Ryan A. Cooke is a PhD student in the School of
processing for fog radio access networks, IEEE Trans. Wirel. Commun. 15 Engineering at the University of Warwick, UK, where
(11) (2016) 7621–7632. he also received his M.Eng. degree in Electronic Engi-
[25] A. Botta, W. De Donato, V. Persico, A. Pescapé, Integration of Cloud neering in 2015.
computing and Internet of Things: A survey, Future Gener. Comput. Syst. His research interests include reconfigurable com-
56 (2016) 684–700. puting, and in-network analytics acceleration.
[26] S. Madden, M.J. Franklin, J.M. Hellerstein, W. Hong, TAG: A tiny aggregation
service for Ad-hoc sensor networks, in: Proceedings of the Symposium on
Operating Systems Design and Implementation, Vol. 36, 2002, pp. 131–146.
[27] C. Intanagonwiwat, R. Govindan, D. Estrin, Directed diffusion, in: Proceed-
ings of the International Conference on Mobile Computing and Networking,
2000, pp. 56–67.
[28] M. Ding, X. Cheng, G. Xue, Aggregation tree construction in sensor
networks, in: Proceedings of the Vehicular Technology Conference, Vol. Suhaib A. Fahmy is Reader in Computer Engineering
4, 2003, pp. 2168–2172. at the University of Warwick, where his research en-
[29] H. Luo, H. Tao, H. Ma, S.K. Das, Data fusion with desired reliability in compasses reconfigurable computing, high-level system
wireless sensor networks, IEEE Trans. Parallel Distrib. Syst. 23 (3) (2012) design, and computational acceleration of complex al-
501–513. gorithms.
[30] M. Satyanarayanan, Z. Chen, K. Ha, W. Hu, W. Richter, P. Pillai, Cloudlets: at He received the M.Eng. degree in information sys-
the leading edge of mobile-cloud convergence, in: International Conference tems engineering and the Ph.D. degree in electrical and
on Mobile Computing, Applications and Services, 2014, pp. 1–9. electronic engineering from Imperial College London,
[31] H.T. Dang, M. Canini, F. Pedone, R. Soule, Paxos made switch-y, in: UK, in 2003 and 2007, respectively. From 2007 to 2009,
Proceedings of SIGCOMM, 2016, pp. 18–24. he was a Research Fellow with Trinity College Dublin
[32] M. Papadonikolakis, C.-S. Bouganis, A novel FPGA-based SVM classifier, and a Visiting Research Engineer with Xilinx Research
in: Proceedings of the International Conference on Field-Programmable Labs, Dublin. From 2009 to 2015, he was an Assistant Professor with the School
Technology, 2010, pp. 2–5. of Computer Engineering, Nanyang Technological University, Singapore.
[33] H.M. Hussain, K. Benkrid, A.T. Erdogan, H. Seker, Highly parameterized K- Dr. Fahmy was a recipient of the Best Paper Award at the IEEE Conference
means clustering on FPGAs: Comparative results with GPPs and GPUs, in: on Field Programmable Technology in 2012, the IBM Faculty Award in 2013
Proceedings of the International Conference on Reconfigurable Computing and 2017, the Community Award at the International Conference on Field
and FPGAs, Vol. 1, 2011, pp. 475–480. Programmable Logic and Applications, the ACM TODAES Best Paper Award in
[34] N. Pittman, A. Forin, A. Criminisi, J. Shotton, A. Mahram, Image segmenta- 2019 and is a senior member of the IEEE and ACM.
tion using hardware forest classifiers, in: Proceedings of the International
Symposium on Field-Programmable Custom Computing Machines, 2013,
pp. 73–80.
[35] S.A. Fahmy, K. Vipin, S. Shreejith, Virtualized FPGA accelerators for effi-
cient cloud computing, in: International Conference on Cloud Computing
Technology and Science, 2015, pp. 430–435.