0% found this document useful (0 votes)
33 views

Adaptive Clustering For Dynamic IoT Data Streams

This document proposes an adaptive clustering method for clustering IoT data streams. The method determines the optimal number of clusters by analyzing the data distribution. It then uses online clustering to dynamically assign incoming data points to clusters as the data and distributions change over time. The method is evaluated on both synthetic and real-world traffic data streams, demonstrating its ability to adapt to concept and data drifts. This enables dynamic processing and interpretation of large-scale, non-stationary IoT data streams.

Uploaded by

SAIDCAMARGO
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Adaptive Clustering For Dynamic IoT Data Streams

This document proposes an adaptive clustering method for clustering IoT data streams. The method determines the optimal number of clusters by analyzing the data distribution. It then uses online clustering to dynamically assign incoming data points to clusters as the data and distributions change over time. The method is evaluated on both synthetic and real-world traffic data streams, demonstrating its ability to adapt to concept and data drifts. This enables dynamic processing and interpretation of large-scale, non-stationary IoT data streams.

Uploaded by

SAIDCAMARGO
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

64 IEEE INTERNET OF THINGS JOURNAL, VOL. 4, NO.

1, FEBRUARY 2017

Adaptive Clustering for Dynamic IoT Data Streams


Daniel Puschmann, Payam Barnaghi, Senior Member, IEEE, and Rahim Tafazolli, Senior Member, IEEE

Abstract—The emergence of the Internet of Things (IoT) has To group the data coming from the streams, we can use
led to the production of huge volumes of real-world streaming clustering or classification methods. Classification methods
data. We need effective techniques to process IoT data streams require supervised learning and need labeled training data.
and to gain insights and actionable information from real-world
observations and measurements. Most existing approaches are There are usually huge amount of data produced in IoT appli-
application or domain dependent. We propose a method which cations, however, these data lack having labels, which makes
determines how many different clusters can be found in a stream these types of methods infeasible to be used. While clustering
based on the data distribution. After selecting the number of methods avoid this pitfall since they do not need supervised
clusters, we use an online clustering mechanism to cluster the learning, they work best in offline scenarios where all data is
incoming data from the streams. Our approach remains adaptive
to drifts by adjusting itself as the data changes. We bench- present from the start and the data distribution remains fixed.
mark our approach against state-of-the-art stream clustering In this paper, we propose a clustering method with the ability
algorithms on data streams with data drift. We show how our to cope with changes in the data stream which makes it more
method can be applied in a use case scenario involving near real- suitable for IoT data streams.
time traffic data. Our results allow to cluster, label, and interpret Data is usually clustered according to different criteria, e.g.,
IoT data streams dynamically according to the data distribution.
This enables to adaptively process large volumes of dynamic data similarity and homogeneity. The clustering results in a data
online based on the current situation. We show how our method analysis scenario can be interpreted as categories in a dataset
adapts itself to the changes. We demonstrate how the number and can be used to assign data to various groups (i.e., clus-
of clusters in a real-world data stream can be determined by ters). In this paper, we discuss an adaptable clustering method
analyzing the data distributions. that analyzes the distribution of data and updates the cluster
Index Terms—Adaptive clustering, Internet of Things (IoT), centroids according to the online changes in the data stream.
stream processing. This allows creating dynamic clusters and assigning data to
these clusters not only by their features (e.g., geometric dis-
tances) but also by investigating how the data is distributed
I. I NTRODUCTION at a given time. We evaluate this clustering method against
HE SHIFT from the desktop computing era toward several state-of-the-art methods on evolving data streams.
T ubiquitous computing and the Internet of Things (IoT)
has given rise to huge amounts of continuous data collected
To showcase the applicability of this paper, we use a case
study from an intelligent traffic analysis scenario. In this sce-
from the physical world. The data produced in the IoT context nario, we cluster the traffic sensor measurements according to
has several characteristics which makes it different from other features such as average speed of vehicles and number of cars.
data used in common database systems and machine learning These clusters can then be analyzed to assign them a label;
or data analytics. IoT data can come from multiple different for example, a cluster that always includes the highest num-
heterogeneous sources and domains, for example numerical ber of cars, according to the overall density of the cars at a
observations and measurements from different sensors or tex- given time and/or the capacity of a street, will be given the
tual input from social media streams. Common data streams “busy” tag. By further abstracting we can identify events such
usually follow a Gaussian distribution over a long-term period. as traffic jams, which can be used as an input for automated
However, in IoT applications we need to consider short-term decision making systems such as automatic rerouting via GPS
snapshots of the data, in which we can have a wider range navigators.
of more sporadic distributions. Furthermore, the nature of This paper is organized as follows. In Section II, we present
IoT data streams is dynamic and its underlying data distri- the state of the art and discuss the benefits and drawbacks of
bution can change over time. Another point is that the data different stream cluster algorithms. We present related work
comes in large quantities and is produced in real-time or close to analyze stream data with concept and data drifts. The
to real-time. This necessitates development of IoT specific silhouette coefficient is chosen as a metric for measuring
data analytics solutions which can handle the heterogeneity, the cluster quality and the mathematical backgrounds of the
dynamicity, and velocity of the data streams. method described in Section II. In Section III, we introduce the
concepts of our adaptive online clustering method which auto-
matically computes the best number of clusters based on the
Manuscript received July 1, 2016; revised October 6, 2016; accepted
October 6, 2016. Date of publication October 19, 2016; date of current ver- data distribution. Section IV describes the proposed adaptive
sion February 8, 2017. This work was supported by the European Commission clustering method in more technical details. We present evalu-
Seventh Framework Programme for the CityPulse Project under Grant 609035. ations of this paper in Section V. In Section V-A, we compare
The authors are with the Institute for Communication Systems, University
of Surrey, Guildford, GU2 7XH, U.K. (e-mail: [email protected]). our method against state-of-the-art methods on a synthesized
Digital Object Identifier 10.1109/JIOT.2016.2618909 data set. We have conducted a case study using traffic data and
2327-4662 c 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
PUSCHMANN et al.: ADAPTIVE CLUSTERING FOR DYNAMIC IoT DATA STREAMS 65

present the results in Section V-B. In Section VI, we discuss There are several approaches to deal with the problem of
the significance of this paper and outline the future work. identifying how many different clusters can be found in a
data set. Chiang and Mirkin [8] have conducted an experi-
mental study in which they proposed a new method called
k-means that chooses the right k with seven other approaches.
II. R ELATED W ORK In their experiment, the method which performs best in terms
Several approaches for the clustering problem exist, of choosing the number of clusters and cluster recovery works
however, we will only take a closer look at a particular as follows. Clusters are chosen based on new anomalies in the
approach: Lloyds algorithm, better known under the name data and a threshold based on Hartigans rule [9] is used to
k-means [1]. It should be noted that this particular approach eliminate small superfluous clusters.
has been selected to be improved for the purpose of streaming DenSream was introduced by Cao et al. [10] to cluster
data clustering because of its simplicity. The concept of uti- streaming data under the conditions of changing data dis-
lizing the data distribution can be also applied to determining tributions and noise in data streams. DenStream creates and
the parameter k for k-median [2] or the number of classes in maintains dense microclusters in an online process. Whenever
unsupervised multiclass support vector machines [3] and other a clustering request is issued, a macrocluster method (e.g.,
clustering algorithms. DBSCAN [11]) is used to compute the final cluster result on
k-means splits a given data set into k different clusters. It top of the microcluster centroids.
does so by first choosing k random points within the data It should be noted that in Chiang and Mirkin [8] experi-
sets as initial cluster centroids and then assigning each data mental setting only uses data generated from clusters with a
point to the most suitable of these clusters while adjusting Gaussian distribution. We argue that data from the real-world
the center. This process is repeated with the output as the not necessarily follows a Gaussian distribution. There are a
new input arguments until the centroids converge toward sta- large range of distributions which might fit the data better in
ble points. Since the final results of the clustering is heavily different environments and applications such as Cauchy, expo-
dependent on the initial centroids, the whole process is carried nential, or triangular distributions. In order to reflect this, our
out several times with different initial parameters. For a data selection criteria for the number of clusters is the shape of the
set of fixed size this might not be a problem; however, in the data distribution of the different data features.
context of streaming data this characteristic of the algorithm Transferring the cluster problem from a fixed environment
leads to heavy computational overload. to streaming data brings another dimension into play for
Convergence of k-means to a clustering using the random interpreting the data. This dimension is the situation in which
restarts not only means that this procedure takes additional the data is produced. For our purpose, we define situation
time, but depending on the data set, k-means can produce as the way the data is distributed in the data stream com-
lower quality clusters. k-means++ [4] is a modification of bined with statistical properties of the data stream in a time
k-means that intelligently selects the initial centroids based on frame. This situation depends both on the location and time.
randomized seeding for the initial cluster centroids. While the For example, categorizing outdoor temperature readings into
first center is chosen randomly from a uniform distribution, three different categories (i.e., cold, average, warm) is heavily
the following centroids are selected with probability weighted dependent on the location, e.g., on the proximity to the equa-
based on their proportion to the overall potential. tor. For example, what is considered hot weather in the U.K.
STREAM [5] is a one-pass clustering algorithm which treats is perceived differently somewhere in the Caribbean.
data sets, which are too large for to be processed in-memory, as Similarly our interpretation of data can change when we fix
a data stream; however, the approach has shown limitations in the location but look at measurements taken at different points
cases where the data stream evolves over time leading to mis- in time. For example, consider a temperature reading of 10◦ in
clustering. Aggarwal et al. [6] introduced their approach called the U.K. If this measurement was taken in winter, we certainly
CluStream that is able to deal with these cases. CluStream is consider this as warm. If it was taken in summer though it
also able to give information about past clusters for a user would be considered as cold temperature. This phenomenon
defined time horizon. Their solution works by dividing the where the same input data leads to a different outcome in the
stream cluster problem into an online microcluster component output is known as concept drift [12], [13].
and an offline macroclustering component. One drawback of There are several existing methods and solution focusing
Aggarwal et al.’s [6] approach is that the number of clusters on concept drift; some of the recent works in this domain
has to be either known in advanced and fixed, or chosen by are reviewed in [14]. Over the last decade, a lot of research
a user in each step, which means that human supervision has was dedicated to handling concept drift in supervised learning
to be involved in the process. scenarios mainly utilizing decision trees [15] or ensemble clas-
Another well-known approach in stream clustering is sifiers [16]; however, adaptation mechanisms in unsupervised
StreamKM++ [7]. This approach is based on k-means++ [4]. methods have only recently started to be investigated [14].
However, again the number of clusters needs to be known There are different types of concept drift. If only the data
beforehand. StreamKM++ constructs and maintains a core-set distribution changes without any effect on the output, it is
representing the data stream. After the data stream is pro- called virtual drift. Real concept drift denotes cases where the
cessed, the core-set is clustered with k-means++. As of that, output for the same input changes. This usually has one of the
StreamKM++ is not designed for evolving data streams. following reasons. Either the perception of the categories or
66 IEEE INTERNET OF THINGS JOURNAL, VOL. 4, NO. 1, FEBRUARY 2017

objectives has changed or changes in the outcome are triggered k-means++, DBSCAN [11] or similar methods as a macro
by changes in the data distribution. We argue that in the IoT cluster whenever a cluster request is issued by the end-user or
domain and especially in smart city applications, the latter type an application which uses the stream clustering mechanism.
of concept drift is more important. In order to avoid confusion This means that the actual clusters and the labels for the data
between the different types of concept drift, we introduce the items are only computed when a clustering request on the data
term “data drift” to describe real concept drift that is caused stream is made.
by changes in the data stream. This approach works in scenarios where the clustering result
Smith et al. [17] have developed a tool for creating data is not needed continuously. However, if the clustering result is
streams with data drifts through human interactions. In their needed on a continuous basis and the offline calculation of the
experiments, they have found that current unsupervised adap- data stream representation has to be issued in high frequency,
tation techniques such as near centroid classifier (NCC) [18] the efficiency gain of the methods are lost and the response
can fall victim to cyclic mislabeling, rendering the clustering time in online applications with large volumes of dynamic data
results useless. While Smith et al. [17] found that semisu- is limited by applying the macro clusters. Therefore, a new
pervised (semisupervised NCC) and hybrid adaptations of the method with low computational complexity that can produce
technique (semi- and nearest-centroid classifier) lead to more cluster results directly during processing the stream is required.
robust results; adaptive methods are also needed in scenar- Our proposed solution to this problem is to create a clus-
ios for which labels are not available and therefore only tering mechanism in which the centroids change and adapt to
unsupervised learning can be applied. data drifts. We propose an adaptive method to recalibrate and
Cabanes et al. [19] introduced a method which constructs a adjust the centroids.
synthetic representation of the data stream from which the data
distribution can be estimated. Using a dissimilarity measure for A. Silhouette Coefficient
comparing the data distributions, they are able to identify data The common metrics to evaluate the performance of clus-
drifts in the input streams. Their work is limited by the fact tering such as homogeneity, completeness, and v-measure
that they only present preliminary results and are still working are mainly suitable for offline and static clustering methods
on an adaptive version of their approach. where a ground truth in the form of class labels is avail-
Estimating the data distribution is an essential step for iden- able. However, in our method, as the centroids are adapted
tifying and handling data drifts. The data distribution can be with the data drifts, the latter metrics will not provide an
calculated using kernel density estimation (KDE) [20], [21]. accurate view of the performance. In order to measure the
The most important parameter for KDE is the bandwidth selec- effectiveness of our method we use the silhouette metric.
tion. There are different methods to choose this parameter The use of the silhouette coefficient as a criterion to choose
automatically from the provided data. They include com- the right value for number of clusters has been proposed by
putationally light rules of thumb such as Scotts rule [22] Pravilovic et al. [28]. This metric is used in various works to
and Silvermans rule [23] and computationally heavy methods measure the performance of the clustering methods including
such as cross-validation [24]. A detailed survey on bandwidth the MOA framework [26], [27] that is used in this paper for
selection for KDE is provided in [25]. However, the easily the evaluation and comparisons.
computed rules are sufficient for most practical purposes. The silhouette metric as a quality measure for cluster-
Bifet et al. [26] introduced massive online analysis (MOA), ing algorithms was initially proposed by Rousseeuw [29].
a framework for analyzing evolving data streams with a broad Intuitively it computes how well each data point fits into its
range of techniques implemented for stream learning. Initially assigned cluster compared to how well it would fit into the next
MOA only supported methods for stream classification; exten- best cluster (i.e., the cluster with the second smallest distance)
sions of the framework have added additional functionalities
b(i) − a(i)
for data stream analytics. Particularly interesting for this paper s(i) = . (1)
is an extension of the framework which provides an easy way max((a(i), b(i))
to compare different stream clustering algorithms. In addition The silhouette for one data points is defined in (1), whereby i
to providing implementation of state-of-the-art stream cluster- represents the data point, b(i) is the average distance to each
ing algorithms and evaluation measures, Kranen et al. [27] of the points in the next best cluster and a(i) is the average
introduced new data generators for evolving streams based distance to each of the points of the assigned cluster. The total
on randomized radial base functions (randomRBFGenerator). silhouette score is obtained by taking the average of all s(i).
We compare our method against their implementations of From this definition, we can see that the silhouette width s(i)
CluStream [6] and DenStream [10] which are both designed is always between −1 and 1. The interpretation of the values
to handle evolving data streams. is as follows: values closer to 1 represent better categorization
Stream cluster algorithms such as CluStream [6] and of the data point to the assigned cluster, while a value close
DenStream [10] stay adaptive to evolving data streams by to −1 denotes less efficiency in the categorization, i.e., the
splitting the clustering into offline and online parts. The online data point would have better fit into the next-nearest cluster.
part continuously retrieves a representation of the data stream. Following that a silhouette width of 0 is neutral, that is to
This is done through the computation of microclusters. The say a data point with this value would fit equally well in both
microclusters allow for efficient and accurate computations clusters. We average over all silhouette values s(i) to obtain
of clusters by applying common clustering methods such as the score of the overall clustering. This average value is the
PUSCHMANN et al.: ADAPTIVE CLUSTERING FOR DYNAMIC IoT DATA STREAMS 67

case described in Section V-B. Fig. 1(b) shows the prob-


ability density functions (PDFs) that were computed using
KDE [20], [21] from the data shown Fig. 1(a). We follow
the intuition that a directional change, referred to as a turning
point (tp), in the probability distribution can signify the begin-
ning of a new category. The tps which ended up producing
the best cluster result are visualized as arrows in Fig. 1(b).
We split the PDF in areas of equiprobable distributions as
visualized by the blue vertical lines. This idea is inspired by
the symbolic aggregate approximation (SAX) algorithm [30]
where a Gaussian distribution is split into equiprobable areas
Fig. 1. Distribution of the different features and their resulting PDFs.
that are used to map continuous data from streams to dis-
crete symbolized representations. Following this approach we
can obtain smaller and denser cluster in areas with many data
overall silhouette score and can be used to compare the quality
points, whereas in areas with less data points we get wider
of different cluster results.
and sparser cluster.
Rousseuw [29] points out that single, strong outliers can
Following that the number of areas in the PDF can be con-
lead to misleading results; therefore, it has to be made sure
sidered as a possible k for the k-means algorithm, the centers of
that there are no singleton clusters in the results. In order to
these areas are then considered as possible initial centroids. In
use the silhouette coefficient in a streaming setting, we have to
contrast to the random initialization of the original k-means [1]
define the time frame from which the data points are taken into
we propose a way to intelligently select the initial centroids.
account for measuring the cluster quality. A natural contender
This makes the clustering part of the algorithm deterministic
for that time frame is the last time the centroids have been
and random restarts become unnecessary.
recalculated, since this is the point in time when we discovered
Since in general, different features of a data stream do not
a data drift and the new clustering has to adapt to the data
follow the same distribution, the PDF curves obtained from
stream from then on.
different features contain more than one possible number for k
and also provide different candidates for the centroids even if k
III. A DAPTIVE S TREAMING C LUSTERING happens to have the same value. Furthermore, the combination
Most of the stream clustering methods need to know in of the different feature distributions could also allow for com-
advance how many clusters can be found within a data stream binations of optimal clusters which lie between the minimum
or at the very least are dependent on different parametrization and maximum number of turning points of the distribution
for their outcome. However, we are dealing with dynamic envi- functions. We test for
ronments where the distribution of data streams can change k ∈ [tpmin , tpmin + tpmax ]. (2)
over time. There is a need for adaptive clustering algorithms
that adapt their parameters and the way they cluster the data How can we then decide which of these value of k and which
based on changes of the data stream. centroid candidates lead to a better cluster results? In order
With the abundance of data produced, one of the main ques- to answer this question, we need a metric with which we can
tions is not only what to do with the data but also what compare the resulting clusters for different values of k when
possibilities have not been yet considered. If new insights are we apply the clustering mechanism to an initial sample set of
obtained from the data these can in turn inspire new appli- size n. The metric must satisfy the following properties.
cations and services. One can even go further and ignore any 1) Independence of k.
prior knowledge and assumptions (e.g., type of data categories 2) Independence of knowledge of any possible ground
of results) in order to retrieve such insights. However, previ- truth.
ously known knowledge can influence the expectations and Property one comes as no surprise. Since we have to compare
therefore can enhance and/or alter the results. cluster results with different k values, the metric must not be
biased by the number of k’s. For instance this would be the
case if we chose variance as a comparison criterion. In this
A. Finding the Right Number of Clusters case there would be a strong bias toward higher values of k
One of the key problems in working with unknown data is since the variance within the clusters converges to zero as the
how to determine the number of clusters that can be found in number of clusters converges to the sample set size.
different segments of the data. We propose that the distribution The second property is derived from the fact that our
of the data can give good indications of the categories. Since approach does not take prior knowledge into consideration.
usually the data has several features we look at each of the On one hand, the approach is domain independent. On the
distributions of different features. The shape of the probability other hand, one of the main objectives is to extract information
distribution curve gives good approximations of how many which is inherent in the data itself, and therefore, should not be
clusters we need to group the data. obstructed by assumptions of any kind. This property instantly
Fig. 1(a) shows the distribution of two different fea- excludes the majority of commonly used metrics for cluster
tures (average speed and number of cars) from the use evaluation. Evaluation criteria such as purity, homogeneity,
68 IEEE INTERNET OF THINGS JOURNAL, VOL. 4, NO. 1, FEBRUARY 2017

and completeness all evaluate the quality of the clusters by Algorithm 1 DETERMINE C ENTROIDS(A, k, n)
comparing the assigned labels of the data to the labels of the Require: Data matrix A = {a0 , a1 , . . . , an } with each ai being
ground truth in one way or another. an array of length m containing all values of feature n
The silhouette metric described in Section II-A satisfies both 1: %Therefore sample j is the data point: [a0 [j], . . . , an [j]]
properties. In order to estimate the quality of the clusters, Ensure: List C = {c0 , c1 , . . . , cmax(tps) } of clusterings, with
they are examined by computing how well the data points fit each ci being a list of centroids with length k = tpmin + i
into their respective cluster in comparison to the next-nearest 2: pdf [] = ∅
neighbor cluster. 3: tps[] = ∅
4: for i ← 1 to n do
5: %Array containing the PDFs of each feature
B. Dealing With the Data Drift 6: pdf [i] = gaussianKDE(a[i])
In scenarios where the input data is fixed, once the k-means 7: %Array containing the number of turning points of the
algorithm with random restarts converges, the clusters become PDF
fixed. The resulting clusters can be then reused to categorize 8: tps[i] = countTurningPoints(pdf [i])
similar data sets. 9: end for
However, in the case of streaming data two observations 10: C[] = ∅
which are taken on two different (and widespread) time points 11: for i in range(min(tps), min(tps) + max(tps) do
do not necessary have the same meaning and consequently 12: betas[] = ∅
will belong to different clusters. This in turn leads to a dif- 13: %Each f represents the PDF of a feature
ferent clustering of the two observations. Identical data can 14: for f in pdf do
have different meaning when produced in a different situation. 15: betas[] = findBetas( f , tps[i])
For example, imagine observing 50 people at 3 P. M . during a 16: C[i] = list of means between two adjacent betas
weekday walking over a university campus. This would not be 17: end for
considered as busy given the usual number of students in the 18: end for
area. Observing the same number of people at 3 A . M . would 19: return C
however be considered as exceptional, giving the indication
that a special event is happening.
We incorporate this into our clustering mechanism by
adapting the centroids of the clusters based on the current
IV. A DAPTIVE S TREAMING k-M EANS :
distribution in the data stream. The data drift detection is
A C LOSER L OOK
triggered by changes in the statistical properties of the PDF.
The justification for our method is based on properties of Algorithm 1 shows how the centroids in our adaptive
stochastic convergence. Convergence in mean square [see (3)] method are computed. This takes place after a configurable
implies convergence in probability, which in turn implies con- initialization period and is repeated at the beginning of each
vergence in probability and distribution [31]. The formula for adjustment step. More information about the data collection
convergence in the mean square is given and the adjustment can be found in Section V-B. Initially,
the PDFs of each of the features of the data are computed
using KDE [20], [21]. The continuous PDFs are represented
lim E|Xn − X|2 = 0. (3)
n→∞ by discrete arrays.
These PDF representations are then fed into Algorithm 2.
During training, we store the standard deviation and expected Turning points can be determined by analyzing the first deriva-
value of the data with the current distribution. When process- tive. They have the property that dy/dx = 0, where dx is the
ing new incoming data, we track how the expected value and difference between two infinitely close x values of the PDF, dy
standard deviation changes given the new values. Equation 3 is the difference between two infinitely close y values of the
states that as a sequence approaches infinite length, its mean PDF and dy/dx is the slope of the PDF. This is a necessary but
squared approaches the value of the random variable X (in not sufficient criteria for having a turning point. Only if the
our case defined by the previously computed PDF). However, sign of dy/dx changes from negative to positive or vice versa,
we can make the assumption that as we get more and more we actually have a turning point in our function. These are
values, that if the expected value converges to ε such that just the definitions for local maximum and minimum points,
E|Xn − X|2 = ε for n >> 0 and ε > 0, the current time respectively. Finding these points, we can present the turn-
series data is no longer converging to the distribution that ing points of a feature PDF and this number can be used to
we predicted with the PDF estimation. Therefore, we have determine the right number of clusters.
a change in the underlying distribution of the data stream and We use the heuristic that the right amount of clusters lies
trigger a recalibration of our methods. If we can observe a between the smallest number of turning points of a feature
higher quality in the new clusters, the old centroids will be PDF and this number added to the maximum number of
adjusted. turning points found in any of the PDFs.
A detailed description of the algorithm follows in the next Once the number of turning points—and therefore the pos-
section. sible values for the number of clusters—for each feature are
PUSCHMANN et al.: ADAPTIVE CLUSTERING FOR DYNAMIC IoT DATA STREAMS 69

Algorithm 2 COUNT T URNING P OINTS( f ) Algorithm 3 STREAMING KM EANS(D)


Require: Array f representing a probability density distribu- Require: Data stream D, length of data sequence used for
tion initialisation l
Ensure: Number of turning points tps Ensure: Continuous clustering of the data input stream
1: y = εy ∗ max( f ) 1: % initialisation phase
2: % Little gradient should be recognised as no gradient 2: for cCs in determineCentroids(D.next(l)) do
3: x = εx ∗ max( f ) 3: currentk = length(cCs)
4: % Areas of no ascent/descent should only be counted if 4: nCs = kmeans(cCs, D.next(l), currentk ))
they last long enough 5: if silhoutte(nCs) < lastSil then
5: for x in f do 6: % We found a new best clustering
6: if changedDirection(x, f , y , x ) then 7: centroids = nCs
7: tps + + 8: lastSil = silhoutte(nCs)
8: end if 9: k = currentk
9: end for 10: end if
10: return tps 11: end for
12: % Continuous clustering phase
13: loop
14: if changeDetected(D) then
computed, Algorithm 1 determines candidates for the initial 15: centroids = determineCentroids()
centroids. 16: centroids =
The computation then splits the PDF curve into equiprob- 17: kmeans(centroids, D.getData(), k)
able areas (similar to the SAX algorithm [30]). The bound- 18: else
aries of these areas are called beta points. Since we are 19: centroids =
interested in the center of these region, the middle points 20: kmeans(centroids, D.getData(), k)
between two adjacent betas are computed and saved as initial 21: end if
centroids. 22: end loop
Once the candidates for the initial clusters are identified—
one for each feature—a normal k-means is run on the dataset
with the initial centroids as starting points for the clustering. the input array. Putting this information together results again
The results are then compared by computing the silhou- in a complexity of O(n · l). Since both parts of the algorithm
ettes. For an in-depth description, we refer the reader to the have the same complexity, the total complexity equals O(n · l).
paper by Rousseeuw [29]; however, a short elaboration on the Algorithm 3 uses Algorithm 1 for finding the initial cen-
mathematical background is also given in Section II-A. troids. Running k-means with determined initial centroids
Incoming data points are fed into the k-means with the cur- takes O(nkd) since no iterations of the algorithm are needed.
rent cluster centroids and assigned to their nearest cluster. The Then for each clustering we compute the silhouette score.
cluster centroid of the assigned cluster is adjusted to reflect Calculating the silhouette score is computationally intensive
the new center of the cluster including the inserted value. since each distance pair has to be computed. Here, we can
We give a brief complexity analysis of Algorithms 1 apply the following steps to increase the performance. Instead
through 3. We start with Algorithm 2, since it is used by of calculating the distance pairs for each value of k, we can
Algorithm 1. Since the algorithm goes along the array f , which initially compute the distance pair matrix and pass it to the
represents the PDF, the complexity lies in O(length( f )). This silhouette calculation. For cases where we have a huge size
array has exactly the same length as the array which has been of data values, we perform random sampling to decrease the
fed into the function gaussianKDE, computing the pdf repre- number of distances pairs that have to be computed. In prac-
sentation. Algorithm 1 is called with a matrix of dimensions tice, sampling has been shown to provide close approximations
n × l. Here, n is the number of features, l being the length to the actual silhouette score at a fraction of the compu-
of the initial data sequence. Therefore, length( f ) = l and we tational cost. During the online clustering phase, assigning
have a complexity of O(l) for Algorithm 2. At the same time, the nearest cluster to new incoming values takes only O(1)
the gaussianKDE function scales linearly with the size of the time. Recalibrating the cluster centroids requires one call of
input array, resulting as well in the complexity of O(l). Algorithm 1 (O(n · l)) and another run of k-means (O(nkd)).
For Algorithm 1, we first look at lines 3–6. Here for each
of the n feature vectors both gaussianKDE and Algorithm 2
are called. This results in a complexity of O(n · l). V. E VALUATION
We then examine lines 8–14. We can see that the outer for We test the proposed method both on synthesized data and
loop runs exactly max(tps) times, which in practice is a small on a real-world data set. We evaluate the method against
constant. The inner for loop runs in the length of the number of state-of-the-art methods using data sets which are generated
the PDFs; since we compute one PDF for each feature this is in different ways and discuss in which cases the method has
equal to n times. Function findBetas goes along the input array an advantage over existing approaches and in which cases it
to find the beta values and therefore scales in the length of is outperformed by them.
70 IEEE INTERNET OF THINGS JOURNAL, VOL. 4, NO. 1, FEBRUARY 2017

We introduce a novel way of generating data streams with


data drift. The drift is introduced both by shifting the centroids
in randomized intervals and by changing the data distribution
function used to randomly draw the data from the centroids.
Here, we can scale up the dimensionality of the generated
data, with each feature having its own distribution to draw the
data from.
Finally, we show how our method can be used in a real-
world case study by applying it to the output of traffic sensors
which measure the average speed and the number of cars in
street segments.

A. Synthesized Data
To evaluate our method we test our method against
two established stream cluster algorithms: 1) CluStream [6]
(horizon = 1000, maxNumKernels = 100, kernelRadiFactor =
2) and 2) DenStream [10] (horizon = 1000, epsilon = 0.02,
beta = 0.2, mu = 1, initPoints = 1000, offline = 2,
lambda = 0.25). For this we use two different ways of
generating data streams with data drift. The first one, random- Fig. 2. Silhouette coefficient comparison on synthetic data sets.
RBFGenerator, was introduced by Kranen et al. [27]. Given
an initial fixed number of centroids and dimension, the cen-
troids are randomly generated and assigned with a standard Data drift is added sporadically and is independent for each
deviation and a weight. New samples are generated as fol- dimension through two different ways. The first is a direc-
lows. Using the weight of the centroids as a selection criteria, tional change of random length. The second is that over the
one of the centroids is picked. By choosing a random direc- course of generating the data, the data distribution used for
tion, the new sample is offset by a vector of length drawn the dimension is changed for one or more of the dimensions.
from a Gaussian distribution with the standard deviation of Both changes appear in random intervals.
the centroid. This creates clusters of varying densities. Each We compare our method against CluStream and DenStream.
time a sample is drawn, the centroids are moved with a con- Fig. 2(a) shows the performance on data generated by the
stant speed, initialized by an additional parameter, and creating randomRBFGenerator with data drift. The results for the data
the data drift. generated by the introduced novel way with different number
This however, has a drawback that the data drift is not nat- of features are shown in Fig. 2(b)–(d). One hundered centroids
ural, as the centroids are constantly shifting. We argue that have been used for the data generation. For the visualization,
during a short time frame, the data stream roughly follows the silhouette score has been normalized to a range between
a certain distribution. The underlying distribution can then 0 and 1 as done within the MOA framework.1
change between time-frames, triggered by situational changes. On the data produced by the randomRBFGenerator, our
These changes can be reoccurring in time (for example in the novel method constantly outperforms CluStream by around
case of traffic during rush hours and off-peak times) or more 13%. DenStream performs better at times, however, the silhou-
sudden changes (for example traffic congestions caused by an ette score of DenStream drops below the levels of CluStream
accident). at times, suggesting that the method does not adapt consis-
For that reason, we introduce a novel way of generating tently to the drift within the data. As seen in Figs. 4–7, for the
data with data drift. The centroids are selected through Latin synthesized data with different number of features, our novel
hypercube sampling [32]. The number of clusters and dimen- method constantly perform around 40% better than CluStream
sions are fixed beforehand. Similar to the method before, each and more than 280% than DenStream.
centroid is assigned with a standard deviation and weight.
Furthermore, each dimension is given a distribution function,
which later is used to generate the data samples. Considering B. Case Study: Real-Time Traffic Data
that each dimension represents a feature of a data stream, To showcase how our approach can be applied to real-world
this models the fact that in IoT applications we are dealing scenarios, we use (near-)real-time traffic data from the city of
with largely heterogeneous data streams in which the fea- Aarhus.2 449 Traffic sensors are deployed in the city which
tures do not follow the same data distribution. Our current produce new values every 5 min. The data is pulled and fed
implementation supports triangular, Gaussian, exponential, and into an implementation of our clustering algorithm that is
Cauchy distributions. The implementation is easily expand-
able and can support other common or custom distributions. 1 [Online]. Available: http://www.cs.waikato.ac.nz/\∼abifet/MOA/API/_
The data generation code is available via our website at: silhouette_coefficient_8java_source.html
http://iot.ee.surrey.ac.uk/ 2 [Online]. Available: http://www.odaa.dk/dataset/realtids-trafikdata
PUSCHMANN et al.: ADAPTIVE CLUSTERING FOR DYNAMIC IoT DATA STREAMS 71

Fig. 4. Silhouette coefficients.

the time of the day and the location. For example, 15 cars at
a given time in one street could mean busy while in another
situation it could mean “very quiet.” Similarly 15 cars in the
city center can have a different meaning during the day than
at night.
We use a ring buffer as a data cache, that captures the data
produced in the last hour. Whenever a recalculation process
Fig. 3. Traffic densities at different times in Aarhus. is triggered based on a detected data drift because the data
no longer converges to the mean square (see Section III-B),
we use the silhouette coefficient score to check if the new
described in Section IV. Before the value of k is computed centroids lead to a better cluster quality. If that is the case,
and the initial centroids are determined, the data is collected the current centroids are then adjusted. The definition of the
for one hour which equates to 5035 data points. The main data silhouette coefficient and its computations can be found in
is collected for a period of 5 months. Over the course of one Section II-A.
day, 122 787 samples are collected. For the clustering we use Fig. 4 shows the computed silhouette coefficients for cluster-
number of cars and average speed measured by the sensors. ing the initial data sequence with different number of clusters.
For the purpose of evaluation and visualization, a timestamp The number of clusters is chosen according to the high-
and location information are also added to each data point. est value of the coefficient and is emphasized in bold. For
The intuition is that the clustering of traffic data is dependent performance reasons, a sampling size of 1000 has been chosen
on the time of day and the location. This perspective allows to compute these values. It can be decided based on the data
us to cluster incoming data in a way to better reflect the cur- input and the task at hand if the number of clusters should stay
rent situation. Fig. 3(a)–(c) visualized the traffic density as a constant through the remainder of clustering the data. In our
heat map in the morning, at noon and in the evening in cen- use case scenario, we do not change the number of clusters.
tral Aarhus, respectively. Light green areas in the map show a Figs. 5 shows how the centroids of the clusters change at
low traffic density, while red areas indicate a high density. No different times on a working day and on a Saturday for the
coloring means that there are no sensors nearby. In the morn- different recalculation times. The way the data is clustered
ing [Fig. 3(a)], there is only moderate traffic density spread differs significantly between the two days. While the aver-
around the city. At noon [Fig. 3(b)], we can see that two big age speed remains roughly the same, the amount of vehicles
centers of very high density (strong red hue) have emerged. varies a lot. Most prominently this difference can be seen
Fig. 3(c) shows that in the evening there is now only very low in the centroids of the cluster representing high number of
to moderate traffic density in the city. cars. For example, Fig. 5(a) the number of cars is consider-
Several reasons for data shift on varying degrees of tem- ably higher at noon than in the evening. Resulting from the
poral granularity are conceivable. During the day the density changed centroids, data points may be assigned to different
of traffic changes according to the time. In rush hours where clusters depending on the time compared to using nonadaptive
people are simultaneously trying to get to or back from work, clustering. For example, using the centroids on working day at
the data distribution of traffic data differs greatly from less noon on the data published during the same time on Saturday,
busy times during working hours or at night. As of the same 180 out of 3142 values would be assigned to a different
reasons, the data distribution is also quite different in week- cluster.
ends as in weekdays. During holidays, e.g., around Christmas In order to interpret how busy an area is, it is necessary
or Easter, the dynamics of traffic can change a great deal. All to also take into consideration all adjacent time points in that
these changes in the traffic data distribution lead to a need of area. Therefore, the output of our approach can be used to
reconsidering what can be categorized as a busy or “quiet” further abstract from the data by matching patterns within a
street, in other words we are dealing with data drift in these time frame in an area. The results can be fed into an event
cases, as the same input leads to different output at different detection and interpretation method. To clarify the reasoning
times. behind this, two examples are given. Let us consider a mea-
Our approach deals with the data drift by recalculating the surement of a fast moving car. Only if the measurements in
centroids based on the current distribution. This means that close time range are similar, we can interpret this traffic as
defining a street as busy or quiet is relative and depends on a good traffic flow. If this is not the case, a different event
72 IEEE INTERNET OF THINGS JOURNAL, VOL. 4, NO. 1, FEBRUARY 2017

adaptive solution follows a daily pattern. During the day, the


quality drops to levels of the nonadaptive solution. This can
be explained through the fact that during the night the traf-
fic flow is more clear cut, for example, there are more cases
where there is no traffic at all. The adaptive solution is able
to exploit this fact and adapt itself to produce better clusters
that are closer to the actual categories that can be found in
the data streams. During the day, there are many more data
samples on the edge of clusters that could be clustered into
either one of adjacent clusters, leading to a worse silhouette
coefficient score. Here, the quality of the adaptive clustering
at times drops to values near the quality of the nonadaptive.
However, the next iteration of the adaptive clustering improves
the cluster quality again automatically based on changes in the
data distribution. At some points in time it drops below the
quality of the nonadaptive one: in these cases the quality
quickly recovers to better values again. Overall the mean of
the silhouette coefficient is 0.41 in the nonadaptive setting and
0.463 in the adaptive setting which translates to an average
Fig. 5. Centroids adapting to changes in the data stream. Low, medium, and improvement of 12.2% in cluster quality.
high refer to the level of traffic density the cluster is representing.

VI. C ONCLUSION
In this paper, we have introduced an adaptive clustering
method that is designed for dynamic IoT data streams. The
method adapts to data drifts of the underlying data streams.
The proposed method is also able to determine the number of
categories found inherently in the data stream based on the
data distribution and a cluster quality measure. The adaptive
method works without prior knowledge and is able to discover
inherent categories from the data streams.
Fig. 6. Silhouette coefficient on traffic data set. We have conducted a set of experiments using synthesized
data and data taken from an traffic use-case scenario where
we analyze traffic measurements from the city of Aarhus. We
has taken place, e.g., maybe an ambulance rushed through the run adaptive stream clustering method and compare it against
street. a nonadaptive stream cluster algorithm. Overall the clusters
Another example would be the measurement of slow mov- produced using an adaptive setting have an average improve-
ing cars. If this is a singular measurement, it could mean ment of 12.2% in the cluster quality metric (i.e., silhouette
for example that somebody is driving slow because she/he coefficient) over the clusters produced using a nonadaptive
is searching for a parking space. However, if the same mea- setting.
surement accumulates it could mean that a traffic jam is taking Compared to state-of-the-art stream cluster methods, our
place. novel approach shows significant improvements on synthe-
In order to ensure that the adaptive method leads to mean- sized data sets. Against CluStream, there are performance
ingful results we have conducted another experiment. We improvements between 13% and 40%. On data generated by
compare our adaptive stream clustering method with a non- randomRBFgenerator, DenStream has better cluster quality at
adaptive streaming version of the k-means algorithm, i.e., few points of the experiment, is generally outperformed by
the centroids are never recalculated. The silhouette coefficient our method though. On the other synthesized data streams,
of the clusters in both setting are computed in equal time our novel approach shows an improvement of more than 280%
intervals. Fig. 6(a) shows how the silhouette coefficients com- compared to DenStream.
pare over the course of one day. For example, at February 22, The results of our clustering method can be used as an
2014, 05:30:00 the nonadaptive approach scores a silhouette input for pattern and event recognition methods and for ana-
coefficient of 0.410 while the adaptive approach scores 0.538, lyzing the real-world streaming data. To clarify our approach
an improvement of 31.2%. This means items clustered by the we have used k-means as the underlying clustering mechanism,
adaptive methods have a better convergence considering the however, the concepts of our approach can also be applied to
distribution of the data at that time. other clustering methods. For the latter the distribution analysis
Fig. 6(b) shows how the silhouette coefficients compare and cluster update mechanisms can be directly adapted from
over the course of one week. The adaptive clustering performs this paper and only the cluster and centroid adaptation mech-
mainly better than the nonadaptive. The cluster quality of the anisms should be implemented for other clustering solution.
PUSCHMANN et al.: ADAPTIVE CLUSTERING FOR DYNAMIC IoT DATA STREAMS 73

[6] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A framework for clus-


tering evolving data streams,” in Proc. 29th Int. Conf. Very Large Data
Bases, Berlin, Germany, 2003, pp. 81–92.
[7] M. R. Ackermann et al., “StreamKM++: A clustering algorithm for data
streams,” J. Exp. Algorithmics, vol. 17, no. 1, pp. 173–187, 2012.
[8] M. M.-T. Chiang and B. Mirkin, “Intelligent choice of the number of
clusters in k-means clustering: An experimental study with different
cluster spreads,” J. Classif., vol. 27, no. 1, pp. 3–40, 2010.
[9] J. A. Hartigan, Clustering Algorithms. New York, NY, USA: Wiley,
1975.
[10] F. Cao, M. Ester, W. Qian, and A. Zhou, “Density-based clustering over
an evolving data stream with noise,” in Proc. Conf. Data Min., Bethesda,
MD, USA, 2006, pp. 328–339.
[11] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algo-
rithm for discovering clusters in large spatial databases with noise,”
in Proc. Conf. Knowl. Disc. Data Min., vol. 2. Portland, OR, USA,
1996, pp. 226–231.
[12] J. C. Schlimmer and R. H. Granger, “Incremental learning from noisy
data,” Mach. Learn., vol. 1, no. 3, pp. 317–354, 1986.
Fig. 7. Forest cover type data set. [13] G. Widmer and M. Kubat, “Learning in the presence of concept drift
and hidden contexts,” Mach. Learn., vol. 23, no. 1, pp. 69–101, 1996.
[14] J. Gama, I. Žliobaitė, A. Bifet, M. Pecheniztkiy, and A. Bouchachia,
“A survey on concept drift adaptation,” ACM Comput. Surveys, vol. 46,
For the future work, we plan to apply the proposed solution no. 4, 2014, Art. no. 44.
to different types of multimodal data in the IoT domain. We [15] H. Yang and S. Fong, “Countering the concept-drift problem in big data
will also investigate the concept drift and clustering updates using iOVFDT,” in Proc. IEEE Int. Congr. Big Data. Santa Clara, CA,
USA, 2013, pp. 126–132.
based on user requirement changes and target changes. In this [16] D. M. Farid et al., “An adaptive ensemble classifier for mining concept
paper, we proposed a clustering method designed to deal with drifting data streams,” Expert Syst. Appl., vol. 40, no. 15, pp. 5895–5906,
drifts in the data. For this we have not considered the spatial 2013.
[17] J. Smith, N. Dulay, M. A. Tóth, O. Amft, and Y. Zhang, “Exploring
dimension of the data. Spatial clustering and auto-correlation concept drift using interactive simulations,” in Proc. IEEE Int.
are important topics in data mining and we aim to extend this Conf. Pervasive Comput. Commun. Workshop (PERCOM Workshops),
paper with solutions to this problem. San Diego, CA, USA, 2013, pp. 49–54.
[18] K. Forster, D. Roggen, and G. Troster, “Unsupervised classifier self-
calibration through repeated context occurences: Is there robustness
against sensor displacement to gain?” in Proc. IEEE Int. Symp. Wearable
A PPENDIX Comput., Linz, Austria, 2009, pp. 77–84.
We have applied our approach on an additional multivari- [19] G. Cabanes, Y. Bennani, and N. Grozavu, “Unsupervised learning for
analyzing the dynamic behavior of online banking fraud,” in Proc.
ate, real world data set well-known in stream clustering and IEEE 13th Int. Conf. Data Min. Workshops, Dallas, TX, USA, 2013,
classification tasks, the forest cover types data set. The data pp. 513–520.
set was originally introduced by Blackard and Dean [33] and [20] E. Parzen, “On estimation of a probability density function and mode,”
is available online in the UCI Machine Learning Repository Ann. Math. Stat., vol. 33, no. 3, pp. 1065–1076, 1962.
[21] M. Rosenblatt, “Remarks on some nonparametric estimates of a density
(http://archive.ics.uci.edu/ml/datasets/Covertype). The forest function,” Ann. Math. Stat., vol. 27, no. 3, pp. 832–837, 1956.
cover types for 30 × 30 meter cells were obtained from the [22] D. W. Scott, Multivariate Density Estimation: Theory, Practice, and
U.S. Forest Service Region 2 Resource Information System Visualization. New York, NY, USA: Wiley, 1992.
[23] B. W. Silverman, Density Estimation for Statistics and Data Analysis.
data, contains ten continuous variables and has more than New York, NY, USA: Chapman & Hall, 1986.
580 000 data samples. [24] D. W. Scott and G. R. Terrell, “Biased and unbiased cross-validation
Fig. 7 shows that while DenStream performs better on aver- in density estimation,” J. Amer. Stat. Assoc., vol. 82, no. 400,
age, the cluster quality drops down to misclustering (values pp. 1131–1146, 1987.
[25] M. C. Jones, J. S. Marron, and S. J. Sheather, “A brief survey of band-
below 0.5) at times during the stream clustering. Our approach width selection for density estimation,” J. Amer. Stat. Assoc., vol. 91,
shows a consistent cluster quality while processing the data no. 433, pp. 401–407, 1996.
stream. [26] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “MOA: Massive
online analysis,” J. Mach. Learn. Res., vol. 11, pp. 1601–1604,
Jan. 2010.
[27] P. Kranen et al., “Clustering performance on evolving data streams:
R EFERENCES Assessing algorithms and evaluation measures within MOA,” in Proc.
[1] S. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf. Theory, IEEE Int. Conf. Data Min. Workshops, Sydney, NSW, Australia, 2010,
vol. IF-28, no. 2, pp. 129–137, Mar. 1982. pp. 1400–1403.
[2] P. S. Bradley, O. L. Mangasarian, and W. N. Street, “Clustering via [28] S. Pravilovic, A. Appice, and D. Malerba, “Integrating cluster analysis
concave minimization,” in Proc. Adv. Neural Inf. Process. Syst., Denver, to the ARIMA model for forecasting geosensor data,” in Foundations of
CO, USA, 1997, pp. 368–374. Intelligent Systems, Cham, Switzerland, 2014, pp. 234–243.
[3] L. Xu and D. Schuurmans, “Unsupervised and semi-supervised [29] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and
multi-class support vector machines,” in Proc. 20th Nat. Conf. Artif. validation of cluster analysis,” J. Comput. Appl. Math., vol. 20, no. 1,
Intell., Pittsburgh, PA, USA, 2005, pp. 904–910. pp. 53–65, 1987.
[4] D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful [30] J. Lin, E. Keogh, S. Lonardi, and B. Chiu, “A symbolic representation
seeding,” in Proc. 18th Annu. ACM-SIAM Symp. Discrete Algorithms of time series, with implications for streaming algorithms,” in Proc.
Soc. Ind. Appl. Math., New Orleans, LA, USA, 2007, pp. 1027–1035. 8th ACM SIGMOD Workshop Res. Issues Data Min. Knowl. Disc.,
[5] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan, “Clustering San Diego, CA, USA, 2003, pp. 2–11.
data streams,” in Proc. 41st Annu. Symp. Found. Comput. Sci., [31] R. J. Sefling, Clustering Algorithms, vol. 162. New York, NY, USA:
Redondo Beach, CA, USA, 2000, pp. 359–366. Wiley, 2009.
74 IEEE INTERNET OF THINGS JOURNAL, VOL. 4, NO. 1, FEBRUARY 2017

[32] M. D. McKay, R. J. Beckman, and W. J. Conover, “A comparison Payam Barnaghi (S’04–A’05–M’06–SM’12) is


of three methods for selecting values of input variables in the anal- a Reader with the Institute for Communication
ysis of output from a computer code,” Technometrics, vol. 21, no. 2, Systems Research, University of Surrey, Guildford,
pp. 239–245, 1979. U.K. He is also the Coordinator of the EU FP7
[33] J. A. Blackard and D. J. Dean, “Comparative accuracies of artificial neu- CityPulse Project. His current research interests
ral networks and discriminant analysis in predicting forest cover types include machine learning, the Internet of Things, the
from cartographic variables,” Comput. Electron. Agriculture, vol. 24, semantic Web, adaptive algorithms, and information
no. 3, pp. 131–151, 1999. search and retrieval.
Dr. Barnaghi is a Fellow of the Higher Education
Academy.

Rahim Tafazolli (M’93–SM’08) is a Professor and


the Director of the Institute for Communication
Daniel Puschmann is currently pursuing the Ph.D.
Systems, University of Surrey, Guildford, U.K. He
degree at the Institute for Communication Systems,
has been active in research for over 20 years. He has
University of Surrey, Guildford, U.K. authored or co-authored over 360 papers in refereed
His current research interests include informa-
international journals and conferences.
tion abstraction and extracting actionable informa-
Prof. Tafazolli is a Fellow of the IET and WWRF.
tion from streaming data produced in the Internet
He is the founder and the Past Chairman of IET
of Things using stream processing and machine International Conference on 3rd Generation Mobile
learning techniques.
Communications. He is the Chairman of EU Expert
Group on Mobile Platform (e-mobility SRA) and the
Chairman of Post-IP Working Group in e-mobility, the Past Chairman of WG3.

You might also like