Adaptive Clustering For Dynamic IoT Data Streams
Adaptive Clustering For Dynamic IoT Data Streams
1, FEBRUARY 2017
Abstract—The emergence of the Internet of Things (IoT) has To group the data coming from the streams, we can use
led to the production of huge volumes of real-world streaming clustering or classification methods. Classification methods
data. We need effective techniques to process IoT data streams require supervised learning and need labeled training data.
and to gain insights and actionable information from real-world
observations and measurements. Most existing approaches are There are usually huge amount of data produced in IoT appli-
application or domain dependent. We propose a method which cations, however, these data lack having labels, which makes
determines how many different clusters can be found in a stream these types of methods infeasible to be used. While clustering
based on the data distribution. After selecting the number of methods avoid this pitfall since they do not need supervised
clusters, we use an online clustering mechanism to cluster the learning, they work best in offline scenarios where all data is
incoming data from the streams. Our approach remains adaptive
to drifts by adjusting itself as the data changes. We bench- present from the start and the data distribution remains fixed.
mark our approach against state-of-the-art stream clustering In this paper, we propose a clustering method with the ability
algorithms on data streams with data drift. We show how our to cope with changes in the data stream which makes it more
method can be applied in a use case scenario involving near real- suitable for IoT data streams.
time traffic data. Our results allow to cluster, label, and interpret Data is usually clustered according to different criteria, e.g.,
IoT data streams dynamically according to the data distribution.
This enables to adaptively process large volumes of dynamic data similarity and homogeneity. The clustering results in a data
online based on the current situation. We show how our method analysis scenario can be interpreted as categories in a dataset
adapts itself to the changes. We demonstrate how the number and can be used to assign data to various groups (i.e., clus-
of clusters in a real-world data stream can be determined by ters). In this paper, we discuss an adaptable clustering method
analyzing the data distributions. that analyzes the distribution of data and updates the cluster
Index Terms—Adaptive clustering, Internet of Things (IoT), centroids according to the online changes in the data stream.
stream processing. This allows creating dynamic clusters and assigning data to
these clusters not only by their features (e.g., geometric dis-
tances) but also by investigating how the data is distributed
I. I NTRODUCTION at a given time. We evaluate this clustering method against
HE SHIFT from the desktop computing era toward several state-of-the-art methods on evolving data streams.
T ubiquitous computing and the Internet of Things (IoT)
has given rise to huge amounts of continuous data collected
To showcase the applicability of this paper, we use a case
study from an intelligent traffic analysis scenario. In this sce-
from the physical world. The data produced in the IoT context nario, we cluster the traffic sensor measurements according to
has several characteristics which makes it different from other features such as average speed of vehicles and number of cars.
data used in common database systems and machine learning These clusters can then be analyzed to assign them a label;
or data analytics. IoT data can come from multiple different for example, a cluster that always includes the highest num-
heterogeneous sources and domains, for example numerical ber of cars, according to the overall density of the cars at a
observations and measurements from different sensors or tex- given time and/or the capacity of a street, will be given the
tual input from social media streams. Common data streams “busy” tag. By further abstracting we can identify events such
usually follow a Gaussian distribution over a long-term period. as traffic jams, which can be used as an input for automated
However, in IoT applications we need to consider short-term decision making systems such as automatic rerouting via GPS
snapshots of the data, in which we can have a wider range navigators.
of more sporadic distributions. Furthermore, the nature of This paper is organized as follows. In Section II, we present
IoT data streams is dynamic and its underlying data distri- the state of the art and discuss the benefits and drawbacks of
bution can change over time. Another point is that the data different stream cluster algorithms. We present related work
comes in large quantities and is produced in real-time or close to analyze stream data with concept and data drifts. The
to real-time. This necessitates development of IoT specific silhouette coefficient is chosen as a metric for measuring
data analytics solutions which can handle the heterogeneity, the cluster quality and the mathematical backgrounds of the
dynamicity, and velocity of the data streams. method described in Section II. In Section III, we introduce the
concepts of our adaptive online clustering method which auto-
matically computes the best number of clusters based on the
Manuscript received July 1, 2016; revised October 6, 2016; accepted
October 6, 2016. Date of publication October 19, 2016; date of current ver- data distribution. Section IV describes the proposed adaptive
sion February 8, 2017. This work was supported by the European Commission clustering method in more technical details. We present evalu-
Seventh Framework Programme for the CityPulse Project under Grant 609035. ations of this paper in Section V. In Section V-A, we compare
The authors are with the Institute for Communication Systems, University
of Surrey, Guildford, GU2 7XH, U.K. (e-mail: [email protected]). our method against state-of-the-art methods on a synthesized
Digital Object Identifier 10.1109/JIOT.2016.2618909 data set. We have conducted a case study using traffic data and
2327-4662 c 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
PUSCHMANN et al.: ADAPTIVE CLUSTERING FOR DYNAMIC IoT DATA STREAMS 65
present the results in Section V-B. In Section VI, we discuss There are several approaches to deal with the problem of
the significance of this paper and outline the future work. identifying how many different clusters can be found in a
data set. Chiang and Mirkin [8] have conducted an experi-
mental study in which they proposed a new method called
k-means that chooses the right k with seven other approaches.
II. R ELATED W ORK In their experiment, the method which performs best in terms
Several approaches for the clustering problem exist, of choosing the number of clusters and cluster recovery works
however, we will only take a closer look at a particular as follows. Clusters are chosen based on new anomalies in the
approach: Lloyds algorithm, better known under the name data and a threshold based on Hartigans rule [9] is used to
k-means [1]. It should be noted that this particular approach eliminate small superfluous clusters.
has been selected to be improved for the purpose of streaming DenSream was introduced by Cao et al. [10] to cluster
data clustering because of its simplicity. The concept of uti- streaming data under the conditions of changing data dis-
lizing the data distribution can be also applied to determining tributions and noise in data streams. DenStream creates and
the parameter k for k-median [2] or the number of classes in maintains dense microclusters in an online process. Whenever
unsupervised multiclass support vector machines [3] and other a clustering request is issued, a macrocluster method (e.g.,
clustering algorithms. DBSCAN [11]) is used to compute the final cluster result on
k-means splits a given data set into k different clusters. It top of the microcluster centroids.
does so by first choosing k random points within the data It should be noted that in Chiang and Mirkin [8] experi-
sets as initial cluster centroids and then assigning each data mental setting only uses data generated from clusters with a
point to the most suitable of these clusters while adjusting Gaussian distribution. We argue that data from the real-world
the center. This process is repeated with the output as the not necessarily follows a Gaussian distribution. There are a
new input arguments until the centroids converge toward sta- large range of distributions which might fit the data better in
ble points. Since the final results of the clustering is heavily different environments and applications such as Cauchy, expo-
dependent on the initial centroids, the whole process is carried nential, or triangular distributions. In order to reflect this, our
out several times with different initial parameters. For a data selection criteria for the number of clusters is the shape of the
set of fixed size this might not be a problem; however, in the data distribution of the different data features.
context of streaming data this characteristic of the algorithm Transferring the cluster problem from a fixed environment
leads to heavy computational overload. to streaming data brings another dimension into play for
Convergence of k-means to a clustering using the random interpreting the data. This dimension is the situation in which
restarts not only means that this procedure takes additional the data is produced. For our purpose, we define situation
time, but depending on the data set, k-means can produce as the way the data is distributed in the data stream com-
lower quality clusters. k-means++ [4] is a modification of bined with statistical properties of the data stream in a time
k-means that intelligently selects the initial centroids based on frame. This situation depends both on the location and time.
randomized seeding for the initial cluster centroids. While the For example, categorizing outdoor temperature readings into
first center is chosen randomly from a uniform distribution, three different categories (i.e., cold, average, warm) is heavily
the following centroids are selected with probability weighted dependent on the location, e.g., on the proximity to the equa-
based on their proportion to the overall potential. tor. For example, what is considered hot weather in the U.K.
STREAM [5] is a one-pass clustering algorithm which treats is perceived differently somewhere in the Caribbean.
data sets, which are too large for to be processed in-memory, as Similarly our interpretation of data can change when we fix
a data stream; however, the approach has shown limitations in the location but look at measurements taken at different points
cases where the data stream evolves over time leading to mis- in time. For example, consider a temperature reading of 10◦ in
clustering. Aggarwal et al. [6] introduced their approach called the U.K. If this measurement was taken in winter, we certainly
CluStream that is able to deal with these cases. CluStream is consider this as warm. If it was taken in summer though it
also able to give information about past clusters for a user would be considered as cold temperature. This phenomenon
defined time horizon. Their solution works by dividing the where the same input data leads to a different outcome in the
stream cluster problem into an online microcluster component output is known as concept drift [12], [13].
and an offline macroclustering component. One drawback of There are several existing methods and solution focusing
Aggarwal et al.’s [6] approach is that the number of clusters on concept drift; some of the recent works in this domain
has to be either known in advanced and fixed, or chosen by are reviewed in [14]. Over the last decade, a lot of research
a user in each step, which means that human supervision has was dedicated to handling concept drift in supervised learning
to be involved in the process. scenarios mainly utilizing decision trees [15] or ensemble clas-
Another well-known approach in stream clustering is sifiers [16]; however, adaptation mechanisms in unsupervised
StreamKM++ [7]. This approach is based on k-means++ [4]. methods have only recently started to be investigated [14].
However, again the number of clusters needs to be known There are different types of concept drift. If only the data
beforehand. StreamKM++ constructs and maintains a core-set distribution changes without any effect on the output, it is
representing the data stream. After the data stream is pro- called virtual drift. Real concept drift denotes cases where the
cessed, the core-set is clustered with k-means++. As of that, output for the same input changes. This usually has one of the
StreamKM++ is not designed for evolving data streams. following reasons. Either the perception of the categories or
66 IEEE INTERNET OF THINGS JOURNAL, VOL. 4, NO. 1, FEBRUARY 2017
objectives has changed or changes in the outcome are triggered k-means++, DBSCAN [11] or similar methods as a macro
by changes in the data distribution. We argue that in the IoT cluster whenever a cluster request is issued by the end-user or
domain and especially in smart city applications, the latter type an application which uses the stream clustering mechanism.
of concept drift is more important. In order to avoid confusion This means that the actual clusters and the labels for the data
between the different types of concept drift, we introduce the items are only computed when a clustering request on the data
term “data drift” to describe real concept drift that is caused stream is made.
by changes in the data stream. This approach works in scenarios where the clustering result
Smith et al. [17] have developed a tool for creating data is not needed continuously. However, if the clustering result is
streams with data drifts through human interactions. In their needed on a continuous basis and the offline calculation of the
experiments, they have found that current unsupervised adap- data stream representation has to be issued in high frequency,
tation techniques such as near centroid classifier (NCC) [18] the efficiency gain of the methods are lost and the response
can fall victim to cyclic mislabeling, rendering the clustering time in online applications with large volumes of dynamic data
results useless. While Smith et al. [17] found that semisu- is limited by applying the macro clusters. Therefore, a new
pervised (semisupervised NCC) and hybrid adaptations of the method with low computational complexity that can produce
technique (semi- and nearest-centroid classifier) lead to more cluster results directly during processing the stream is required.
robust results; adaptive methods are also needed in scenar- Our proposed solution to this problem is to create a clus-
ios for which labels are not available and therefore only tering mechanism in which the centroids change and adapt to
unsupervised learning can be applied. data drifts. We propose an adaptive method to recalibrate and
Cabanes et al. [19] introduced a method which constructs a adjust the centroids.
synthetic representation of the data stream from which the data
distribution can be estimated. Using a dissimilarity measure for A. Silhouette Coefficient
comparing the data distributions, they are able to identify data The common metrics to evaluate the performance of clus-
drifts in the input streams. Their work is limited by the fact tering such as homogeneity, completeness, and v-measure
that they only present preliminary results and are still working are mainly suitable for offline and static clustering methods
on an adaptive version of their approach. where a ground truth in the form of class labels is avail-
Estimating the data distribution is an essential step for iden- able. However, in our method, as the centroids are adapted
tifying and handling data drifts. The data distribution can be with the data drifts, the latter metrics will not provide an
calculated using kernel density estimation (KDE) [20], [21]. accurate view of the performance. In order to measure the
The most important parameter for KDE is the bandwidth selec- effectiveness of our method we use the silhouette metric.
tion. There are different methods to choose this parameter The use of the silhouette coefficient as a criterion to choose
automatically from the provided data. They include com- the right value for number of clusters has been proposed by
putationally light rules of thumb such as Scotts rule [22] Pravilovic et al. [28]. This metric is used in various works to
and Silvermans rule [23] and computationally heavy methods measure the performance of the clustering methods including
such as cross-validation [24]. A detailed survey on bandwidth the MOA framework [26], [27] that is used in this paper for
selection for KDE is provided in [25]. However, the easily the evaluation and comparisons.
computed rules are sufficient for most practical purposes. The silhouette metric as a quality measure for cluster-
Bifet et al. [26] introduced massive online analysis (MOA), ing algorithms was initially proposed by Rousseeuw [29].
a framework for analyzing evolving data streams with a broad Intuitively it computes how well each data point fits into its
range of techniques implemented for stream learning. Initially assigned cluster compared to how well it would fit into the next
MOA only supported methods for stream classification; exten- best cluster (i.e., the cluster with the second smallest distance)
sions of the framework have added additional functionalities
b(i) − a(i)
for data stream analytics. Particularly interesting for this paper s(i) = . (1)
is an extension of the framework which provides an easy way max((a(i), b(i))
to compare different stream clustering algorithms. In addition The silhouette for one data points is defined in (1), whereby i
to providing implementation of state-of-the-art stream cluster- represents the data point, b(i) is the average distance to each
ing algorithms and evaluation measures, Kranen et al. [27] of the points in the next best cluster and a(i) is the average
introduced new data generators for evolving streams based distance to each of the points of the assigned cluster. The total
on randomized radial base functions (randomRBFGenerator). silhouette score is obtained by taking the average of all s(i).
We compare our method against their implementations of From this definition, we can see that the silhouette width s(i)
CluStream [6] and DenStream [10] which are both designed is always between −1 and 1. The interpretation of the values
to handle evolving data streams. is as follows: values closer to 1 represent better categorization
Stream cluster algorithms such as CluStream [6] and of the data point to the assigned cluster, while a value close
DenStream [10] stay adaptive to evolving data streams by to −1 denotes less efficiency in the categorization, i.e., the
splitting the clustering into offline and online parts. The online data point would have better fit into the next-nearest cluster.
part continuously retrieves a representation of the data stream. Following that a silhouette width of 0 is neutral, that is to
This is done through the computation of microclusters. The say a data point with this value would fit equally well in both
microclusters allow for efficient and accurate computations clusters. We average over all silhouette values s(i) to obtain
of clusters by applying common clustering methods such as the score of the overall clustering. This average value is the
PUSCHMANN et al.: ADAPTIVE CLUSTERING FOR DYNAMIC IoT DATA STREAMS 67
and completeness all evaluate the quality of the clusters by Algorithm 1 DETERMINE C ENTROIDS(A, k, n)
comparing the assigned labels of the data to the labels of the Require: Data matrix A = {a0 , a1 , . . . , an } with each ai being
ground truth in one way or another. an array of length m containing all values of feature n
The silhouette metric described in Section II-A satisfies both 1: %Therefore sample j is the data point: [a0 [j], . . . , an [j]]
properties. In order to estimate the quality of the clusters, Ensure: List C = {c0 , c1 , . . . , cmax(tps) } of clusterings, with
they are examined by computing how well the data points fit each ci being a list of centroids with length k = tpmin + i
into their respective cluster in comparison to the next-nearest 2: pdf [] = ∅
neighbor cluster. 3: tps[] = ∅
4: for i ← 1 to n do
5: %Array containing the PDFs of each feature
B. Dealing With the Data Drift 6: pdf [i] = gaussianKDE(a[i])
In scenarios where the input data is fixed, once the k-means 7: %Array containing the number of turning points of the
algorithm with random restarts converges, the clusters become PDF
fixed. The resulting clusters can be then reused to categorize 8: tps[i] = countTurningPoints(pdf [i])
similar data sets. 9: end for
However, in the case of streaming data two observations 10: C[] = ∅
which are taken on two different (and widespread) time points 11: for i in range(min(tps), min(tps) + max(tps) do
do not necessary have the same meaning and consequently 12: betas[] = ∅
will belong to different clusters. This in turn leads to a dif- 13: %Each f represents the PDF of a feature
ferent clustering of the two observations. Identical data can 14: for f in pdf do
have different meaning when produced in a different situation. 15: betas[] = findBetas( f , tps[i])
For example, imagine observing 50 people at 3 P. M . during a 16: C[i] = list of means between two adjacent betas
weekday walking over a university campus. This would not be 17: end for
considered as busy given the usual number of students in the 18: end for
area. Observing the same number of people at 3 A . M . would 19: return C
however be considered as exceptional, giving the indication
that a special event is happening.
We incorporate this into our clustering mechanism by
adapting the centroids of the clusters based on the current
IV. A DAPTIVE S TREAMING k-M EANS :
distribution in the data stream. The data drift detection is
A C LOSER L OOK
triggered by changes in the statistical properties of the PDF.
The justification for our method is based on properties of Algorithm 1 shows how the centroids in our adaptive
stochastic convergence. Convergence in mean square [see (3)] method are computed. This takes place after a configurable
implies convergence in probability, which in turn implies con- initialization period and is repeated at the beginning of each
vergence in probability and distribution [31]. The formula for adjustment step. More information about the data collection
convergence in the mean square is given and the adjustment can be found in Section V-B. Initially,
the PDFs of each of the features of the data are computed
using KDE [20], [21]. The continuous PDFs are represented
lim E|Xn − X|2 = 0. (3)
n→∞ by discrete arrays.
These PDF representations are then fed into Algorithm 2.
During training, we store the standard deviation and expected Turning points can be determined by analyzing the first deriva-
value of the data with the current distribution. When process- tive. They have the property that dy/dx = 0, where dx is the
ing new incoming data, we track how the expected value and difference between two infinitely close x values of the PDF, dy
standard deviation changes given the new values. Equation 3 is the difference between two infinitely close y values of the
states that as a sequence approaches infinite length, its mean PDF and dy/dx is the slope of the PDF. This is a necessary but
squared approaches the value of the random variable X (in not sufficient criteria for having a turning point. Only if the
our case defined by the previously computed PDF). However, sign of dy/dx changes from negative to positive or vice versa,
we can make the assumption that as we get more and more we actually have a turning point in our function. These are
values, that if the expected value converges to ε such that just the definitions for local maximum and minimum points,
E|Xn − X|2 = ε for n >> 0 and ε > 0, the current time respectively. Finding these points, we can present the turn-
series data is no longer converging to the distribution that ing points of a feature PDF and this number can be used to
we predicted with the PDF estimation. Therefore, we have determine the right number of clusters.
a change in the underlying distribution of the data stream and We use the heuristic that the right amount of clusters lies
trigger a recalibration of our methods. If we can observe a between the smallest number of turning points of a feature
higher quality in the new clusters, the old centroids will be PDF and this number added to the maximum number of
adjusted. turning points found in any of the PDFs.
A detailed description of the algorithm follows in the next Once the number of turning points—and therefore the pos-
section. sible values for the number of clusters—for each feature are
PUSCHMANN et al.: ADAPTIVE CLUSTERING FOR DYNAMIC IoT DATA STREAMS 69
A. Synthesized Data
To evaluate our method we test our method against
two established stream cluster algorithms: 1) CluStream [6]
(horizon = 1000, maxNumKernels = 100, kernelRadiFactor =
2) and 2) DenStream [10] (horizon = 1000, epsilon = 0.02,
beta = 0.2, mu = 1, initPoints = 1000, offline = 2,
lambda = 0.25). For this we use two different ways of
generating data streams with data drift. The first one, random- Fig. 2. Silhouette coefficient comparison on synthetic data sets.
RBFGenerator, was introduced by Kranen et al. [27]. Given
an initial fixed number of centroids and dimension, the cen-
troids are randomly generated and assigned with a standard Data drift is added sporadically and is independent for each
deviation and a weight. New samples are generated as fol- dimension through two different ways. The first is a direc-
lows. Using the weight of the centroids as a selection criteria, tional change of random length. The second is that over the
one of the centroids is picked. By choosing a random direc- course of generating the data, the data distribution used for
tion, the new sample is offset by a vector of length drawn the dimension is changed for one or more of the dimensions.
from a Gaussian distribution with the standard deviation of Both changes appear in random intervals.
the centroid. This creates clusters of varying densities. Each We compare our method against CluStream and DenStream.
time a sample is drawn, the centroids are moved with a con- Fig. 2(a) shows the performance on data generated by the
stant speed, initialized by an additional parameter, and creating randomRBFGenerator with data drift. The results for the data
the data drift. generated by the introduced novel way with different number
This however, has a drawback that the data drift is not nat- of features are shown in Fig. 2(b)–(d). One hundered centroids
ural, as the centroids are constantly shifting. We argue that have been used for the data generation. For the visualization,
during a short time frame, the data stream roughly follows the silhouette score has been normalized to a range between
a certain distribution. The underlying distribution can then 0 and 1 as done within the MOA framework.1
change between time-frames, triggered by situational changes. On the data produced by the randomRBFGenerator, our
These changes can be reoccurring in time (for example in the novel method constantly outperforms CluStream by around
case of traffic during rush hours and off-peak times) or more 13%. DenStream performs better at times, however, the silhou-
sudden changes (for example traffic congestions caused by an ette score of DenStream drops below the levels of CluStream
accident). at times, suggesting that the method does not adapt consis-
For that reason, we introduce a novel way of generating tently to the drift within the data. As seen in Figs. 4–7, for the
data with data drift. The centroids are selected through Latin synthesized data with different number of features, our novel
hypercube sampling [32]. The number of clusters and dimen- method constantly perform around 40% better than CluStream
sions are fixed beforehand. Similar to the method before, each and more than 280% than DenStream.
centroid is assigned with a standard deviation and weight.
Furthermore, each dimension is given a distribution function,
which later is used to generate the data samples. Considering B. Case Study: Real-Time Traffic Data
that each dimension represents a feature of a data stream, To showcase how our approach can be applied to real-world
this models the fact that in IoT applications we are dealing scenarios, we use (near-)real-time traffic data from the city of
with largely heterogeneous data streams in which the fea- Aarhus.2 449 Traffic sensors are deployed in the city which
tures do not follow the same data distribution. Our current produce new values every 5 min. The data is pulled and fed
implementation supports triangular, Gaussian, exponential, and into an implementation of our clustering algorithm that is
Cauchy distributions. The implementation is easily expand-
able and can support other common or custom distributions. 1 [Online]. Available: http://www.cs.waikato.ac.nz/\∼abifet/MOA/API/_
The data generation code is available via our website at: silhouette_coefficient_8java_source.html
http://iot.ee.surrey.ac.uk/ 2 [Online]. Available: http://www.odaa.dk/dataset/realtids-trafikdata
PUSCHMANN et al.: ADAPTIVE CLUSTERING FOR DYNAMIC IoT DATA STREAMS 71
the time of the day and the location. For example, 15 cars at
a given time in one street could mean busy while in another
situation it could mean “very quiet.” Similarly 15 cars in the
city center can have a different meaning during the day than
at night.
We use a ring buffer as a data cache, that captures the data
produced in the last hour. Whenever a recalculation process
Fig. 3. Traffic densities at different times in Aarhus. is triggered based on a detected data drift because the data
no longer converges to the mean square (see Section III-B),
we use the silhouette coefficient score to check if the new
described in Section IV. Before the value of k is computed centroids lead to a better cluster quality. If that is the case,
and the initial centroids are determined, the data is collected the current centroids are then adjusted. The definition of the
for one hour which equates to 5035 data points. The main data silhouette coefficient and its computations can be found in
is collected for a period of 5 months. Over the course of one Section II-A.
day, 122 787 samples are collected. For the clustering we use Fig. 4 shows the computed silhouette coefficients for cluster-
number of cars and average speed measured by the sensors. ing the initial data sequence with different number of clusters.
For the purpose of evaluation and visualization, a timestamp The number of clusters is chosen according to the high-
and location information are also added to each data point. est value of the coefficient and is emphasized in bold. For
The intuition is that the clustering of traffic data is dependent performance reasons, a sampling size of 1000 has been chosen
on the time of day and the location. This perspective allows to compute these values. It can be decided based on the data
us to cluster incoming data in a way to better reflect the cur- input and the task at hand if the number of clusters should stay
rent situation. Fig. 3(a)–(c) visualized the traffic density as a constant through the remainder of clustering the data. In our
heat map in the morning, at noon and in the evening in cen- use case scenario, we do not change the number of clusters.
tral Aarhus, respectively. Light green areas in the map show a Figs. 5 shows how the centroids of the clusters change at
low traffic density, while red areas indicate a high density. No different times on a working day and on a Saturday for the
coloring means that there are no sensors nearby. In the morn- different recalculation times. The way the data is clustered
ing [Fig. 3(a)], there is only moderate traffic density spread differs significantly between the two days. While the aver-
around the city. At noon [Fig. 3(b)], we can see that two big age speed remains roughly the same, the amount of vehicles
centers of very high density (strong red hue) have emerged. varies a lot. Most prominently this difference can be seen
Fig. 3(c) shows that in the evening there is now only very low in the centroids of the cluster representing high number of
to moderate traffic density in the city. cars. For example, Fig. 5(a) the number of cars is consider-
Several reasons for data shift on varying degrees of tem- ably higher at noon than in the evening. Resulting from the
poral granularity are conceivable. During the day the density changed centroids, data points may be assigned to different
of traffic changes according to the time. In rush hours where clusters depending on the time compared to using nonadaptive
people are simultaneously trying to get to or back from work, clustering. For example, using the centroids on working day at
the data distribution of traffic data differs greatly from less noon on the data published during the same time on Saturday,
busy times during working hours or at night. As of the same 180 out of 3142 values would be assigned to a different
reasons, the data distribution is also quite different in week- cluster.
ends as in weekdays. During holidays, e.g., around Christmas In order to interpret how busy an area is, it is necessary
or Easter, the dynamics of traffic can change a great deal. All to also take into consideration all adjacent time points in that
these changes in the traffic data distribution lead to a need of area. Therefore, the output of our approach can be used to
reconsidering what can be categorized as a busy or “quiet” further abstract from the data by matching patterns within a
street, in other words we are dealing with data drift in these time frame in an area. The results can be fed into an event
cases, as the same input leads to different output at different detection and interpretation method. To clarify the reasoning
times. behind this, two examples are given. Let us consider a mea-
Our approach deals with the data drift by recalculating the surement of a fast moving car. Only if the measurements in
centroids based on the current distribution. This means that close time range are similar, we can interpret this traffic as
defining a street as busy or quiet is relative and depends on a good traffic flow. If this is not the case, a different event
72 IEEE INTERNET OF THINGS JOURNAL, VOL. 4, NO. 1, FEBRUARY 2017
VI. C ONCLUSION
In this paper, we have introduced an adaptive clustering
method that is designed for dynamic IoT data streams. The
method adapts to data drifts of the underlying data streams.
The proposed method is also able to determine the number of
categories found inherently in the data stream based on the
data distribution and a cluster quality measure. The adaptive
method works without prior knowledge and is able to discover
inherent categories from the data streams.
Fig. 6. Silhouette coefficient on traffic data set. We have conducted a set of experiments using synthesized
data and data taken from an traffic use-case scenario where
we analyze traffic measurements from the city of Aarhus. We
has taken place, e.g., maybe an ambulance rushed through the run adaptive stream clustering method and compare it against
street. a nonadaptive stream cluster algorithm. Overall the clusters
Another example would be the measurement of slow mov- produced using an adaptive setting have an average improve-
ing cars. If this is a singular measurement, it could mean ment of 12.2% in the cluster quality metric (i.e., silhouette
for example that somebody is driving slow because she/he coefficient) over the clusters produced using a nonadaptive
is searching for a parking space. However, if the same mea- setting.
surement accumulates it could mean that a traffic jam is taking Compared to state-of-the-art stream cluster methods, our
place. novel approach shows significant improvements on synthe-
In order to ensure that the adaptive method leads to mean- sized data sets. Against CluStream, there are performance
ingful results we have conducted another experiment. We improvements between 13% and 40%. On data generated by
compare our adaptive stream clustering method with a non- randomRBFgenerator, DenStream has better cluster quality at
adaptive streaming version of the k-means algorithm, i.e., few points of the experiment, is generally outperformed by
the centroids are never recalculated. The silhouette coefficient our method though. On the other synthesized data streams,
of the clusters in both setting are computed in equal time our novel approach shows an improvement of more than 280%
intervals. Fig. 6(a) shows how the silhouette coefficients com- compared to DenStream.
pare over the course of one day. For example, at February 22, The results of our clustering method can be used as an
2014, 05:30:00 the nonadaptive approach scores a silhouette input for pattern and event recognition methods and for ana-
coefficient of 0.410 while the adaptive approach scores 0.538, lyzing the real-world streaming data. To clarify our approach
an improvement of 31.2%. This means items clustered by the we have used k-means as the underlying clustering mechanism,
adaptive methods have a better convergence considering the however, the concepts of our approach can also be applied to
distribution of the data at that time. other clustering methods. For the latter the distribution analysis
Fig. 6(b) shows how the silhouette coefficients compare and cluster update mechanisms can be directly adapted from
over the course of one week. The adaptive clustering performs this paper and only the cluster and centroid adaptation mech-
mainly better than the nonadaptive. The cluster quality of the anisms should be implemented for other clustering solution.
PUSCHMANN et al.: ADAPTIVE CLUSTERING FOR DYNAMIC IoT DATA STREAMS 73