0% found this document useful (0 votes)

18 views

Segmentation Pure Reference

It is useful for clustering.

Uploaded by

VIJIRAJVINO

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Segmentation Pure Reference

It is useful for clustering.

Uploaded by

VIJIRAJVINO

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Received September 10, 2016, accepted October 17, 2016, date of publication November 22, 2016,

date of current version December 8, 2016.

Digital Object Identifier 10.1109/ACCESS.2016.2619898

Segmentation of Factories on Electricity

Consumption Behaviors Using Load
Profile Data
IMRAN KHAN1 , JOSHUA ZHEXUE HUANG2 , Md Abdul Masud2 , AND QINGSHAN JIANG1
1 Shenzhen Key Laboratory of High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055,
China
2 College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China

Corresponding authors: I. Khan ([email protected]) and Q. Jiang ([email protected])

This research work was supported by the National Natural Science Foundation of China under grant No. 61473194, the Science and
Technology Planning Project of Guangdong Province of China under grant No. 2013B091300019, and Shenzhen Fundamental Research
Foundation under Grant No. CXZZ 20150813155917544.

ABSTRACT In recent years, the new achievements in the field of technology and data science allowed
to gather detailed and well-structured information about electricity consumption behaviors of industrial
enterprises. Such type of information can find numerous applications in the power distribution industry.
The utilities often use the data from contracts to assign each industrial customer a class label according
to this type defined in predetermined industry segmentation. Such type of fixed-chart segmentation is not
able to satisfy the needs of modern enterprises for the flexible and dynamic determination of production
modes. In this paper, we address this problem by proposing a new method for the segmentation of various
types of factories based on their electricity consumption patterns represented in load profile data. It exploits
the evolution-based characteristics of smart meter data of multiple types of factories to remove irrelevant
features. We use data visualization to estimate the number of clusters and apply the well-known k-means
algorithm on filtered data to generate segmentation. Experimental results on real load profile data collected
with smart meters from manufacturing industries in Guangdong province of China have shown that the new
clustering approach produced the meaningful segmentation of factories that reflect production operations.

INDEX TERMS Segmentation, power consumption, smart grid, load profiles, feature selection.
I. INTRODUCTION to degraded performance and increased computation time
The extensive application of smart meters as a part of smart of the most learning algorithms. Irrelevant dimensions of
grids provides enormous opportunities, but it, however, also AMI data are a source of inconsistencies and inefficiency
leads to challenges for power distribution operators. Sig- that make it difficult to discover the production modes of an
nificant investments in the Advanced Metering Infrastruc- industry sector on the basis of power consumption behavior.
ture (AMI) enable the smart grids to be well monitored, Consequently, such dimensions may lead to poor decisions
controlled, managed and optimized, and customers to be with an adverse impact on the reliable and economic grid
well serviced. On the other hand, power providers face more operation and planning. To the best of our understanding, no
challenges in handling big data due to the need to satisfy a research or industrial community considered the evolution-
list of business imperatives. Such a list includes reliability based characteristic of smart grid data to obtain strongly cor-
and efficiency, safety and security, profitability, and imple- related data subset for defining business process operations.
mentation of evolving intelligent grid that can serve a hetero- In this paper, we suggest a solution to this problem by
geneous customer base. This list of business essentials could presenting a new method for segmentation of different types
appear overwhelming, particularly in the context of efficient of factories based on their electricity consumption patterns
integration of big data content and solutions. represented in load profile data. It utilizes an innovative
Smart metering data can often show substantial changes in concept called density estimation to discover the irrelevant
trends over time. Therefore, it is useful to understand, visual- dimensions of AMI data in an efficient manner. Our method
ize and diagnose the evolution of these patterns. Such data detects the local densities in different special regions (indi-
often poses challenges as huge size, irrelevant dimension- vidual dimensions) of the data. When computing the local
ality, skewed distribution, sparsity and seasonal variations. densities, we also include those of temporal regions that are
The presence of irrelevant dimensions could arguably lead the combination of subsequent dimensions. We classify the

2169-3536
2016 IEEE. Translations and content mining are permitted for academic research only.
8394 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 4, 2016
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
I. Khan et al.: Segmentation of Factories on Electricity Consumption Behaviors Using Load Profile Data

local densities into two classes, the high-density one repre- feature selection around the learning process and explore
sented by 1 and the low-density one represented by 0. We use for features which improve the performance of the learning
a binary matrix to represent the density classes of factories at task. Filter methods [12][14], on the other hand, investigate
different time slots. From the binary matrix, we compute the the intrinsic characteristics of the data and select highly-
similarity of density vectors between every two subsequent ranked features according to some criterion before starting
time slots and identify the irrelevant dimensions of density the learning task. Wrapper methods are computationally more
vectors to be deleted from the time series data. We finally expensive than filter methods as they depend on deploying
use a visualization approach to determine the total number of the learning models several times until a subset of relevant
clusters and use the k-means algorithm to cluster the filtered features is found.
data to generate factory segmentation results. Only a few of the current filter methods are unsupervised.
Experimental results have been obtained using smart meter The Laplacian score [13] is measured to reflect its locality
data sampled at 15-minute intervals, collected from manufac- preserving power. This approach is based on the observation
turing industries in Guangdong province of China. According that two data points are probably related to the same subject
to results, the new feature selection algorithm outperformed if they are close to each other. In fact, in various learning
the well-known state-of-the-art algorithms. The new cluster- problems such as classification, the local structure of the
ing approach produced the meaningful segmentation of facto- data space is more important than the global structure. The
ries that reflect production operations. Such segmentation can Sparse K-Means score [14] uses a lasso-type penalty to select
be used in utility applications such as the design of variable the features. This framework to develop simple methods for
rates. sparse K-means for feature selection. Data variance [12]
The rest of the present work is organized as follows. might be the simplest unsupervised evaluation of the features.
Section 2 presents smart grid data. In Section 3, a detailed The variance along a dimension reflects its representative
description of the proposed method is presented. Section 4 power. Although the data variance criteria find features that
shows a thorough evaluation of clustering results on a real- are helpful for describing data, there is no reason to expect
world dataset. Conclusions are given in Section 5. that these features must be helpful for discriminating between
data in different classes.
II. RELATED WORK We addressed these problems by proposing a new feature
Nowadays, the increasing availability of energy consumption selection technique for smart meter data to enhance the per-
data allows unique opportunities in designing segmentation formance of clustering algorithms. The obtained results are
strategies of industrial energy use to support smart grid data useful to efficiently adopt the strategies by utilities to increase
applications. The introduction of smart meters has driven the business gain.
studies on high-resolution time series modeling and customer
clustering. III. SMART GRID DATA
The large size of smart meter data suggests that new AMI deployment is a significant trend in the electricity dis-
approaches are needed to maintain demand response, design tribution industry. The enormous volume of generated AMI
programs for improving the energy efficiency and ensure data is associated with two fundamental challenges: to retain
efficient customer targeting [1], [2]. In spite of the high the data and extract business value from it. Such challenges
number of clustering algorithms available in the literature e.g. make AMI prime candidates for the application of big data
automated variable weighting in k-means (W-k-means) [3], processing and analytics. Fig. 1 shows the typical AMI archi-
clustering with fastmap projection [4], swarm intelligence tecture with multiple smart meters.
based clustering [2], incremental densitybased and rule Electricity meters have been provided with microproces-
based algorithms [5], [6]. The self-organizing maps [7], sors and storage units that allow for intelligent functions and
k-means [8], and hierarchical clustering [9] are often applied turn them into smart meters. They also ensure bi-directional
for load pattern mining. Though, existing algorithms do not communication and remote operating capabilities. A large
focus on the identification of characteristics of clustering of number of smart meters have been deployed in different resi-
customers. They extract load profiles from electricity data dential and commercial buildings. In the industry sector, they
by considering the global properties of power consumption are usually installed at factory sites to record the data about
patterns, rather than undertaking the local ones. Moreover, the power consumption of ongoing production activities.
they always operate over all feature spaces of an input dataset
to learn as much as possible, which degrades the performance A. DATA COLLECTION
due to the lack to discover the hidden patterns in noisy and Typically, smart meters generate readings at small intervals
irrelevant dimensions. The scalability is another significant of 15, 30, or 60 minutes. Smart meter data is collected and
issue of existing algorithms for load profiling. forwarded via a local area network (LAN) to the data col-
Feature selection plays an important role to improve the lection center. In terms of data processing, some tasks could
quality of clustering in machine learning and data mining. be carried out at the regional collection centers. Often, the
The Feature selection approaches can be classified into wrap- data are transferred to central collection centers via a wide
per and filter techniques. Wrapper techniques [10], [11] wrap area network (WAN). Deploying a substantial number of

VOLUME 4, 2016 8395

I. Khan et al.: Segmentation of Factories on Electricity Consumption Behaviors Using Load Profile Data

FIGURE 1. Typical smart metering architecture [15].

smart meters and connecting them to collection centers is an or wireless. Due to the nature of wireless transmission, signal
expensive and time-consuming process that often takes many attenuations that affect data transmission could occur. On the
years. other hand, wired channels also are susceptible to equipment
For the goal of the present research, we obtained the elec- and power supply failures, a sudden interruption of lines, etc.
tricity consumption data of a manufacturing center located Thus, missing values can take place in the data streams.
in the Pearl River Delta (PRD) Region, Guangdong Province Fig.2 illustrates some data streams with missing values. For
of China. This province is an important industrial center, example, stream 1 and 8 in Fig. 2(a) and 2(b), respectively,
where the volume of the smart meter data for one month show periodicity with sudden power consumption falling to
collected from the factories of one city amounts to approx- zero because of missing values. One significant indication
imately 80 GB. There are different types of factories in the of this issue is stated in [16]. As a rule of thumb, a typical,
PRD region, and each one has many installed smart meters. well-run, large-scale smart meter system misses up to 4 %
Each smart meter records power consumption at 15-minute of the interval usage data that is supposed to record and
intervals and sends measured information back to the col- retrieve each month. For a million meter-system, this amounts
lection center. The collection center maintains a text file for to over 28 million missing data intervals per month. The smart
each smart meter that contains the following attributes: date, grid requires a high level of confidence in the data for its
timestamp, a unique identifier for the meter that produced the applications.
reading, and consumption value (kW). A recent study by an independent testing group found
The data collection task usually involves a costly and that 99.91 percent of smart meters are accurate within
time-consuming process. We obtained data from 21330 smart 0.5 percent [17]. Besides, smart meters are continuously con-
meters sampled at 15-minute intervals of the year 2012 in trolled by the responsible authorities to ensure that they are
the form of text files. We imported each file into a raw working correctly. The industrial utilities on their side con-
dataset with n rows and d dimensions. Each dimension of tinuously monitor the data transmitted from smart meters to
a raw dataset denotes a time slot, and each row represents prove that power usage is within the expected limits. If read-
a particular factory with its power consumption at multiple ings show a big deviation from the normal levels, specialists
sequential time slots. examine the meter. For example, in Fig. 2(a), 2(b), the load
profiles 4 and 7, counted from the upper left corner to the
B. DATA EXPLORATION right down corner, show a significant difference from other
A load profile provides information about electricity con- data streams. Moreover, load profiles 7 and 6 exhibit power
sumption for a given factory over a given period, e.g. a day or consumption below zero. Such load profiles may represent
month, at a particular frequency, typically every 15 minutes. that their corresponding smart meters have a technical fault.
Our target was to extract production mode of multiple types
of factories based on their daily power consumption behavior. C. SEGMENTATION OF LOAD PROFILES
Therefore, we need to analyze one-day data for the analysis. The electricity demand of customers varies daily and sea-
However, visual analysis of all individual load profile is a sonally. A production plant assembly line begins and ends
difficult and time-consuming process. Thus, we randomly operation during the whole day and week. During peak times,
chose and analyzed a few of them to discover the generic a tremendous amount of electricity is required (this is the so-
types of the load profiles that show abnormal behavior. called peak load), but a base load requirement is needed year-
Data transmission errors can affect data streams leading to round. Since electricity for industrial consumers cannot be
evaluation and simulation problems. The connection between stored, electricity distribution network operator must predict
smart meters and data collection centers could be both wired electrical power demands for even the most extreme condi-

8396 VOLUME 4, 2016

I. Khan et al.: Segmentation of Factories on Electricity Consumption Behaviors Using Load Profile Data

FIGURE 2. The one day load profiles (power consumption behaviours) of some factories. (a) Active Power Data
Streams(weekend). (b) Active power data streams(workday).

tions (such as high ambient temperature due to hot weather). operations of factories. The proposed method consists of
Consumption depends predominantly on the time of day and two steps: feature selection and clustering. According to the
the season. The well-defined production modes (such as two- characteristics of AMI streaming data, we propose a new
shift mode, three-shift mode or one-day off) on the basis of feature selection method that utilizes evolving characteristics
load profile segments could facilitate the handling of demand of AMI streaming data.
and supply.
Data engineering is deemed to be a fundamental problem in A. NOTATIONS
the development of smart grid applications. To build models The electricity consumption data of a factory i is represented
of data, the success of the most clustering techniques hinges as a time series Xi = x1 , x2 , ..., xd , ..., where each xj is
on the reliable selection of a small set of highly correlated a measurement of electricity consumption at a given time
features. The presence of irrelevant, redundant, and noisy interval, i.e., 15 minutes. Let X be a set of N time series from
features at the stage of model development could result in a N factories. For a given time window with d time slots (inter-
poor clustering performance. vals), X is a N d matrix {xi,j } where xi,j is the measurement
As shown in Fig. 2, the load profiles of some factories of time series i at the jth time slot. Let Yj be a vertical vector of
show clear daily electricity consumption patterns. These pat- N elements representing the measurements of N factories at
terns reflect daily production operations of factories and the jth time slot. X = {Y1 , Y2 , ..., Yd } represents a sequence
the daily patterns repeat on work days and weekend days. of d vectors. Let W be a time window of d time slots. X is
The variation of electricity consumption values at multi- a matrix representing N time series {X1 , X2 , ..., XN } with d
ple time scales makes smart grid data streams different dimensional attributes {Y1 , Y2 , Y3 , ...., Yd }.
from other data streams like the stock time series. Fur- Based on the above notation, we have a simple data rep-
thermore, the variation is caused by many factors, such resentation model as shown in Fig.3. The left figure is a
as production order, weather condition, working hours, data matrix of N time series in a time window of d time
price incentives, etc. Therefore, segmentation of load pro- intervals. Each column of the matrix represents the distri-
files is a challenging task for clustering methods to inves- bution of the total electricity consumption at a time interval
tigate production modes of factories from load profile over N factories as shown in the middle figure. We call this
data. distribution as spatial distribution. The electricity consump-
tion along the neighboring time intervals is called temporal
IV. METHODOLOGY distribution as shown in the right figure. In this work, we
In this section, we present a new method for segmentation use one day as the time window for electricity consumption
of different types of factories based on their electricity con- pattern analysis. The window length is 24 hours starting and
sumption patterns represented in load profile data. These ending at midnight. There are 96-time intervals in the time
electricity consumption patterns represent daily production window, i.e., d = 96. Using this data model and the electricity

VOLUME 4, 2016 8397

I. Khan et al.: Segmentation of Factories on Electricity Consumption Behaviors Using Load Profile Data

FIGURE 3. The time series data of N factories in a time window of d time intervals is represented as an N d matrix of
the left figure. Each column Yj is considered as spatial distribution of the total electricity consumption at the j th time
interval among N factories as shown in the middle figure. The distribution along the neighboring time intervals is called
temporal distribution as shown in the right figure.

consumption distribution concepts, we developed a feature where tj and tj+1 indicate the time slots of Yj and Yj+1 ,
selection method described below. respectively.
For each time slot j, we can use Yj and Yj+1 to calculate its
B. FEATURES SELECTION METHOD velocity-density vector Dvj with (4).
1) LOCAL DENSITY ESTIMATION Given the spatial density vector Dsj and the velocity density
At a given time slot j, Yj is a vector representing the elec- vector Dvj , we calculate the spatiotemporal density vector Dstj
tricity consumption distribution of N factories. We use the as
k-means
clustering algorithm to cluster the N factories into Dstj = {dstj (i)} = {dsj (i) dvj (i)} (5)
N clusters according to the N measurement values of Yj .
We estimate the distribution density of the factories in cluster where 1 i N .
k as The spatiotemporal density is a modification of the spatial
1X
m density by the velocity density. Given a sequence of vectors
fk (x) = Kh (x xi ) (1) Y1 , Y2 , ..., Ym , we can calculate a sequence of spatiotemporal
m
i=1 density vectors Dst1 , Dst2 , ..., Dstm . The last spatiotemporal
density vector Dstm is computed by using Dv(m1) to modify
where k (1 k N ) is a cluster number, m is the number
of factories in cluster k and Kh (.) is a Gaussian kernel defined Dsm because Ym+1 is not available. The algorithm to compute
as the spatiotemporal density is given in Algorithm 1.
(xxi ) 2
1 2) DENSITY THRESHOLD ESTIMATION
Kh (x xi ) = e 2h2 (2)
(2)h Let D = {Dst1 , Dst2 , ..., Dstd } be a sequence ofd spatiotem-
where h is a smoothing parameter. poral density vectors. Each vector contains N clusters.
Using (1) and spatial density concept, we calculate the We compute the average spatiotemporal density value dx for
spatial density for each factory at each time interval Yj and each cluster and rank the clusters on the average spatiotempo-
produce a spatial density vector Dsj . The density estimate is ral density values. We plot the average spatiotemporal density
cluster-based, so it is a local spatial density. values against the order of clusters from the highest average
Using the temporal distribution concept, we calculate the spatiotemporal density values to the lowest ones. Fig. 4 shows
density change in a small time window hw that contains examples of four-time slots. We can see that the average
neighboring time slots j and j + 1 with the spatiotemporal spatiotemporal density distributions are different in different
kernel function [18] as time slots. Some time slots have more high average density

t
clusters than others.
K (hs ,hw ) (Y , t) = 1
0
Khs (Y ) (3) We rank the clusters of all time slots on average spatiotem-
hw
poral density values. Fig. 5 shows the distribution of the
where Khs (Y ) is a Gaussian kernel in (1), hw is a tempo- average spatiotemporal densities on all time slots. From the
ral kernel width and hs is a spatial kernel width, and t aggregated distribution of densities of all clusters, we set a
(i.e. t = j) is the arrival time of vertical vector Yj and t = j/d threshold to divide clusters into two classes, i.e., high-density
where 0 < t 1. clusters and low-density clusters.
Using (3), we calculate the velocity-density as To determine the threshold, we use Minimal Description
K 0 (hs ,hw ) (Yj , tj ) K 0 (hs ,hw ) (Yj+1 , tj+1 ) Length (MDL) principle to divide all clusters into two subsets
V(hs ,hw ) (Y , tj ) = (4) as used in [19]. Let A be the set of high-density clusters
hw
8398 VOLUME 4, 2016
I. Khan et al.: Segmentation of Factories on Electricity Consumption Behaviors Using Load Profile Data

FIGURE 4. The average spatiotemporal density distributions of four time slots.

Algorithm 1 Local Density Estimation of any cluster in B. Let A and B be the averages of the
Input: XN d average cluster densities in A and B, respectively. Cluster l
Output: Spatiotemporal density vectors is found by minimizing the code length (CL) of the MDL
Dst1 , Dst2 , ..., Dstd principle as
for j := 1 to d do
X
CLl = log(1 + A ) + log(1 + |ci A |)
Select attributes Yj and Yj+1 from XN d ; ci A
Apply
k-means on Yj using the number of clusters X
N; log(1 + B ) + log(1 + |ci B |) (6)
ci B
if j != d then
Compute clusters vise spatial density vector Dsj where ci is the average density of cluster i in set A or B.
of Yj using (1); The average density cl of cluster l is used as the threshold to
Compute velocity density vector Dvj for Yj using separate high-density clusters from low-density ones. We use
Yj and Yj+1 using (4); Algorithm 2 to minimize the minimum code length CL to find
Compute Dstj for Yj using Dsj and Dvj using (5). cluster l and threshold cl .
else
Compute clusters vise spatial density vector Dsd Algorithm 2 MDL-Based Threshold Selection
of Yd using (1); Input: The sorted sequence of average density of all
Compute velocity density vector Dvd for Yd clusters S
using Yd and Yd1 using (4); Output: lmin and cmin
Compute Dstd for Yd using Dsd and Dvd using for l := 1 to stotal1 do
(5). Assign the first l cluster average densities to A;
Spatiotemporal density vectors Dst1 , Dst2 , ..., Dstd ; Assign the next stotal l cluster average densities to
B;
cl =the average density of cluster l in A;
Compute A , the average density of clusters in A;
Compute B , the average density of clusters in B;
Calculate the code length
P ;
CLl = log(1 + A )+ cl A log(1 + |cl A |) -
log(1 + B )+ cl B log(1 + |cl B |) ;
P

if (cmin >cl );
cmin =cl ; lmin = l;
Output lmin and cmin ;
FIGURE 5. Aggregated distribution of four time slots in Fig. 4.

and B the set of low-density clusters. Let l be the cluster in 3) DETECTION OF IRRELEVANT FEATURES
A whose average density is smaller or equal to the average Given the density matrix D = (Dst1 , Dst2 , ..., Dstd ) and the
density of any cluster in A but greater than the average density density threshold cl found using Algorithm 2, we compute a

VOLUME 4, 2016 8399

I. Khan et al.: Segmentation of Factories on Electricity Consumption Behaviors Using Load Profile Data

binary matrix B as revealing trends, outliers, and clusters from large and com-
( plex datasets. We use the k-means algorithm to cluster Xr
1, if d(ri ,dj ) > cl into a large number of clusters and visually investigate the
b(ri ,dj ) = (7)
0, Otherwise potential clusters and outliers. Fig.7 shows some examples of
where ri and dj are the ith rows and jth dimension, clusters produced by k-means.
respectively. From the array of clusters in the figure, we can see the
Matrix B classifies densities into two classes with 1 repre- top row contains 2 clusters of clear patterns, i.e., cluster
senting high density and 0 representing low density as shown number 2 and 4. The bottom row also contains two cluster
Fig. 6. patterns, i.e., cluster number 7 and 8. The patterns of two
clusters in the leftmost column are not clear. Cluster num-
ber 3 in the top row and cluster number 6 in the bottom
row show some patterns but are not explainable on work
patterns. To determine the number of clusters, we do not
count the clusters in the first column and only consider the
remaining 6 clusters. Therefore, the true number of clusters
is between 4 and 6 in this case. We say [4, 6] is a possible
range.
To find the optimal number from the obtained range of
clusters, we run the k-means algorithm multiple times on
Xr using randomly chosen k from the estimated clusters
range. For each clustering, we again visualize the clusters in
FIGURE 6. The binary density matrix BNd . the dimensions of two principle components to explore the
highest variances of the data. The plots are also used to com-
Given BN d , we use Jaccard similarity coefficient to compute the separation and compactness of the clustering results.
pute the similarity between two time slots Yi and Yj as These two methods are collectively used to find the optimal
n11 number of clusters that provides plots with compact and well-
JC(Yi , Yj ) = (8) separated clusters, where each cluster shows clear electricity
n01 + n11 + n10
consumption patterns. The procedure for segmentation of
where factories based on their power consumption behaviours is
n11 is the total number of elements where Yi and Yj both summarized in Algorithm 3.
have a value of 1.
n01 is the total number of elements where Yi is 0 and Yj
is 1. Algorithm 3 Segmentation of Factories Using Load Pro-
n10 is the total number of elements where Yi is 1 and Yj
file Data
Input: One-month Data D
is 0.
Output: Daily-Basis Segmentation of Factories
Using (8), we compute the similarity matrix Sdd from
Remove anomalous data records from D ;
BN d . Sdd has values between 0 and 1. A large value
Apply feature selection technique on D;
between two time slots represents that they have high sim-
Aggregate filtered data into one-day data (96
ilarity. For each row of Sdd , we compute the average
dimensions) by averaging power consumption
similarity value of d dimensions. If the average similar-
measurements at the corresponding time slots in days;
ity value of the row is smaller than a given threshold ,
Apply data visualization on aggregated data to estimate
the dimension represented by the row is considered irrele-
the number of clusters;
vant and is deleted from the data matrix X . In our work,
Apply the k-means algorithm on aggregated data;
is determined by the user. The filtered matrix is aggre-
Output: ;
gated into one-day data (96 dimensions) by averaging power
consumption measurements at the corresponding time slots
in days.
V. EXPERIMENTAL RESULTS
C. CLUSTERING In this section, we present experiment results of the new
With the feature selection method discussed above, we delete method on a real-world AMI dataset. We compare the per-
some Y vectors from the streaming sequence of Y1 , Y2 , ..., Yd formance of clustering results of the new method with three
in time window W . The remaining Y s form a reduced data feature selection methods of state-of-the-art algorithms for
matrix Xr . time series data. The comparisons have shown that the new
We use a visual method to determine the number of clusters feature selection method can produce better clustering results
in Xr . As stated in [20], visualizations are powerful tools to than other three methods. We also discuss applications of
help the users to explore and make sense of data, intuitively factory segmentation on electricity consumption behaviors in

8400 VOLUME 4, 2016

I. Khan et al.: Segmentation of Factories on Electricity Consumption Behaviors Using Load Profile Data

FIGURE 7. Exploration of cluster patterns to determine the number of clusters.

tariff setting, demand response management and the quality f = 1, ..., d, the variance score is defined as follows:
of service. n
1X 2
VS(f ) = v(x, f ) f ,
n
x=1
A. DATA
n
The real world dataset used in experiments was col- 1X
f = v(x, f ) (9)
lected from Guangdong province of China. The AMI n
x=1
streaming data were obtained with smart meters installed
The Laplacian score is based on locality preserving projec-
at 21330 manufacturing factories. One month data in
tion and Laplacian eigenmaps. It favors on features with high
November 2012 was selected. Each time series contains
locality preserving power. The Laplacian score is computed
2880 measurements collected at 15 minutes time inter-
as:
val. 21330 manufacturing factories were from 33 industrial P 2
categories. x,y v(x, f ) v(y, f ) Sxy
We use one day as the pattern analysis time window and LS(f ) = P 2
x v(x, f ) f Dxx
divide the days in the month into workdays from Monday
to Friday and weekends from Saturday to Sunday because ||dx dy ||2
Sxy = e
t , if dx , dy are neighbors (10)
work patterns at a workday and a weekend day are usually 0, Otherwise
different. Each day has 96 electricity consumption measure-
ments. There were 22 workdays and eight weekend days in where Dxx = y Sxy , f is the mean of values of feature f ,
P
November 2012. We represent workday and weekend day t is the constant parameter, and dx and dy are the neighbors
data in two matrices. Fig.2 plots some workday and weekend that either dx belongs to the k-nearest neighbors of dy , or vice
day electricity consumption time series. We can see that there versa.
are anomaly time series that need to be deleted from the In the comparison experiments, we first used a feature
dataset. We can observe two types of anomaly time series selection method to produce a reduced time series dataset.
in Fig.2, one with constant electricity consumption measure- Then, we used the k-means algorithm to generate the cluster-
ment values that often result from fault readings of smart ing results. Finally, we used clustering evaluation measures
meters and one with very low average electricity consumption to evaluate the clustering results produced by different fea-
which indicates irregular production operations such as lack ture selection methods. To make the result stable, for each
of production orders. feature selection method, we conducted clustering five times
and used the average of evaluation measures to compare the
B. THREE FEATURE SELECTION METHODS clustering results of different feature selection methods.
FOR COMPARISON
C. EVALUATION MEASURES
We chose three feature selection for comparisons with the
proposed method. They are Variance score [12], Laplacian Three evaluation measures were used to evaluate the clus-
score [13], and Sparse K-Means score (SK-Means) [14]. tering results in the experiments. The first measure is Mean
These methods represent state-of-the-art individual variable Index Adequacy (MIA) [21], defined as the average of the
weighting methods. The SK-Means method uses the well- distances between the objects and the centers of the clusters to
known lasso-type penalty to select the features. The Variance which the objects are assigned. MIA is calculated as follows:
v
score method uses the variance of instances for each of the u k
u1 X
d r (i) , L (i)

attributes as a measure to estimate the separability. For a MIA = t (11)
given feature f and instance values v(x, f ), x = 1, ..., n, k
i=1

VOLUME 4, 2016 8401

I. Khan et al.: Segmentation of Factories on Electricity Consumption Behaviors Using Load Profile Data

FIGURE 8. Performance comparison on one-day aggregated datasets. The vertical axis shows the clustering performance measure and the
horizontal axis is the number of features removed in the reduced data set. (a) Evaluation on workday data. (b) Evaluation on weekend day
data.

where k is the total number of clusters, L (i) is the set of objects day dataset and removed 1004 anomaly time series from
in cluster i, r (i) is the center of cluster i and d is the sum of the workday dataset and 1363 anomaly time series from the
distances between objects in the cluster and the cluster center. weekend day dataset. Then, we ran the four feature selection
MIA measures the separations of clusters. The smaller the methods on the two datasets to remove some insignificant
MIA, the more separate the clusters. features from the two datasets. After that, we aggregated
The second measure is Davies-Boulden Index (DBI) [22], the multiple days time series in each dataset into one-day
which measures the ratio of the within-cluster scatter and the time series by taking the averages of multiple electricity
between-cluster separation. DBI is calculated as consumption values at each time slot in the one day window.
k 0 (i) Finally, we used the k-means clustering algorithm to cluster
d (L ) + d 0 (L (j) )

1X the aggregated one-day workday and weekend datasets. The
DBI = max i 6 = j (12)
k d(r (i) , r (j) ) clustering results were evaluated with the three evaluation
x=1
measures discussed above. The number of clusters k was
where L (i) is the set of objects in cluster i, d 0 (L (i) ) is the visually determined. For the workday dataset, k was chosen
geometric mean of the inter-distances between objects in L (i) , as 25, and for the weekend data, k was set as 30.
and d(r (i) , r (j) ) is the distance between the centers of clusters Fig.8 shows the clustering performance comparisons of the
i and j. The smaller the DBI , the better the clustering result. four feature selection methods on the workday and weekend
The third measure is CD index [23] defined as the total datasets evaluated with three measures. The vertical axis
distance between centers of all clusters. CD is calculated as indicates the evaluation measure and the horizontal axis is the
follows: number of features being removed with the four feature selec-
k k 1 tion methods. We can observe that both MIA and DBI mea-
Dmax X X
CD = d(r , r )
(i) (j)
(13) sures decrease as the number of removed features increase,
Dmin
i=1 j=1 and CD measure decreases as the number of removed fea-
where Dmax and Dmin represent the maximum and minimum tures increases. These results indicate that feature selection is
distances between the cluster centers, respectively. The larger necessary for improving the clustering performance of both
the CD, the better the clustering result. workday and weekend datasets.
The comparison of the four feature selection methods
D. PERFORMANCE COMPARISON ON FEATURE shows that the proposed method performs the best in all three
SELECTION METHODS measures because its performance measure line in the MIA
Using the one month AMI dataset, we conducted experiments and CBI figures is located below the Laplacian, variance, and
to compare the clustering performance of the new feature SK-Means performance lines, whereas it is located above the
selection method and other three methods. In preprocessing, other three performance lines in the CD figure. By analyzing
we divided the dataset into workday dataset and weekend the comparison results in the figures, we absorb that MIA and

8402 VOLUME 4, 2016

I. Khan et al.: Segmentation of Factories on Electricity Consumption Behaviors Using Load Profile Data

FIGURE 9. Visualization of electricity consumption behaviors based segmentation on workday: y-axis: power consumption (kW); x-axis:
half hour index (time).

FIGURE 10. Visualization of electricity consumption behaviors based segmentation on weekend: y-axis: power consumption (kW); x-axis:
half hour index (time).

DBI measurements are the lowest and CDI measurement is These clusters show the daily electricity consumption pat-
the highest at the number of removed features 250 and 300 for terns of different groups of factories in the month of
the weekend and workday datasets, respectively. On the basis November 2012. We can see the difference of consump-
of this intuition, we remove 300 features from the workday tion patterns during workdays and weekends. These patterns
dataset and 250 features from the weekend day dataset. reflect production operation patterns of factories in different
industrial sectors.
E. CLUSTER PATTERNS ANALYSIS Two obvious patterns are the patterns of the two shift mode
After comparison of feature selection methods, we used the and the three shifts mode of production patterns. These are
new feature selection method to remove 300 features from the common production modes in the discrete manufacturing
the workday dataset and 250 features from the weekend day process in the PRD region of Guangdong Province in China.
dataset. The average numbers of removed features per day Some clusters reflect the production patterns of the contin-
from the weekend and workday datasets are 13.6 and 31.3, uous manufacturing process which show constant electricity
respectively. Then, we used the k-means clustering algorithm consumption pattern in 24 hours of a day.
to cluster the daily basis aggregated datasets, that have been From the array of Fig. 9, the 9 clusters in the first row
generated from reduced one-month datasets for workday and and the first four clusters in the second row from the left
weekend. The numbers of generated clusters for workday and column are the three shift patterns. The next 8 clusters are the
weekend datasets were 30 and 25, respectively, which were two shift patterns. The next two clusters are constant patterns
determined visually. that represent continuous manufacturing process. The follow-
Fig. 9 and Fig. 10 visualize the 30 and 25 clusters ing 13 clusters present different patterns of clusters, some
from the workday and weekend day datasets, respectively. showing discrete manufacturing patterns and some showing

VOLUME 4, 2016 8403

I. Khan et al.: Segmentation of Factories on Electricity Consumption Behaviors Using Load Profile Data

TABLE 1. Dominant factory types in power consumption based work patterns.

continuous manufacturing patterns. The cluster patterns contains time series of factories in different industry sectors.
imply irregular manufacturing processes that may be caused In each cluster, we list the three industry sectors of the top
by production disturbances such as insufficient production three frequent factories in the cluster and the percentages
orders, frequent change of production processes or partial of the factories in each industry sector. We can see that the
operation of production lines due to maintenance. For exam- factories in different industry sectors use the same production
ple, the first cluster of the bottom line shows a two-shift mode. For example, the three shift cluster pattern contains
production pattern but the electricity consumptions on the factories most from Metal Products, Plastic Products, and
morning and afternoon shifts were small. These patterns Communication Equipment industry sectors. Since Metal
reflect either factories of small capacity or factories that Products and Communication Equipment industry sectors
production capacity is not entirely used due to insufficient are the major industries in the PRD region of Guangdong
production orders. province in China and product categories in these industries
The magnitude of electricity consumption differs from one are diverse, the factories in these industry sectors have differ-
cluster to another. The difference resulted from the different production modes.
ence in electricity consumption in different industry sec- Cluster patterns of weekend data are not clear because
tors and difference of production capacity of factories in the production processes of different factories in different
the same industry sector. For instance, clusters from 1 to 9 industry sectors are different on weekends. Some factories
represent three-shift mode but have different peak electric- do not work on weekends. Some work only on Saturdays
ity consumptions on workday from 500 kW to 10000 kW. and some work on both Saturdays and Sundays. Table 2
The highest peak consumption of cluster 23 ranges from shows the percentages of factories in different industry sec-
6000 to 12000 kW. These clusters are small with a few tors that do not work (No-Day-Off ) or work on Saturdays
factories. (1-Day-Off) or work on both Saturdays and Sundays
From the array of Fig. 10, we can see that cluster patterns (2-Day-Off). We can see that most factories work only one
are more diverse than workday cluster patterns. There are weekday on Saturday. Very few factories work two days on
less three shift patterns because weekends are not regular weekends. These different work policies on weekends make
work days in many factories. Some factories work only on the cluster patterns on weekends different from the workday
Saturdays with only morning shift and afternoon shift. Few cluster patterns.
factories work with three shifts on weekends. Many factories
work irregularly on weekends as they cannot complete their F. APPLICATIONS OF CLUSTER PATTERNS
production orders on workdays. One potential application of segmentation of factories on
From the cluster patterns, we can further analyze the electricity consumption patterns is to design variable rates
characteristics of factories represented in each cluster pat- of the electricity price to reduce peak loads of smart grid.
tern. Table 1 lists examples of cluster patterns. Each cluster The economic benefits of such time-variable electricity rates

8404 VOLUME 4, 2016

I. Khan et al.: Segmentation of Factories on Electricity Consumption Behaviors Using Load Profile Data

TABLE 2. Percentage of consumption patterns with respect to the factory types.

are justifiable [24]. However, the design of time-variable [4] I. Khan, J. Z. Huang, N. T. Tung, and G. Williams, Ensemble clustering
rates requires segmenting the electricity users according to of high dimensional data with fastmap projection, in Proc. Trends Appl.
Knowl. Discovery Data Mining, Nov. 2014, pp. 483493.
their load profiles [25]. Segment-specific rate design deter- [5] I. Khan, J. Huang, and K. Ivanov, Incremental density-based ensemble
mines a time-variable rate for each factory segment. As stated clustering over evolving data streams, Neurocomputing, vol. 191, pp. 34
in [26], the segment-specific rate design is a complex pro- 43, 2016.
[6] I. Khan, J. Huang, and N. Tung, Learning time-based rules for prediction
cess, requiring to determine the number of time zones, the
of alarms from telecom alarm data using ant colony optimization, Int. J.
start times of all time zones, the total number of price Comput. Inf. Technol., vol. 13, no. 1, pp. 139147, 2013.
zones and the profitability of suppliers. In this process, seg- [7] S. V. Verdu, M. O. Garcia, C. Senabre, A. G. Marin, and F. J. G. Franco,
mentation of users on load profiles is the first necessary Classification, filtering, and identification of electrical customer load
patterns through the use of self-organizing maps, IEEE Trans. Power
step. Syst., vol. 21, no. 4, pp. 16721682, Nov. 2006.
[8] J. Kwac, J. Flora, and R. Rajagopal, Household energy consumption
VI. CONCLUSIONS segmentation using hourly data, IEEE Trans. Smart Grid, vol. 5, no. 1,
pp. 420430, Jan. 2014.
The extensive roll-out of smart meters on smart grids gen-
[9] A. Albert and R. Rajagopal, Smart meter driven segmentation: What your
erates enormous opportunities and also creates challenges consumption says about you, IEEE Trans. Power Syst., vol. 28, no. 4,
to electricity utilities. Significant investments in the AMI pp. 40194030, Nov. 2013.
allow for a high level of monitoring, control, and optimiza- [10] A. Sankar and C.-H. Lee, A maximum-likelihood approach to stochas-
tic matching for robust speech recognition, IEEE Trans. Speech Audio
tion of smart grids, which, subsequently, leads to improved Process., vol. 4, no. 3, pp. 190202, May 1996.
customer services. Utilization of the AMI data from smart [11] K. Wagstaff and C. Cardie, Clustering with instance-level constraints,
meters enables utilities to achieve significant business gains. in Proc. 17th Int. Conf. Mach. Learn., 2000, pp. 11031110.
However, effective and efficient processing and analysis of [12] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, U.K.:
Oxford Univ. Press, 1995.
big AMI data are still a big challenge to smart grid companies. [13] X. He, D. Cai, and P. Niyogi, Laplacian score for feature selection, in
In this paper, we presented an implementation and evalua- Proc. Adv. Neural Inf. Process. Syst., 2005, pp. 507514.
tion of a cluster analysis approach for application to smart [14] D. M. Witten and R. Tibshirani, A framework for feature selection in
clustering, J. Amer. Statist. Assoc., vol. 105, no. 490, pp. 713726, 2010.
meter data. We proposed a new feature selection method
[15] R. Berthier, W. Sanders, and H. Khurana, Intrusion detection for advanced
to reduce the dimensions of a selected time window by metering infrastructures: Requirements and architectural directions, in
removing insignificant features, thus improving the cluster- Proc. 1st IEEE Int. Conf. Smart Grid Commun. (SmartGridComm),
ing performance. We demonstrated that the discovered cluster Oct. 2010, pp. 350355.
[16] F. P. Sioshansi, Smart Grid: Integrating Renewable, Distributed and Effi-
patterns allow for a better segmentation of factories using cient Energy. San Francisco, CA, USA: Academic, 2011.
specific patterns in behaviors of electricity consumption to [17] S. Meters, Smart meter systems: A metering industry perspective, Edi-
judge for the different production modes. We discussed the son Electr. Inst., Washington, DC, USA, EEI-AEIC-UTC White Paper,
2011.
application of segmentation in the segment-specific time vari-
[18] C. C. Aggarwal, Data Streams: Models and Algorithms, vol. 31. New York
able rate design. , NY, USA: Springer, 2007.
In our future work, we will study the change of cluster [19] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, Automatic
patterns over time and develop a predictive technology for subspace clustering of high dimensional data, Data Mining Knowl. Dis-
covery, vol. 11, no. 1, pp. 533, 2005.
prediction of the pattern change. [20] B. Shneiderman, Inventing discovery tools: Combining information
visualization with data mining1, Inf. Vis., vol. 1, no. 1, pp. 512,
REFERENCES 2002.
[1] S. J. Moss, M. Cubed, and K. Fleisher, Market segmentation and energy [21] G. Chicco, R. Napoli, F. Piglione, and C. Toader, A review of con-
efficiency program design, California Inst. Energy Environ., Berkeley, cepts and techniques for emergent customer categorisation, in Proc.
CA, USA, Tech. Rep., 2008. TELMARK Discussion Forum Eur. Electricity Markets, London, U.K.,
[2] C. Gu, P. Shi, S. Shi, H. Huang, and X. Jia, A tree regression-based 2002, pp. 5158.
approach for VM power metering, IEEE Access, vol. 3, pp. 610621, [22] D. L. Davies and D. W. Bouldin, A cluster separation measure, IEEE
May 2015. Trans. Pattern Anal. Mach. Intell., vol. 1, no. 2, pp. 224227, Apr. 1979.
[3] J. Z. Huang, M. K. Ng, H. Rong, and Z. Li, Automated variable weighting [23] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, On clustering vali-
in K-means type clustering, IEEE Trans. Pattern Anal. Mach. Intell., dation techniques, J. Intell. Inf. Syst., vol. 17, no. 2, pp. 107145,
vol. 27, no. 5, pp. 657668, May 2005. Dec. 2001.

VOLUME 4, 2016 8405

I. Khan et al.: Segmentation of Factories on Electricity Consumption Behaviors Using Load Profile Data

[24] H. Parmesano, Rate design is the no. 1, energy efficiency tool, Electr. Md Abdul Masud is currently pursuing the
J., vol. 20, no. 6, pp. 1825, 2007. Ph.D. degree with the College of Computer Sci-
[25] G. Chicco, R. Napoli, P. Postolache, M. Scutariu, and C. Toader, Cus- ence and Software Engineering, Shenzhen Uni-
tomer characterization options for improving the tariff offer, IEEE Trans. versity, Shenzhen, China. His research interests
Power Syst., vol. 18, no. 1, pp. 381387, Feb. 2003. include data mining and analysis of complex data.
[26] C. Flath, D. Nicolay, and F. Lilia, Cluster analysis of smart metering
data, Bus. Inf. Syst. Eng., vol. 4, no. 1, pp. 3139, 2012.

IMRAN KHAN received the M.S. degree from

the National University of Computer and Emerg-
ing Sciences, Islamabad, Pakistan, in 2011. He is
currently pursuing the Ph.D. degree with the
Shenzhen Institutes of Advanced Technology,
Chinese Academy of Sciences, Shenzhen, China.
His research interests include data mining, analy-
sis of complex data, and data warehouse, and the
business intelligent system.

JOSHUA ZHEXUE HUANG received the Ph.D. QINGSHAN JIANG received the Ph.D. degree
degree from the Royal Institute of Technology in mathematics from the Chiba Institute of Tech-
in 1994. From 1994 to 1998, he served as nology, Japan, in 1996, and the Ph.D. degree
a Researcher with the Australian Academy of in computer science from the University of
Science. From 1998 to 2000, he served as a Sherbrooke, Canada, in 2002. In 1999, he was as a
Business Intelligence Senior Consultant with MIP Post-Doctoral Fellow with The Fields Institute for
Australia Company. From 2000 to 2008, he served Research in Mathematical Sciences, University of
as an Advanced Researcher with The University Toronto, Canada. He is currently a Professor with
of Hong Kong. He joined SIAT, Chinese Academy the Shenzhen Institutes of Advanced Technology,
of Sciences, in 2008. He is currently a Professor Chinese Academy of Sciences, Shenzhen, China.
with Shenzhen University, China. His current research interests include data His research interests include data mining, information security, pattern
mining, analysis of complex data, database, data warehouse, and the business recognition, massive data analysis, and database technology.
intelligent system.