Customer Segmentation and Profiling Thesis
Customer Segmentation and Profiling Thesis
S.M.H. Jansen
This Master thesis was written to complete the study Operations Research at
the University of Maastricht (UM). The research took place at the Department
of Mathematics of UM and at the Department of Information Management of
Vodafone Maastricht. During this research, I had the privilege to work together
with several people. I would like to express my gratitude to all those people for
giving me the support to complete this thesis. I want to thank the Department
of Information Management for giving me permission to commence this thesis
in the first instance, to do the necessary research work and to use departmental
data.
I am deeply indebted to my supervisor Dr. Ronald Westra, whose help, stimu-
lating suggestions and encouragement helped me in all the time of research for
and writing of this thesis. Furthermore, I would like to give my special thanks
to my second supervisor Dr. Ralf Peeters, whose patience and enthusiasm en-
abled me to complete this work. I have also to thank my thesis instructor, Drs.
Annette Schade, for her stimulating support and encouraging me to go ahead
with my thesis.
My former colleagues from the Department of Information Management sup-
ported me in my research work. I want to thank them for all their help, support,
interest and valuable hints. Especially I am obliged to Drs. Philippe Theunen
and Laurens Alberts, MSc.
Finally, I would like to thank the people, who looked closely at the final ver-
sion of the thesis for English style and grammar, correcting both and offering
suggestions for improvement.
1
Contents
1 Introduction 8
1.1 Customer segmentation and customer profiling . . . . . . . . . . 9
1.1.1 Customer segmentation . . . . . . . . . . . . . . . . . . . 9
1.1.2 Customer profiling . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Structure of the report . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Clustering 22
3.1 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.2 The clusters . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.3 Cluster partition . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Cluster algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 K-medoid . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.3 Fuzzy C-means . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.4 The Gustafson-Kessel algorithm . . . . . . . . . . . . . . 29
3.2.5 The Gath Geva algorithm . . . . . . . . . . . . . . . . . . 30
3.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1 Principal Component Analysis . . . . . . . . . . . . . . . 33
3.4.2 Sammon mapping . . . . . . . . . . . . . . . . . . . . . . 34
3.4.3 Fuzzy Sammon mapping . . . . . . . . . . . . . . . . . . . 35
2
4.3 Designing the segments . . . . . . . . . . . . . . . . . . . . . . . 45
Bibliography 68
3
List of Figures
4.1 Values of Partition Index, Separation Index and the Xie Beni Index 38
4.2 Values of Dunn’s Index and the Alternative Dunn Index . . . . . 39
4.3 Values of Partition coefficient and Classification Entropy with
Gustafson-Kessel clustering . . . . . . . . . . . . . . . . . . . . . 40
4.4 Values of Partition Index, Separation Index and the Xie Beni
Index with Gustafson-Kessel clustering . . . . . . . . . . . . . . . 41
4.5 Values of Dunn’s Index and Alternative Dunn Index with Gustafson-
Kessel clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6 Result of K-means algorithm . . . . . . . . . . . . . . . . . . . . 43
4.7 Result of K-medoid algorithm . . . . . . . . . . . . . . . . . . . . 44
4.8 Result of Fuzzy C-means algorithm . . . . . . . . . . . . . . . . . 44
4.9 Result of Gustafson-Kessel algorithm . . . . . . . . . . . . . . . . 44
4.10 Result of Gath-Geva algorithm . . . . . . . . . . . . . . . . . . . 45
4.11 Distribution of distances from cluster centers within clusters for
the Gath-Geva algorithm with c = 4 . . . . . . . . . . . . . . . . 46
4.12 Distribution of distances from cluster centers within clusters for
the Gustafson-Kessel algorithm with c = 6 . . . . . . . . . . . . . 46
4.13 Cluster profiles for c = 4 . . . . . . . . . . . . . . . . . . . . . . . 47
4.14 Cluster profiles for c = 6 . . . . . . . . . . . . . . . . . . . . . . . 48
4.15 Cluster profiles of centers for c = 4 . . . . . . . . . . . . . . . . . 49
4.16 Cluster profiles of centers for c = 6 . . . . . . . . . . . . . . . . . 50
4
5.2 Separating hyperplanes in different dimensions . . . . . . . . . . 54
5.3 Demonstration of the maximum-margin hyperplane . . . . . . . . 55
5.4 Demonstration of the soft margin . . . . . . . . . . . . . . . . . . 56
5.5 Demonstration of kernels . . . . . . . . . . . . . . . . . . . . . . . 57
5.6 Examples of separation with kernels . . . . . . . . . . . . . . . . 58
5.7 A separation of classes with complex boundaries . . . . . . . . . 59
5
List of Tables
4.1 The values of all the validation measures with K-means clustering 39
4.2 The values of all the validation measures with Gustafson-Kessel
clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 The numerical values of validation measures for c = 4 . . . . . . 42
4.4 The numerical values of validation measures for c = 6 . . . . . . 43
4.5 Segmentation results . . . . . . . . . . . . . . . . . . . . . . . . . 51
6
Abstract
7
Chapter 1
Introduction
8
advance with knowledge of an expert, and dividing the customers over these
segmentations by their best fits. This research will deal with the problem of
making customer segmentations without knowledge of an expert and without
defining the segmentations in advance. The segmentations will be determined
based on (call) usage behavior. To realize this, different data mining techniques,
called clustering techniques, will be developed, tested, validated and compared
to each other. In this report, the principals of the clustering techniques will be
described and the process of determining the best technique will be discussed.
Once the segmentations are obtained, for each customer a profile will be de-
termined with the customer data. To find a relation between the profile and
the segments, a data mining technique called Support Vector Machines (SVM)
will be used. A Support Vector machine is able to estimate the segment of a
customer by personal information, such as age, gender and lifestyle. Based on
the combination of the personal information (the customer profile), the segment
can be estimated and the usage behavior of the customer profile can be deter-
mined. In this research, different settings of the Support Vector Machines will
be examined and the best working estimation model will be used.
9
segmentations is not an easy task. Difficulties in making good segmentation are
[18]:
• Relevance and quality of data are essential to develop meaningful seg-
ments. If the company has insufficient customer data, the meaning of a
customer segmentation in unreliable and almost worthless. Alternatively,
too much data can lead to complex and time-consuming analysis. Poorly
organize data (different formats, different source systems) makes it also
difficult to extract interesting information. Furthermore, the resulting
segmentation can be too complicated for the organization to implement
effectively. In particular, the use of too many segmentation variables can
be confusing and result in segments which are unfit for management deci-
sion making. On the other hand, apparently effective variables may not be
identifiable. Many of these problems are due to an inadequate customer
database.
• Intuition: Although data can be highly informative, data analysts need
to be continuously developing segmentation hypotheses in order to identify
the ’right’ data for analysis.
• Continuous process: Segmentation demands continuous development
and updating as new customer data is acquired. In addition, effective seg-
mentation strategies will influence the behavior of the customers affected
by them; thereby necessitating revision and reclassification of customers.
Moreover, in an e-commerce environment where feedback is almost imme-
diate, segmentation would require almost a daily update.
• Over-segmentation: A segment can become too small and/or insuffi-
ciently distinct to justify treatment as separate segments.
One solution to construct segments can be provided by data mining methods
that belong to the category of clustering algorithms. In this report, several
clustering algorithms will be discussed and compared to each other.
10
gender. If one needs profiles for specific products, the file would contain product
information and/or volume of money spent. Customer features one can use for
profiling, are described in [2, 10, 19]:
• Geographic. Are they grouped regionally, nationally or globally
• Cultural and ethnic. What languages do they speak? Does ethnicity affect
their tastes or buying behaviors?
• Economic conditions, income and/or purchasing power. What is the av-
erage household incom or power of the customers? Do they have any
payment difficulty? How much or how often does a customer spend on
each product?
• Age and gender. What is the predominant age group of your target buyers?
How many children and what age are in the family? Are more female or
males using a certain service or product?
• Values, attitudes and beliefs. What is the customers’ attitude toward your
kind of product or service?
• Life cycle. How long has the customer been regularly purchasing products?
• Knowledge and awareness. How much knowledge do customers have about
a product,service, or industry? How much education is needed? How much
brand building advertising is needed to make a pool of customers aware
of offer?
• Lifestyle. How many lifestyle characteristics about purchasers are useful?
• Recruitment method. How was the customer recruited?
The choice of the features depends also on the availability of the data. With
these features, an estimation model can be made. This can be realized by a
data mining method called Support Vector Machines (SVM). This report gives
an description of SVM’s and it will be researched under which circumstances
and parameters a SVM works best in this case.
11
networks, have made data mining more attractive and practical. The typical
data mining process consist of the following steps [4]:
• problem formulation
• data preparation
• model building
• interpretation and evaluation of the results
Pattern extraction is an important component of any data mining activity and
it deals with relationships between subsets of data. Formally, a pattern is de-
fined as [4]:
Data mining tasks are used to extract patterns from large data sets. The vari-
ous data mining tasks can be broadly divided into six categories as summarized
in Figure 1.1. The taxonomy reflects the emerging role of data visualization as
a separate data mining task, even as it is used to support other data mining
tasks. Validation of the results is also a data mining task. By the fact that the
validation supports the other data mining tasks and is always necessary within
a research, this task was not mentioned as a separate one. Different data mining
tasks are grouped into categories depending on the type of knowledge extracted
by the tasks. The identification of patterns in a large data set is the first step to
gaining useful marketing insights and marking critical marketing decisions. The
data mining tasks generate an assortment of customer and market knowledge
which form the core of knowledge management process. The specific tasks to
be used in this research are Clustering (for the customer segmentation), Classi-
fication (for estimating the segment) and Data visualization.
Clustering algorithms produce classes that maximize similarity within clusters
but minimize similarity between classes. A drawback of this method is that the
number of clusters has to be given in advance. The advantage of clustering is
that expert knowledge is not required. For example, based on user behavior
data, clustering algorithms can classify the Vodafone customers into ”call only”
users, ”international callers”, ”SMS only” users etc.
Classification algorithms groups customers in predefined classes. For example,
12
Vodafone can classify its customers based on their age, gender and type of sub-
scription and then target its user behavior.
Data visualization allow data miners to view complex patterns in their cus-
tomer data as visual objects complete in three or two dimensions and colors.
In some cases it is needed to reduce high dimensional data into three or two
dimensions. To realize this, algorithms as Principal Component Analysis and
Sammon’s Mapping (discussed in Section 3.4) can be used. To provide varying
levels of details of observed patterns, data miners use applications that provide
advanced manipulation capabilities to slice, rotate or zoom the objects.
13
Chapter 2
The first step (after the problem formulation) in the data mining process is
to understand the data. Without such an understanding, useful applications
cannot be developed. All data of Vodafone is stored in a data warehouse. In
this chapter, the process of collecting the right data from this data ware house,
will be described. Furthermore, the process of preparing the data for customer
segmentation and customer profiling will be explained.
14
Figure 2.1: Structure of customers by Vodafone
he works. These customers are called business users. In some cases, customers
with a consumer account, can have a subscription that is under normal circum-
stances only available for business users. These customers also count as business
users. The total number of (postpaid) business users at Vodafone is more than
800,000. The next sections describe which data of these customers is needed for
customer segmentation and profiling.
15
• How? : How can a customer cause a call detail record? By making a voice
call, or sending an SMS (there are more possibilities, but their appearances
are so rare that they were not used during this research). The customer
can also receive an SMS or voice call.
• Who? : Who is the customer calling? Does he call to fixed lines? Does
he call to Vodafone mobiles?
• What? : What is the location of the customer and the recipient? They
can make international phone calls.
• When? : When does a customer call? A business customer can call during
office daytime, or in private time in the evening or at night and during
the weekend.
• Where? : Where is the customer calling? Is he calling abroad?
• How long? : How long is the customer calling?
• How often? : How often does a customer call or receive a call?
Based on these keywords and based on proposed features in the literature [1,
15, 19, 20] , a list of features that can be used as a summary description of a
customer based on the calls they originate and receive over some time period P
is obtained:
• 1. average call duration
• 2. average # calls received per day
• 3. average # calls originated per day
• 4. % daytime calls (9am - 6pm)
• 5. % of weekday calls (Monday - Friday)
• 6. % of calls to mobile phones
• 7. average # sms received per day
16
These twelve features can be used to build customer segments. Such a segment
describes a certain behavior of group of customers. For examples, customers
who use their telephone only at their office could be in a different segment then
users that use their telephone also for private purposes. In that case, the seg-
mentation was based on the percentage weekday and daytime calls. Most of the
twelve features listed above can be generated in a straightforward manner from
the underlying data of the data ware house, but some features require a little
more creativity and operations on the data.
It may be clear that generating useful features, including summary features, is a
critical step within the data mining process. Should poor features be generated,
data mining will not be successful. Although the construction of these features
may be guided by common sense, it should include exploratory data analysis.
For example, the use of the time period 9am-6pm in the fourth feature is not
based on the commonsense knowledge that the typical workday on a office is
from 9am to 5pm. More detailed exploratory data analysis, shown in Figure
2.2 indicates that the period from 9am to 6pm is actually more appropriate for
this purpose. Furthermore, for each summary feature, there should be sufficient
variance within the data, otherwise distinguish between customers is not possi-
ble and the feature is not useful. On the other hand, to much variance hampers
the process of segmentation. For some features values, the variance is visible in
the following histograms. Figure 2.3 shows that the average call duration, the
number of weekday and daytime calls and the originated calls have sufficient
variance. Note that the histograms resemble well known distributions. This
also indicates that the chosen features are suited for the customer segmenta-
tion. Interesting to see is the relation between the number of calls originated
and received. First of all, in general, customers originating more calls than
receiving. Figure 2.4 demonstrates this, values above the blue line represent
customers with more originating calls than receiving calls. In Figure 2.4 is also
visible that the customers that originated more calls, receive also more calls in
proportion. Another aspect that is simple to figure out is the fact that customer
17
(a) Call duration (b) Weekday calls
18
that make more weekday calls also call more at daytime (in proportion). This is
plotted in Figure 2.5. It is clear to see that the chosen features contain sufficient
variance and that certain relations and different customer behavior are already
visible. The chosen features appear to be well chosen and useful for customer
segmentation.
19
Because a relative small difference in age between customers should show close
relationships, the age of the customers has to be grouped. Otherwise, the result
of the classification algorithm is too specific to the trainings data [14]. In general,
the goal of grouping variables is to reduce the number of variables to a more
manageable size and to remove the correlations between each variable. The
composition of the groups should be chosen with care. It is of high importance
that the sizes of the groups are almost equal (if this is possible) [22]. If there is
one group with a sufficient higher amount of customers than other groups, this
feature will not increase the performance of the classification. This is caused
by the fact that from each segment a relative high number of customers is
represented in this group. Based on this feature, the segment of a customer can
not be determined. Table 2.1 shows the percentages of customers within the
chosen groups. It is clear to see that sizes of the groups were chosen with care
and the values can be used for defining the customers profile.With this profile, a
Support Vector Machine will be used to estimate the segment of the customer.
Chapter 5 and Chapter 6 contain information and results of this method.
20
Data may contain cryptic codes. These codes has to be augmented and
replaced by recognizable and equivalent text.
• Combining data, for instance the customer data, from multiple tables into
one common variable.
21
Chapter 3
Clustering
In this chapter, the used techniques for the cluster segmentation will be ex-
plained.
clusters into which the data can be divided were easily identified. The similarity
criterion that was used in this case is distance: two or more objects belong to
the same cluster if they are ”close” according to a given distance (in this case
geometrical distance). This is called distance-based clustering. Another way of
clustering is conceptual clustering. Within this method, two or more objects
22
belong to the same cluster if this one defines a concept common to all that
objects. In other words, objects are grouped according to their fit to descriptive
concepts, not according to simple similarity measures. In this research, only
distance-based clustering algorithms were used.
23
The cluster centers may be vectors of the same dimensions as the data objects,
but can also be defined as ”higher-level” geometrical objects, such as linear or
nonlinear subspaces or functions.
Data can reveal clusters of different geometrical shapes, sizes and densities as
demonstrated in Figure 3.2 Clusters can be spherical, elongated and also be
hollow. Cluster can be found in any n-dimensional space. Clusters a,c and d
can be characterized as linear and non linear subspaces of the data space (R2 in
this case). Clustering algorithms are able to detect subspaces of the data space,
and therefore reliable for identification. The performance of most clustering
algorithms is influenced not only by the geometrical shapes and densities of the
individual clusters, but also by the spatial relations and distances among the
clusters. Clusters can be well-separated, continuously connected to each other,
or overlapping each other.
24
either be fuzzy or crisp (hard). Hard clustering methods are based on the clas-
sical set theory, which requires that an object either does or does not belong
to a cluster. Hard clustering in a data set X means partitioning the data into
a specified number of exclusive subsets of X. The number of subsets (clusters)
is denoted by c. Fuzzy clustering methods allow objects to belong to several
clusters simultaneously, with different degrees of membership. The data set X
is thus partitioned into c fuzzy subsets. In many real situations, fuzzy cluster-
ing is more natural than hard clustering, as objects on the boundaries between
several classes are not forced to fully belong to one of the classes, but rather
are assigned membership degrees between 0 and 1 indicating their partial mem-
berships (illustrated by Figure 3.3 The discrete nature of hard partitioning also
Hard partition
The objective of clustering is to partition the data set X into c clusters. Assume
that c is known, e.g. based on prior knowledge, or it is a trial value, of witch
partition results must be validated. Using classical sets, a hard partition can be
seen as a family of subsets {Ai |1 ≤ i ≤ c ⊂ P (X)}, its properties can be defined
as follows:
[c
Ai = X, (3.3)
i=1
Ai ∩ Aj , 1 ≤ i 6= j ≤ c, (3.4)
Ø ⊂ Ai ⊂ X, 1 ≤ i ≤ c. (3.5)
25
These conditions imply that the subsets Ai contain all the data in X, they must
be disjoint and none of them is empty nor contains all the data in X. Expressed
in the terms of membership functions:
c
_
µAi = 1, (3.6)
i=1
Fuzzy partition
Fuzzy partition can be defined as a generalization of hard partitioning, in this
case µik is allowed to acquire all real values between zero and 1. Consider the
matrix U = [µik ], containing the fuzzy partitions, its conditions are given by:
Note that there is only one difference with the conditions of the hard partition-
ing. Also the definition of the fuzzy partitioning space will not much differ with
26
the definition of the hard partitioning space. It can be defined as follows: Let
X be a finite data set and the number of clusters 2 ≤ c < N ∈ N. Then, the
fuzzy partitioning space for X can be seen as the set:
c
X N
X
Mf c = {U ∈ RN xc |µik ∈ [0, 1], ∀i, k; µik = 1, ∀i; 0 < µik < N , ∀k}.
k=1 i=1
(3.16)
The i-th column of U contains values of the membership functions of the i-th
fuzzy subset of X. Equation (1.14) implies that the sum of each column should
be 1, which means that the total membership of each xk in X equals one. There
are no constraints on the distribution of memberships among the fuzzy clusters.
This research will focus on hard partitioning. However, fuzzy cluster algorithms
will be applied as well. To deal with the problem of fuzzy memberships, the
cluster with the highest degree of membership will be chosen as the cluster were
the object belongs to. This method will result into hard partitioned clusters.
The possibilistic partition will not be used in this researched and will not be
discussed here.
3.2.1 K-means
K-means is one of the simplest unsupervised learning algorithms that solves
the clustering problem. However, the results of this hard partitioning method
are not always reliable and this algorithm has numerical problems as well. The
procedure follows an easy way to classify a given N x n data set through a certain
numbers of c clusters defined in advance. The K-means algorithm allocates each
data point to one of the c clusters to minimize the within sum of squares:
c
X
sumk∈Ai ||xk − vi ||2 . (3.17)
i=1
Ai represents a set of data points in the i-th cluster and vi is the average of the
data points in cluster i. Note that ||xk −vi ||2 is actually a chosen distance norm.
Within the cluster algorithms, vi is the cluster center (also called prototype) of
cluster i: PNi
xk
vi = k=1 , xk ∈ Ai , (3.18)
Ni
where Ni is the number of data points in Ai .
27
3.2.2 K-medoid
K-medoid clustering, also a hard partitioning algorithm, uses the same equations
as the K-means algorithm. The only difference is that in K-medoid the cluster
centers are the nearest data points to the mean of the data in one cluster V =
{vi ∈ X|1 ≤ i ≤ c}. This can be useful when, for example, there is no continuity
in the data space. This implies that a mean of the points in one cluster does
actually not exist.
with
V = [v1 , v2 , ..., vc ], vi ∈ Rn . (3.20)
V denotes the vector with the cluster centers that has to be determined. The
distance norm ||xk − vi ||2A is called a squared inner-product distance norm and
is defined by:
2
DikA = ||kk − vi ||2A = (xk − vi )T A(xk − vi ). (3.21)
On a statistical point of view, equation 3.19 measures the total number of vari-
ance of xk from vi . The minimization of the C-means functional can be seen as a
non linear optimization problem, that can be solved by a variety of methods. Ex-
amples of methods that can solve non linear optimization problems are grouped
coordinate minimization and genetic algorithms. The simplest method to solve
this problem is utilizing the Picard iteration through the first-order conditions
for the stationary points of equation 3.19. This method is called the fuzzy c-
means algorithm. To find the stationary points of the c-means functional, one
can adjoint the constrained in 3.14 to J by means of Lagrange multipliers:
c X N N c
!
X X X
¯
J(X; U, V, λ) = m 2
(µik ) DikA + λk µik − 1 , (3.22)
i=1 k=1 k=1 i=1
1
µik = Pc 2/(m−1)
, 1 ≤ i ≤ c, 1 ≤ k ≤ N, (3.23)
j=1 (DikA /DjkA )
28
and PN
µm
ik xk
vi = Pk=1
N
, 1 ≤ i ≤ c. (3.24)
k=1 µm
i,k
The solution of these equations are satisfying the constraints that were given in
equation (3.13) and (3.15). Remark that the vi of equation (3.24) is the weighted
average of the data points that belong to a cluster and the weights represents the
membership degrees. This explains why the name of the algorithm is c-means.
The Fuzzy C-means algorithm is actually an iteration between the equations
(3.23) and (3.24). The FCM algorithm uses the standard Euclidean distance for
its computations. Therefor, it is able to define hyper spherical clusters. Note
that it can only detect clusters with the same shape, caused by the common
choice of the norm inducing matrix A = I. The norm inducing matrix can also
be chosen as an nxn diagonal matrix of the form:
(1/σ1 )2
0 ··· 0
0 (1/σ2 )2 · · · 0
AD = . (3.25)
.. .. . . .
.
. . . .
0 0 · · · (1/σn )2
This matrix accounts for different variances in the directions of the coordinate
axes of X. Another possibility is to choose A as the inverse of the nxn covariance
matrix A = F −1 , where
N
1 X
F = (xk − x̂)(xk − x̂)T (3.26)
N
k=1
and x̂ denotes the mean of the data. Hence that, in this case, matrix A is based
on the Mahalanobis distance norm.
29
If A is fixed, the conditions under (3.13), (3.14) and (3.15) can be applied
without any problems. Unfortunately, equation (3.28) can not be minimized
in a straight forward manner, since it is linear in Ai . This implies that J can
be made as small as desired by making Ai less positive definite. To avoid this,
Ai has to be constrained to obtain a feasible solution. A general way to this
is by constraining the determinant of the matrix. A varying Ai with a fixed
determinant relates to the optimization of the cluster whit a fixed volume:
||Ai || = ρi , ρ > 0. (3.29)
Here ρ is a remaining constant for each cluster. In combination with the La-
grange multiplier, Ai can be expressed in the following way:
Ai = [ρi det(Fi )]1/n Fi−1 , (3.30)
with PN
k=1 (µik )m (xk − vi )(xk − vi )T
Fi = PN . (3.31)
m
k=1 (µik )
Fi is also called the fuzzy covariance matrix. Hence that this equation in com-
bination with equation (3.30) can be substituted into equation (3.27). The
outcome of the inner-product norm of (3.27) is a generalized squared Maha-
lanobis norm between the data points and the cluster center. The covariance is
weighted by the membership degrees of U .
Comparing this with the Gustafson-Kessel algorithm, the distance norm includes
an exponentional term. This implies that this distance norm will decrease faster
than the inner-product norm. In this case, the fuzzy covariance matrix F i is
defined by:
PN
(µik )w (xk − vi )(xk − vi )T
Fwi = k=1 PN , 1 ≤ i ≤ c. (3.33)
w
k=1 (µik )
The reason for using the w variable is to generalize this expression. In the origi-
nal FMLE algorithm, w = 1. In this research, w will be set to 2, to compensate
the exponential term and obtain clusters that are more fuzzy. Because of the
generalization, two weighted covariance matrices arise. The variable αi in equa-
tion (3.32) is the prior probability of selecting cluster i. αi can be defines as
follows:
N
1 X
αi = µik . (3.34)
N
k=1
30
Gath and Geva [9] discovered that the FMLE algorithm is able to detect clusters
of different shapes, sizes and densities and that the clusters are not constrained
in volume. The main drawback of this algorithm is the robustness, since the
exponential distance norm can converge to a local optimum. Furthermore, it is
not know how reliable the results of this algorithm are.
3.3 Validation
Cluster validation refers to the problem whether a found partition is correct and
how to measure the correctness of a partition. A clustering algorithm is designed
to parameterize clusters in a way that it gives the best fit. However, this does
not apply that the best fit is meaningful at all. The number of clusters might
not be correct or the cluster shapes do not correspond to the actual groups in
the data. In the worst case, the data can not be grouped in a meaningful way at
all. One can distinguish two main approaches to determine the correct number
of clusters in the data:
• Start with a sufficiently large number of clusters, and successively reducing
this number by combining clusters that have the same properties.
• Cluster the data for different values of c and validate the correctness of
the obtained clusters with validation measures.
To be able to perform the second approach, validation measures has to be de-
signed. Different validation methods have been proposed in the literature, how-
ever, none of them is perfect by oneself. Therefore, in this research are used
several indexes, which are described below:
• Partition Coefficient (PC): measures the amount of ”overlapping” be-
tween clusters. It is defined by Bezdek [5] as follows:
c N
1 XX
P C(c) = (uij )2 (3.35)
N i=1 j=1
where uij is the membership of data point j in cluster i. The main draw-
back of this validity measure is the lack of direct connection to the data
itself. The optimal number of clusters can be found by the maximum
value.
• Classification Entropy (CE): measures only the fuzziness of the cluster,
which is a slightly variation on the Partition Coefficient.
c N
1 XX
CE(c) = − uij log(uij ) (3.36)
N i=1 j=1
31
• Partition Index (PI): expresses the ratio of the sum of compactness and
separation of the clusters. Each individual cluster is measured with the
cluster validation method. This value is normalized by dividing it by the
fuzzy cardinality of the cluster. To receive the Partition index, the sum
of the value for each individual cluster is used.
c PN m 2
j=1 (uij ) ||xj − vi ||
X
P I(c) = Pc (3.37)
i=1
Ni k=1 ||vk − vi ||2
P I is mainly used for the comparing of different partitions with the same
number of clusters. A minor value of a SC means a better partitioning.
• Separation Index (SI): in contrast with the partition index (PI), the
separation index uses a minimum-distance separation to validate the par-
titioning.
Pc PN 2 2
i=1 j=1 (uij ) ||xj − vi ||
SI(c) = (3.38)
N mini,k ||vk − vi ||2
• Xie and Beni’s Index (XB): is a method to quantify the ratio of the
total variation within the clusters and the separations of the clusters [3].
Pc PN
i=1 j=1 (uij )m ||xj − vi ||2
XB(c) = (3.39)
N mini,j ||xj − vi ||2
The lowest value of the XB index should indicate the optimal number of
clusters.
• Dunn’s Index (DI): this index was originally designed for the identifica-
tion of hard partitioning clustering. Therefor, the result of the clustering
has to be recalculated.
minx∈Ci ,y∈Cj d(x, y)
DI(c) = min{ min { }} (3.40)
i∈c j∈c,i6=j maxk∈c {maxx,y∈C d(x, y)}
The main disadvantage of the Dunn’s index is the very expansive compu-
tational complexity as c and N increase.
• Alternative Dunn Index (ADI):To simplify the calculation of the
Dunn index, the Alternative Dunn Index was designed. This will be the
case when the dissimilarity between two clusters, measured with minx∈Ci ,y∈Cj d(x, y),
is rated in under bound by the triangle-inequality:
32
Note, that the Partition Coefficient and the Classification Entropy are only
useful for fuzzy partitioned clustering. In case of fuzzy clusters the values of the
Dunn’s Index and the Alternative Dunn Index are not reliable. This is caused
by the repartitioning of the results with the hard partition method.
3.4 Visualization
To understand the data and the results of the clustering methods, it is useful
to visualize the data and the results. However, the used data set is a high-
dimensional data set, which can not be plotted and visualized directly. This
section describes three methods that can map the data points into a lower
dimensional space.
In this research, the three mapping methods will be used for the visualization
of the clustering results. The first method is the Principal Component Analysis
(PCA), a standard and a most widely method to map high-dimensional data
into a lower dimensional space. Then, this report will focus on the Sammon
mapping method. The advantage of the Sammon mapping is the ability to
preserve inter pattern distances. This kind of mapping of distances is much
closer related to the proposition of clustering than saving the variances (which
will be done by PCA). However, the Sammon mapping application has two main
drawbacks:
• Sammon mapping is a projection method, which is based on the preser-
vation of the Euclidean inter point distance norm. This implies that the
Sammon mapping only can be applied on clustering algorithms that use
the Euclidean distance norm during the calculations of the clusters.
• The Sammon mapping method aims to find in a high n-dimensional space
N points in a lower q-dimensional subspace, such in a way the inter
point distances correspond to the distances measured in the n-dimensional
space. To achieve this, a computational expensive algorithm is needed, be-
cause in every iteration step a computation of N (N − 1)/2 distances is
required.
To avoid these problems of the Sammon mapping method, a modified algorithm,
called the Fuzzy Sammon mapping, is used during this research. A draw back
of this Fuzzy Sammon mapping is the loose of precision in distance, since only
the distance between the data points and the cluster centers considered to be
important.
The three visualisation methods will be explained in more detail in the following
subsections.
33
as much of the variability in the data as possible. The succeeding components
describe the remaining variability. The main goals of the PCA method are:
• Identifying new meaningful underlying variables.
• Discovering and/or reducing the dimensionality of a data set.
In a mathematical way, the principal components will be achieved by analyzing
the eigenvectors and eigenvalues. The direction of the first principal component
is diverted from the eigenvector with the largest eigenvalue. The eigenvalue
associated with the second largest eigenvalue correspond to the second principal
component, etc. In this research, the second objective is used. In this case, the
covariance matrix of the data set can be described by:
1
F = (xk − v)(xk − v)T , (3.43)
N
where v = x¯k . Principal Component Analysis is based on the projection of
correlated high-dimensional data onto a hyperplane [3]. This methods uses
only the first q nonzero eigenvalues and the corresponding eigenvectors of the
covariance matrix:
Fi = Ui Λi UiT . (3.44)
With Λi as a matrix that contains the eigenvalues λi,j of Fi in its diagonal in
decreasing order and Ui is a matrix containing the eigenvectors corresponding
to the eigenvalues in its columns. Furthermore, there is a q-dimensional reduced
vector that represents the vector xk of X, which can be defined as follows:
yi,k = Wi−1 (xk ) = WiT (xk ). (3.45)
The weight matrix Wi contains the q principal orthonormal axes in its column:
1
Wi = Ui,q Λi,q
2
. (3.46)
where λ is a constant:
X N
X −1 N
X
λ= dij = dij . (3.48)
i<j i=1 j=i+1
34
Note that there is no need to maintain λ, since a constant does not change
the result of the optimization process. The minimization of the error E is
an optimization problem in the N xq variables yil , with i ∈ {1, 2, ..., N } and
l ∈ {1, 2, ..., q} which implies that yi = [yi1 , ..., yiq ]T . The rating of yil at the
t-th iteration can defined by:
∂E(t)
∂y (t)
il
yil (t + 1) = yil (t) − α ∂ 2 E(t) , (3.49)
2 (t)
∂yil
N
∂ 2 E(t) (yil − ykl )2 dki − d∗ki
2 X 1
=− (dki − d∗ki ) − 1 +
∂yil2 (t) λ dki d∗ki d∗ki dki
k=1,k6=i
(3.51)
With this gradient-descent method, it is not possible to reach a local minimum in
the error surface, while searching for the minimum of E. This is a disadvantage,
because multiple experiments with different random initializations are necessary
to find the minimum. However, it is possible to estimate the correct initialization
based on the information which is obtained from the data.
with d(xk , vi ) representing the distance between data point xk and the cluster
center vi in the original n-dimensional space. The Euclidean distance between
the cluster center zi and the data yk of the projected q-dimensional space is
represented by d∗ (yk , zi ). According to this information, in a projected two
dimensional space every cluster is represented by a single point, independently
to the shape of the original cluster. The Fuzzy Sammon mapping algorithm is
similar to the original Sammon mapping, but in this case the projected cluster
35
center will be recalculated in every iteration after the adaption of the projected
data points. The recalculation will be based on the weighted mean formula of
the fuzzy clustering algorithms, described in Section 3.2.3 (equation 3.19).
The membership values of the projected data can be plotted based on the stan-
dard equation for the calculation of the membership values:
1
µ∗ki = 2
m−1 , (3.53)
Pc d∗ (xk ,ηi)
j=1 d∗ (xk ,vj )
In the next chapter, the cluster algorithms will be tested and evaluated. The
PCA and the (Fuzzy) Sammon mapping methods will be used to visualize the
data and the clusters.
36
Chapter 4
In this chapter, the cluster algorithms will be tested and their performance will
be measured with the proposed validation methods of the previous chapter. The
best working cluster method will be used to determine the segments. The chap-
ter ends with an evaluation of the segments.
37
clusters 1, and the classification entropy is always ’NaN’. This is caused by the
fact that these 2 measures were designed for fuzzy partitioning methods, and
in this case the hard partitioning algorithm K-means is used. In Figure 4.1,
the values of the Partition Index, Separation Index and Xie and Beni’s Index
are shown. Mention again, that no validation index is reliable only by itself.
Figure 4.1: Values of Partition Index, Separation Index and the Xie Beni Index
Therefor, all the validation indexes are shown. The optimum could differ by
using different validation methods. This means that the optimum only could
be detected by the comparison of all the results. To find the optimal number of
cluster, partitions with less clusters are considered better, when the difference
between the values of the validation measure are small. Figure 4.1 shows that for
the PI and SI, the number of clusters easily could be rated to 4. For the Xie and
Beni index, this is much harder. The elbow can be found at c = 3, c = 6, c = 9
or c = 13, depending on the definition and parameters of an elbow. In Figure
4.2 there are more informative plots shown. The Dunn’s index and the Alterna-
tive Dunn’s index confirm that the optimal number of clusters for the K-means
algorithm should be chosen to 4. The values of all the validation measures for
the K-means algorithm, are embraced in table 4.1
38
Figure 4.2: Values of Dunn’s Index and the Alternative Dunn Index
c 2 3 4 5 6 7 8
PC 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
CE N aN N aN N aN N aN N aN N aN N aN
PI 3.8318 1.9109 1.1571 1.0443 1.2907 0.9386 0.8828
SI 0.0005 0.0003 0.0002 0.0002 0.0002 0.0002 0.0002
XBI 5.4626 4.9519 5.0034 4.3353 3.9253 4.2214 3.9079
DI 0.0082 0.0041 0.0034 0.0065 0.0063 0.0072 0.0071
ADI 0.0018 0.0013 0.0002 0.0001 0.0001 0.0001 0.0000
c 9 10 11 12 13 14 15
PC 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
CE N aN N aN N aN N aN N aN N aN N aN
PI 0.8362 0.8261 0.8384 0.7783 0.7696 0.7557 0.7489
SI 0.0002 0.0002 0.0002 0.0001 0.0001 0.0001 0.0001
XBI 3.7225 3.8620 3.8080 3.8758 3.4379 3.3998 3.5737
DI 0.0071 0.0052 0.0061 0.0070 0.0061 0.0061 0.0061
ADI 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Table 4.1: The values of all the validation measures with K-means clustering
39
It is also possible to define the optimal numbers of clusters for fuzzy cluster-
ing algorithms with this method. To illustrate this, the results of the Gustafson-
Kessel algorithm will be shown. In Figure 4.3 the results of the Partition Index
and the Classification Entropy are plotted. Compared to the hard clustering
methods, the validation methods can be used now for the fuzzy clustering. How-
ever, the main drawback of PC is the monotonic decreasing with c, which makes
it hardly to detect the optimal number of cluster. The same problem holds for
CE: monotonic increasing, caused by the lack of direct connection to the data.
The optimal number of cluster can not be rated based on those two validation
methods. Figure 4.4 gives more information about the optimal number of clus-
ters. For the PI and the SI, the local minimum is reached at c = 6. Again,
for the XBI, it is difficult to find the optimal number of clusters. The points
at c = 3, c = 6 and c = 11, can be seen as an elbow. In Figure 4.5, the Dunn
index also indicates that the optimal number of clusters should be at c = 6. On
the other hand, the Alternative Dunn index, has an elbow at the point c = 3.
However, for the Alternative Dunn Index is not known how reliable its results
are, so the optimal number of clusters for the Gustafson-Kessel algorithm will be
six. The results of the validation measures for the Gustafson-Kessel algorithm
are written in table 4.2. This process can be repeated for all other cluster algo-
rithms. The results can be found in Appendix B. For the K-means, K-medoid
and the Gath-Geva,the optimal number of clusters is chosen at c = 4. For the
other algorithms, the optimal number of clusters is located at c = 6.
40
Figure 4.4: Values of Partition Index, Separation Index and the Xie Beni Index
with Gustafson-Kessel clustering
Figure 4.5: Values of Dunn’s Index and Alternative Dunn Index with Gustafson-
Kessel clustering
41
c 2 3 4 5 6 7 8
PC 0.6462 0.5085 0.3983 0.3209 0.3044 0.2741 0.2024
CE 0.5303 0.8218 1.0009 1.2489 1.4293 1.5512 1.7575
PI 0.9305 1.2057 1.5930 1.9205 0.8903 0.7797 0.8536
SI 0.0002 0.0003 0.0007 0.0004 0.0001 0.0001 0.0002
XBI 2.3550 1.6882 1.4183 1.1573 0.9203 0.9019 0.7233
DI 0.0092 0.0082 0.0083 0.0062 0.0029 0.0041 0.0046
ADI 0.0263 0.0063 0.0039 0.0018 0.0007 0.0001 0.0009
c 9 10 11 12 13 14 15
PC 0.2066 0.1611 0.1479 0.1702 0.1410 0.1149 0.1469
CE 1.8128 2.0012 2.0852 2.0853 2.2189 2.3500 2.3046
PI 0.9364 0.7293 0.7447 0.7813 0.7149 0.6620 0.7688
SI 0.0002 0.0001 0.0002 0.0002 0.0001 0.0001 0.0001
XBI 0.5978 0.5131 0.4684 0.5819 0.5603 0.5675 0.5547
DI 0.0039 0.0030 0.0028 0.0027 0.0017 0.0015 0.0006
ADI 0.0003 0.0002 0.0004 0.0002 0.0000 0.0001 0.0000
Table 4.2: The values of all the validation measures with Gustafson-Kessel
clustering
PC CE PI SI XBI DI ADI
K-means 1 NaN 1.1571 0.0002 5.0034 0.0034 0.0002
K-medoid 1 NaN 0.2366 0.0001 Inf 0.0084 0.0002
FCM 0.2800 1.3863 0.0002 42.2737 1.0867 0.0102 0.0063
GK 0.3983 1.0009 1.5930 0.0007 1.4183 0.0083 0.0039
GG 0.4982 1.5034 0.0001 0.0001 1.0644 0.0029 00030
and 4.4 show that the PC and CE are useless for the hard clustering methods
K-means and K-medoid. On the score of the values of the three most used
indexes, Separation index, Xie and Beni’s index and Dunn’s index, one can
conclude that for c = 4 the Gath-Geva algorithm has the best results and
for c = 6 the Gustafson-Kessel algorithm. To visualize the clustering results,
the validation methods that are described in Section 3.4 can be used. With
these visualization methods, the dataset can be reduced to a 2-dimensional
space. To avoid visibility problems (plotting too much values will cause one
42
PC CE PI SI XBI DI ADI
K-means 1 NaN 1.2907 0.0002 3.9253 0.0063 0.0001
K-medoid 1 NaN 0.1238 0.0001 Inf 0.0070 0.0008
FCM 0.1667 1.7918 0.0001 19.4613 0.9245 0.0102 0.0008
GK 0.3044 1.4293 0.8903 0.0001 0.9203 0.0029 0.0007
GG 0.3773 1.6490 0.1043 0.0008 1.0457 0.0099 0.0009
big cloud of data points), only 500 values (representing 500 customers) from
this 2-dimensional dataset will be randomly picked. For the K-means and the
K-medoid algorithm, the Sammon’s mapping gives the best visualization of
the results. For the other cluster algorithms, the Fuzzy Sammon’s mapping
visualization gives the best projection with respect to the partitions of the data
set. These visualization methods are used for the following plots. Figures 4.x-
4.x show the different clustering results for c = 4 and c = 6 on the data set.
Figure 4.6 and 4.7 show that hard clustering methods can find a solution
for the clustering problem. None of the clusters contain sufficient more or less
customers than other clusters. The plot of the Fuzzy C-means algorithm, in
Figure 4.8, shows unexpected results. For the situation with 4 clusters, there
are only 2 clusters clearly visible. By a detailed look at the plot, one can see that
there are actually 4 cluster centers, but the cluster centers are almost situated
on the same location. In the situation with 6 clusters, one can see three big
cluster, with one small cluster in one of the big clusters. The other two cluster
centers are nearly invisible. This implies that the Fuzzy C-means algorithm is
not able to find good clusters for this data set. In Figure 4.9, the results of the
Gustafon-Kessel algorithm are plotted. For both situations, the clusters are well
separated. Note that the cluster in the left bottom corner and the cluster in the
43
Figure 4.7: Result of K-medoid algorithm
44
top right corner in Figure 4.9 are also maintained in the situation with 6 clusters.
This may indicate that the data points in these clusters represents customers
that differ on multiple fields with the other customers of Vodafone. The results
45
Figure 4.11: Distribution of distances from cluster centers within clusters for
the Gath-Geva algorithm with c = 4
Figure 4.12: Distribution of distances from cluster centers within clusters for
the Gustafson-Kessel algorithm with c = 6
46
The profiles of the different clusters do not differ much in shape. However, in
each cluster, at least one value differs sufficient from the values of the other
cluster. This confirms the assumption that customers of different clusters have
indeed a different usage behavior. Most of the lines in one profile are drawn
closely together. This means that the customers in one profile contain similar
values of the feature values.
47
Figure 4.14: Cluster profiles for c = 6
48
More relevant plots are shown in Figure 4.15 and ??. The mean of all the
lines (equivalent to the cluster center) was calculated and a line between all the
(normalized) feature vales was drawn. The difference between the clusters are
visible by some feature values. For instance, in the situation with four clusters,
Cluster 1 has customers, compared with other cluster, have a high feature value
at feature 8. Cluster 2 has high values at position 6 and 9, while Cluster 3
contains peaks at features 2 and 12. The 4th and final cluster has high values
at feature 8 and 9.
49
Figure 4.16: Cluster profiles of centers for c = 6
50
With the previous clustering results, validation measures and plots, it is not
possible to decide which of the two clustering methods gives a better result.
Therefor, both results will be used as a solution for the customer segmentation.
For the Gath-Geva algorithm with c = 4 and the Gustafson-Kessel algorithm
with c = 6, table 4.5 shows the result of the customer segmentation. The feature
Feature 1 2 3 4 5 6
Average 119.5 1.7 3.9 65.8 87.0 75.7
Segment 1 (27.2%) 91.3 0.9 2.9 54.8 86.6 58.2
c=4 Segment 2 (28.7%) 120.1 1.8 3.6 73.6 87.1 93.7
Segment 3 (23.9%) 132.8 2.4 4.4 60.1 86.7 72.1
Segment 4 (20.2%) 133.8 1.7 4.7 74.7 87.6 78.8
Segment 1 (18.1%) 94.7 1.2 2.8 66.3 88.0 72.6
Segment 2 (14.4%) 121.8 1.7 4.1 65.9 86.4 73.0
c=6 Segment 3 (18.3%) 121.6 2.5 4.9 66.0 84.3 71.5
Segment 4 (17.6%) 126.8 1.6 4.0 65.7 87.3 71.2
Segment 5 (14.8%) 96.8 1.1 3.5 65.2 88.6 92.9
Segment 6 (16.8%) 155.3 2.1 4.1 65.7 87.4 73.0
Feature 7 8 9 10 11 12
Average 1.6 3.7 2.2 14.4 6.9 25.1
Segment 1 (27.2%) 1.7 4.0 1.6 12.3 6.2 12.2
c=4 Segment 2 (28.7%) 1.2 3.1 2.1 12.8 6.6 30.6
Segment 3 (23.9%) 1.4 3.4 2.1 22.4 9.4 39.7
Segment 4 (20.2%) 2.1 4.3 3.0 10.1 5.4 17.9
Segment 1 (18.1%) 2.3 4.5 1.8 11.3 6.1 13.5
Segment 2 (14.4%) 1.6 3.7 1.9 17.8 9.5 40.4
c=6 Segment 3 (18.3%) 1.0 2.9 2.9 15.1 6.6 26.9
Segment 4 (17.6%) 1.5 3.6 1.9 15.0 6.2 24.0
Segment 5 (14.8%) 0.8 2.9 1.8 12.4 6.1 23.1
Segment 6 (16.8%) 2.4 4.6 2.9 14.8 6.9 22.7
51
usage. They call often to mobile phones during day time. They do not
send and receive many sms messages.
• Segment 3: The customers in this segment make relative many voice
calls. These customers call to many different numbers and have a lot of
contacts which are Vodafone customers.
• Segment 4: These customers originate many voice calls. They also send
and receive many sms messages. They call often during daytime and call
more then average to international numbers. Their call duration is high.
Remarkable is the fact that they don not have so many contacts as the
number of calls do suspect. They have a relative small number of contacts.
For the situation with 6 segments, the customers in this segments can be de-
scribed as follows:
• Segment 1: In this segment are customers with a relative low number
of voice calls. Their average call duration is also lower than average.
However, their sms usage is relative high. These customers do not call to
many different numbers.
• Segment 2: This segment contains customers with a relative high number
of contacts. They also call to many different areas. They have also more
contacts with a Vodafone mobile.
• Segment 3: The customers in this segment make relative many voice
calls. Their sms usage is low. In proportion, they make more international
phone calls than other customers.
• Segment 4: These customers are the average customers. None of the
feature values is high or low.
• Segment 5: These customers do not receive many voice calls. The aver-
age call duration is low. They also receive and originate a low number of
sms messages.
• Segment 6: These customers originate and receive many voice calls.
They also send and receive many sms messages. The duration of their
voice calls is longer than average. The percentage of international calls is
high.
In the next session the classification method Support Vector Machine will be
explained. This technique will be used to classify/estimate the segment of a
customer by personal information as age, gender and lifestyle (the customer
data of Section 2.1.3).
52
Chapter 5
53
(a) Two-dimensional representation of the (b) A separating hyperplane
customers
1 or the segment 2 side of the separating line. Now, to define the notion of
a separating hyperplane, consider the situation where there are not just two
feature values to describe the customer. For example, if there was just 1 feature
value to describe the customer, then the space in which the corresponding one-
dimensional feature resides is a one-dimensional line. This line can be divided
in half by using a single point (see Figure 5.2a). In two dimensions, a straight
line divides the space in half (remember Figure 5.1b) In a three-dimensional
space, a plane is needed to divide the space, illustrated in Figure 5.2b. This
procedure can be extrapolated mathematically in higher dimensions. The term
for a straight line in a high-dimensional space is a hyperplane. So the term
separating hyperplane is, essentially, the line that separates the segments.
54
5.2 The maximum-margin hyperplane
The concept of treating objects as points in a high-dimensional space and finding
a line that separates them, is a common way of classification, and therefore not
unique to the SVM. However, the SVM differs from all other classifier methods
by virtue of how the hyperplane should be selected. Consider again the classifi-
cation problem of Figure 5.1a The goal of SVM is to find a line that separates
the segment 1 customers from the segment 2 customers. However, there are
an infinite number of possible lines, as portrayed in Figure 5.2 The question is
which line should be chosen as the optimal classifier and how should the optimal
line be defined. A logical way of selecting the optimal line, is selecting the line
that is, roughly speaking, ’in the middle’. In other words, the line that sepa-
rates the two segments and adopts the maximal distance from any of the given
customers (see Figure 5.2). It is not surprising that a theorem of the statistical
learning theory is supporting this choice [6]. By defining the distance from the
hyperplane to the nearest customer (in general an expression vector) as the mar-
gin of the hyperplane, the SVM selects the maximum separating hyperplane.
By selecting this hyper plane, the SVM is able to predict the unknown segment
of the customer in Figure 5.1a. The vectors (points) that constrain the width
of the margin are the support vectors. This theorem, is in many ways, the key
to the success of Support Vector Machines. However, there are a some remarks
and caveats to deal with. First at all, the theorem is based on the fact that the
data on which the SVM is trained are drawn from the same distribution as the
data it has to classify. This is of course logical, since it is not reasonable that
a Support Vector machine trained on customer data is able to classify different
car types. More relevantly, it is not reasonable to expect that the SVM can
classify well if the training data set is prepared with a different protocol then
the test data set. On the other hand, the theorem of a SVM indicates that the
two data sets has to be drawn from the same distribution. For example, a SVM
55
does not assume that the data is drawn from a normal distribution.
(a) Data set containing one error (b) Separating with soft margin
56
is a one-dimensional data set, as seen before in Figure 5.1. In that case, the
separating hyperplane was a single point. Now, consider the situation of Figure
5.4, which illustrates an non separable data set. No single point can separate
the two segments and introducing a soft margin would not help. A kernel func-
tion provides a solution to this problem. The kernel function adds an extra
dimension to the data, in this case by squaring the one dimensional data set.
The result is plotted in Figure 5.4. Within the new higher dimensional space,
as shown in the figure, the SVM can separate the data in two segments by one
straight line. In general, the kernel function can be seen as a mathematical
trick for the SVM to project data from a low-dimensional space to a space of
higher dimensions. If one chooses a good kernel function, the data will become
separable in the corresponding higher dimension. To understand kernels better,
57
sional kernel. This results in boundaries which are to specific to the examples
of the data set. This phenomenon is called over fitting. The SVM will not
function well on new unseen unlabeled data. There exists another large practi-
(a) Linearly separable in four dimensions (b) A SVM that has over fit the data
cal difficulty when applying new unseen data to the SVM. This problems relies
on the question how to choose a kernel function that separates the data, but
without introducing too many irrelevant dimensions. Unfortunately, the answer
too this question is, in most cases, trial and error. In this research a SVM will
be experimented with a variety of ’standard’ kernel functions. By using the
cross-validation method, the optimal kernel will be selected on a statistical way.
However, this is a time-consuming process and it is not guaranteed that the
best kernel function that was found during cross-validation, is actually the best
kernel function that exists. It is more likely that there exists a kernel function
that was not tested and performs better than the selected kernel function. Prac-
tically, the method described above, mainly gives sufficient results. In general
the kernel function is defined by:
where xi are the training vectors. The vectors are mapped into a higher dimen-
sional space by the function Φ. Many kernel mapping functions can be used,
probably an infinite number, but a few kernel functions have been found to work
well in for a wide variety of applications [16]. The default and recommended
kernel functions were used during this research and will be discussed now.
• Linear: which function is defined by:
58
• Radial basis function: also known as the Gaussian kernel is of the form
59
Chapter 6
divided into two groups, the training set, the test set and the validation set.
The training set will be used to train the SVM. The test set will be used to
estimate the error during the training of the SVM. With the validation set,
the actual performance of the SVM will be measured after the SVM is trained.
The training of the SVM will be stopped when the test error reached a local
60
minimum, see Figure 6.2. By K-fold cross validation, a k-fold partition of the
data set is created. For each of K experiments, K-1 folds will be used for training
and the remaining one for testing. Figure 6.3 illustrates this process. In this
research, K is set to 10. The advantage of K-fold cross validation is that all the
examples in the dataset are eventually used for both training and testing. The
error is calculated by taking the average off all K experiments.
61
C 1 2 5 10 20 50 100 200 500
28.9% 29.4% 30.9% 31.3% 31.4% 32.0% 27.6% 27.6% 21.8%
optimal value for the soft margin is C = 10 and by using the 6 segments C = 50.
The correct number of classifications are respectively, 43.2% and 32.0%. For the
polynomial kernel function, there are two parameters. The number of degrees,
denoted by d and the width γ. Therefor, the optimal number for the maximal
margin will be determined. This is done by multiple test runs with random
values for d and γ. The average value for each soft margin C can be found in the
tables 6.3 and 6.4. These C-values are used to find out which d and γ give the
best results. The results are shown in tables 6.5 and 6.6. For the situation with
d 1 2 3 4 5 6 7
γ = 0.4 76.1% 76.3% 78.1% 73.2% 74.8% 76.0% 75.0%
γ = 0.6 76.0% 76.3% 77.6% 74.1% 74.5% 75.4% 75.8%
γ = 0.8 75.8% 76.3% 77.2% 74.0% 74.4% 77.1% 75.2%
γ = 1.0 76.2% 76.4% 78.0% 75.0% 75.2% 75.6% 75.8%
γ = 1.2 76.0% 76.2% 78.1% 74.6% 75.1% 76.0% 75.8%
γ = 1.4 75.2% 76.2% 78.1% 74.9% 75.5% 76.3% 74.9%
d 1 2 3 4 5 6 7
γ = 0.4 75.0% 74.6% 75.9% 76.0% 75.8% 74.3% 73.9%
γ = 0.6 74.2% 75.1% 74.9% 76.2% 75.0% 74.5% 74.0%
γ = 0.8 73.8% 74.7% 74.3% 76.2% 75.9% 74.8% 73.1%
γ = 1.0 74.1% 75.0% 73.6% 76.1% 75.3% 74.2% 72.8%
γ = 1.2 72.1% 74.1% 75.5% 75.4% 75.4% 74.1% 73.0%
γ = 1.4 73.6% 74.3% 72.2% 76.0% 74.4% 74.3% 72.9%
62
4 segments, the optimal score is 78.1% and for 6 segments 76.2%. The following
kernel function, the radial basis function has only one variable, namely γ. The
results of the Radial Basis function are given in table 6.7 and table 6.8. The
best result with 4 segments is 80.3%, with 6 segments the best score is 78.5%.
The sigmoid function has also only 1 variable. The results are given in table
6.9 and 6.10 The results show that 66.1% and 44.6% of the data is classified
63
C 1 2 5 10 20 50 100 200 500
γ = 0.4 33.8 34.0 34.7 33.1 34.6 30.0 32.6 28.8 28.8
γ = 0.6 29.6 27.4 28.5 29.7 21.4 20.8 20.0 18.8 18.1
γ = 0.8 39.1 36.4 33.6 35.7 38.9 32.0 26.4 24.6 22.9
γ = 1.0 40.0 42.5 39.8 40.7 39.9 39.8 30.4 31.1 28.0
γ = 1.2 41.9 40.6 43.6 43.2 44.1 43.2 44.6 40.6 41.7
γ = 1.4 38.6 34.5 32.1 30.6 30.2 27.5 24.3 26.3 27.9
64
6.3 Feature Validation
In this section, the features will be validated. The importance of each feature
will be measured. This will be done, by leaving one feature out of the feature
vector and train the SVM without this feature. The results of both situations,
are shown in Figure 6.4 and 6.5. The result show that Age is an important
Figure 6.4: Results while leaving out one of the features with 4 segments
Figure 6.5: Results while leaving out one of the features with 6 segments
feature for classifying the right segment. This is in contrast with the type of
telephone, which increase the result with only tenths of percents. Each feature
increases the result and therefore each feature is useful for the classification.
65
Chapter 7
This section concludes the research and the corresponding results and will give
some recommendations for future work.
7.1 Conclusions
The first objective of our research was to perform automatic customer segmen-
tation based on usage behavior, without the direct intervention of a human
specialist. The second part of the research was focused on profiling customers
and finding a relation between the profile and the segments. The customer
segments were constructed by applying several clustering algorithms. The clus-
tering algorithms used selected and preprocessed data from the Vodafone data
warehouse. This led to solutions for the customer segmentation with respec-
tively four segments and six segments. The customer’s profile was based on
personal information of the customers. A novel data mining technique, called
Support Vector Machines was used to estimate the segment of a customer based
on his profile.
There are various ways for selecting suitable feature values for the clustering
algorithms. This selection is vital for the resulting quality of the clustering.
One different feature value will result in different segments. The result of the
clustering can therefore not be regarded as universally valid, but merely as one
possible outcome. In this research, the feature values were selected in such a
way that it would describe the customer’s behavior as complete as possible.
However, it is not possible to include all possible combinations of usage behav-
ior characteristics within the scope of this research. To find the optimal number
of clusters, the so-called elbow criterion was applied. Unfortunately, this crite-
rion could not always be unambiguously identified. An other problem was that
the location of the elbow could differ between the validation measures for the
same algorithm. For some algorithms, the elbow was located at c = 4 and for
other algorithms, the location was c = 6. To identify the best algorithm, several
validation measures were used. Not every validation method marked the same
66
algorithm as the best algorithm. Therefore, some widely established validation
measures were employed to determine the most optimal algorithm. It was how-
ever not possible to determine one algorithm that was optimal for c = 4 and
c = 6. For the situation with four clusters, the Gath-Geva algorithm appears to
be the best algorithm and the Gustafson-Kessel algorithm gives the best results
by six clusters. To determine which customer segmentation algorithm is best
suited for a particular data set and a specific parameter setting, the clustering
results were interpreted in a profiling format. The results show, that in both
situations the clusters were well separated and clearly distinguished from each
other. It is hard to compare the two clustering results, because of the different
number of clusters. Therefore, both clustering results were used as a starting
point for the segmentation algorithm. The corresponding segments differ on
features as number of voice calls, sms usage, call duration, international calls,
different numbers called and percentage of weekday and daytime calls. A short
characterization of each cluster was made.
A Support Vector Machine algorithm was used to classify the segment of a
customer, based on the customer’s profile. The profile exists of the age, gen-
der, telephone type, subscription type, company size, and residential area of
the customer. As a comparison, four different kernel functions with different
parameters were tested on their performance. It was found that the radial basis
function gives the best result with a classification of 80.3% for the situation
with four segments and 78.5% for the situation with six segments. It appeared
that the resulting percentage of correctly classified segments was not as high
as expected. A possible explanation could be that the features of the customer
are not adequate for making a customer’s profile. This is caused by the fre-
quently missing data in the Vodafone data warehouse about lifestyle, habits
and income of the customers. A second reason for the low number of correct
classification is the fact that the usage behavior in the database corresponds
to a telephone number and this telephone number corresponds to a person. In
real life, however, this telephone is maybe not used exclusively by the person
(and the corresponding customer’s profile) as stored in the database. Customers
may lend their telephone to relatives, and companies may exchange telephones
among their employees. In such cases, the usage behavior does not correspond
to a single customer’s profile and this impairs the classification process.
The last part of the research involves the relative importance of each individ-
ual feature of the customer’s profile. By leaving out one feature value during
classification, the effect of each feature value became visible. It was found that
without the concept of ’customer age’, the resulting quality of the classifica-
tion was significantly decreased. On the other hand, leaving out a feature such
as the ’telephone type’ barely decreased the classification result. However, this
and some other features did well increase the performance of classification. This
implies, that this feature bears some importance for the customer profiling and
the classification of the customer’s segment.
67
7.2 Recommendations for future work
Based on our research and experiments, it is possible to formulate some recom-
mendations for obtaining more suitable customer profiling and segmentation.
The first recommendation is to use different feature values for the customer
segmentation. This can lead to different clusters and thus different segments.
To know the influence of the feature values on the outcome of the clustering, a
complete data analysis research is required. Also, a detailed data analysis of the
meaning of the cluster is recommended. In this research, the results are given
by a short description of each segment. Extrapolating this approach, a more
detailed view of the clusters and their boundaries can be obtained. Another
way to validate the resulting clusters is to offer them to a human expert, and
use his feed-back for improving the clustering criteria.
To improve on determining the actual number of clusters present in the data
set, the application of more specialized methods than the elbow criterion could
be applied. An interesting alternative is, for instance, the application of evolu-
tionary algorithms, as proposed by Wei Lu [21]. Another way of improving this
research is to extent the number of cluster algorithms like main shift cluster-
ing, hierarchical clustering or mixture of Gaussians. To estimate the segment
of the customer, also, other classification methods can be used. For instance,
neural networks, genetic algorithms or Bayesian algorithms. Of specific interest
is, within the framework of Support Vector Machines, cluster analysis of the
application of miscellaneous (non-linear) kernel functions.
Furthermore, it should be noted that the most obvious and best way to improve
the classification is to come to a more accurate and precise definition of the
customer profiles. The customer profile used in this research is not sufficient
detailed enough to describe the wide spectrum of customers. One reason for this
is the missing data in the Vodafone data warehouse. Consequently, an enhanced
and more precise analysis of the data ware house will lead to improved features
and, thus, to an improved classification.
Finally, we note that the study would improve noticeably by involving multiple
criteria to evaluate the user behavior, rather than mere phone usage as em-
ployed here. Similarly, it is challenging to classify the profile of the customer
based on the corresponding segment alone. However, this is a complex course
and it essentially requires the availability of high-quality features.
68
Bibliography
[1] Ahola, J. and Rinta-Runsala E., Data mining case studies in customer profiling.
Research report TTE1-2001-29, VTT Information Technology (2001).
[2] Amat, J.L., Using reporting and data mining techniques to improve knowledge of
subscribers; applications to customer profiling and fraud management. J. Telecom-
mun. Inform. Technol., no. 3 (2002), pp. 11-16.
[3] Balasko, B., Abonyi, J. and Balazs, F., Fuzzy Clustering and Data Analysis Tool-
box For Use with Matlab. (2006).
[4] Bounsaythip, C. and Rinta-Runsala, E., Overview of Data Mining for Customer
Behavior Modeling. Research report TTE1-2001-18, VTT Information Technol-
ogy (2001).
[5] Bezdek, J.C. and Dunn, J.C., Optimal fuzzy partition: A heuristic for estimating
the parameters in a mixture of normal distributions. IEEE Trans. Comput., vol.
C-24 (1975), pp. 835-838.
[6] Dibike, Y.B., Velickov, S., Solomatine D. and Abbott, M.B., Model Induction
with Support Vector Machines: Introduction and Applications. J. Comp. in Civ.
Engrg., vol. 15 iss. 3 (2001), pp. 208-216.
[7] Feldman, R. and Dagan, I., Knowledge discovery in textual databases (KDT). In
Proc. 1st Int. Conf. Knowledge Discovery and Data Mining, (2005), pp. 112-117.
[9] Gath, I. and Geva, A.B., Unsupervised optimal fuzzy clustering. IEEE Trans
Pattern and Machine Intell, vol. 11 no. 7 (1989), pp. 773-781.
[10] Giha, F.E., Singh, Y.P. and Ewe, H.T., Customer Profiling and Segmentation
based on Association Rule Mining Technique. Proc. Softw. Engin. and Appl., no.
397 (2003).
[11] Gustafson, D.E. and Kessel, W.E., Fuzzy clustering with a fuzzy covariance ma-
trix. In Proc. IEEE CDC, (1979), pp. 761766.
[12] Janusz, G., Data mining and complex telecommunications problems modeling. J.
Telecommun. Inform. Technol., no. 3 (2003), pp. 115-120.
69
[13] Mali, K., Clustering and its validation in a symbolic framework. Patt. Recogn.
Lett., vol. 24 (2003), pp. 2367-2376.
[14] Mattison, R., Data Warehousing and Data Mining for Telecommunications.
Boston, London: Artech House, (1997).
[15] McDonald, M. and Dunbar, I., Market segmentation. How to do it, how to profit
from it. Palgrave Publ., (1998).
[16] Noble, W.S., What is a support vector machine? Nature Biotechnology, vol. 24
no. 12 (2006), pp. 1565-1567.
[17] Shaw, M.J., Subramaniam, C., Tan, G.W. and Welge, M.E., Knowledge manage-
ment and data mining for marketing. Decision Support Systems, vol. 31 (2001),
pp. 127137.
[18] Verhoef, P., Spring, P., Hoekstra, J. and Lee, P., The commercial use of segmenta-
tion and predictive modeling techniques for database marketing in the Netherlands.
Decis. Supp. Syst., vol. 34 (2002), pp. 471-481.
[19] Virvou, M., Savvopoulos, A. Tsihrintzis, G.A. and Sotiropoulos, D.N., Construct-
ing Stereotypes for an Adaptive e-Shop Using AIN-Based Clustering. ICANNGA
(2007), pp. 837-845.
[20] Wei, C.P. and Chiu, I.T., Turning telecommunications call detail to churn pre-
diction: a data mining approach. Expert Syst. Appl., vol. 23 (2002), pp. 103112.
[21] Wei Lu, I.T., A New Evolutionary Algorithm for Determining the Optimal Num-
ber of Clusters. CIMCA/IAWTIC (2005), pp. 648-653.
[22] Weiss, G.M., Data Mining in Telecommunications. The Data Mining and Knowl-
edge Discovery Handbook (2005), pp. 1189-1201.
70
Appendix A
In this Appendix a simplified model of the data ware house can be found. The
white rectangles correspond to the tables that were used for this research. The
most important data fields of these tables are written in the table. The colored
boxes group the tables in a category. To connected the tables with each other,
the relation tables (the red tables in the middle) are needed.
71
Figure A.1: Model of the Vodafone data warehouse
72
Appendix B
In this Appendix, the plots of the validation measures, for the algorithms that
not were discussed in Section 4.1, are given.
73
Figure B.2: Dunn’s index and Alternative Dunn’s index of K-medoid
74
Figure B.4: Partition index, Separation index and Xie Beni index of Fuzzy
C-means
Figure B.5: Dunn’s index and Alternative Dunn’s index of Fuzzy C-means
75