Article Intéressant
Article Intéressant
net/publication/276124646
DOWNLOADS VIEWS
15 29
1 AUTHOR:
Simone A. Ludwig
North Dakota State University
88 PUBLICATIONS 285 CITATIONS
SEE PROFILE
Simone A. Ludwig
Abstract The management and analysis of big data has been identified as
one of the most important emerging needs in recent years. This is because
of the sheer volume and increasing complexity of data being created or col-
lected. Current clustering algorithms can not handle big data, and therefore,
scalable solutions are necessary. Since fuzzy clustering algorithms have shown
to outperform hard clustering approaches in terms of accuracy, this paper in-
vestigates the parallelization and scalability of a common and effective fuzzy
clustering algorithm named Fuzzy C-Means (FCM) algorithm. The algorithm
is parallelized using the MapReduce paradigm outlining how the Map and
Reduce primitives are implemented. A validity analysis is conducted in order
to show that the implementation works correctly achieving competitive purity
results compared to state-of-the art clustering algorithms. Furthermore, a scal-
ability analysis is conducted to demonstrate the performance of the parallel
FCM implementation with increasing number of computing nodes used.
Keywords MapReduce · Hadoop · Scalability
1 Introduction
Managing scientific data has been identified as one of the most important
emerging needs of the scientific community in recent years. This is because of
the sheer volume and increasing complexity of data being created or collected,
in particular, in the growing field of computational science where increases in
computer performance allow ever more realistic simulations and the potential
to automatically explore large parameter spaces. As noted by Bell et al. [1]:
Simone A. Ludwig
Department of Computer Science
North Dakota State University
Fargo, ND, USA
E-mail: [email protected]
2 Simone A. Ludwig
“As simulations and experiments yield ever more data, a fourth paradigm
is emerging, consisting of the techniques and technologies needed to perform
data intensive science”. The question to address is how to effectively generate,
manage and analyze the data and the resulting information. The solution
requires a comprehensive, end-to-end approach that encompasses all stages
from the initial data acquisition to its final analysis.
Data mining is is one of the most developed fields in the area of artificial
intelligence and encompases a relatively broad field that deals with the auto-
matic knowledge discovery from databases. Based on the rapid growth of data
collected in various fields with their potential usefulness, this requires efficient
tools to extract and utilize potentially gathered knowledge [2].
One of the important data mining tasks is classification, which is an ef-
fective method that is used in many different areas. The main idea behind
classification is the construction of a model (classifier) that assigns items in a
collection to target classes with the goal to accurately predict the target class
for each item in the data [3]. There are many techniques that can be used
to do classification such as decision trees, Bayes networks, genetic algorithms,
genetic programming, particle swarm optimization, and many others [4]. An-
other important data mining technique that is used when analyzing data is
clustering [5]. The aim of clustering algorithms is to divide a set of unlabeled
data objects into different groups called clusters. The cluster membership mea-
sure is based on a similarity measure. In order to obtain high quality clusters,
the similarity measure between the data objects in the same cluster should be
maximized, and the similarity measure between the data objects from different
groups should be minimized.
Clustering is the classification of objects into different groups, i.e., the
partitioning of data into subsets (clusters), such that data in each subset shares
some common features, often proximity according to some defined distance
measure. Unlike conventional statistical methods, most clustering algorithms
do not rely on the statistical distribution of data, and thus can be usefully
applied in situations where little prior knowledge exists [6].
Most sequential classification/clustering algorithms suffer from the prob-
lem that they do not scale with larger sizes of data sets, and most of them
are computationally expensive, both in terms of time and space. For these
reasons, the parallelization of the data classification/clustering algorithms is
paramount in order to deal with large scale data. To develop a good parallel
classification/clustering algorithm that takes big data into consideration, the
algorithm should be efficient, scalable and obtain high accuracy solutions.
In order to enable big data to be processed, the parallelization of data
mining algorithms is paramount. Parallelization is a process where the com-
putation is broken up into parallel tasks. The work done by each task, often
called its grain size, can be as small as a single iteration in a parallel loop or
as large as an entire procedure. When an application can be broken up into
large parallel tasks, the application is called a coarse grain parallel application.
Two common ways to partition computation are task partitioning, in which
MapReduce-based Fuzzy C-Means Clustering Algorithm 3
each task executes a certain function, and data partitioning, in which all tasks
execute the same function but on different data.
This paper proposes the parallelization of a Fuzzy C-Means (FCM) cluster-
ing algorithm. The parallelization methodology used is the divide-and-conquer
methodology referred to as MapReduce. The implementation details are ex-
plained in details outling how the FCM algorithm can be parallelized. Fur-
thermore, a validity analysis is conducted in order to demonstrate the correct
functioning of the implementation measuring the purity and comparing these
to state-of-the art clustering algorithms. Moreover, a scalability analysis is
conduced to investigate the performance of the parallel FCM implementation
by measuring the speedup for increasing number of computing nodes used.
The remainder of this paper is as follows. Section 2 introduces clustering
and fuzzy clustering in particular. The following section (Section 3) discusses
related work in the area of big data processing. In Section 4, the implemen-
tation is described in details. The experimental setup and results are given in
Section 5, and Section 6 concludes this work outlining the findings obtained.
2 Background to Clustering
Equation 2a denotes that the union subset Ai contains all the data. The sub-
sets have to be disjoint as stated by Equation 2b, and none of them can be
empty nor contain all the data in Z as given by Equation 2c. In terms of mem-
bership (characteristic) function, a partition can be conveniently represented
by the partition matrix U = [µik ]c×N . The ith subset Ai of Z. It follows from
Equations 2 that the elements of U must satisfy the following conditions [6]:
Thus, the space of all possible hard partition matrices for Z referred to as the
hard partitioning space is defined by:
) c N
*
( (
c×N
MHC = U ∈ R |µik ∈ {0, 1}, ∀i, k; µik = 1, ∀k; 0 < µik < N, ∀i
i=1 k=1
(2g)
MapReduce-based Fuzzy C-Means Clustering Algorithm 5
The ith row of the fuzzy partition matrix U contains values of the ith mem-
bership function of the fuzzy subset Ai of Z. Equation 3b constrains the sum
of each column to 1, and thus the total membership of each zk in Z equals
one. The fuzzy partitioning space for Z is the set [6]:
) c N
*
( (
MF C = U ∈ Rc×N |µik ∈ [0, 1], ∀i, k; µik = 1, ∀k; 0 < µik < N, ∀i
i=1 k=1
(3d)
3 Related Work
In this section, first related work in the area of clustering is introduced, and
afterwards clustering techniques that make use of parallelization mechanisms
applied to big data are described. Given that this paper is concerned with the
implementation and evaluation of a parallelized FCM algorithm, the related
work outlines other research conducted in this area.
Fuzzy clustering can be categorized into three categories, namely hierar-
chical fuzzy clustering methods, graph-theoretic fuzzy clustering methods, and
fuzzy clustering [10].
Hierarchical clustering techniques generate a hierarchy of partitions by
means of agglomerative and divisive methods [10]; whereby the agglomerative
algorithms produce a sequence of clusters of decreasing number by merging
two clusters from the previous level. The divisive algorithms on the other hand
perform the clustering the other way around. In [11] a hierarchical clustering
algorithm in the area of business systems planning is proposed. The best clus-
ter count is determined by a matching approach. In another technique, called
fuzzy equivalent relation-based hierarchical clustering, the clustering is man-
aged without needing a predefined count of clusters [12].
Another category referred to as graph-theoretic fuzzy clustering methods
is based on the notion of connectivity of nodes of a graph representing the
data set. In particular, the graph representing the data structure is a fuzzy
6 Simone A. Ludwig
proposed by Google [22]. In the following we will review research that has used
and applied the MapReduce paradigm to the clustering of big data sets.
A MapReduce design and implementation of an efficient DBSCAN algo-
rithm is introduced in [23]. The proposed algorithm addresses the drawbacks
of existing parallel DBSCAN algorithms such as the data balancing and scala-
bility issues. The algorithm tackles these issues using a parallelized implemen-
tation that removes the sequential processing bottleneck and thereby improves
the algorithm’s scalability. Furthermore, the algorithm showed the largest im-
provement when the data was imbalanced. The evaluation conducted using
large scale data sets demonstrate the efficiency and scalability of their algo-
rithm.
A parallel K-means clustering algorithm based on MapReduce was pro-
posed in [24]. The algorithm locates the centroids by calculating the weighted
average of each individual cluster points via the Map function. The Reduce func-
tion then assigns a new centroid to each data point based on the distance cal-
culations. At the end, a MapReduce iterative refinement technique is applied
to locate the final centroids. The authors evaluated the implementation us-
ing the measures of speedup, scaleup, and sizeup. The results showed that the
proposed algorithm is able to process large data sets of 8 GB using commodity
hardware effectively.
In [25], an algorithm for solving the problem of document clustering using
the MapReduce-based K-means algorithm is proposed. The algorithm uses
the MapReduce paradigm to iterate over the document collection in order
to calculate the term frequency and inverse document frequency. The algo-
rithm represents the documents as <key,value> pairs, where the key is the
document type and the value is the document text. The authors compare a
non-parallelized version of K-means with the parallelized version to show the
speedup gain. Furthermore, the experiments of the parallelized k-means al-
gorithm on 50,000 documents showed that the algorithm performs very well
in terms of accuracy for this text clustering task while having a reasonable
execution time.
Another K-means clustering algorithm using MapReduce by merging the
K-means algorithm with the ensemble learning bagging method is introduced
in [26]. The proposed algorithm addresses the instability and sensitivity to
outliers. Ensemble learning is a technique that uses a collection of models to
achieve better results than any model in the collection, and bagging is one of
the most popular type of ensemble techniques. The evaluation was performed
on relatively small data sets (instances around 5,000) and only 4 nodes were
used for the parallelization. The authors show that a speedup is obtained on
data sets consisting of outliers.
A Self-Organizing Map (SOM) was modified to work with large scale data
sets by implementing the algorithm using the MapReduce concept to improve
the performance of clustering as shown in [27]. A self-organizing map is an
unsupervised neural network that projects high-dimensional data onto a low-
dimensional grid and visually represents the topological order of the original
data. Unfortunately, the details of the MapReduce implementation are not
8 Simone A. Ludwig
given, but the experiments that were conducted with a small data set demon-
strated the efficiency of the MapReduce-based SOM.
In [28], the authors applied the MapReduce framework to solve co-clustering
problems by introducing a framework called DISCO (DIStributed CO-clustering
with MapReduce). Unlike clustering which groups similar rows or columns in-
dependently, co-clustering searches for interrelated submatrices of rows and
columns. The authors proved that using MapReduce is a good solution for
co-clustering mining tasks by applying the algorithm to data sets such as
collaborative filtering, text mining, etc. The experiments demonstrated that
co-clustering with MapReduce can scale well with large data sets using up to
40 nodes.
In [29], a fast clustering algorithm with a constant factor approximation
guarantee was proposed. The authors use a sampling technique to reduce the
data size first, and then apply the Lloyd’s algorithm on the remaining data
set. A comparison of this algorithm with several sequential and parallel al-
gorithms for the k-median problem was conducted using randomly generated
data sets to evaluate the performance of the algorithm. The randomly gen-
erated data sets contained up to 10 million points. The results showed that
the algorithm achieves better or similar solutions compared to the existing
algorithms especially on very large data sets.
A big data clustering method based on the MapReduce framework was
proposed in [30]. The authors used an ant colony approach to decompose the
big data into several data partitions to be used in parallel clustering. Applying
MapReduce to the ant colony clustering algorithm lead to the automation
of the semantic clustering to improve the data analysis task. The proposed
algorithm was developed and tested on data sets with large number of records
(up to 800K) and showed acceptable accuracy with good speedup.
In [31], the authors introduced a new approach, called the Best Of both
Worlds (BOW) method to minimize the I/O cost of cluster analysis with the
MapReduce model by minimizing the network overhead among the processing
nodes. They proposed a subspace clustering method to handle very large data
sets in a reasonable amount of time. Experiments on terabyte data sets were
conducted using 700 mappers and up to 140 reducers. The results showed very
good speedup results.
As can been seen by the related work, different parallel clustering methods
have been proposed in the past. The aim of this paper is to parallelize the
FCM algorithm using the MapReduce concept in order to conduct a thorough
experimentation and scalability analysis.
n (
( c
J= (uij )m d2 (yi , cj ) (4)
i=1 j=1
1
uk+1
ij = ! 2 (6)
c dij
k=1 ( dkj ) (m−1)
where
dij = ||yi − cj ||2 (7)
until maxij ||ukij − uk+1
ij || <ϵ
Return c
10 Simone A. Ludwig
The growth of the internet has challenged researchers to develop new ideas
to deal with the ever increasing amount of data. Parallelization of algorithms
is needed in order to enable big data processing. Classic parallel applications
that were developed in the past either used message passing runtimes such
as MPI (Message Passing Interface) [33] or PVM (Parallel Virtual Machines)
[34].
A parallel implementation of the fuzzy c-means algorithm with MPI was
proposed in [35]. The implementation consists of three Master/Slave processes,
whereby the first computes the centroids, the second computes the distances
and updates the partition matrix as well as updates the new centroids, and
the third calculates the validity index. Moderately sized data sets were used
to evaluate the approach and good speedup results were achieved.
However, MPI utilizes a rich set of communication and synchronization
constructs, which need to be explicitly programmed. In order to make the
development of parallel applications easier, Google introduced a program-
ming paradigm called MapReduce that uses the Map and Reduce primitives
that are present in functional programming languages. The MapReduce imple-
mentation enables large computations to be divided into several independent
Map functions. MapReduce provides fault tolerance since it has a mechanism
that automatically re-executes Map or Reduce tasks that have failed.
The MapReduce model works as follows. The input of the computation is
a set of key-value pairs, and the output is a set of output key-value pairs. The
algorithm to be parallelized needs to be expressed by Map and Reduce func-
tions. The Map function takes an input pair and returns a set of intermediate
key-value pairs. The framework then groups all intermediate values associated
with the same intermediate key and passes them to the Reduce function. The
Reduce function uses the intermediate key and set of values for that key. These
values are merged together to form a smaller set of values. The intermediate
values are forwarded to the Reduce function via an iterator. More formally,
the Map and Reduce functions have the following types:
map(k1 , v1 ) → list(k2 , v2 )
reduce(k2 , list(v2 ) → list(v3 )
In order to consider the mapping of the FCM algorithm to the Map and
Reduce primitives, it is necessary for FCM to be partitioned into two MapRe-
duce jobs since only one would not be sufficient. The first MapReduce job
calculates the centroid matrix by iterating over the data records, and a second
MapReduce job is necessary since the following calculations need the complete
centroid matrix as the input. The second MapReduce job also iterates over the
data records and calculates the distances to be used to update the membership
matrix as well as to calculate the fitness.
The details of the proposed implementation in the form of block diagrams
are given in Figures 1 and 2. As can be seen, in the first MapReduce job the
mappers have a portion of the data set and a portion of the membership matrix
MapReduce-based Fuzzy C-Means Clustering Algorithm 11
and produce centroid sub-matrices. The reducer of the first MapReduce job
then merges the sub-matrices into the centroid matrix.
The second MapReduce job compared to the first involves more compu-
tations to be executed. During the Map phase, portions of the data set are
received and the distance sub-matrices and membership matrices, and sub-
objective values are computed. Again, in the Reduce phase the membership
sub-matrices are merged, and the sub-objective values are summed up. Please
note that even though the figure shows three Map arrows, however, these steps
are done in one Map phase.
An algorithmic description detailing the Main procedure and the two MapRe-
duce jobs is given in Algorithm 2-6. The Algorithm 2 proceeds as follows. First,
the membership matrix is randomly initialized. Then, the data set needs to
be prepared such that the data set itself and the membership matrix are
merged together vertically in order for the first MapReduce job to have the
data (both the data set as well as the membership matrix) available. This data
set is stored in HDFS to be ready for the first MapReduce job. After this, the
12 Simone A. Ludwig
Hadoop framework calls the first and then the second MapReduce job. The
resulting updated membership matrix is copied from the HDFS to the local
file system for the second iteration to begin. The algorithm iterates calling the
two mapreduce jobs several times until the stopping condition is met. Once
this is completed the purity value is calculated using Equation 8.
Algorithm 3 shows the pseudo code for the Map function of the first
MapReduce job. The inputs are the data record values and the membership
matrix (which are vertically merged), and the output is the intermediate cen-
troid matrix. During this Map function the intermediate centroid matrix values
are calculated using Equation 5, which are then emitted.
The intermediate centroid matrices are then merged by the Reduce function
of the first MapReduce job 1 as shown in Algorithm 4.
Algorithm 5 shows the Map function of the second MapReduce job. The
function takes as the inputs the data record values and the centroid matrix.
The output is an intermediate membership matrix. This Map function first cal-
culates the distances between the data points and the centroids using Equation
7, and then updates the intermediate membership matrix using Equation 6,
which is then emitted.
The cover type data [36] set was used as the base for the experimentation.
The data set contains data regarding 30 x 30 meter patches of forest such
as elevation, distance to roads and water, hillshade, etc. The data set is used
to identify which type of cover a particular patch of forest belongs to, and
there are seven different types defined as output. The data set provides useful
information to natural resource managers in order to allow them to make
appropriate decisions about ecosystem management strategies. The data set
characteristics are the following: number of instances is 581,012, number of
attributes (categorical, numerical) is 54, and number of classes/labels is 7.
14 Simone A. Ludwig
In order to guarantee that the parallelized version of the FCM algorithm works
appropriately, a performance measure needs to be used to quantify the result-
ing clustering quality. Given that the cover type data set is a “classification
data set”, which includes the correct labels, therefore, the clustering quality
can be measured in terms of purity [38]. Purity is defined as:
k
1(
P urity = max(| Li ∩ Cj |) (8)
n j=1 i
Since different data set sizes are necessary for the performance evaluation, the
cover type data set has been split into different portions as shown in Table 2.
Listed is the name of the data set, which corresponds to the number of rows
contained, as well as the file size is given in bytes. Since the number of rows of
the complete cover type data set is 581,012, for the experiments of the larger
data set sizes, rows of the data sets were duplicated.
Three different experiments were conducted. The first experiment com-
pares the ratio of the number of mappers and reducers used, and the second
experiment uses the findings from the first experiment to perform a scalabil-
ity analysis of the number of mappers/reducers used. The third experiments
compares the Mahout FKM implementation with the proposed MR-FCM al-
gorithm.
The difference between the two sets of measurements is that for the first
experiment the same number of mappers and reducers are used for the second
MapReduce job.
For the first MapReduce job preliminary experiments showed that the best
performance is achieved when the number of reducers and the number of map-
16 Simone A. Ludwig
pers is 7, since there are 7 outcomes (clusters) in the data set. Therefore, in
all experiments conducted the mappers and reducers for the first MapReduce
job were set to 7, however, the timings of the first MapReduce (MR1) job are
also be reported.
Figure 3 shows the execution time and the speedup of the second MapRe-
duce (MR2) job. The left hand side (Figures 3(a) and 3(c)) shows the results
of using equal number of mappers and reducers, and the right hand side (Fig-
ures 3(b) and 3(d)) shows the results of using half the number of mappers
for the reducers. The speedup results are calculated based on the time in 10
mappers and 10 reducers, and 10 mappers and 5 reducers, as shown in Figures
3(c) and 3(d), respectively. What can be seen from this comparison is that
the utilization of the Hadoop framework is much better when only half the
number of mappers are used for the reducers. In particular, using 10 mappers
and 10 reducers, the MR2 time is 861 seconds, whereas the MR2 time is 199
seconds when 10 mappers and only 5 reducers are used. Therefore, an im-
provement of factor 4 is achieved with the setup of number of reducers equals
half of the number of mappers. This finding is used throughout the next set
of experiments.
The second experiment conducts a scalability analysis whereby the data
set sizes are increased by 100,000 rows starting with a data set of 100,000 rows
all the way up to 900,000 rows. Again, the number of mappers and reducers
for the MR1 job was set to 7. Table 3 shows the execution time of MR1 for
varying data set sizes. We observe a normal linear trend of the execution time
for increasing data set sizes given that the number of mappers and reducers is
fixed to 7.
Figure 4 shows the execution time of MR2 for increasing numbers of map-
per and reducers used. The experiments use increments of 50, starting with 50
mappers (25 reducers) and ending with 500 mappers (250 reducers). Looking
at all the figures we can see that the first figures with the lower numbers of
mappers used show an “exponential” increase, whereas the later figures with
MapReduce-based Fuzzy C-Means Clustering Algorithm 17
200 900
180 800
160 700
140
Time in seconds
Time in seconds
600
120
500
100
400
80
300
60
40 200
20 100
0 0
10 15 20 25 30 35 40 45 50 10 15 20 25 30 35 40 45 50
Number of mappers and reducers Number of mappers and half reducers
(a) no. of mappers equals no. of reducers (b) no. of reducers is half no. of mappers
7 18
16
6
14
5 12
Speedup
Speedup
10
4
8
3
6
2 4
2
1
10 15 20 25 30 35 40 45 50 10 15 20 25 30 35 40 45 50
Number of mappers and reducers Number of mappers and half reducers
(c) no. of mappers equals no. of reducers (d) no. of reducers is half no. of mappers
Table 3 Time in seconds for MR1 for varying data set sizes (no. of mappers and reducers
equals 7)
the higher numbers of mappers used “flatten” and show an almost linear in-
crease. The execution times for the 900K data set are 669, 164, 88, 60, 46, and
41 seconds for 50, 100, 150, 200, 250, and 300 mappers, respectively.
The third experiment compares the two Mahout algorithms KMeans and
FKM with MR-FCM for data set sizes varying from 1M to 5M. Since the in-
frastructure used at TACC does not allow the number of mappers and reducers
to be set explicitly, both algorithms are run without specifying the number
18 Simone A. Ludwig
700 180
160
600
140
500
Time in seconds
Time in seconds
120
400 100
300 80
60
200
40
100
20
0 0
100K 200K 300K 400K 500K 600K 700K 800K 900K 100K 200K 300K 400K 500K 600K 700K 800K 900K
Data set Data set
(a) 50 mappers and 25 reducers (b) 100 mappers and 50 reducers
90 70
80
60
70
50
Time in seconds
Time in seconds
60
50 40
40 30
30
20
20
10
10
0 0
100K 200K 300K 400K 500K 600K 700K 800K 900K 100K 200K 300K 400K 500K 600K 700K 800K 900K
Data set Data set
(c) 150 mappers and 75 reducers (d) 200 mappers and 100 reducers
50 45
45 40
40 35
35
Time in seconds
Time in seconds
30
30
25
25
20
20
15
15
10 10
5 5
0 0
100K 200K 300K 400K 500K 600K 700K 800K 900K 100K 200K 300K 400K 500K 600K 700K 800K 900K
Data set Data set
(e) 250 mappers and 125 reducers (f) 300 mappers and 150 reducers
Fig. 4 Time results of MR2 with varying data set sizes using different number of mappers
and reducers
Table 4 Time in seconds comparing KMeans, FKM, and MR-FCM for varying data set
sizes
for this might be that the Mahout library uses the vector format of the data
set whereas the MR-FCM does not.
6 Conclusions
Since current clustering algorithms can not handle big data, there is a need for
scalable solutions. Fuzzy clustering algorithms have shown to outperform hard
clustering algorithms. Fuzzy clustering assigns membership degrees between 0
and 1 to the objects to indicate partial membership. This paper investigated
the parallelization of the FCM algorithm and outlined how the algorithm can
be parallelized using the MapReduce paradigm that was introduced by Google.
Two MapReduce jobs are necessary for the parallelization since the calculation
of the centroids need to be performed before the membership matrix can be
calculated.
The accuracy of the MR-FCM algorithm was measured in terms of purity
and compared to different clustering algorithms (both hard clustering and
fuzzy clustering techniques) showed to produce comparable results.
The experimentation and scalability analysis revealed that the optimal
utilization is achieved for the first MapReduce job using 7 mappers and 7
reducers, which is equal to the number of clusters in the data set. Furthermore,
it was shown that for the second MapReduce job the best utilization is achieved
when using half the number of mappers for the reducers. A factor of 4 in
terms of speedup was achieved. The scalability analysis showed that for the
data sets investigated (100K up to 900K), a nearly linear increase can be
observed when 250 mappers and 125 reducers and more are used. Another
evaluation compared the two Mahout algorithms KMeans and FKM with MR-
FCM for data set sizes varying from 1M to 5M. This comparison showed that
KMeans, being the less computationally expensive algorithm performed best.
FKM and MR-FCM are computationally very similar, however, the Mahout
FKM algorithm showed to scale better than the MR-FCM algorithm.
Overall, with the implementation we have shown how a FCM algorithm
can be parallelized using the MapReduce framework, and the experimental
evaluation demonstrated that comparable purity results can be achieved. Fur-
thermore, the MR-FCM algorithm scales well with increasing data set sizes as
shown by the scalability analysis conducted.
20 Simone A. Ludwig
References
1. G. Bell, A. J. G. Hey, and A. Szalay, Beyond the Data Deluget, Science 323
(AAAS, 6/3/2009), 1297-8. DOI 10.112/science.1170411.
2. A. Ghosh, and L. C. Jain, Evolutionary Computation in Data Mining Series:
Studies in Fuzziness and Soft Computing, vol. 163, Springer, 2005.
3. P. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining,
Addison-Wesley, May 2005. (ISBN:0-321-32136-7)
4. H. Jabeen, and A. R. Baig, Review of Classification Using Genetic Program-
ming, International Journal of Engineering Science and Technology, vol. 2.
no. 2. pp.94-103, 2010.
5. J. Han, Data Mining: Concepts and Techniques, Morgan Kaufmann Pub-
lishers Inc., San Francisco, CA, USA, 2005. ISBN:1558609016.
6. R. Babuska, Fuzzy Clustering Lecture, last retrieved from:
http://homes.di.unimi.it/~valenti/SlideCorsi/Bioinformatica05/
Fuzzy-Clustering-lecture-Babuska.pdf on October 2014.
7. J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algo-
rithms, Kluwer Academic Publishers Norwell, MA, USA, 1981.
8. A. K. Jain, and R. C. Dubes, Algorithms for Clustering Data, Prentice-Hall,
Inc., Upper Saddle River, NJ, 1988.
9. E. H. Ruspini, Numerical methods for fuzzy clustering, Information Sci-
ences. 2, 319350, 1970.
10. M. S. Yang, A survey of fuzzy clustering, Math. Comput. Modelling, 18,
pp. 1-16, 1993.
11. H. S. Lee, Automatic clustering of business process in business systems
planning, European Journal of Operational Research, 114, pp. 354-362, 1999.
12. G. J. Klir, B. Yuan, Fuzzy Sets and Fuzzy Logic Theory and Application,
Prentice Hall PTR, Upper Saddle River, NJ, 1995.
13. A. Rosenfeld, Fuzzy graphs, in: L. A. Zadeh, K. S. Fu, M. Shimura (Eds.),
Fuzzy Sets and their Applications to Cognitive and Decision Processes, Aca-
demic Press, New York, 1975.
14. D. W. Matula, Cluster analysis via graph theoretic techniques Proceedings
of the Louisiana Conference on Combinatorics, Graph Theory and Comput-
ing, Winnipeg, 1970.
MapReduce-based Fuzzy C-Means Clustering Algorithm 21
15. J. C. Dunn, A Fuzzy Relative of the ISODATA Process and Its Use in
Detecting Compact Well-Separated Clusters, Journal of Cybernetics 3: 32-
57, 1973.
16. V. P. Guerrero-Bote, C. Lopez-Pujalte, F. de Moya-Anegon, V. Herrero-
Solana, Comparison of neural models for document clustering, Int. Journal
of Approximate Reasoning, vol. 34, pp. 287-305, 2003.
17. I. Gath, and A. B. Geva, Unsupervised optimal fuzzy clustering, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 11(7), pp. 773-
781, 1989.
18. J. C. Bezdek, C. Coray, R. Gunderson and J. Watson, Detection and Char-
acterization of Cluster Substructure - Linear Structure, Fuzzy c-Varieties
and Convex Combinations Thereof, SIAM J. Appl. Math., vol. 40, no. 2, pp.
358-372, 1981.
19. Y. Yang, and S. Huang, Image Segmentation by Fuzzy C-Means Clustering
Algorithm with a Novel Penalty Term, in Computing and Informatics, vol.
26, pp. 17-31, 2007.
20. W. Cai, S. Chen, and D. Zhang, Fast and robust fuzzy c-means clustering
algorithms incorporating local information for image segmentation, Pattern
Recognition, vol. 40, no. 3, pp. 825-838, 2007.
21. T. H. Sarma, P. Viswanath, B. E. Reddy, A hybrid approach to speed-up
the k-means clustering method, Int. J. Mach. Learn. & Cyber. 4:107-117,
2013.
22. J. Dean, and S. Ghemawat, Mapreduce: simplified data processing on large
clusters, in Proceedings of the 6th conference on Symposium on Operating
Systems Design & Implementation - Volume 6, OSDI’04, pp. 10–10, 2004.
23. Y. He, H. Tan, W. Luo, S. Feng, and J. Fan, Mr-dbscan: a scalable
mapreduce-based dbscan algorithm for heavily skewed data, Frontiers of
Computer Science, vol. 8, no. 1, pp. 83–99, 2014.
24. W. Zhao, H. Ma, Q. He, Parallel k-means clustering based on mapreduce,
in Proceedings of the CloudCom’09, Berlin, Heidelberg: Springer-Verlag, pp.
674–679, 2009.
25. P. Zhou, J. Lei, and W. Ye, Large-scale data sets clustering based on
mapreduce and hadoop, Computational Information Systems, vol. 7, no. 16,
pp. 5956–5963, 2011.
26. H.-G. Li, G.-Q. Wu, X.-G. Hu, J. Zhang, L. Li, X. Wu, K-means cluster-
ing with bagging and mapreduce, in Proceedings of the 2011 44th Hawaii
International Conference on System Sciences, Washington, DC, USA: IEEE
Computer Society, pp. 1–8, 2011.
27. S. Nair, and J. Mehta, Clustering with apache hadoop, in Proceedings of
the International Conference, Workshop on Emerging Trends in Technology,
ICWET’11, New York, NY, USA: ACM, pp. 505–509, 2011.
28. S. Papadimitriou, and J. Sun, Disco: Distributed co-clustering with map-
reduce: A case study towards petabyte-scale end-to-end mining, in Proc. of
the IEEE ICDM’08, Washington, DC, USA, pp. 512–521, 2008.
29. A. Ene, S. Im, B. Moseley, Fast clustering using mapreduce, in Proceedings
of KDD’11, NY, USA: ACM, pp. 681–689, 2011.
22 Simone A. Ludwig
30. J. Yang, and X. Li, Mapreduce based method for big data semantic cluster-
ing, in Proceedings of the 2013 IEEE International Conference on Systems,
Man, and Cybernetics, SMC’13, Washington, DC, USA: IEEE Computer
Society, pp. 2814–2819, 2013.
31. F. Cordeiro, C. Traina Jr, A. J. M. Traina, J. Lopez, U. Kang, C. Talout-
sos, Clustering very large multi-dimensional datasets with mapreduce, in
Proceedings of KDD’11, NY, USA: ACM, pp. 690–698, 2011.
32. Simone A. Ludwig, Clonal selection based fuzzy C-means algorithm for
clustering, GECCO ’14 Proceedings of the 2014 conference on Genetic and
evolutionary computation, Pages 105-112.
33. MPI (Message Passing Interface). http://www-unix.mcs.anl.gov/mpi/
34. PVM (Parallel Virtual Machine). http://www.csm.ornl.gov/pvm/
35. M. V. Modenesi, M. C. A. Costa, A. G. Evsukoff, N. F. Ebecken, Parallel
Fuzzy c-Means Cluster Analysis, in Lecture Notes in Computer Science on
High Performance Computing for Computational Science - VECPAR 2006,
Springer, 2007.
36. J. A. Blackard, Comparison of Neural Networks and Discriminant Analysis
in Predicting Forest Cover Types, Ph.D. dissertation, Department of Forest
Sciences, Colorado State University, Fort Collins, Colorado, 1998.
37. Apache Hadoop. http://hadoop.apache.org/.
38. J. Han, Data Mining: Concepts and Techniques, Morgan Kaufmann, San
Francisco, CA, USA, 2005.
39. G. Karypis, CLUTO: A clustering toolkit, University of Minnesota, Com-
puter Science, Tech. Rep. 02-017, 2003.
40. T. C. Havens, R. Chitta, A. K. Jain, J. Rong, Speedup of fuzzy and pos-
sibilistic kernel c-means for large-scale clustering, 2011 IEEE International
Conference on Fuzzy Systems (FUZZ), pp. 463-470, June 2011.
41. R. Hathaway and J. Bezdek, Optimization of clustering criteria by refor-
mulation, IEEE Trans. Fuzzy Systems, vol. 3, pp. 241245, 1995.
42. Mahout library, http://mahout.apache.org/.