Clustering Notes
Clustering Notes
Clustering is a DM or TM technique used to group data sets with similar content into unknown
(number of) groups (or clusters). This is to say that the useful information extracted from the
huge data here is the groupings of the data. Normally, documents within a cluster are more
similar to each other than documents lying in other clusters. Thus, clustering technique doesn’t
use predefined topics unlike classification, but instead clusters documents based on similarity to
each other. It thus works in an unsupervised manner. A good clustering method should ensure
that the intra-class similarity is high while the inter-class similarity is low.
Clustering has traditionally been applied in DM whereby the data is well organized (or
structured, specifically records). But research has also been done on how to apply it on text. For
example in an opinion poll, we can retrieve only the positive comments (the agreeing
comments). Also, some web browsers include clustering of text such that the obtained web
search text documents are grouped (into unknown clusters).
Applications ofclustering
Some of the many possible applications of text clustering include;
‘Improving precision and recall in information retrieval’ (Wanner 2004, p. 4).
Organizing of web engine search results into meaningful groups.
Web filtering: Removing unwanted web materials.
In marketing: E.g. grouping Customer Relationship Management (CRM) correspondence.
In opinion poll mining: Grouping opinions into the possible groups.
Bioinformatics: Text mining techniques have been applied to identify and classify molecular
biology terms corresponding to instances of concepts under study by biologists.
‘Land use: Identification of areas of similar land use in an earth observation database’
Stefanowski (2009, p. 18).
Image processing.
Pattern recognition.
Etc.
1
1. Preprocessing: Typical structured data preprocessing activities include aggregation
(combing two or more attributes into a single attribute to solve the duplicate values
problem), sampling (using the characteristics of data or models based on a subset of the
original data), and dimension reduction (determining dimensions (or combinations of
dimensions) that are important for clustering). In clustering text, dimension reduction is
done, and involves removing high-dimension-causing words from the text document i.e.
unnecessary words that cause unimportant high dimensions (e.g. introductory words like
headings, punctuation marks like commas, frequent words like “the”, “of”, “and”, etc), as
well as complexities (e.g. resolving words with multiple meanings). This makes the text
simpler/easier to structure. For example, removal of too frequent and unimportant words like
“and” makes the resulting text simpler and easier to structure. Besides, such words do not
usually form the basis of clustering (they don’t tell us about the topic that a document
concerns). ‘Data preprocessing, or data cleansing, is the algorithm that detects and removes
errors or inconsistencies from data and consolidates similar data in order to improve the
quality of subsequent analyses. This cleaned data will then be fed to the analysis process
(Kongthorn, p. 12)).
2. Data Representation: The preprocessed data must then be represented as structured data
sets. The mostly used structure is the vector space model. The preprocessed documents are
converted into the structure ready for clustering.
3. Clustering: The clustering technique is then applied on the resulting transformed structure.
4. Evaluation: Here, the user examines and interprets the clustering results. Evaluation is done
on the results.
We summarize the following objectives of data clustering (DC) or text clustering (TC).
1. Appropriate document representation model: Obtaining a document representation
model that best preserves the semantic relationships between words in the document. Text is
context sensitive and so the exact meaning of a sentence may vary depending on the
sequential representation of words in the text document. Also, algorithms that convert the
unstructured text documents into a structured form should retain the original data.
2. Effectiveness (or accuracy) of clustering: The techniques used by an algorithm should
ensure that “similar” documents are clustered together. In this regard, the measure of
“similarity” between two documents has no exact definition and so is a challenge. It’s
always possible to have two paragraphs discussing the same thing but expressed in different
words because of text ambiguities.
3. Efficiency: An algorithm should operate in a manner that utilizes computing resources well,
i.e. the processing should be fast enough and should use minimum memory space, especially
due to the typical huge number of documents involved especially in many TC applications.
First, the formulae and logic used should be as brief as possible. Secondly, dimension
reduction of the usual high dimensional textual data is crucial. It’s important to reduce
documents’ sizes (e.g. by removing words that are irrelevant to the topic) to improve
efficiency of operations. This is however not easy to achieve. Obtaining appropriate
algorithms to do this without losing important data is hard.
4. Scalability: An algorithm should be scalable so as to cater for huge data sizes, high
dimensions, and large number of clusters.
2
5. Robustness (or flexibility): Chandra, E&Anuradha, V(2011) say that ‘the algorithm should
be effective in processing data with noise and outliers. In TC (especially using the
algorithms that are based on the VSM), it is possible to have;
clusters with arbitrary (or irregular) shapes
clusters with different densities, and
Outliers/noise
Algorithms should not suffer from these three aspects, but should be able to deal with any
type of data with respect to the aspects. An algorithm should be able to identify the clusters
which have irregular shapes as well as those with different densities. It should also identify
any point which doesn’t belong to a cluster, or even data which doesn’t contain clusters.
6. Interpretability and clusters labels: The produced results should be easy to interpret as
clusters. Also, obtaining appropriate label/topic name for each cluster is appropriate but also
difficult. It’s important to remember that clustering is unsupervised and so cluster
labeling is not user’s activity.
7. Usability: An algorithm should preferably have minimum requirements for domain
knowledge: It’s desirable for an algorithm to operate automatically without need for user’s
inputs, e.g. the number of clusters. It’s difficult for the user to determine the best or the
optimum value of these parameters depending on the inputs, and also it’s not practical for
the user to always specify the parameters. However, automating the parameters’ values is
hard to do.
8. Applicability: An algorithm should be able to solve as wide range of applications as
possible. Also, some applications are such that more than one documents concern several
topics, and so should be allowed to belong to the appropriate clusters (overlapping of
document clusters).
Quality (or performance): All the above factors contribute to the final quality of an algorithm.
There are various text representation methods. Assume a set of n documents whereby there are a
total of m attributes/terms in the documents that forms the basis of clustering (the number of
3
occurrences of the attributes/terms in the documents vary). There are various interpretations to
this data (or hence representation ways)
Comment
The mostly used text representation methods are the Vector Space Model (VSM) and the Matrix
Model discussed in sub-chapter 2.4 below. The two models are used
alternatively/interchangeably i.e. we can form a model of either but also refer to the equivalent
representation using the other model. Thus, it’s usual to form matrix representation of documents
and refer to it as VSM. The approaches that use the VSM text representation method are
distance-based approach, feature extraction approach, density-based approach, grid-based
approach, ontology-based approach, and neural networks-based approach.
4
approach, grid-based approach, ontology-based approach, and neural networks approach. For
example, the distance based algorithms measure the Euclidean distance between two documents
in the VSM to determine how close two documents are so as to decide if the two documents
should be clustered together or not. The approaches are discussed in sub-chapter 2.5 below.
A hierarchical cluster can be produced using two approaches, i.e. agglomerative or divisive
approach.
Agglomerative approach
We cluster documents by moving from the leaves up to the root of the tree. The leaves are the
individual documents/clusters. We subsequently merge pairs of most similar clusters until we
obtain one final cluster (i.e. the root of the tree).
Divisive approach
We cluster documents by moving from the root down to the leaves of the tree. The whole
document set is considered as a single cluster. We then subsequently divide a cluster into most
related sub-clusters until each cluster contains exactly one document.
5
(ii) Flat category (or partitioning category)
According to Steinbach (2010, p. 4), flat-type clusters produce one-level (i.e. un-hierarchical)
partitions of documents. They usually receive the expected number of clusters as a parameter.
The most widely used flat algorithm is the K-means algorithm.
On the other hand, a soft (or fuzzy) algorithm is one in which a document can be assigned to
more than one clusters (i.e. there is overlapping of clusters). Many applications are such that a
document may concern many topics (based on some keywords/attributes), and so should be
clustered into many clusters.
Example
Assume the following three textual documents with underlined identifiable key terms.
Obviously, the underlined qualify to be the key terms to form the basis of identifying what a
document talks about, and thus do clustering.
We construct a term-dictionary as T1: fruit, T2: health, T3: infant, T4: exercise. We can then
form a term-document matrix as
6
1 1 0
1 1 1
A=
0 1 0
0 0 1
Here, the first row is the vector (1, 1, 0) representing the first term (fruit), showing that the term
occurs in the first and the second document, but not in the third. Similarly, the vector (1, 1, 0, 0)
represents the first document (that contains the first and the second terms, but not the third and
the fourth terms). And entry A42 is 0, showing that the fourth term (exercise) is not present in the
second document.
Example
Using the example in section 2.4.1 above, the term-document matrix using frequencies is
1 1 0
1 1 1
A=
0 2 0
0 0 1
The difference is that the third term (infant) occurs twice in the second document.
7
length of the straight line between them. The limitation of the Euclidean measure according to
Weiss (2006, p. 34) is that it requires the normalization of the length of each vector so as to
avoid distortion of the results. Thus, the cosine measure is sometimes often used.
Thus, in each loop of step 3 above, the algorithm aims at minimizing the following function for k
clusters and n data points.
j=k i=n
J=∑ ∑ ||xi-cj|| 2
j=1 i=1
where ||xi-cj|| is a chosen distance measure between data point xi from cluster cj.
Example
Assume four documents containing two terms, whereby the first term occurs with frequencies 1,
0, 4, and 6 respectively in the documents, while the second term occurs with frequencies 2, 2, 1,
8
0 respectively. Thus, the four document vectors are (1, 2), (0, 2), (4, 1), and (6, 0), i.e. with term-
document matrix
1 0 4 6
2 2 1 0
We choose k=2, and the first two points (i.e. (1, 2), (0, 2)) as the initial first and second
centroids.
First loop
We compute the distance matrix (containing distance of each point from each centroid) to be
0 1 3.16 5.39
D1 =
1 0 4.12 6.33
The first row of D shows the distance of each point from the first centroid, and the second row
shows the distance of each point from the second centroid.
Here, the point (1, 2) has distance ((1-1) 2+(2-2)2)1/2=0) from centroid (1, 2), and distance ((1-
0)2+(2-2)2)1/2=1) from centroid (0, 2).
The point (0, 2) has distance ((0-1) 2+(2-2)2)1/2=1) from centroid (1, 2), and distance ((0-0) 2+(2-
2)2)1/2=0) from centroid (0, 2).
The point (4, 1) has distance ((4-1) 2+(1-2)2)1/2=3.16) from centroid (1, 2), and distance ((4-
0)2+(1-2)2)1/2=4.12) from centroid (0, 2).
The point (6, 0) has distance ((6-1) 2+(0-2)2)1/2=5.39) from centroid (1, 2), and distance ((6-
0)2+(0-2)2)1/2=6.33) from centroid (0, 2).
We then form the clusters by assigning each point to its nearest centroid. We form the group
matrix G by assigning each point value 1 (if it should belong to that cluster), and value 0 if not.
Note that first row represents the first cluster, and second row the second cluster. E.g., the third
point (4, 1) has distance 3.16 from the first centroid, and distance 4.12 from the second centroid,
meaning it’s nearer to the first centroid. So we set the third column of G below to (1, 0). Thus,
1 0 1 1
G1 =
0 1 0 0
This shows that the first, third and fourth points belong to the first cluster, and the second point
to the second cluster.
We then recompute the centroid of each cluster as the average of the points in that cluster. Thus,
first centroid is ((1+4+6)/3, (2+1+0)/3) which is (3.67, 1), while the second centroid is (0, 2).
Second loop
We then start the second loop of the algorithm and compute D to be
2.85 3.80 0.33 2.54
D2 =
1 0 4.12 6.33
Here, the point (1, 2) has distance ((1-3.67) 2+(2-1)2)1/2=2.85) from centroid (3.67, 1), and
distance ((1-0)2+(2-2)2)1/2=1) from centroid (0, 2).
9
The point (0, 2) has distance ((0-3.67)2+(2-1)2)1/2=3.80) from centroid (3.67, 1), and distance ((0-
0)2+(2-2)2)1/2=0) from centroid (0, 2).
The point (4, 1) has distance ((4-3.67)2+(1-1)2)1/2=0.33) from centroid (3.67, 1), and distance ((4-
0)2+(1-2)2)1/2=4.12) from centroid (0, 2).
The point (6, 0) has distance ((6-3.67)2+(0-1)2)1/2=2.54) from centroid (3.67, 1), and distance ((6-
0)2+(0-2)2)1/2=6.33) from centroid (0, 2).
Thus, the first point changes into the second centroid/cluster since it’s now distance 2.85 from
the first centroid (3.67, 1) compared to distance 1 from the second centroid (0, 2). We therefore
compute the new group matrix to be
0 0 1 1
G2=
1 1 0 0
We then recompute the centroid of each cluster as the average of the points in that cluster. Thus,
first centroid is ((4+6)/2, (1+0)/2) which is (5, 0.5), while the second centroid is ((1+0)/2,
(2+2)/2) which is (0.5, 2).
Third loop
We then start the third loop of the algorithm and compute D to be
4.27 5.22 1.11 1.11
D3 =
0.5 0.5 3.64 5.85
6.33
Here, the point (1, 2) has distance ((1-5)2+(2-0.5)2)1/2=4.27) from centroid (5, 0.5), and distance
((1-0.5)2+(2-2)2)1/2=0.5) from centroid (0.5, 2).
The point (0, 2) has distance ((0-5)2+(2-0.5)2)1/2=5.22) from centroid (5, 0.5), and distance ((0-
0.5)2+(2-2)2)1/2=0.5) from centroid (0.5, 2).
The point (4, 1) has distance ((4-5)2+(1-0.5)2)1/2=1.11) from centroid (5, 0.5), and distance ((4-
0.5)2+(1-2)2)1/2=3.64) from centroid (0.5, 2).
The point (6, 0) has distance ((6-5)2+(0-0.5)2)1/2=1.11) from centroid (5, 0.5), and distance ((6-
0.5)2+(0-2)2)1/2=5.85) from centroid (0.5, 2).
Thus,
0 0 1 1
G3 =
1 1 0 0
Conclusion
The first two points (thus documents) i.e. (1, 2), (0, 2) are in the second cluster while the last two
documents i.e. (4, 1), and (6, 0) are in the first cluster.
10
Since there are two terms, we have a two dimensional space whereby the x axis represents the
first term while the y axis represents the second term. Each document is a point on the xy space.
Note that;
Document points are shown using
Centroids are shown using or (if they are also data points)
Points inside a cluster are enclosed using
11
Loop 1 Loop 2
Centroids: 1st (1, 2), 2nd (0, 2) Centroids: 1st (3.67, 1), 2nd (0, 2)
1 2 3 4 5 1 2 3 4 5
6 6
Loop 3
Centroids: 1st (5, 0.5), 2nd (0.5, 2)
1 2 3 4 5
6
Figure 2.3: Illustration of KMeans clustering
Its limitation is that the user must specify the initial number of clusters before clustering. It is not
trivial for the user to determine a reasonable number of clusters depending on the particular set
of documents. It’s obvious that different data sets have different number of documents with
varying topics, and thus different groups. ‘A major problem with partitioning algorithms is
selecting an appropriate number of output clusters’ (Lasek 2011, p. 17). Further, the algorithm
forces the documents to have that particular number of clusters (i.e. K) no matter how many
topics are contained in the documents. And this is a major limitation. Secondly and equally very
important regarding the accuracy of KMeans, we can observe from the algorithm’s logic above
that the algorithm suffers from robustness, i.e. is poor with irregularly shaped data and does not
detect outliers. Stefanowski (2009, p. 48) agrees with this by saying that K Means algorithm is
too sensitive to outliers. Let’s have an illustration for this.
12
An outlier point
(a) 2 groups (b) 2 groups
Figure 2.4: Illustration: Groups of irregular shapes, densities and outlier (hard to detect in
KMeans) and outlier
The two obvious groups of points in each pair are of different shapes. Thus, some points in a
group could be nearer (in distance) to the center (or centroid) of the other group (or cluster)
rather than theirs, based on the K Means algorithm. In other words, some points (in irregular-
shaped regions) may be close to one another, yet in different groups. Thus, it’s very hard for the
K Means algorithm to detect the clusters since it’s purely based on distance measurements. And
so, wrong clustering could happen as illustrated below.
cluster A
cluster
A
cluster B
cluster
B
2 clusters 2 clusters
Figure 2.5: Illustration of wrong clustering by KMeans
Also from the logic of KMeans, the outlier point shown at the bottom right corner of the second
pair of clusters will not be interpreted as noise (not belonging to any cluster), but will rather be
assigned to a cluster. This is because KMeans assigns each point to its nearest centroid, no matter
how far the centroid is. Thus, KMeans is not able to identify outliers/noise in data. And this
further makes calculations of the mean distances of points from their centroids distorted a lot.
Thirdly, it doesn’t include dimension reduction. Fourth, according to Li (2007, p. 22), there is no
description about the cluster’s contents (i.e. labels of clusters), so the contents can’t be utilized
more efficiently. Still, when the data points are few, it can be proved that it’s likely to get
different clustering for different initial centroids.
In an attempt to improve the K Means algorithm, other similar algorithms have been developed.
All these can be considered as the K Means family. They include the following.
K Means variations
13
The K Medians Algorithm: Works just like the K Means algorithm, except that we
compute the median instead of the mean of each cluster as the centroid. As a result, there is
less effect of extreme values (i.e. outliers). But, the algorithm suffers from the other
limitations of the K Means algorithm.
The K Medoids Algorithm: As opposed to the K Means algorithm whose centers (or
centroids) may not be data points (centroids are obtained as the mean value of all data points
in a cluster, thus centroids may not be data points), the K Medoids algorithm chooses data
points as centers (medoids). I.e., each cluster is represented by one of the objects in the
cluster (i.e. the representative object). A medoid is a data point that is the most centrally
located in a cluster, meaning that its average dissimilarity to all other objects in the cluster is
the minimum (compared to the average of the other points). This makes the K Medoids
algorithm be less sensitive to outliers than the K Means algorithm. However, this algorithm
is less efficient than the K Means algorithm (the step of computing medoids – specifically
their dissimilarity to all other objects is harder than computing means), and also retains
other limitations of K Means (except the problem of outliers).
The Bisecting K Means Algorithm: Whereas the K Means algorithm splits a cluster into k
sub clusters, the Bisecting K Means algorithm splits a cluster into two sub clusters in a
divisive hierarchical manner, but using the K Means-type of clustering. Some researchers
have found this algorithm to be more accurate than KMeans, but still suffers the limitations
of the K Means algorithm.
The Partitioning Around Medoids (PAM) algorithm
CLARA (Clustering LARge Applications)
CLARANS (Clustering Large Applications based on RANdomized Search)
The Kernel K Means Algorithm
For example, in a frequent word set type of these algorithms, a word set consists of some words,
e.g. “purchase, car”. For example, the word set “purchase car” is in the following three
sentences.
“Please purchase car for me”
“I went to purchase the requested car”
“Frequent car purchase is unnecessary”
A word set is frequent if the number of documents containing the words in this set is at least the
specified threshold.
14
rather than the entire documents. Secondly, there is description of a cluster’s contents by the use
of a cluster’s label. The label is the set of frequent words shared by documents in that cluster.
However, they have their limitations. First, feature extraction methods have expensive
computations and which waters down their efficiency. Secondly, feature extraction methods
suffer from a common limitation as compared with other approaches, in that the generated new
features may be difficult to interpret.
In LSA, we decompose a weighted (e.g. with term frequencies) term-document matrix A into the
three matrices U, S, and VT. Here, U is the document matrix while VT is the term matrix. In other
words, matrix U describes the original column entities as vectors of derived orthogonal factor
values, while matrix VT describes the original row entities as vectors of derived orthogonal factor
values. Another characteristic of vectors of U and VT is that they have mixed signs (i.e. some
values are positive and others negative). Also, the set of rows (representing terms) usually
repeals some semantic relations among the terms (representing synonyms). LSA is able to reveal
deep/hidden (i.e. latent) relations among data items, thus the name Latent Semantic Analysis.
And the set of columns (representing documents) repeals some clusters. Also, topics are
represented by the factors of the matrix, and consequently, the relationships between documents
are represented by the orthogonal characteristics of the factors. Thus, words in a factor have little
relations with words in other factors, but words in a factor have high relations with words in that
factor.
15
We then reconstruct the original matrix A by multiplying U, S and VT but using only the vectors
of U and VT that are associated with higher eigen values of S (we ignore others). We can specify
a particular threshold of these eigen values to massively reduce A’s dimension. In this way, A’s
rank is reduced and so the dimension reduction need of text mining clustering is accomplished.
In other words, we project an original vector space (a term-document matrix in this case) into a
small factor space. SVD thus gives a reduced representation of the original text data.
A limitation of density-based algorithms is that there are initial parameters that need to be set.
And it’s hard to specify the most appropriate setting. Secondly, there is no cluster labeling. Third
and as illustrated immediately above, the algorithms do not allow cluster overlapping.
16
The DBSCAN (Density Based Spatial Clustering of Applications with Noise) algorithm
This algorithm was presented by Ester, Kriegel, Sander, and Xu in 1996. According Nagpal &
Mann (2011), the DBSCAN algorithm takes in two inputs, i.e. the radius of the cluster (Eps) and
minimum points required inside the cluster (Minpts). Consequently, three types of points are the
core (a point of a cluster whose matching neighborhood is dense enough with respect to Eps and
Minpts. I.e. it has a minimum of Minpts points within radius of Eps. A core point is thus inside
the cluster), border (a point on the border of the cluster, but within the neighborhood of the core.
In other words, a neighbor of a core point which is not a core point itself), and noise (a point
which is neither a core point nor a border point). And point q is directly density-reachable from
a point p (or is in the neighborhood of p) if it is not farther away from p than distance Eps and p
is a core point. And q is called density-reachable from p if there exists a sequence p1, p2, …, pn
of points with p1=p and pn=q where each pi+1 is directly density-reachable from pi.
According to Mumtaz & Duraiswamy, DBSCAN clusters by arbitrary selecting a starting point p
that has not been visited, and then finding all neighbor points within distance Eps of p. Then,
If the number of the neighbors is at least Minpts, then a cluster is formed. The point p and its
neighbors are added to this cluster and then p is marked as visited. The algorithm then
repeats the evaluation process for all neighbors recursively.
If the number of neighbors is less than Minpts, then p is marked as noise.
If a cluster is fully expanded (i.e. all points within reach are visited) then the algorithm
proceeds to iterate through the remaining unvisited points in the dataset.
Experiments have shown that DBSCAN is faster and more precise than many other algorithms. It
has a time complexity of O(n log n), which is quite low. ‘It holds good for large spatial
databases’ (Parimala et al. 2011). Mumtaz & Duraiswamy say the following about DBSCAN: ‘It
can even find clusters completely surrounded by (but not connected to) a different cluster’. This
is obviously because of the Minpts and Eps requirement.
DBSCAN however has its own limitations. First, since DBSCAN uses a global (or one) value of
parameters Minpts and Eps, it has problem clustering multi-density regions. That means that
sparse (i.e. lesser-dense) regions that deserve to be clusters may be interpreted as noise (because
they don’t have enough points to satisfy the threshold Minpts within distance Eps), while many
denser regions that should be in different clusters may be clustered together (because the
different regions may be within a single radius of Eps). This means that clustering multi-dense
points is inaccurate. Secondly, the speed of the algorithm depends on the setting of the density
parameters (Eps and Minpts). And it’s hard to determine the most appropriate setting.
17
result to more dense clusters, and a higher value of Eps will result to less dense clusters). Thus,
the main advantage of OPTICS over DBSCAN is that OPTICS handles data points of different
densities.
Note that whereas density functions clustering algorithms compute probabilities of data points in
the VSM, probability-based clustering algorithms compute probabilities of words in documents
(i.e. without applying the VSM).
Examples
Probabilistic Latent Semantic Analysis (PLSA)
Latent Dirichlet Allocation (LDA)
Correlated Topic Model (CTM)
In short, grid clustering concerns forming clusters with contiguous dense cells. Thus, grid-based
approach is related to (or applies) density-based approach. Grid-based clustering algorithms are
usually hierarchical agglomeration based. Examples of such algorithms are CLIQUE, STING,
and MAFIA, WAVECLUSTER.
18
The basic steps of a grid algorithm are;
(i) Divide the input data space into a particular number of cells.
(ii) Calculating the density for each cell.
(iii) Eliminate cells with densities of less than the threshold.
(iv) Form clusters from the contiguous cells.
The unique property of grid-based clustering approach is that its computational complexity is
independent of the number of data objects, but dependent only on the number of cells in each
dimension in the quantized space.
An obvious limitation of many grid algorithms is that they require some input parameters, e.g.
number of cells (intervals) and density threshold (for CLIQUE described below), and number of
cells and number of levels (for STING described below). Consequently and secondly, the
efficiency of the algorithms is obviously seriously determined by the size of the cells. This is
because a small cell size will clearly lead to unnecessary computation in regions with sparse
points. But having a large cell size of course will lead to inaccuracy in regions with dense points.
According to Rama et al. (2010), a limitation of STING is that its performance relies on the
granularity of the lowest level of the grid structure.
Illustration: Some clusters that are hard to detect using grid algorithms.
If we define the volume of a cluster to be the sum of the volumes of all the non-empty grids of
that cluster, then as is clear that the volume of any cluster will decreases as we continue
partitioning the grids (and the number of surrounding empty grids increases), meaning different
clustering results can result from different cell sizes. Thus, the optimal size of cells is a major
problem in grid-based algorithms. Thirdly and similarly to the second limitation, the efficiency
19
of an algorithm is determined by the density threshold. It’s hard to determine an optimal
threshold, which also depends on the cell’s size and dimensionality of data. Fourth, determining
kind of cell adjacency (e.g. for 2D clustering, using 4 or 8 neighbors, etc) is hard. Most grid-
based algorithms cluster cells horizontally and vertically, but never diagonally. This is unlike
other approaches like distance approach and density approach whereby finding nearby points is
not limited to horizontal and vertical directions, but is done in any direction. But even if some
cluster also diagonally, determining the cell adjacency is a problem unlike in density-based
approaches whereby this is not an issue. This obviously might greatly affect the performance (i.e.
accuracy). Fifth, grid clustering is poor in clustering irregular shapes unlike density clustering,
and so is poor in noise handling. Sixth and just like density-based clustering, grid-based
clustering does not ordinarily include overlapping of clusters (or fuzzy clustering) and cluster
labeling.
20