0% found this document useful (0 votes)
2 views

Clustering Notes

The document discusses data clustering, a technique in data mining and text mining used to group similar data sets into clusters without predefined topics. It outlines the objectives, steps, and various applications of clustering, emphasizing the importance of effective document representation and the evaluation of clustering results. Additionally, it classifies clustering algorithms based on criteria such as data representation methods, clustering approaches, and the nature of produced clusters.

Uploaded by

floraaluoch3
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Clustering Notes

The document discusses data clustering, a technique in data mining and text mining used to group similar data sets into clusters without predefined topics. It outlines the objectives, steps, and various applications of clustering, emphasizing the importance of effective document representation and the evaluation of clustering results. Additionally, it classifies clustering algorithms based on criteria such as data representation methods, clustering approaches, and the nature of produced clusters.

Uploaded by

floraaluoch3
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 20

2: DATA CLUSTERING

2.1: Meaning of Clustering


Data Mining (DM) is a computer science area that can be defined as extraction of useful
information from large (structured) data sets. Text Mining (TM) on the other hand is a variation
of DM to use unstructured data that is typically text documents (e.g. word processing documents
and web documents). Thus, text mining can be defined as a computer science area that deals with
extraction of useful information from text documents (i.e. unstructured data sets).

Clustering is a DM or TM technique used to group data sets with similar content into unknown
(number of) groups (or clusters). This is to say that the useful information extracted from the
huge data here is the groupings of the data. Normally, documents within a cluster are more
similar to each other than documents lying in other clusters. Thus, clustering technique doesn’t
use predefined topics unlike classification, but instead clusters documents based on similarity to
each other. It thus works in an unsupervised manner. A good clustering method should ensure
that the intra-class similarity is high while the inter-class similarity is low.

Clustering has traditionally been applied in DM whereby the data is well organized (or
structured, specifically records). But research has also been done on how to apply it on text. For
example in an opinion poll, we can retrieve only the positive comments (the agreeing
comments). Also, some web browsers include clustering of text such that the obtained web
search text documents are grouped (into unknown clusters).

Applications ofclustering
Some of the many possible applications of text clustering include;
 ‘Improving precision and recall in information retrieval’ (Wanner 2004, p. 4).
 Organizing of web engine search results into meaningful groups.
 Web filtering: Removing unwanted web materials.
 In marketing: E.g. grouping Customer Relationship Management (CRM) correspondence.
 In opinion poll mining: Grouping opinions into the possible groups.
 Bioinformatics: Text mining techniques have been applied to identify and classify molecular
biology terms corresponding to instances of concepts under study by biologists.
 ‘Land use: Identification of areas of similar land use in an earth observation database’
Stefanowski (2009, p. 18).
 Image processing.
 Pattern recognition.
 Etc.

2.2: Objectives in Clustering


The Clustering Steps
We can generalize the clustering steps as; involving removing the data complexities explained in
section 2.2.1, and then applying the traditional and simpler/known data clustering technique on
the resulting structured database, i.e.

1
1. Preprocessing: Typical structured data preprocessing activities include aggregation
(combing two or more attributes into a single attribute to solve the duplicate values
problem), sampling (using the characteristics of data or models based on a subset of the
original data), and dimension reduction (determining dimensions (or combinations of
dimensions) that are important for clustering). In clustering text, dimension reduction is
done, and involves removing high-dimension-causing words from the text document i.e.
unnecessary words that cause unimportant high dimensions (e.g. introductory words like
headings, punctuation marks like commas, frequent words like “the”, “of”, “and”, etc), as
well as complexities (e.g. resolving words with multiple meanings). This makes the text
simpler/easier to structure. For example, removal of too frequent and unimportant words like
“and” makes the resulting text simpler and easier to structure. Besides, such words do not
usually form the basis of clustering (they don’t tell us about the topic that a document
concerns). ‘Data preprocessing, or data cleansing, is the algorithm that detects and removes
errors or inconsistencies from data and consolidates similar data in order to improve the
quality of subsequent analyses. This cleaned data will then be fed to the analysis process
(Kongthorn, p. 12)).
2. Data Representation: The preprocessed data must then be represented as structured data
sets. The mostly used structure is the vector space model. The preprocessed documents are
converted into the structure ready for clustering.
3. Clustering: The clustering technique is then applied on the resulting transformed structure.
4. Evaluation: Here, the user examines and interprets the clustering results. Evaluation is done
on the results.

We summarize the following objectives of data clustering (DC) or text clustering (TC).
1. Appropriate document representation model: Obtaining a document representation
model that best preserves the semantic relationships between words in the document. Text is
context sensitive and so the exact meaning of a sentence may vary depending on the
sequential representation of words in the text document. Also, algorithms that convert the
unstructured text documents into a structured form should retain the original data.
2. Effectiveness (or accuracy) of clustering: The techniques used by an algorithm should
ensure that “similar” documents are clustered together. In this regard, the measure of
“similarity” between two documents has no exact definition and so is a challenge. It’s
always possible to have two paragraphs discussing the same thing but expressed in different
words because of text ambiguities.
3. Efficiency: An algorithm should operate in a manner that utilizes computing resources well,
i.e. the processing should be fast enough and should use minimum memory space, especially
due to the typical huge number of documents involved especially in many TC applications.
First, the formulae and logic used should be as brief as possible. Secondly, dimension
reduction of the usual high dimensional textual data is crucial. It’s important to reduce
documents’ sizes (e.g. by removing words that are irrelevant to the topic) to improve
efficiency of operations. This is however not easy to achieve. Obtaining appropriate
algorithms to do this without losing important data is hard.
4. Scalability: An algorithm should be scalable so as to cater for huge data sizes, high
dimensions, and large number of clusters.

2
5. Robustness (or flexibility): Chandra, E&Anuradha, V(2011) say that ‘the algorithm should
be effective in processing data with noise and outliers. In TC (especially using the
algorithms that are based on the VSM), it is possible to have;
 clusters with arbitrary (or irregular) shapes
 clusters with different densities, and
 Outliers/noise
Algorithms should not suffer from these three aspects, but should be able to deal with any
type of data with respect to the aspects. An algorithm should be able to identify the clusters
which have irregular shapes as well as those with different densities. It should also identify
any point which doesn’t belong to a cluster, or even data which doesn’t contain clusters.
6. Interpretability and clusters labels: The produced results should be easy to interpret as
clusters. Also, obtaining appropriate label/topic name for each cluster is appropriate but also
difficult. It’s important to remember that clustering is unsupervised and so cluster
labeling is not user’s activity.
7. Usability: An algorithm should preferably have minimum requirements for domain
knowledge: It’s desirable for an algorithm to operate automatically without need for user’s
inputs, e.g. the number of clusters. It’s difficult for the user to determine the best or the
optimum value of these parameters depending on the inputs, and also it’s not practical for
the user to always specify the parameters. However, automating the parameters’ values is
hard to do.
8. Applicability: An algorithm should be able to solve as wide range of applications as
possible. Also, some applications are such that more than one documents concern several
topics, and so should be allowed to belong to the appropriate clusters (overlapping of
document clusters).

Quality (or performance): All the above factors contribute to the final quality of an algorithm.

2.3: Classifications of Clustering Algorithms


Many researchers have attempted to classify clustering algorithms in different ways. We can
however summarize the classifying of clustering algorithms as based on the following criteria:
The text representation method used, the approach used, the hierarchy of the resulting clusters
(i.e. either hierarchical or partitioning), the overlapping nature of the produced clusters (i.e. hard
or soft algorithms), and the redundancy reduction method (the technique used to make an
algorithm deal with higher dimensional data).

2.3.1: Data Representation Method


One criteria of classifying clustering algorithms is nature of the data being subjected to
clustering. Some clustering algorithms use the entire data in the documents as it is as the
clustering feature. Others however, map the data into a different representation, and then apply
the clustering formulae/logic on this resulting representation rather than on the original data.

There are various text representation methods. Assume a set of n documents whereby there are a
total of m attributes/terms in the documents that forms the basis of clustering (the number of

3
occurrences of the attributes/terms in the documents vary). There are various interpretations to
this data (or hence representation ways)

(i) Raw documents


This is using the entire data in the documents as it is as the clustering feature without any
mapping.

(ii) Graphs model


A graph consists of nodes connected by edges. Consequently, the above data is a collection of n
nodes and m edges in a graph based on the graph theory. The edge weight between any two
nodes in a graph represents the similarity between the two nodes, and hence between the
corresponding two documents.

(iii) Probability distribution model


The data can also be viewed as a sample using a probability distribution. This is also known as
model-based clustering. The specific rules of the particular distribution will dictate the measure
of similarity between two text documents.

(iv) Vector space model


This data is a collection of n vectors/points in a space of dimension m. The Euclidean distance
between any two points shows how the points are near to one another, and this represents the
similarity between the corresponding two documents. ‘The document set comprises
an m×n term-document [or attribute-instance] matrix, in which each column represents a
document [or instance], and the (i, j)th entry represents a weighted frequency of term i in
document j [or value of attribute i in instance j]’ (Berry, M & Malu C (2007)).

(v) Matrix model


This data also can be taken to be a matrix with n columns and m rows (or vice versa). Some
mathematical calculation of any two columns (e.g. the product) will represents how similar the
columns are, and hence how similar the documents are.

Comment
The mostly used text representation methods are the Vector Space Model (VSM) and the Matrix
Model discussed in sub-chapter 2.4 below. The two models are used
alternatively/interchangeably i.e. we can form a model of either but also refer to the equivalent
representation using the other model. Thus, it’s usual to form matrix representation of documents
and refer to it as VSM. The approaches that use the VSM text representation method are
distance-based approach, feature extraction approach, density-based approach, grid-based
approach, ontology-based approach, and neural networks-based approach.

2.3.2: Clustering Approach


A clustering approach is the method of determining that an instance or document belongs to a
particular cluster or simply, the similarity measure. In other words, an approach determines
how an algorithm considers or assumes similar documents to be. The various approaches used in
clustering of text documents can be summarized as follows: Center-based approach, frequent
sequence approach, feature selection approach, feature extraction approach, density-based

4
approach, grid-based approach, ontology-based approach, and neural networks approach. For
example, the distance based algorithms measure the Euclidean distance between two documents
in the VSM to determine how close two documents are so as to decide if the two documents
should be clustered together or not. The approaches are discussed in sub-chapter 2.5 below.

2.3.3: The Hierarchy of Produced Clusters


A clustering algorithm can either be hierarchical or flat in terms of the hierarchy of the
produced clusters (i.e. how the produced clusters are related to one another).

(i) Hierarchical category


Hierarchical-type algorithms produce tree-like clusters, with the roots of the tree (the bottom
most clusters) being the lowest level cluster, while the leaves of the tree (at the top) being the
highest level clusters. Examples of hierarchical algorithms are BIRCH, CURE and
CHAMELEON.

A hierarchical cluster can be produced using two approaches, i.e. agglomerative or divisive
approach.

Agglomerative approach
We cluster documents by moving from the leaves up to the root of the tree. The leaves are the
individual documents/clusters. We subsequently merge pairs of most similar clusters until we
obtain one final cluster (i.e. the root of the tree).

According to Sree (2012, p. 2), a basic approach of an agglomerative clustering is;


1. Assign each item to a cluster, so that if you have N items, you now have N clusters, each
containing just one item. Let the distances (or similarities) between the clusters be the
distances between the items they contain.
2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that
now you have one less cluster.
3. Compute the distances between the new cluster and each of the old clusters.
4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

Divisive approach
We cluster documents by moving from the root down to the leaves of the tree. The whole
document set is considered as a single cluster. We then subsequently divide a cluster into most
related sub-clusters until each cluster contains exactly one document.

According to Rai (2010, p. 3), an approach of divisive clustering is as follows.


1. Put all objects into one cluster.
2. Repeat the following until all clusters are singletons.
(a) Choose a cluster to split.
(b) Replace the chosen cluster with the sub-cluster.

5
(ii) Flat category (or partitioning category)
According to Steinbach (2010, p. 4), flat-type clusters produce one-level (i.e. un-hierarchical)
partitions of documents. They usually receive the expected number of clusters as a parameter.
The most widely used flat algorithm is the K-means algorithm.

2.3.4: The Overlapping Nature


A clustering algorithm can either be hard or soft in terms of the overlapping nature (of the
produced clusters). A hard clustering algorithm is one in which every document can be assigned
to only one cluster (i.e. there is no overlapping of clusters).

On the other hand, a soft (or fuzzy) algorithm is one in which a document can be assigned to
more than one clusters (i.e. there is overlapping of clusters). Many applications are such that a
document may concern many topics (based on some keywords/attributes), and so should be
clustered into many clusters.

2.3.5: Redundancy Reduction Method


We can classify algorithms based on whether they do dimension reduction or not.

2.4: The VSM Representation Method


2.4.1: The Boolean Model
The simplest implementation of the VSM is the Boolean model, whereby a document is
regarded simply as a “bag of words” (i.e. a set of words in textual data), or a sequence of
numerical values (in structured data). ‘In mathematics, a bag, also called a multiset, is a set with
duplicates allowed’ (Rehurek 2011, p. 7). Here, a collection of n documents is represented using
a matrix of m rows and n columns, whereby the rows represent the attributes/terms and the
columns represent the instances/documents. In other words, the rows are attribute/term vectors
while the columns are instance/document vectors. The ijth entry in the matrix is either a 1 (if the
ith attribute/term is present in the jth instance/document), or a 0 (if not). Thus, this ‘term-
document’ matrix is said to be a “bag of words” since it contains repetition of values 1 and 0 (in
unstructured data). In space-based view, a document will be a data point in a high dimensional
space, whereby each attribute/term is an axis of the space.

Example
Assume the following three textual documents with underlined identifiable key terms.
Obviously, the underlined qualify to be the key terms to form the basis of identifying what a
document talks about, and thus do clustering.

D1: Eating fruit improves health.


D2: Give your infant fruit regularly, for the infant to have good health.
D3: Regular exercise improves your health.

We construct a term-dictionary as T1: fruit, T2: health, T3: infant, T4: exercise. We can then
form a term-document matrix as

6
1 1 0
1 1 1
A=
0 1 0
0 0 1

Here, the first row is the vector (1, 1, 0) representing the first term (fruit), showing that the term
occurs in the first and the second document, but not in the third. Similarly, the vector (1, 1, 0, 0)
represents the first document (that contains the first and the second terms, but not the third and
the fourth terms). And entry A42 is 0, showing that the fourth term (exercise) is not present in the
second document.

2.4.2: Modification of the VSM to Use Frequencies


A limitation of the Boolean of that the relevance of a term is a binary decision. Consequently, the
VSM is modified such that we use word frequencies. Here, the ijth entry in the attribute/term-
document matrix represents the frequency of the ith term in the jth document (in TC) or the
magnitude of the value of the ith attribute in the jth document (instance) in (DC). This provides
more information about terms, and allows us to apply partial matching, i.e. compute a continuous
degree of similarity between queries and documents, in query based systems.

Example
Using the example in section 2.4.1 above, the term-document matrix using frequencies is

1 1 0
1 1 1
A=
0 2 0
0 0 1

The difference is that the third term (infant) occurs twice in the second document.

2.5: Clustering Algorithms


This section consequently reviews the existing clustering algorithms and places each under the
appropriate category. The various approaches are center-based methods, frequent sequence
methods, feature selection and feature extraction methods, density-based methods, grid-based
methods, ontology-based methods, and neural networks-based methods.

2.5.1: Center-based (or distance-based) Methods


Center-based methods find clusters by measuring the distance between each data point from a
center. ‘Any distance metric applicable to a multidimensional vector space is applicable, but two
methods are widely used: Euclidean distance and cosine measure’ (Weiss 2006, p. 34).

The euclidean measure


The Euclidean measure applies the VSM, whereby a document is a data point (or a vector) in a
high dimensional space, whereby each attribute/term is an axis of the space. Thus, the similarity
of two instances/documents will be the distance between the two points in the space i.e. the

7
length of the straight line between them. The limitation of the Euclidean measure according to
Weiss (2006, p. 34) is that it requires the normalization of the length of each vector so as to
avoid distortion of the results. Thus, the cosine measure is sometimes often used.

The cosine measure


In the cosine measure, the similarity between documents x and y can be expressed as the cosine
of the angle between the two document vectors. To understand this, two parallel vectors have
angle of 0 between them, and so their similarity measure is cos(0)=1 meaning they are fully
similar. If you rotate one of them so that the angle goes towards 90, the cosine decreases towards
cos(90)=0, the point at which the two vectors are perpendicular and so fully dissimilar (and this
is the same as counter rotation i.e. from 360 to 270 degrees). And an angle of between 90 and
270 (both values exclusive) represent negative cosine measures of two documents, meaning the
documents have opposite direction.

(i) The KMeans Algorithm


The KMeans algorithm is among the few most popular clustering algorithms, and has several
variations. It was developed by J. MacQueen in 1967. It’s a partitioning, similarity-based flat
algorithm whose objective is to minimize the average squared Euclidean distance of documents
from their cluster centers where a cluster center is defined as the mean or centroid of the
documents in a cluster. According to Punitha (2012, p. 2), the KMeans algorithm assigns each
point to a cluster whose center (also called centroid) is nearest. The centroid of a cluster is the
average of all the points in the cluster based on a distance measure, e.g. the Euclidian distance
measure. The steps of the algorithm are.
1. Choose the number of clusters, k.
2. Randomly generate k clusters and determine the cluster centers (centroids).
3. Repeat the following until no object moves (i.e. no object changes clusters)
(i) Determine the distance of each object to all centroids.
(ii) Assign each point to the nearest centroid.
(iii) Re-compute the new cluster centers.

Thus, in each loop of step 3 above, the algorithm aims at minimizing the following function for k
clusters and n data points.

j=k i=n
J=∑ ∑ ||xi-cj|| 2
j=1 i=1

where ||xi-cj|| is a chosen distance measure between data point xi from cluster cj.

The complexity of he K Means algorithm is O(n*i*c*t), where n=number of points, i=number of


iterations, c=number of clusters, and t=number of terms. Thus, the algorithm has a linear time
complexity.

Example
Assume four documents containing two terms, whereby the first term occurs with frequencies 1,
0, 4, and 6 respectively in the documents, while the second term occurs with frequencies 2, 2, 1,

8
0 respectively. Thus, the four document vectors are (1, 2), (0, 2), (4, 1), and (6, 0), i.e. with term-
document matrix
1 0 4 6
2 2 1 0

We choose k=2, and the first two points (i.e. (1, 2), (0, 2)) as the initial first and second
centroids.

First loop
We compute the distance matrix (containing distance of each point from each centroid) to be
0 1 3.16 5.39
D1 =
1 0 4.12 6.33

The first row of D shows the distance of each point from the first centroid, and the second row
shows the distance of each point from the second centroid.

Here, the point (1, 2) has distance ((1-1) 2+(2-2)2)1/2=0) from centroid (1, 2), and distance ((1-
0)2+(2-2)2)1/2=1) from centroid (0, 2).
The point (0, 2) has distance ((0-1) 2+(2-2)2)1/2=1) from centroid (1, 2), and distance ((0-0) 2+(2-
2)2)1/2=0) from centroid (0, 2).
The point (4, 1) has distance ((4-1) 2+(1-2)2)1/2=3.16) from centroid (1, 2), and distance ((4-
0)2+(1-2)2)1/2=4.12) from centroid (0, 2).
The point (6, 0) has distance ((6-1) 2+(0-2)2)1/2=5.39) from centroid (1, 2), and distance ((6-
0)2+(0-2)2)1/2=6.33) from centroid (0, 2).

We then form the clusters by assigning each point to its nearest centroid. We form the group
matrix G by assigning each point value 1 (if it should belong to that cluster), and value 0 if not.
Note that first row represents the first cluster, and second row the second cluster. E.g., the third
point (4, 1) has distance 3.16 from the first centroid, and distance 4.12 from the second centroid,
meaning it’s nearer to the first centroid. So we set the third column of G below to (1, 0). Thus,
1 0 1 1
G1 =
0 1 0 0

This shows that the first, third and fourth points belong to the first cluster, and the second point
to the second cluster.

We then recompute the centroid of each cluster as the average of the points in that cluster. Thus,
first centroid is ((1+4+6)/3, (2+1+0)/3) which is (3.67, 1), while the second centroid is (0, 2).

Second loop
We then start the second loop of the algorithm and compute D to be
2.85 3.80 0.33 2.54
D2 =
1 0 4.12 6.33

Here, the point (1, 2) has distance ((1-3.67) 2+(2-1)2)1/2=2.85) from centroid (3.67, 1), and
distance ((1-0)2+(2-2)2)1/2=1) from centroid (0, 2).

9
The point (0, 2) has distance ((0-3.67)2+(2-1)2)1/2=3.80) from centroid (3.67, 1), and distance ((0-
0)2+(2-2)2)1/2=0) from centroid (0, 2).
The point (4, 1) has distance ((4-3.67)2+(1-1)2)1/2=0.33) from centroid (3.67, 1), and distance ((4-
0)2+(1-2)2)1/2=4.12) from centroid (0, 2).
The point (6, 0) has distance ((6-3.67)2+(0-1)2)1/2=2.54) from centroid (3.67, 1), and distance ((6-
0)2+(0-2)2)1/2=6.33) from centroid (0, 2).

Thus, the first point changes into the second centroid/cluster since it’s now distance 2.85 from
the first centroid (3.67, 1) compared to distance 1 from the second centroid (0, 2). We therefore
compute the new group matrix to be
0 0 1 1
G2=
1 1 0 0

We then recompute the centroid of each cluster as the average of the points in that cluster. Thus,
first centroid is ((4+6)/2, (1+0)/2) which is (5, 0.5), while the second centroid is ((1+0)/2,
(2+2)/2) which is (0.5, 2).

Third loop
We then start the third loop of the algorithm and compute D to be
4.27 5.22 1.11 1.11
D3 =
0.5 0.5 3.64 5.85
6.33
Here, the point (1, 2) has distance ((1-5)2+(2-0.5)2)1/2=4.27) from centroid (5, 0.5), and distance
((1-0.5)2+(2-2)2)1/2=0.5) from centroid (0.5, 2).
The point (0, 2) has distance ((0-5)2+(2-0.5)2)1/2=5.22) from centroid (5, 0.5), and distance ((0-
0.5)2+(2-2)2)1/2=0.5) from centroid (0.5, 2).
The point (4, 1) has distance ((4-5)2+(1-0.5)2)1/2=1.11) from centroid (5, 0.5), and distance ((4-
0.5)2+(1-2)2)1/2=3.64) from centroid (0.5, 2).
The point (6, 0) has distance ((6-5)2+(0-0.5)2)1/2=1.11) from centroid (5, 0.5), and distance ((6-
0.5)2+(0-2)2)1/2=5.85) from centroid (0.5, 2).

Thus,
0 0 1 1
G3 =
1 1 0 0

And so there is no change of the clusters’ grouping, and so we stop.

Conclusion
The first two points (thus documents) i.e. (1, 2), (0, 2) are in the second cluster while the last two
documents i.e. (4, 1), and (6, 0) are in the first cluster.

Illustration of the clustering


Remember our original data was (1, 2), (0, 2), (4, 1), and (6, 0), i.e. with term-document matrix
A= 1 0 4 6
2 2 1 0
We can illustrate the immediate above clustering example using the space-based view as follows.

10
Since there are two terms, we have a two dimensional space whereby the x axis represents the
first term while the y axis represents the second term. Each document is a point on the xy space.

Note that;
 Document points are shown using
 Centroids are shown using or (if they are also data points)
 Points inside a cluster are enclosed using

11
Loop 1 Loop 2
Centroids: 1st (1, 2), 2nd (0, 2) Centroids: 1st (3.67, 1), 2nd (0, 2)

1 2 3 4 5 1 2 3 4 5
6 6
Loop 3
Centroids: 1st (5, 0.5), 2nd (0.5, 2)

1 2 3 4 5
6
Figure 2.3: Illustration of KMeans clustering

Strengths and limitations of KMeans algorithm


The advantage of the K Means algorithm is that it’s simple to understand and implement.
Secondly, the K-means algorithm is efficient in memory requirements since the system only
needs to store the data points with their features, the centroids of K clusters and the membership
of data points to the clusters. According to Geraci (2008, p. 18), K Means is fast for small
document sizes. According to Rosell (2009, p. 31), the time complexity of the K Means
algorithm is O(knI), where k is the number of clusters, n the number of objects and I the number
of iterations (which is dependent on the stopping criterion). Thus, the algorithm is very efficient.
It’s also scalable. From the research done by Osama (2008), the quality of K Means algorithm
becomes very good with huge data sets, meaning that K Means is scalable.

Its limitation is that the user must specify the initial number of clusters before clustering. It is not
trivial for the user to determine a reasonable number of clusters depending on the particular set
of documents. It’s obvious that different data sets have different number of documents with
varying topics, and thus different groups. ‘A major problem with partitioning algorithms is
selecting an appropriate number of output clusters’ (Lasek 2011, p. 17). Further, the algorithm
forces the documents to have that particular number of clusters (i.e. K) no matter how many
topics are contained in the documents. And this is a major limitation. Secondly and equally very
important regarding the accuracy of KMeans, we can observe from the algorithm’s logic above
that the algorithm suffers from robustness, i.e. is poor with irregularly shaped data and does not
detect outliers. Stefanowski (2009, p. 48) agrees with this by saying that K Means algorithm is
too sensitive to outliers. Let’s have an illustration for this.

12
An outlier point
(a) 2 groups (b) 2 groups
Figure 2.4: Illustration: Groups of irregular shapes, densities and outlier (hard to detect in
KMeans) and outlier

The two obvious groups of points in each pair are of different shapes. Thus, some points in a
group could be nearer (in distance) to the center (or centroid) of the other group (or cluster)
rather than theirs, based on the K Means algorithm. In other words, some points (in irregular-
shaped regions) may be close to one another, yet in different groups. Thus, it’s very hard for the
K Means algorithm to detect the clusters since it’s purely based on distance measurements. And
so, wrong clustering could happen as illustrated below.

cluster A
cluster
A
cluster B

cluster
B

2 clusters 2 clusters
Figure 2.5: Illustration of wrong clustering by KMeans

Also from the logic of KMeans, the outlier point shown at the bottom right corner of the second
pair of clusters will not be interpreted as noise (not belonging to any cluster), but will rather be
assigned to a cluster. This is because KMeans assigns each point to its nearest centroid, no matter
how far the centroid is. Thus, KMeans is not able to identify outliers/noise in data. And this
further makes calculations of the mean distances of points from their centroids distorted a lot.

Thirdly, it doesn’t include dimension reduction. Fourth, according to Li (2007, p. 22), there is no
description about the cluster’s contents (i.e. labels of clusters), so the contents can’t be utilized
more efficiently. Still, when the data points are few, it can be proved that it’s likely to get
different clustering for different initial centroids.

In an attempt to improve the K Means algorithm, other similar algorithms have been developed.
All these can be considered as the K Means family. They include the following.

K Means variations

13
 The K Medians Algorithm: Works just like the K Means algorithm, except that we
compute the median instead of the mean of each cluster as the centroid. As a result, there is
less effect of extreme values (i.e. outliers). But, the algorithm suffers from the other
limitations of the K Means algorithm.
 The K Medoids Algorithm: As opposed to the K Means algorithm whose centers (or
centroids) may not be data points (centroids are obtained as the mean value of all data points
in a cluster, thus centroids may not be data points), the K Medoids algorithm chooses data
points as centers (medoids). I.e., each cluster is represented by one of the objects in the
cluster (i.e. the representative object). A medoid is a data point that is the most centrally
located in a cluster, meaning that its average dissimilarity to all other objects in the cluster is
the minimum (compared to the average of the other points). This makes the K Medoids
algorithm be less sensitive to outliers than the K Means algorithm. However, this algorithm
is less efficient than the K Means algorithm (the step of computing medoids – specifically
their dissimilarity to all other objects is harder than computing means), and also retains
other limitations of K Means (except the problem of outliers).
 The Bisecting K Means Algorithm: Whereas the K Means algorithm splits a cluster into k
sub clusters, the Bisecting K Means algorithm splits a cluster into two sub clusters in a
divisive hierarchical manner, but using the K Means-type of clustering. Some researchers
have found this algorithm to be more accurate than KMeans, but still suffers the limitations
of the K Means algorithm.
 The Partitioning Around Medoids (PAM) algorithm
 CLARA (Clustering LARge Applications)
 CLARANS (Clustering Large Applications based on RANdomized Search)
 The Kernel K Means Algorithm

2.5.2: The Frequent Sequence Algorithms


These algorithms measure the similarity of (usually textual) documents by using the frequent
sequence of elements in documents (which is low dimensional), instead of using the high
dimensional original data. Thus, they achieve redundancy reduction using the frequent
sequences. The idea is that a frequent sequence is a description of a cluster which corresponds to
all the documents containing that frequent sequence. Thus, documents in a cluster are those that
share the frequent sequence more than other documents in other clusters.

For example, in a frequent word set type of these algorithms, a word set consists of some words,
e.g. “purchase, car”. For example, the word set “purchase car” is in the following three
sentences.
“Please purchase car for me”
“I went to purchase the requested car”
“Frequent car purchase is unnecessary”
A word set is frequent if the number of documents containing the words in this set is at least the
specified threshold.

Strengths and limitations


The advantage of these approaches of clustering is that there is dimension reduction of the
document sizes. Only the frequent word sequences or phrases are used to cluster the documents,

14
rather than the entire documents. Secondly, there is description of a cluster’s contents by the use
of a cluster’s label. The label is the set of frequent words shared by documents in that cluster.

2.5.3: Feature Extraction Algorithms


These are algorithm used to reduce the dimensions of the instances/documents being clustered by
obtaining (or extracting) only some important features from the original documents and
clustering them, thus achieving redundancy reduction. They do this by doing feature
transformation (using a linear transformation), i.e. defines new features to represent the original
features (or data set). Here, the correlations among the words in the data set are leveraged in
order to create features, which correspond to the concepts in the data.

Strengths and limitations of feature extraction


One advantage of these methods is that the dimension of the documents is reduced drastically,
thus improving on efficiency (but their inefficiency is as a result of their computations).
Secondly, such algorithms are able to discover deep relations between attributes/terms and
instances/documents.

However, they have their limitations. First, feature extraction methods have expensive
computations and which waters down their efficiency. Secondly, feature extraction methods
suffer from a common limitation as compared with other approaches, in that the generated new
features may be difficult to interpret.

The Latent Semantic Analysis (LSA)


LSA projects an original vector space or term-document matrix into a small factor space (Lee
2010, p. 2). It is a feature extraction clustering technique that applies Singular Value
Decomposition (SVD) to reduce the dimension of the term-document matrix. SVD applies a
mathematical rule that specifies that an m by n rectangular matrix say A, can be broken down
into the product of three matrices say U, S and VT i.e. A mn=UmmSmnVTnn, whereby
 U is a m by m orthogonal matrix whose columns are orthonormal eigenvectors of AAT,
 S is a m by n diagonal matrix containing the square roots of eigen values from U and V in
descending order, and
 V is an n by n orthogonal matrix whose columns are orthonormal eigenvectors of ATA.

In LSA, we decompose a weighted (e.g. with term frequencies) term-document matrix A into the
three matrices U, S, and VT. Here, U is the document matrix while VT is the term matrix. In other
words, matrix U describes the original column entities as vectors of derived orthogonal factor
values, while matrix VT describes the original row entities as vectors of derived orthogonal factor
values. Another characteristic of vectors of U and VT is that they have mixed signs (i.e. some
values are positive and others negative). Also, the set of rows (representing terms) usually
repeals some semantic relations among the terms (representing synonyms). LSA is able to reveal
deep/hidden (i.e. latent) relations among data items, thus the name Latent Semantic Analysis.
And the set of columns (representing documents) repeals some clusters. Also, topics are
represented by the factors of the matrix, and consequently, the relationships between documents
are represented by the orthogonal characteristics of the factors. Thus, words in a factor have little
relations with words in other factors, but words in a factor have high relations with words in that
factor.

15
We then reconstruct the original matrix A by multiplying U, S and VT but using only the vectors
of U and VT that are associated with higher eigen values of S (we ignore others). We can specify
a particular threshold of these eigen values to massively reduce A’s dimension. In this way, A’s
rank is reduced and so the dimension reduction need of text mining clustering is accomplished.
In other words, we project an original vector space (a term-document matrix in this case) into a
small factor space. SVD thus gives a reduced representation of the original text data.

2.5.4: Density-based Methods


Rather than measuring distances between data points (as in distance-based algorithms), density-
based algorithms find clusters by differentiating regions in terms of the relative density (or
compactness/concentration/number of objects per unit area) of VSM points in them. Thus,
regions adjacent to a cluster contain data points of either less concentration or higher
concentration. Some clusters that will be easily detected by a density-based clustering algorithm
are illustrated below.

5 clusters 2 clusters 3clusters (with overlapping hard


in density clustering)
Figure 2.6: Some clusters easy to find in density clustering

Density-based algorithms are partitioning type of algorithms.

Strengths and limitations of density-based methods


A common advantage that we can identify from the logic of density-based algorithms as well as
from the immediate above illustration is that the algorithms can discover clusters of irregular
shapes because a cluster is represented as a connected dense region that can grow in any
direction that density leads. Consequently and secondly, density-based algorithms are less
sensitive to outliers/noise (because of the arbitrary shapes discovering ability). For example, the
clusters in the immediate above illustration (some with arbitrary shapes) will be a big problem to
other algorithms like the K Means family of algorithms, but will not be a big problem to density-
based algorithms. It’s obvious that this moving through the connected dense region of points will
exclude any outlier from being detected in a cluster. Also, you don’t specify the number of
clusters unlike in K Means family of algorithms. Another key property of these algorithms is that
they require only one scan of the input data set. This increases processing speed.

A limitation of density-based algorithms is that there are initial parameters that need to be set.
And it’s hard to specify the most appropriate setting. Secondly, there is no cluster labeling. Third
and as illustrated immediately above, the algorithms do not allow cluster overlapping.

16
The DBSCAN (Density Based Spatial Clustering of Applications with Noise) algorithm
This algorithm was presented by Ester, Kriegel, Sander, and Xu in 1996. According Nagpal &
Mann (2011), the DBSCAN algorithm takes in two inputs, i.e. the radius of the cluster (Eps) and
minimum points required inside the cluster (Minpts). Consequently, three types of points are the
core (a point of a cluster whose matching neighborhood is dense enough with respect to Eps and
Minpts. I.e. it has a minimum of Minpts points within radius of Eps. A core point is thus inside
the cluster), border (a point on the border of the cluster, but within the neighborhood of the core.
In other words, a neighbor of a core point which is not a core point itself), and noise (a point
which is neither a core point nor a border point). And point q is directly density-reachable from
a point p (or is in the neighborhood of p) if it is not farther away from p than distance Eps and p
is a core point. And q is called density-reachable from p if there exists a sequence p1, p2, …, pn
of points with p1=p and pn=q where each pi+1 is directly density-reachable from pi.

According to Mumtaz & Duraiswamy, DBSCAN clusters by arbitrary selecting a starting point p
that has not been visited, and then finding all neighbor points within distance Eps of p. Then,
 If the number of the neighbors is at least Minpts, then a cluster is formed. The point p and its
neighbors are added to this cluster and then p is marked as visited. The algorithm then
repeats the evaluation process for all neighbors recursively.
 If the number of neighbors is less than Minpts, then p is marked as noise.
 If a cluster is fully expanded (i.e. all points within reach are visited) then the algorithm
proceeds to iterate through the remaining unvisited points in the dataset.

Experiments have shown that DBSCAN is faster and more precise than many other algorithms. It
has a time complexity of O(n log n), which is quite low. ‘It holds good for large spatial
databases’ (Parimala et al. 2011). Mumtaz & Duraiswamy say the following about DBSCAN: ‘It
can even find clusters completely surrounded by (but not connected to) a different cluster’. This
is obviously because of the Minpts and Eps requirement.

DBSCAN however has its own limitations. First, since DBSCAN uses a global (or one) value of
parameters Minpts and Eps, it has problem clustering multi-density regions. That means that
sparse (i.e. lesser-dense) regions that deserve to be clusters may be interpreted as noise (because
they don’t have enough points to satisfy the threshold Minpts within distance Eps), while many
denser regions that should be in different clusters may be clustered together (because the
different regions may be within a single radius of Eps). This means that clustering multi-dense
points is inaccurate. Secondly, the speed of the algorithm depends on the setting of the density
parameters (Eps and Minpts). And it’s hard to determine the most appropriate setting.

OPTICS (Ordering Points To Identify the Clustering Structure)


OPTICS algorithm was presented by Michael Ankerst, Markus M. Breunig, Hans-Peter Kriegel
and Jörg Sander in 1999. It uses the idea of DBSCAN, but includes finding clusters of different
densities. It orders data points being clustered linearly so that closest points become neighbors in
the linear order. Also, the closest points (or the more dense points) are given priority in the
clustering. This idea of having an order in which data objects are clustered i.e. the more dense
points are clustered first is known as density-based cluster ordering. The density-based cluster
ordering is done by having multiple number of distance parameter Eps (a lower value of Eps will

17
result to more dense clusters, and a higher value of Eps will result to less dense clusters). Thus,
the main advantage of OPTICS over DBSCAN is that OPTICS handles data points of different
densities.

2.5.5: Probability-based Methods


These methods use probabilities to cluster documents. The methods are also known as ‘model-
based’ methods. The goal is to find the most likely cluster for a data element, i.e. they find the
probability with which a data belongs to a cluster. In this case, the documents’ data is regarded to
belong to a certain probability distribution, and the area around the mean of a distribution
constitutes a natural cluster. Therefore, we associate the cluster with the corresponding
distribution’s statistics, e.g. mean, variance, standard deviation, etc.

Note that whereas density functions clustering algorithms compute probabilities of data points in
the VSM, probability-based clustering algorithms compute probabilities of words in documents
(i.e. without applying the VSM).

Strengths and limitations of probability methods


‘One clear advantage of probabilistic methods is the interpretability of the constructed clusters.
Probability clustering results to easily interpretable cluster system’ (Rai 2010, p. 4). However,
probability clustering algorithms have their limitation in that they require expensive
computations thus are less efficient. According to Lee (2010, p. 5), a limitation of CTM is that it
requires complex computations.

Examples
Probabilistic Latent Semantic Analysis (PLSA)
Latent Dirichlet Allocation (LDA)
Correlated Topic Model (CTM)

2.5.6: Grid-based Methods


Grid-based clustering methods quantize the VSM space of the data points into a finite number of
cells to form a grid structure. The cells containing at least the minimum number of points are
considered dense. Then the dense cells are connected to form clusters. Thus, there are no
distance computations on the data points, and shapes are restricted to the union of the grid cells.
The clustering is based not on the data points, but on the value space surrounding the data points.

Figure 2.7: Some 4*4 grids

In short, grid clustering concerns forming clusters with contiguous dense cells. Thus, grid-based
approach is related to (or applies) density-based approach. Grid-based clustering algorithms are
usually hierarchical agglomeration based. Examples of such algorithms are CLIQUE, STING,
and MAFIA, WAVECLUSTER.

18
The basic steps of a grid algorithm are;
(i) Divide the input data space into a particular number of cells.
(ii) Calculating the density for each cell.
(iii) Eliminate cells with densities of less than the threshold.
(iv) Form clusters from the contiguous cells.

Strengths and limitations of grid-based algorithms


The advantage of the grid based methods is that they are much efficient. The processing time is
much less even with large data sets. According to (Lasek 2011, p. 20), in high dimensional
space, this approach is more efficient than density-based approach. The processing time is fast
and dependent only on the number of cells, i.e. they have a complexity of O(number of cells)
rather than O(number of documents i.e. n). Fung (1999, p. 20) says that

The unique property of grid-based clustering approach is that its computational complexity is
independent of the number of data objects, but dependent only on the number of cells in each
dimension in the quantized space.

An obvious limitation of many grid algorithms is that they require some input parameters, e.g.
number of cells (intervals) and density threshold (for CLIQUE described below), and number of
cells and number of levels (for STING described below). Consequently and secondly, the
efficiency of the algorithms is obviously seriously determined by the size of the cells. This is
because a small cell size will clearly lead to unnecessary computation in regions with sparse
points. But having a large cell size of course will lead to inaccuracy in regions with dense points.
According to Rama et al. (2010), a limitation of STING is that its performance relies on the
granularity of the lowest level of the grid structure.

Illustration: Some clusters that are hard to detect using grid algorithms.

5 clusters 2 clusters 3clusters (with overlapping)


(a) (b)

Figure 2.8: Illustration of cell adjacency problem in grid clustering

If we define the volume of a cluster to be the sum of the volumes of all the non-empty grids of
that cluster, then as is clear that the volume of any cluster will decreases as we continue
partitioning the grids (and the number of surrounding empty grids increases), meaning different
clustering results can result from different cell sizes. Thus, the optimal size of cells is a major
problem in grid-based algorithms. Thirdly and similarly to the second limitation, the efficiency

19
of an algorithm is determined by the density threshold. It’s hard to determine an optimal
threshold, which also depends on the cell’s size and dimensionality of data. Fourth, determining
kind of cell adjacency (e.g. for 2D clustering, using 4 or 8 neighbors, etc) is hard. Most grid-
based algorithms cluster cells horizontally and vertically, but never diagonally. This is unlike
other approaches like distance approach and density approach whereby finding nearby points is
not limited to horizontal and vertical directions, but is done in any direction. But even if some
cluster also diagonally, determining the cell adjacency is a problem unlike in density-based
approaches whereby this is not an issue. This obviously might greatly affect the performance (i.e.
accuracy). Fifth, grid clustering is poor in clustering irregular shapes unlike density clustering,
and so is poor in noise handling. Sixth and just like density-based clustering, grid-based
clustering does not ordinarily include overlapping of clusters (or fuzzy clustering) and cluster
labeling.

20

You might also like