0% found this document useful (0 votes)

2 views

Clustering Notes

The document discusses data clustering, a technique in data mining and text mining used to group similar data sets into clusters without predefined topics. It outlines the objectives, steps, and various applications of clustering, emphasizing the importance of effective document representation and the evaluation of clustering results. Additionally, it classifies clustering algorithms based on criteria such as data representation methods, clustering approaches, and the nature of produced clusters.

Uploaded by

floraaluoch3

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Clustering Notes

Uploaded by

floraaluoch3

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 20

2: DATA CLUSTERING

2.1: Meaning of Clustering

Data Mining (DM) is a computer science area that can be defined as extraction of useful
information from large (structured) data sets. Text Mining (TM) on the other hand is a variation
of DM to use unstructured data that is typically text documents (e.g. word processing documents
and web documents). Thus, text mining can be defined as a computer science area that deals with
extraction of useful information from text documents (i.e. unstructured data sets).

Clustering is a DM or TM technique used to group data sets with similar content into unknown
(number of) groups (or clusters). This is to say that the useful information extracted from the
huge data here is the groupings of the data. Normally, documents within a cluster are more
similar to each other than documents lying in other clusters. Thus, clustering technique doesn’t
use predefined topics unlike classification, but instead clusters documents based on similarity to
each other. It thus works in an unsupervised manner. A good clustering method should ensure
that the intra-class similarity is high while the inter-class similarity is low.

Clustering has traditionally been applied in DM whereby the data is well organized (or
structured, specifically records). But research has also been done on how to apply it on text. For
example in an opinion poll, we can retrieve only the positive comments (the agreeing
comments). Also, some web browsers include clustering of text such that the obtained web
search text documents are grouped (into unknown clusters).

Applications ofclustering
Some of the many possible applications of text clustering include;
 ‘Improving precision and recall in information retrieval’ (Wanner 2004, p. 4).
 Organizing of web engine search results into meaningful groups.
 Web filtering: Removing unwanted web materials.
 In marketing: E.g. grouping Customer Relationship Management (CRM) correspondence.
 In opinion poll mining: Grouping opinions into the possible groups.
 Bioinformatics: Text mining techniques have been applied to identify and classify molecular
biology terms corresponding to instances of concepts under study by biologists.
 ‘Land use: Identification of areas of similar land use in an earth observation database’
Stefanowski (2009, p. 18).
 Image processing.
 Pattern recognition.
 Etc.

2.2: Objectives in Clustering

The Clustering Steps
We can generalize the clustering steps as; involving removing the data complexities explained in
section 2.2.1, and then applying the traditional and simpler/known data clustering technique on
the resulting structured database, i.e.

1
1. Preprocessing: Typical structured data preprocessing activities include aggregation
(combing two or more attributes into a single attribute to solve the duplicate values
problem), sampling (using the characteristics of data or models based on a subset of the
original data), and dimension reduction (determining dimensions (or combinations of
dimensions) that are important for clustering). In clustering text, dimension reduction is
done, and involves removing high-dimension-causing words from the text document i.e.
unnecessary words that cause unimportant high dimensions (e.g. introductory words like
headings, punctuation marks like commas, frequent words like “the”, “of”, “and”, etc), as
well as complexities (e.g. resolving words with multiple meanings). This makes the text
simpler/easier to structure. For example, removal of too frequent and unimportant words like
“and” makes the resulting text simpler and easier to structure. Besides, such words do not
usually form the basis of clustering (they don’t tell us about the topic that a document
concerns). ‘Data preprocessing, or data cleansing, is the algorithm that detects and removes
errors or inconsistencies from data and consolidates similar data in order to improve the
quality of subsequent analyses. This cleaned data will then be fed to the analysis process
(Kongthorn, p. 12)).
2. Data Representation: The preprocessed data must then be represented as structured data
sets. The mostly used structure is the vector space model. The preprocessed documents are
converted into the structure ready for clustering.
3. Clustering: The clustering technique is then applied on the resulting transformed structure.
4. Evaluation: Here, the user examines and interprets the clustering results. Evaluation is done
on the results.

We summarize the following objectives of data clustering (DC) or text clustering (TC).
1. Appropriate document representation model: Obtaining a document representation
model that best preserves the semantic relationships between words in the document. Text is
context sensitive and so the exact meaning of a sentence may vary depending on the
sequential representation of words in the text document. Also, algorithms that convert the
unstructured text documents into a structured form should retain the original data.
2. Effectiveness (or accuracy) of clustering: The techniques used by an algorithm should
ensure that “similar” documents are clustered together. In this regard, the measure of
“similarity” between two documents has no exact definition and so is a challenge. It’s
always possible to have two paragraphs discussing the same thing but expressed in different
words because of text ambiguities.
3. Efficiency: An algorithm should operate in a manner that utilizes computing resources well,
i.e. the processing should be fast enough and should use minimum memory space, especially
due to the typical huge number of documents involved especially in many TC applications.
First, the formulae and logic used should be as brief as possible. Secondly, dimension
reduction of the usual high dimensional textual data is crucial. It’s important to reduce
documents’ sizes (e.g. by removing words that are irrelevant to the topic) to improve
efficiency of operations. This is however not easy to achieve. Obtaining appropriate
algorithms to do this without losing important data is hard.
4. Scalability: An algorithm should be scalable so as to cater for huge data sizes, high
dimensions, and large number of clusters.

2
5. Robustness (or flexibility): Chandra, E&Anuradha, V(2011) say that ‘the algorithm should
be effective in processing data with noise and outliers. In TC (especially using the
algorithms that are based on the VSM), it is possible to have;
 clusters with arbitrary (or irregular) shapes
 clusters with different densities, and
 Outliers/noise
Algorithms should not suffer from these three aspects, but should be able to deal with any
type of data with respect to the aspects. An algorithm should be able to identify the clusters
which have irregular shapes as well as those with different densities. It should also identify
any point which doesn’t belong to a cluster, or even data which doesn’t contain clusters.
6. Interpretability and clusters labels: The produced results should be easy to interpret as
clusters. Also, obtaining appropriate label/topic name for each cluster is appropriate but also
difficult. It’s important to remember that clustering is unsupervised and so cluster
labeling is not user’s activity.
7. Usability: An algorithm should preferably have minimum requirements for domain
knowledge: It’s desirable for an algorithm to operate automatically without need for user’s
inputs, e.g. the number of clusters. It’s difficult for the user to determine the best or the
optimum value of these parameters depending on the inputs, and also it’s not practical for
the user to always specify the parameters. However, automating the parameters’ values is
hard to do.
8. Applicability: An algorithm should be able to solve as wide range of applications as
possible. Also, some applications are such that more than one documents concern several
topics, and so should be allowed to belong to the appropriate clusters (overlapping of
document clusters).

Quality (or performance): All the above factors contribute to the final quality of an algorithm.

2.3: Classifications of Clustering Algorithms

Many researchers have attempted to classify clustering algorithms in different ways. We can
however summarize the classifying of clustering algorithms as based on the following criteria:
The text representation method used, the approach used, the hierarchy of the resulting clusters
(i.e. either hierarchical or partitioning), the overlapping nature of the produced clusters (i.e. hard
or soft algorithms), and the redundancy reduction method (the technique used to make an
algorithm deal with higher dimensional data).

2.3.1: Data Representation Method

One criteria of classifying clustering algorithms is nature of the data being subjected to
clustering. Some clustering algorithms use the entire data in the documents as it is as the
clustering feature. Others however, map the data into a different representation, and then apply
the clustering formulae/logic on this resulting representation rather than on the original data.

There are various text representation methods. Assume a set of n documents whereby there are a
total of m attributes/terms in the documents that forms the basis of clustering (the number of

3
occurrences of the attributes/terms in the documents vary). There are various interpretations to
this data (or hence representation ways)

(i) Raw documents

This is using the entire data in the documents as it is as the clustering feature without any
mapping.

(ii) Graphs model

A graph consists of nodes connected by edges. Consequently, the above data is a collection of n
nodes and m edges in a graph based on the graph theory. The edge weight between any two
nodes in a graph represents the similarity between the two nodes, and hence between the
corresponding two documents.

(iii) Probability distribution model

The data can also be viewed as a sample using a probability distribution. This is also known as
model-based clustering. The specific rules of the particular distribution will dictate the measure
of similarity between two text documents.

(iv) Vector space model

This data is a collection of n vectors/points in a space of dimension m. The Euclidean distance
between any two points shows how the points are near to one another, and this represents the
similarity between the corresponding two documents. ‘The document set comprises
an m×n term-document [or attribute-instance] matrix, in which each column represents a
document [or instance], and the (i, j)th entry represents a weighted frequency of term i in
document j [or value of attribute i in instance j]’ (Berry, M & Malu C (2007)).

(v) Matrix model

This data also can be taken to be a matrix with n columns and m rows (or vice versa). Some
mathematical calculation of any two columns (e.g. the product) will represents how similar the
columns are, and hence how similar the documents are.

Comment
The mostly used text representation methods are the Vector Space Model (VSM) and the Matrix
Model discussed in sub-chapter 2.4 below. The two models are used
alternatively/interchangeably i.e. we can form a model of either but also refer to the equivalent
representation using the other model. Thus, it’s usual to form matrix representation of documents
and refer to it as VSM. The approaches that use the VSM text representation method are
distance-based approach, feature extraction approach, density-based approach, grid-based
approach, ontology-based approach, and neural networks-based approach.

2.3.2: Clustering Approach

A clustering approach is the method of determining that an instance or document belongs to a
particular cluster or simply, the similarity measure. In other words, an approach determines
how an algorithm considers or assumes similar documents to be. The various approaches used in
clustering of text documents can be summarized as follows: Center-based approach, frequent
sequence approach, feature selection approach, feature extraction approach, density-based

4
approach, grid-based approach, ontology-based approach, and neural networks approach. For
example, the distance based algorithms measure the Euclidean distance between two documents
in the VSM to determine how close two documents are so as to decide if the two documents
should be clustered together or not. The approaches are discussed in sub-chapter 2.5 below.

2.3.3: The Hierarchy of Produced Clusters

A clustering algorithm can either be hierarchical or flat in terms of the hierarchy of the
produced clusters (i.e. how the produced clusters are related to one another).

(i) Hierarchical category

Hierarchical-type algorithms produce tree-like clusters, with the roots of the tree (the bottom
most clusters) being the lowest level cluster, while the leaves of the tree (at the top) being the
highest level clusters. Examples of hierarchical algorithms are BIRCH, CURE and
CHAMELEON.

A hierarchical cluster can be produced using two approaches, i.e. agglomerative or divisive
approach.

Agglomerative approach
We cluster documents by moving from the leaves up to the root of the tree. The leaves are the
individual documents/clusters. We subsequently merge pairs of most similar clusters until we
obtain one final cluster (i.e. the root of the tree).

According to Sree (2012, p. 2), a basic approach of an agglomerative clustering is;

1. Assign each item to a cluster, so that if you have N items, you now have N clusters, each
containing just one item. Let the distances (or similarities) between the clusters be the
distances between the items they contain.
2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that
now you have one less cluster.
3. Compute the distances between the new cluster and each of the old clusters.
4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

Divisive approach
We cluster documents by moving from the root down to the leaves of the tree. The whole
document set is considered as a single cluster. We then subsequently divide a cluster into most
related sub-clusters until each cluster contains exactly one document.

According to Rai (2010, p. 3), an approach of divisive clustering is as follows.

1. Put all objects into one cluster.
2. Repeat the following until all clusters are singletons.
(a) Choose a cluster to split.
(b) Replace the chosen cluster with the sub-cluster.

5
(ii) Flat category (or partitioning category)
According to Steinbach (2010, p. 4), flat-type clusters produce one-level (i.e. un-hierarchical)
partitions of documents. They usually receive the expected number of clusters as a parameter.
The most widely used flat algorithm is the K-means algorithm.

2.3.4: The Overlapping Nature

A clustering algorithm can either be hard or soft in terms of the overlapping nature (of the
produced clusters). A hard clustering algorithm is one in which every document can be assigned
to only one cluster (i.e. there is no overlapping of clusters).

On the other hand, a soft (or fuzzy) algorithm is one in which a document can be assigned to
more than one clusters (i.e. there is overlapping of clusters). Many applications are such that a
document may concern many topics (based on some keywords/attributes), and so should be
clustered into many clusters.

2.3.5: Redundancy Reduction Method

We can classify algorithms based on whether they do dimension reduction or not.

2.4: The VSM Representation Method

2.4.1: The Boolean Model
The simplest implementation of the VSM is the Boolean model, whereby a document is
regarded simply as a “bag of words” (i.e. a set of words in textual data), or a sequence of
numerical values (in structured data). ‘In mathematics, a bag, also called a multiset, is a set with
duplicates allowed’ (Rehurek 2011, p. 7). Here, a collection of n documents is represented using
a matrix of m rows and n columns, whereby the rows represent the attributes/terms and the
columns represent the instances/documents. In other words, the rows are attribute/term vectors
while the columns are instance/document vectors. The ijth entry in the matrix is either a 1 (if the
ith attribute/term is present in the jth instance/document), or a 0 (if not). Thus, this ‘term-
document’ matrix is said to be a “bag of words” since it contains repetition of values 1 and 0 (in
unstructured data). In space-based view, a document will be a data point in a high dimensional
space, whereby each attribute/term is an axis of the space.

Example
Assume the following three textual documents with underlined identifiable key terms.
Obviously, the underlined qualify to be the key terms to form the basis of identifying what a
document talks about, and thus do clustering.

D1: Eating fruit improves health.

D2: Give your infant fruit regularly, for the infant to have good health.
D3: Regular exercise improves your health.

We construct a term-dictionary as T1: fruit, T2: health, T3: infant, T4: exercise. We can then
form a term-document matrix as

6
1 1 0
1 1 1
A=
0 1 0
0 0 1

Here, the first row is the vector (1, 1, 0) representing the first term (fruit), showing that the term
occurs in the first and the second document, but not in the third. Similarly, the vector (1, 1, 0, 0)
represents the first document (that contains the first and the second terms, but not the third and
the fourth terms). And entry A42 is 0, showing that the fourth term (exercise) is not present in the
second document.

2.4.2: Modification of the VSM to Use Frequencies

A limitation of the Boolean of that the relevance of a term is a binary decision. Consequently, the
VSM is modified such that we use word frequencies. Here, the ijth entry in the attribute/term-
document matrix represents the frequency of the ith term in the jth document (in TC) or the
magnitude of the value of the ith attribute in the jth document (instance) in (DC). This provides
more information about terms, and allows us to apply partial matching, i.e. compute a continuous
degree of similarity between queries and documents, in query based systems.

Example
Using the example in section 2.4.1 above, the term-document matrix using frequencies is

1 1 0
1 1 1
A=
0 2 0
0 0 1

The difference is that the third term (infant) occurs twice in the second document.

2.5: Clustering Algorithms

This section consequently reviews the existing clustering algorithms and places each under the
appropriate category. The various approaches are center-based methods, frequent sequence
methods, feature selection and feature extraction methods, density-based methods, grid-based
methods, ontology-based methods, and neural networks-based methods.

2.5.1: Center-based (or distance-based) Methods

Center-based methods find clusters by measuring the distance between each data point from a
center. ‘Any distance metric applicable to a multidimensional vector space is applicable, but two
methods are widely used: Euclidean distance and cosine measure’ (Weiss 2006, p. 34).

The euclidean measure

The Euclidean measure applies the VSM, whereby a document is a data point (or a vector) in a
high dimensional space, whereby each attribute/term is an axis of the space. Thus, the similarity
of two instances/documents will be the distance between the two points in the space i.e. the

7
length of the straight line between them. The limitation of the Euclidean measure according to
Weiss (2006, p. 34) is that it requires the normalization of the length of each vector so as to
avoid distortion of the results. Thus, the cosine measure is sometimes often used.

The cosine measure

In the cosine measure, the similarity between documents x and y can be expressed as the cosine
of the angle between the two document vectors. To understand this, two parallel vectors have
angle of 0 between them, and so their similarity measure is cos(0)=1 meaning they are fully
similar. If you rotate one of them so that the angle goes towards 90, the cosine decreases towards
cos(90)=0, the point at which the two vectors are perpendicular and so fully dissimilar (and this
is the same as counter rotation i.e. from 360 to 270 degrees). And an angle of between 90 and
270 (both values exclusive) represent negative cosine measures of two documents, meaning the
documents have opposite direction.

(i) The KMeans Algorithm

The KMeans algorithm is among the few most popular clustering algorithms, and has several
variations. It was developed by J. MacQueen in 1967. It’s a partitioning, similarity-based flat
algorithm whose objective is to minimize the average squared Euclidean distance of documents
from their cluster centers where a cluster center is defined as the mean or centroid of the
documents in a cluster. According to Punitha (2012, p. 2), the KMeans algorithm assigns each
point to a cluster whose center (also called centroid) is nearest. The centroid of a cluster is the
average of all the points in the cluster based on a distance measure, e.g. the Euclidian distance
measure. The steps of the algorithm are.
1. Choose the number of clusters, k.
2. Randomly generate k clusters and determine the cluster centers (centroids).
3. Repeat the following until no object moves (i.e. no object changes clusters)
(i) Determine the distance of each object to all centroids.
(ii) Assign each point to the nearest centroid.
(iii) Re-compute the new cluster centers.

Thus, in each loop of step 3 above, the algorithm aims at minimizing the following function for k
clusters and n data points.

j=k i=n
J=∑ ∑ ||xi-cj|| 2
j=1 i=1

where ||xi-cj|| is a chosen distance measure between data point xi from cluster cj.

The complexity of he K Means algorithm is O(nic*t), where n=number of points, i=number of

iterations, c=number of clusters, and t=number of terms. Thus, the algorithm has a linear time
complexity.

Example
Assume four documents containing two terms, whereby the first term occurs with frequencies 1,
0, 4, and 6 respectively in the documents, while the second term occurs with frequencies 2, 2, 1,

8
0 respectively. Thus, the four document vectors are (1, 2), (0, 2), (4, 1), and (6, 0), i.e. with term-
document matrix
1 0 4 6
2 2 1 0

We choose k=2, and the first two points (i.e. (1, 2), (0, 2)) as the initial first and second
centroids.

First loop
We compute the distance matrix (containing distance of each point from each centroid) to be
0 1 3.16 5.39
D1 =
1 0 4.12 6.33

The first row of D shows the distance of each point from the first centroid, and the second row
shows the distance of each point from the second centroid.

Here, the point (1, 2) has distance ((1-1) 2+(2-2)2)1/2=0) from centroid (1, 2), and distance ((1-
0)2+(2-2)2)1/2=1) from centroid (0, 2).
The point (0, 2) has distance ((0-1) 2+(2-2)2)1/2=1) from centroid (1, 2), and distance ((0-0) 2+(2-
2)2)1/2=0) from centroid (0, 2).
The point (4, 1) has distance ((4-1) 2+(1-2)2)1/2=3.16) from centroid (1, 2), and distance ((4-
0)2+(1-2)2)1/2=4.12) from centroid (0, 2).
The point (6, 0) has distance ((6-1) 2+(0-2)2)1/2=5.39) from centroid (1, 2), and distance ((6-
0)2+(0-2)2)1/2=6.33) from centroid (0, 2).

We then form the clusters by assigning each point to its nearest centroid. We form the group
matrix G by assigning each point value 1 (if it should belong to that cluster), and value 0 if not.
Note that first row represents the first cluster, and second row the second cluster. E.g., the third
point (4, 1) has distance 3.16 from the first centroid, and distance 4.12 from the second centroid,
meaning it’s nearer to the first centroid. So we set the third column of G below to (1, 0). Thus,
1 0 1 1
G1 =
0 1 0 0

This shows that the first, third and fourth points belong to the first cluster, and the second point
to the second cluster.

We then recompute the centroid of each cluster as the average of the points in that cluster. Thus,
first centroid is ((1+4+6)/3, (2+1+0)/3) which is (3.67, 1), while the second centroid is (0, 2).

Second loop
We then start the second loop of the algorithm and compute D to be
2.85 3.80 0.33 2.54
D2 =
1 0 4.12 6.33

Here, the point (1, 2) has distance ((1-3.67) 2+(2-1)2)1/2=2.85) from centroid (3.67, 1), and
distance ((1-0)2+(2-2)2)1/2=1) from centroid (0, 2).

9
The point (0, 2) has distance ((0-3.67)2+(2-1)2)1/2=3.80) from centroid (3.67, 1), and distance ((0-
0)2+(2-2)2)1/2=0) from centroid (0, 2).
The point (4, 1) has distance ((4-3.67)2+(1-1)2)1/2=0.33) from centroid (3.67, 1), and distance ((4-
0)2+(1-2)2)1/2=4.12) from centroid (0, 2).
The point (6, 0) has distance ((6-3.67)2+(0-1)2)1/2=2.54) from centroid (3.67, 1), and distance ((6-
0)2+(0-2)2)1/2=6.33) from centroid (0, 2).

Thus, the first point changes into the second centroid/cluster since it’s now distance 2.85 from
the first centroid (3.67, 1) compared to distance 1 from the second centroid (0, 2). We therefore
compute the new group matrix to be
0 0 1 1
G2=
1 1 0 0

We then recompute the centroid of each cluster as the average of the points in that cluster. Thus,
first centroid is ((4+6)/2, (1+0)/2) which is (5, 0.5), while the second centroid is ((1+0)/2,
(2+2)/2) which is (0.5, 2).

Third loop
We then start the third loop of the algorithm and compute D to be
4.27 5.22 1.11 1.11
D3 =
0.5 0.5 3.64 5.85
6.33
Here, the point (1, 2) has distance ((1-5)2+(2-0.5)2)1/2=4.27) from centroid (5, 0.5), and distance
((1-0.5)2+(2-2)2)1/2=0.5) from centroid (0.5, 2).
The point (0, 2) has distance ((0-5)2+(2-0.5)2)1/2=5.22) from centroid (5, 0.5), and distance ((0-
0.5)2+(2-2)2)1/2=0.5) from centroid (0.5, 2).
The point (4, 1) has distance ((4-5)2+(1-0.5)2)1/2=1.11) from centroid (5, 0.5), and distance ((4-
0.5)2+(1-2)2)1/2=3.64) from centroid (0.5, 2).
The point (6, 0) has distance ((6-5)2+(0-0.5)2)1/2=1.11) from centroid (5, 0.5), and distance ((6-
0.5)2+(0-2)2)1/2=5.85) from centroid (0.5, 2).

Thus,
0 0 1 1
G3 =
1 1 0 0

And so there is no change of the clusters’ grouping, and so we stop.

Conclusion
The first two points (thus documents) i.e. (1, 2), (0, 2) are in the second cluster while the last two
documents i.e. (4, 1), and (6, 0) are in the first cluster.

Illustration of the clustering

Remember our original data was (1, 2), (0, 2), (4, 1), and (6, 0), i.e. with term-document matrix
A= 1 0 4 6
2 2 1 0
We can illustrate the immediate above clustering example using the space-based view as follows.

10
Since there are two terms, we have a two dimensional space whereby the x axis represents the
first term while the y axis represents the second term. Each document is a point on the xy space.

Note that;
 Document points are shown using
 Centroids are shown using or (if they are also data points)
 Points inside a cluster are enclosed using

11
Loop 1 Loop 2
Centroids: 1st (1, 2), 2nd (0, 2) Centroids: 1st (3.67, 1), 2nd (0, 2)

1 2 3 4 5 1 2 3 4 5
6 6
Loop 3
Centroids: 1st (5, 0.5), 2nd (0.5, 2)

1 2 3 4 5
6
Figure 2.3: Illustration of KMeans clustering

Strengths and limitations of KMeans algorithm

The advantage of the K Means algorithm is that it’s simple to understand and implement.
Secondly, the K-means algorithm is efficient in memory requirements since the system only
needs to store the data points with their features, the centroids of K clusters and the membership
of data points to the clusters. According to Geraci (2008, p. 18), K Means is fast for small
document sizes. According to Rosell (2009, p. 31), the time complexity of the K Means
algorithm is O(knI), where k is the number of clusters, n the number of objects and I the number
of iterations (which is dependent on the stopping criterion). Thus, the algorithm is very efficient.
It’s also scalable. From the research done by Osama (2008), the quality of K Means algorithm
becomes very good with huge data sets, meaning that K Means is scalable.

Its limitation is that the user must specify the initial number of clusters before clustering. It is not
trivial for the user to determine a reasonable number of clusters depending on the particular set
of documents. It’s obvious that different data sets have different number of documents with
varying topics, and thus different groups. ‘A major problem with partitioning algorithms is
selecting an appropriate number of output clusters’ (Lasek 2011, p. 17). Further, the algorithm
forces the documents to have that particular number of clusters (i.e. K) no matter how many
topics are contained in the documents. And this is a major limitation. Secondly and equally very
important regarding the accuracy of KMeans, we can observe from the algorithm’s logic above
that the algorithm suffers from robustness, i.e. is poor with irregularly shaped data and does not
detect outliers. Stefanowski (2009, p. 48) agrees with this by saying that K Means algorithm is
too sensitive to outliers. Let’s have an illustration for this.

12
An outlier point
(a) 2 groups (b) 2 groups
Figure 2.4: Illustration: Groups of irregular shapes, densities and outlier (hard to detect in
KMeans) and outlier

The two obvious groups of points in each pair are of different shapes. Thus, some points in a
group could be nearer (in distance) to the center (or centroid) of the other group (or cluster)
rather than theirs, based on the K Means algorithm. In other words, some points (in irregular-
shaped regions) may be close to one another, yet in different groups. Thus, it’s very hard for the
K Means algorithm to detect the clusters since it’s purely based on distance measurements. And
so, wrong clustering could happen as illustrated below.

cluster A
cluster
A
cluster B

cluster
B

2 clusters 2 clusters
Figure 2.5: Illustration of wrong clustering by KMeans

Also from the logic of KMeans, the outlier point shown at the bottom right corner of the second
pair of clusters will not be interpreted as noise (not belonging to any cluster), but will rather be
assigned to a cluster. This is because KMeans assigns each point to its nearest centroid, no matter
how far the centroid is. Thus, KMeans is not able to identify outliers/noise in data. And this
further makes calculations of the mean distances of points from their centroids distorted a lot.

Thirdly, it doesn’t include dimension reduction. Fourth, according to Li (2007, p. 22), there is no
description about the cluster’s contents (i.e. labels of clusters), so the contents can’t be utilized
more efficiently. Still, when the data points are few, it can be proved that it’s likely to get
different clustering for different initial centroids.

In an attempt to improve the K Means algorithm, other similar algorithms have been developed.
All these can be considered as the K Means family. They include the following.

K Means variations

13
 The K Medians Algorithm: Works just like the K Means algorithm, except that we
compute the median instead of the mean of each cluster as the centroid. As a result, there is
less effect of extreme values (i.e. outliers). But, the algorithm suffers from the other
limitations of the K Means algorithm.
 The K Medoids Algorithm: As opposed to the K Means algorithm whose centers (or
centroids) may not be data points (centroids are obtained as the mean value of all data points
in a cluster, thus centroids may not be data points), the K Medoids algorithm chooses data
points as centers (medoids). I.e., each cluster is represented by one of the objects in the
cluster (i.e. the representative object). A medoid is a data point that is the most centrally
located in a cluster, meaning that its average dissimilarity to all other objects in the cluster is
the minimum (compared to the average of the other points). This makes the K Medoids
algorithm be less sensitive to outliers than the K Means algorithm. However, this algorithm
is less efficient than the K Means algorithm (the step of computing medoids – specifically
their dissimilarity to all other objects is harder than computing means), and also retains
other limitations of K Means (except the problem of outliers).
 The Bisecting K Means Algorithm: Whereas the K Means algorithm splits a cluster into k
sub clusters, the Bisecting K Means algorithm splits a cluster into two sub clusters in a
divisive hierarchical manner, but using the K Means-type of clustering. Some researchers
have found this algorithm to be more accurate than KMeans, but still suffers the limitations
of the K Means algorithm.
 The Partitioning Around Medoids (PAM) algorithm
 CLARA (Clustering LARge Applications)
 CLARANS (Clustering Large Applications based on RANdomized Search)
 The Kernel K Means Algorithm

2.5.2: The Frequent Sequence Algorithms

These algorithms measure the similarity of (usually textual) documents by using the frequent
sequence of elements in documents (which is low dimensional), instead of using the high
dimensional original data. Thus, they achieve redundancy reduction using the frequent
sequences. The idea is that a frequent sequence is a description of a cluster which corresponds to
all the documents containing that frequent sequence. Thus, documents in a cluster are those that
share the frequent sequence more than other documents in other clusters.

For example, in a frequent word set type of these algorithms, a word set consists of some words,
e.g. “purchase, car”. For example, the word set “purchase car” is in the following three
sentences.
“Please purchase car for me”
“I went to purchase the requested car”
“Frequent car purchase is unnecessary”
A word set is frequent if the number of documents containing the words in this set is at least the
specified threshold.

Strengths and limitations

The advantage of these approaches of clustering is that there is dimension reduction of the
document sizes. Only the frequent word sequences or phrases are used to cluster the documents,

14
rather than the entire documents. Secondly, there is description of a cluster’s contents by the use
of a cluster’s label. The label is the set of frequent words shared by documents in that cluster.

2.5.3: Feature Extraction Algorithms

These are algorithm used to reduce the dimensions of the instances/documents being clustered by
obtaining (or extracting) only some important features from the original documents and
clustering them, thus achieving redundancy reduction. They do this by doing feature
transformation (using a linear transformation), i.e. defines new features to represent the original
features (or data set). Here, the correlations among the words in the data set are leveraged in
order to create features, which correspond to the concepts in the data.

Strengths and limitations of feature extraction

One advantage of these methods is that the dimension of the documents is reduced drastically,
thus improving on efficiency (but their inefficiency is as a result of their computations).
Secondly, such algorithms are able to discover deep relations between attributes/terms and
instances/documents.

However, they have their limitations. First, feature extraction methods have expensive
computations and which waters down their efficiency. Secondly, feature extraction methods
suffer from a common limitation as compared with other approaches, in that the generated new
features may be difficult to interpret.

The Latent Semantic Analysis (LSA)

LSA projects an original vector space or term-document matrix into a small factor space (Lee
2010, p. 2). It is a feature extraction clustering technique that applies Singular Value
Decomposition (SVD) to reduce the dimension of the term-document matrix. SVD applies a
mathematical rule that specifies that an m by n rectangular matrix say A, can be broken down
into the product of three matrices say U, S and VT i.e. A mn=UmmSmnVTnn, whereby
 U is a m by m orthogonal matrix whose columns are orthonormal eigenvectors of AAT,
 S is a m by n diagonal matrix containing the square roots of eigen values from U and V in
descending order, and
 V is an n by n orthogonal matrix whose columns are orthonormal eigenvectors of ATA.

In LSA, we decompose a weighted (e.g. with term frequencies) term-document matrix A into the
three matrices U, S, and VT. Here, U is the document matrix while VT is the term matrix. In other
words, matrix U describes the original column entities as vectors of derived orthogonal factor
values, while matrix VT describes the original row entities as vectors of derived orthogonal factor
values. Another characteristic of vectors of U and VT is that they have mixed signs (i.e. some
values are positive and others negative). Also, the set of rows (representing terms) usually
repeals some semantic relations among the terms (representing synonyms). LSA is able to reveal
deep/hidden (i.e. latent) relations among data items, thus the name Latent Semantic Analysis.
And the set of columns (representing documents) repeals some clusters. Also, topics are
represented by the factors of the matrix, and consequently, the relationships between documents
are represented by the orthogonal characteristics of the factors. Thus, words in a factor have little
relations with words in other factors, but words in a factor have high relations with words in that
factor.

15
We then reconstruct the original matrix A by multiplying U, S and VT but using only the vectors
of U and VT that are associated with higher eigen values of S (we ignore others). We can specify
a particular threshold of these eigen values to massively reduce A’s dimension. In this way, A’s
rank is reduced and so the dimension reduction need of text mining clustering is accomplished.
In other words, we project an original vector space (a term-document matrix in this case) into a
small factor space. SVD thus gives a reduced representation of the original text data.

2.5.4: Density-based Methods

Rather than measuring distances between data points (as in distance-based algorithms), density-
based algorithms find clusters by differentiating regions in terms of the relative density (or
compactness/concentration/number of objects per unit area) of VSM points in them. Thus,
regions adjacent to a cluster contain data points of either less concentration or higher
concentration. Some clusters that will be easily detected by a density-based clustering algorithm
are illustrated below.

5 clusters 2 clusters 3clusters (with overlapping hard

in density clustering)
Figure 2.6: Some clusters easy to find in density clustering

Density-based algorithms are partitioning type of algorithms.

Strengths and limitations of density-based methods

A common advantage that we can identify from the logic of density-based algorithms as well as
from the immediate above illustration is that the algorithms can discover clusters of irregular
shapes because a cluster is represented as a connected dense region that can grow in any
direction that density leads. Consequently and secondly, density-based algorithms are less
sensitive to outliers/noise (because of the arbitrary shapes discovering ability). For example, the
clusters in the immediate above illustration (some with arbitrary shapes) will be a big problem to
other algorithms like the K Means family of algorithms, but will not be a big problem to density-
based algorithms. It’s obvious that this moving through the connected dense region of points will
exclude any outlier from being detected in a cluster. Also, you don’t specify the number of
clusters unlike in K Means family of algorithms. Another key property of these algorithms is that
they require only one scan of the input data set. This increases processing speed.

A limitation of density-based algorithms is that there are initial parameters that need to be set.
And it’s hard to specify the most appropriate setting. Secondly, there is no cluster labeling. Third
and as illustrated immediately above, the algorithms do not allow cluster overlapping.

16
The DBSCAN (Density Based Spatial Clustering of Applications with Noise) algorithm
This algorithm was presented by Ester, Kriegel, Sander, and Xu in 1996. According Nagpal &
Mann (2011), the DBSCAN algorithm takes in two inputs, i.e. the radius of the cluster (Eps) and
minimum points required inside the cluster (Minpts). Consequently, three types of points are the
core (a point of a cluster whose matching neighborhood is dense enough with respect to Eps and
Minpts. I.e. it has a minimum of Minpts points within radius of Eps. A core point is thus inside
the cluster), border (a point on the border of the cluster, but within the neighborhood of the core.
In other words, a neighbor of a core point which is not a core point itself), and noise (a point
which is neither a core point nor a border point). And point q is directly density-reachable from
a point p (or is in the neighborhood of p) if it is not farther away from p than distance Eps and p
is a core point. And q is called density-reachable from p if there exists a sequence p1, p2, …, pn
of points with p1=p and pn=q where each pi+1 is directly density-reachable from pi.

According to Mumtaz & Duraiswamy, DBSCAN clusters by arbitrary selecting a starting point p
that has not been visited, and then finding all neighbor points within distance Eps of p. Then,
 If the number of the neighbors is at least Minpts, then a cluster is formed. The point p and its
neighbors are added to this cluster and then p is marked as visited. The algorithm then
repeats the evaluation process for all neighbors recursively.
 If the number of neighbors is less than Minpts, then p is marked as noise.
 If a cluster is fully expanded (i.e. all points within reach are visited) then the algorithm
proceeds to iterate through the remaining unvisited points in the dataset.

Experiments have shown that DBSCAN is faster and more precise than many other algorithms. It
has a time complexity of O(n log n), which is quite low. ‘It holds good for large spatial
databases’ (Parimala et al. 2011). Mumtaz & Duraiswamy say the following about DBSCAN: ‘It
can even find clusters completely surrounded by (but not connected to) a different cluster’. This
is obviously because of the Minpts and Eps requirement.

DBSCAN however has its own limitations. First, since DBSCAN uses a global (or one) value of
parameters Minpts and Eps, it has problem clustering multi-density regions. That means that
sparse (i.e. lesser-dense) regions that deserve to be clusters may be interpreted as noise (because
they don’t have enough points to satisfy the threshold Minpts within distance Eps), while many
denser regions that should be in different clusters may be clustered together (because the
different regions may be within a single radius of Eps). This means that clustering multi-dense
points is inaccurate. Secondly, the speed of the algorithm depends on the setting of the density
parameters (Eps and Minpts). And it’s hard to determine the most appropriate setting.

OPTICS (Ordering Points To Identify the Clustering Structure)

OPTICS algorithm was presented by Michael Ankerst, Markus M. Breunig, Hans-Peter Kriegel
and Jörg Sander in 1999. It uses the idea of DBSCAN, but includes finding clusters of different
densities. It orders data points being clustered linearly so that closest points become neighbors in
the linear order. Also, the closest points (or the more dense points) are given priority in the
clustering. This idea of having an order in which data objects are clustered i.e. the more dense
points are clustered first is known as density-based cluster ordering. The density-based cluster
ordering is done by having multiple number of distance parameter Eps (a lower value of Eps will

17
result to more dense clusters, and a higher value of Eps will result to less dense clusters). Thus,
the main advantage of OPTICS over DBSCAN is that OPTICS handles data points of different
densities.

2.5.5: Probability-based Methods

These methods use probabilities to cluster documents. The methods are also known as ‘model-
based’ methods. The goal is to find the most likely cluster for a data element, i.e. they find the
probability with which a data belongs to a cluster. In this case, the documents’ data is regarded to
belong to a certain probability distribution, and the area around the mean of a distribution
constitutes a natural cluster. Therefore, we associate the cluster with the corresponding
distribution’s statistics, e.g. mean, variance, standard deviation, etc.

Note that whereas density functions clustering algorithms compute probabilities of data points in
the VSM, probability-based clustering algorithms compute probabilities of words in documents
(i.e. without applying the VSM).

Strengths and limitations of probability methods

‘One clear advantage of probabilistic methods is the interpretability of the constructed clusters.
Probability clustering results to easily interpretable cluster system’ (Rai 2010, p. 4). However,
probability clustering algorithms have their limitation in that they require expensive
computations thus are less efficient. According to Lee (2010, p. 5), a limitation of CTM is that it
requires complex computations.

Examples
Probabilistic Latent Semantic Analysis (PLSA)
Latent Dirichlet Allocation (LDA)
Correlated Topic Model (CTM)

2.5.6: Grid-based Methods

Grid-based clustering methods quantize the VSM space of the data points into a finite number of
cells to form a grid structure. The cells containing at least the minimum number of points are
considered dense. Then the dense cells are connected to form clusters. Thus, there are no
distance computations on the data points, and shapes are restricted to the union of the grid cells.
The clustering is based not on the data points, but on the value space surrounding the data points.

Figure 2.7: Some 4*4 grids

In short, grid clustering concerns forming clusters with contiguous dense cells. Thus, grid-based
approach is related to (or applies) density-based approach. Grid-based clustering algorithms are
usually hierarchical agglomeration based. Examples of such algorithms are CLIQUE, STING,
and MAFIA, WAVECLUSTER.

18
The basic steps of a grid algorithm are;
(i) Divide the input data space into a particular number of cells.
(ii) Calculating the density for each cell.
(iii) Eliminate cells with densities of less than the threshold.
(iv) Form clusters from the contiguous cells.

Strengths and limitations of grid-based algorithms

The advantage of the grid based methods is that they are much efficient. The processing time is
much less even with large data sets. According to (Lasek 2011, p. 20), in high dimensional
space, this approach is more efficient than density-based approach. The processing time is fast
and dependent only on the number of cells, i.e. they have a complexity of O(number of cells)
rather than O(number of documents i.e. n). Fung (1999, p. 20) says that

The unique property of grid-based clustering approach is that its computational complexity is
independent of the number of data objects, but dependent only on the number of cells in each
dimension in the quantized space.

An obvious limitation of many grid algorithms is that they require some input parameters, e.g.
number of cells (intervals) and density threshold (for CLIQUE described below), and number of
cells and number of levels (for STING described below). Consequently and secondly, the
efficiency of the algorithms is obviously seriously determined by the size of the cells. This is
because a small cell size will clearly lead to unnecessary computation in regions with sparse
points. But having a large cell size of course will lead to inaccuracy in regions with dense points.
According to Rama et al. (2010), a limitation of STING is that its performance relies on the
granularity of the lowest level of the grid structure.

Illustration: Some clusters that are hard to detect using grid algorithms.

5 clusters 2 clusters 3clusters (with overlapping)

(a) (b)

Figure 2.8: Illustration of cell adjacency problem in grid clustering

If we define the volume of a cluster to be the sum of the volumes of all the non-empty grids of
that cluster, then as is clear that the volume of any cluster will decreases as we continue
partitioning the grids (and the number of surrounding empty grids increases), meaning different
clustering results can result from different cell sizes. Thus, the optimal size of cells is a major
problem in grid-based algorithms. Thirdly and similarly to the second limitation, the efficiency

19
of an algorithm is determined by the density threshold. It’s hard to determine an optimal
threshold, which also depends on the cell’s size and dimensionality of data. Fourth, determining
kind of cell adjacency (e.g. for 2D clustering, using 4 or 8 neighbors, etc) is hard. Most grid-
based algorithms cluster cells horizontally and vertically, but never diagonally. This is unlike
other approaches like distance approach and density approach whereby finding nearby points is
not limited to horizontal and vertical directions, but is done in any direction. But even if some
cluster also diagonally, determining the cell adjacency is a problem unlike in density-based
approaches whereby this is not an issue. This obviously might greatly affect the performance (i.e.
accuracy). Fifth, grid clustering is poor in clustering irregular shapes unlike density clustering,
and so is poor in noise handling. Sixth and just like density-based clustering, grid-based
clustering does not ordinarily include overlapping of clusters (or fuzzy clustering) and cluster
labeling.

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
89% (45)
12 Week Program: Summer Body Starts Now
70 pages
Knee Ability Zero Now Complete As A Picture Book 4 PDF Free
94% (68)
Knee Ability Zero Now Complete As A Picture Book 4 PDF Free
49 pages
Read People Like A Book by Patrick King-Edited
61% (70)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Cheat Code To The Universe
94% (77)
Cheat Code To The Universe
34 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (541)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (28)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
75% (12)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
78% (27)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (55)
Satanic Calendar
4 pages
Epstein Unsealed Document Release: Case: 18-2868 08/09/2019
75% (8)
Epstein Unsealed Document Release: Case: 18-2868 08/09/2019
2,024 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (7)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
ALCHEMIST
64% (14)
ALCHEMIST
4 pages
1001 Songs
70% (70)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
DATA STRUCTURE AND ALGORITHMS Notes
100% (1)
DATA STRUCTURE AND ALGORITHMS Notes
74 pages
Catalytic Reactors
100% (1)
Catalytic Reactors
40 pages
02 Ieee Kadhim2014
No ratings yet
02 Ieee Kadhim2014
6 pages
An Improved Technique For Document Clustering
No ratings yet
An Improved Technique For Document Clustering
4 pages
Text Mining and Its Applications
No ratings yet
Text Mining and Its Applications
5 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Lightweight Document Clustering Sholom Weiss, Brian White, Chid Apte IBM Research Report RC-21684
No ratings yet
Lightweight Document Clustering Sholom Weiss, Brian White, Chid Apte IBM Research Report RC-21684
13 pages
OPTICS: Ordering Points To Identify The Clustering Structure
No ratings yet
OPTICS: Ordering Points To Identify The Clustering Structure
12 pages
Text Clustering
No ratings yet
Text Clustering
10 pages
Implement A Mining Web Document Through New Data Clustering Algorithm PDF
No ratings yet
Implement A Mining Web Document Through New Data Clustering Algorithm PDF
7 pages
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
No ratings yet
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
36 pages
DM UNIT-5 NOTES
No ratings yet
DM UNIT-5 NOTES
16 pages
DATA MINING UNIT-4
No ratings yet
DATA MINING UNIT-4
38 pages
UNIT 3 (2marks) TA
No ratings yet
UNIT 3 (2marks) TA
4 pages
DM UNIT-4 Part2
No ratings yet
DM UNIT-4 Part2
18 pages
Explain Multirelational Data Mining Concept in Detail
No ratings yet
Explain Multirelational Data Mining Concept in Detail
7 pages
1.1 Project Overview: Data Mining
No ratings yet
1.1 Project Overview: Data Mining
74 pages
3.sung Sam Hong Et Al
No ratings yet
3.sung Sam Hong Et Al
19 pages
Wa0010.
No ratings yet
Wa0010.
40 pages
Solution 1
63% (8)
Solution 1
3 pages
Introduction To Data Structure
No ratings yet
Introduction To Data Structure
7 pages
Bs 31267274
No ratings yet
Bs 31267274
8 pages
DMW Lab File Work
No ratings yet
DMW Lab File Work
18 pages
Text Mining An Improvised Feature Based
No ratings yet
Text Mining An Improvised Feature Based
5 pages
Ref 2 Hierarchical
No ratings yet
Ref 2 Hierarchical
7 pages
Unit 1 FDS
No ratings yet
Unit 1 FDS
40 pages
An Automatic Document Classifier System Based On Genetic Algorithm and Taxonomy
No ratings yet
An Automatic Document Classifier System Based On Genetic Algorithm and Taxonomy
8 pages
Improving The Efficiency of Document Clustering and Labeling Using Modified FPF Algorithm
No ratings yet
Improving The Efficiency of Document Clustering and Labeling Using Modified FPF Algorithm
8 pages
150
No ratings yet
150
6 pages
Dwdmsem 6 QB
No ratings yet
Dwdmsem 6 QB
13 pages
Casey Kevin MSThesis
No ratings yet
Casey Kevin MSThesis
51 pages
CSC201sup
No ratings yet
CSC201sup
39 pages
DM-unit 1
No ratings yet
DM-unit 1
22 pages
Data Structure
No ratings yet
Data Structure
156 pages
Geostatistical Analysis Research: International Journal of Engineering Trends and Technology-Volume2Issue3 - 2011
No ratings yet
Geostatistical Analysis Research: International Journal of Engineering Trends and Technology-Volume2Issue3 - 2011
8 pages
Unit 2
No ratings yet
Unit 2
40 pages
Data Structure Questions
No ratings yet
Data Structure Questions
9 pages
Data Mining Report
100% (1)
Data Mining Report
15 pages
A Novel Multi-Viewpoint Based Similarity Measure For Document Clustering
No ratings yet
A Novel Multi-Viewpoint Based Similarity Measure For Document Clustering
4 pages
Solutions To DM I MID (A)
100% (1)
Solutions To DM I MID (A)
19 pages
A Study On K-Means Clustering in Text Mining Using Python
No ratings yet
A Study On K-Means Clustering in Text Mining Using Python
5 pages
guide ldp
No ratings yet
guide ldp
6 pages
DMDW R20 Unit 5
No ratings yet
DMDW R20 Unit 5
21 pages
DMDW Unit-5
No ratings yet
DMDW Unit-5
21 pages
5 What Is Data-WPS Office
No ratings yet
5 What Is Data-WPS Office
19 pages
Document Clustering in Web Search Engine: International Journal of Computer Trends and Technology-volume3Issue2 - 2012
No ratings yet
Document Clustering in Web Search Engine: International Journal of Computer Trends and Technology-volume3Issue2 - 2012
4 pages
Strategies and Algorithms For Clustering Large Datasets: A Review
No ratings yet
Strategies and Algorithms For Clustering Large Datasets: A Review
20 pages
Solving Ordinary Differential Equations Using Tayl
No ratings yet
Solving Ordinary Differential Equations Using Tayl
15 pages
Different Type of Feature Selection For Text Classification
No ratings yet
Different Type of Feature Selection For Text Classification
6 pages
Data Structure Lecture 1
No ratings yet
Data Structure Lecture 1
39 pages
An Efficient and Empirical Model of Distributed Clustering
No ratings yet
An Efficient and Empirical Model of Distributed Clustering
5 pages
A Tag - Tree For Retrieval From Multiple Domains of A Publication System
No ratings yet
A Tag - Tree For Retrieval From Multiple Domains of A Publication System
6 pages
Unit-IV Cluster Outlier Analysis
No ratings yet
Unit-IV Cluster Outlier Analysis
21 pages
Data Structures by Fareed Sem IV 27-Jan-2020 Pages 213 Completed W1
No ratings yet
Data Structures by Fareed Sem IV 27-Jan-2020 Pages 213 Completed W1
213 pages
Ijdkp 030205
No ratings yet
Ijdkp 030205
18 pages
PSO6
No ratings yet
PSO6
10 pages
Data Mining and Visualization Question Bank
100% (1)
Data Mining and Visualization Question Bank
11 pages
Pam Clustering Technique: Bachelor of Technology Computer Science and Engineering
No ratings yet
Pam Clustering Technique: Bachelor of Technology Computer Science and Engineering
11 pages
A New Hierarchical Document Clustering Method: Gang Kou Yi Peng
No ratings yet
A New Hierarchical Document Clustering Method: Gang Kou Yi Peng
4 pages
Ba 2419551957
No ratings yet
Ba 2419551957
3 pages
Data Structures Explained: A Practical Guide with Examples
From Everand
Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Clustering Illustrations Publishing 1
No ratings yet
Clustering Illustrations Publishing 1
54 pages
Classification 2
No ratings yet
Classification 2
41 pages
DATA MINING
No ratings yet
DATA MINING
7 pages
DM 05 04 Rule-Based Classification
No ratings yet
DM 05 04 Rule-Based Classification
72 pages
Exams 1
No ratings yet
Exams 1
20 pages
Transaction Concurrency and Deadlock
No ratings yet
Transaction Concurrency and Deadlock
65 pages
Lect3 Pipeline
No ratings yet
Lect3 Pipeline
4 pages
Word List
No ratings yet
Word List
12 pages
Group 3 - Chapter 1 and 2
100% (1)
Group 3 - Chapter 1 and 2
15 pages
(Ebook) Half Girlfriend by Chetan Bhagat ISBN 9788129135728, 8129135728download
100% (4)
(Ebook) Half Girlfriend by Chetan Bhagat ISBN 9788129135728, 8129135728download
52 pages
Ocw Chapter 10
No ratings yet
Ocw Chapter 10
64 pages
TAT BON SB11-06 Fine Wire Spark Plugs
No ratings yet
TAT BON SB11-06 Fine Wire Spark Plugs
4 pages
3-1 Assignment
No ratings yet
3-1 Assignment
2 pages
Comparison of Mechanical Properties of Natural Rubber Vulcanizates Filled With Hybrid Fillers (Carbon Black/Palm Kernel Shell and Palm Kernel Shell/Sandbox Seed Shell)
No ratings yet
Comparison of Mechanical Properties of Natural Rubber Vulcanizates Filled With Hybrid Fillers (Carbon Black/Palm Kernel Shell and Palm Kernel Shell/Sandbox Seed Shell)
6 pages
Gulamgiri
No ratings yet
Gulamgiri
160 pages
A Forecast Is A Prediction of What Is Going To Happen As A Result of A Given Set of Circumstances
No ratings yet
A Forecast Is A Prediction of What Is Going To Happen As A Result of A Given Set of Circumstances
1 page
Corner Correct, Edge Not Placed
No ratings yet
Corner Correct, Edge Not Placed
22 pages
Paco, A. Rodrigues, R. G. Rodrigues, L. - Brand Image - (Skala) Brend Imidz I Tipicnost 1
No ratings yet
Paco, A. Rodrigues, R. G. Rodrigues, L. - Brand Image - (Skala) Brend Imidz I Tipicnost 1
21 pages
Accounting: An Introduction to Principles and Practice Edward Clarke All Chapters Instant Download
100% (2)
Accounting: An Introduction to Principles and Practice Edward Clarke All Chapters Instant Download
65 pages
No - Ntnu Inspera 78072401 25056133
No ratings yet
No - Ntnu Inspera 78072401 25056133
77 pages
1 Functions As Models: 1.1 Four Ways To Represent A Function
No ratings yet
1 Functions As Models: 1.1 Four Ways To Represent A Function
3 pages
Microsurgery - 2005 - Atkins - Training in microsurgical skills Assessing microsurgery training
No ratings yet
Microsurgery - 2005 - Atkins - Training in microsurgical skills Assessing microsurgery training
5 pages
Chapter 6 - Requirements Engineering Flashcards - Quizlet
No ratings yet
Chapter 6 - Requirements Engineering Flashcards - Quizlet
8 pages
Gynecology 2019 PDF
100% (1)
Gynecology 2019 PDF
46 pages
Test Wave Optics
100% (1)
Test Wave Optics
5 pages
Rubric For Flash Report
100% (1)
Rubric For Flash Report
1 page
James C. Livingston, Modern Christian Thought, Volume 1 the Enlightenment and the Nineteenth Century, Second Edition. Minneapolis Fortress, 2006. Copy
No ratings yet
James C. Livingston, Modern Christian Thought, Volume 1 the Enlightenment and the Nineteenth Century, Second Edition. Minneapolis Fortress, 2006. Copy
450 pages
Physical Assessment
100% (1)
Physical Assessment
13 pages
Presentation 6 - Staffing The Organization
No ratings yet
Presentation 6 - Staffing The Organization
30 pages