Cluster Analysis
Cluster Analysis
UNIVERSITY
Jnana Sangama, Santhibastawad
Road, Machhe
Belgaum 590 018, Karnataka, India
PAVITHRA S 1JS13IS046
JSS ACADEMY OF TECHNICAL
EDUCATION
Jss Campus, Uttarahalli Main Road, Kengeri
Bangalore - 560060
CERTIFICATE
This is to certify that the case study report titled Clustering Analysis as
a part of the assignment has been completed by Pavithra S with
University Seat Number 1JS13IS046, for the VII Semester subject Data
Warehousing and Data Mining Code:10IS72 in Information Science &
Engineering under Visvesvaraya Technological University,
Belgaum during the year 2016.
Learner Declaration
I certify that the work submitted for this assignment is my own and search sources
are fully acknowledged.
PSO-CO EVALUATION
JUSTIFICATIONS
PO JUSTIFICATIONS
PO1
PO2
PO3
PO4
PO5
PO6
PO7
PO8
PO9
PO10
PO11
PO12
PSO JUSTIFICATIONS
PSO1
PSO2
PSO3
Cluster analysis
Cluster analysis or clustering is the task of assigning a set of
objects into groups (called clusters) so that the objects in the same cluster
are more similar (in some sense or another) to each other than to those in
other clusters.
Cluster analysis itself is not one specific algorithm, but the general task to be
solved. It can be achieved by various algorithms that differ significantly in
their notion of what constitutes a cluster and how to efficiently find them.
Popular notions of clusters include groups with low distances among the
cluster members, dense areas of the data space, intervals or particular
statistical distributions. The appropriate clustering algorithm and parameter
settings (including values such as the distance function to use, a density
threshold or the number of expected clusters) depend on the individual data
set and intended use of the results. Cluster analysis as such is not an
automatic task, but an iterative process of knowledge discovery that involves
try and failure. It will often be necessary to modify preprocessing and
parameters until the result achieves the desired properties.
Besides the term clustering, there are a number of terms with similar
meanings, including automatic classification, numerical
taxonomy, botryology (from Greek "grape") and typological analysis.
The subtle differences are often in the usage of the results: while in data
mining, the resulting groups are the matter of interest, in automatic
classification primarily their discriminitative power is of interest. This often
leads to misunderstandings of researchers coming from the fields of data
mining and machine learning, since they use the same terms and often the
same algorithms, but have different goals.
Before Clustering
After Clustering
The notion of a cluster varies between algorithms and is one of the many
decisions to take when choosing the appropriate algorithm for a particular
problem. At first the terminology of a cluster seems obvious: a group of data
objects. However, the clusters found by different algorithms vary
significantly in their properties, and understanding these cluster models is
key to understanding the differences between the various algorithms. Typical
cluster models include:
Scalability
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to determine input parameters
Ability to deal with noisy data
Insensitivity to the order of input records
High dimensionality
Interpretability and usability
Cluster Algorithms
While these methods are fairly easy to understand, the results are not always
easy to use, as they will not produce a unique partitioning of the data set,
but a hierarchy the user still needs to choose appropriate clusters from. The
methods are not very robust towards outliers, which will either show up as
additional clusters or even cause other clusters to merge (known as
"chaining phenomenon", in particular with single-linkage clustering). In the
general case, the complexity is which makes them too slow for large
data sets. For some special cases, optimal efficient methods (of complexity
[1] [2]
) are known: SLINK for single-linkage and CLINK for complete-
linkage clustering. In the data mining community these methods are
recognized as a theoretical foundation of cluster analysis, but often
considered obsolete. They did however provide inspiration for many later
methods such as density based clustering.
Centric-based clustering
Distribution-based clustering
The clustering model most closely related to statistics is based
on distribution models. Clusters can then easily be defined as objects
belonging most likely to the same distribution. A nice property of this
approach is that this closely resembles the way, artificial data sets are
generated: by sampling random objects from a distribution.
Density-based clustering
The key drawback of DBSCAN and OPTICS is that they expect some kind of
density drop to detect cluster borders. On data sets with e.g. overlapping
Gaussian distributions - a common use case in artificial data - the cluster
borders produced by these algorithms will often look arbitrary, because the
cluster density decreases continuously. On mixtures of Gaussians data set,
they will almost every time be outperformed by methods such as EM
clustering that are able to precisely model this kind of data.
Newer Developments
In recent years considerable effort has been put into improving
algorithm performance of the existing algorithms. Among the most
popular are CLARANS (Ng and Han, 1994), and BIRCH (Zhang et al., 1996).
With the recent need to process larger and larger data sets (also known
as big data), the willingness to treat semantic meaning of the generated
clusters for performance has been increasing. This led to the development of
pre-clustering methods such as canopy clustering, which can process huge
data sets efficiently, but the resulting "clusters" are merely a rough pre-
partitioning of the data set to then analyze the partitions with existing slower
methods such as k-means clustering.
Internal evaluation
When a clustering result is evaluated based on the data that was clustered
itself, this is called internal evaluation. These methods usually assign the
best score to the algorithm that produces clusters with high similarity within
a cluster and low similarity between clusters. One drawback of using internal
criterions in cluster evaluation is that high scores on an internal measure do
not necessarily result in effective information retrieval applications.
[21]
Additionally, this evaluation is biased towards algorithms that use the
same cluster model. For example k-Means clustering naturally optimizes
object distances, and a distance-based integral criterion will likely overrate
the resulting clustering.
DaviesBolden index
External evaluation
The Rand index computes how similar the clusters (returned by the
clustering algorithm) are to the benchmark classifications. One can also view
the Rand index as a measure of the percentage of correct decisions made by
the algorithm. It can be compute using the following formula:
F-measure
The F-measure can be used to balance the contribution of false negatives by
Where P is the precision rate and R is the recall rate. We can calculate the F-
[21]
measure by using the following formula :
Jaccard index
The Jaccard index is used to quantify the similarity between two datasets.
The Jaccard index takes on a value between 0 and 1. An index of 1 means
that the two dataset are identical, and an index of 0 indicate that the
datasets have no common elements. The Jaccard index is defined by the
following formula:
This is simply the number of unique elements common to both sets divided
by the total number of unique elements in both sets.
Confusion matrix
APPLICATIONS:
Transcriptomics
Sequence analysis
Medicine
1. Medical imaging
2. IMRT segmentation
Clustering can be used to divide a fluence map into distinct regions for
conversion into deliverable fields in MLC-based Radiation Therapy.
Clustering can be used to group all the shopping items available on the web
into a set of unique products. For example, all the items on eBay can be
grouped into unique products. (eBay doesn't have the concept of a SKU)
Flickr's map of photos and other map sites use clustering to reduce the
number of markers on a map. This makes it both faster and reduces the
amount of visual clutter.
Computer science
1. Software evolution
2. Image segmentation
3. Evolutionary algorithms
4. Recommender systems
Crime Analysis
Cluster analysis can be used to identify areas where there are greater
incidences of particular types of crime. By identifying these distinct areas or
"hot spots" where a similar crime has happened over a period of time, it is
possible to manage law enforcement resources more effectively.
2. Climatology
3. Petroleum Geology
4. Physical Geography
Conclusion
Cluster analysis methods will always produce a grouping. The
groupings produced by cluster analysis may or may not prove useful for
classifying objects. If the groupings discriminate between variables not used
to do the grouping and those discriminations are useful, then cluster analysis
is useful. For example, if grouping zip code areas into fifteen categories
based on age, gender, education, and income discriminates between wine
drinking behaviors, it would be very useful information if one was interested
in expanding a wine store into new areas.
Cluster analysis methods are not clearly established. There are many options
one may select when doing a cluster analysis using a statistical package.
Cluster analysis is thus open to the criticism that a statistician may mine the
data trying different methods of computing the proximities matrix and linking
groups until he or she "discovers" the structure that he or she originally
believed was contained in the data. One wonders why anyone would bother
to do a cluster analysis for such a purpose.