0% found this document useful (0 votes)
174 views

Hierarchical Clustering PDF

Hierarchical clustering is an algorithm that groups similar objects into clusters. It begins by treating each object as an individual cluster, then iteratively merges the two closest clusters until all objects are in one cluster. The distance between clusters can be measured in different ways, such as the distance between cluster centroids. Several algorithms have been developed for hierarchical clustering, including agglomerative methods that merge smaller clusters and divisive methods that split larger clusters. Hierarchical clustering has applications in fields like data mining, text analysis, and bioinformatics.

Uploaded by

Likitha Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
174 views

Hierarchical Clustering PDF

Hierarchical clustering is an algorithm that groups similar objects into clusters. It begins by treating each object as an individual cluster, then iteratively merges the two closest clusters until all objects are in one cluster. The distance between clusters can be measured in different ways, such as the distance between cluster centroids. Several algorithms have been developed for hierarchical clustering, including agglomerative methods that merge smaller clusters and divisive methods that split larger clusters. Hierarchical clustering has applications in fields like data mining, text analysis, and bioinformatics.

Uploaded by

Likitha Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

HIERARCHICAL CLUSTERING CASE STUDY

Done By: Likitha T Reddy

INTRODUCTION
Hierarchical Clustering is an algorithm that helps in combining similar objects into groups
called clusters. Hierarchical clustering is also known as hierarchical cluster analysis or HCA.
The endpoint is a lot of groups, where each bunch is particular from one another group, and
the articles inside each group are extensively like one another. Either a distance matrix or raw
data is required to perform hierarchical clustering. The software will automatically compute a
distance matrix in the background when raw data is given.Hierarchical clustering method
begins by considering each information as a different cluster. Then it continuously executes
the following steps:
1. The two closest clusters are identified.
2. The identified two similar clusters are merged.
This process is continued until all the similar clusters are merged together. The similarity in
the clusters is referred as Euclidean Distance (default). It is defined by the gap between the
two clusters has been computed depending on the length of the straight line drawn from one
cluster to another. The other distance metrics are Minkowksi, Cosine, Correlation,
Chebychev and Spearman.
The decision of distance metric shall be made dependent on hypothetical interest from the
field of study. A distance metric needs to characterize similarity in a manner that is
reasonable for the domain of study. Where there is no hypothetical explanation for a
substitute, the Euclidean shall largely be liked or preferred, as it is typically the proper
proportion of distance in the physical world. After selecting a distance metric, it is important
to decide from where distance is measured. It can be measured between the two most
identical parts of a cluster (single-linkage), the two least identical bits of a cluster (complete-
linkage), the center of the clusters (mean or average-linkage), or some other measure. The
linkage method defines how the gap between two clusters is calculated or estimated. A
theoretical issue is what causes fluctuation. Average (default) and Ward are the two metrics
in the linkage criteria. Where there are no comprehensible theoretical explanations for choice
of linkage criteria, Ward’s method is logically default. This is frequently suitable as this idea
of distance coordinates the standard assumptions of how to figure the differences between
groups in statistics.
Hierarchical clustering is a technique of cluster analysis which tries to assemble a chain of
clusters in data mining and statistics.
Strategies for hierarchical clustering generally fall into two types:
1. Agglomerative: It is Bottom-up approach. Every perception begins in its own cluster,
and pairs of clusters are merged as one moves up the hierarchy.
2. Divisive: It is Top-down approach. Parts are performed recursively as one moves
down the hierarchy as all observations start in one cluster.
In general, the merges and splits are determined in a greedy manner. The results of
hierarchical clustering are usually presented in a Dendogram. It is the main output of this
algorithm which defines relationship between the clusters. The standard algorithm for
hierarchical agglomerative clustering has a time complexity of O(n^3) and requires
O(n^2) memory, which makes it too slow for even medium data sets.
A few Applications of hierarchical clustering:
 US Senator Clustering through Twitter.
 Charting Evolution through Phylogenetic Trees.
 Tracking Viruses through Phylogenetic Trees.
 Recognition Using Biometrics of the Hand.

RELATED WORKS
Photovoltaic Power Data Analysis in 2018 by Sungsik Park and Young B.Park - Photovoltaic
power data analyses only the meaning of each cluster, so it is difficult to grasp the similarity between
clusters. So, they use hierarchical clustering algorithm to calculate the relationship between clusters.
(Using K-Means).
Network Traffic Data Reduction for Improving Suspicious Flow Detection in 2018 by Liya Su,
Yepeng Yao, Ning Li, Junrong Liu, Zhigang Lu, Baoxu Liu - Based on event analysis with pre-set
patterns, namely misuse detection. (1. The hierarchical clustering and 2. the Multinomial Naïve Bayes
supervised learning model.)
Ship Trajectory Data in 2017 by Liangbin Zhao, Guoyou Shi, and Jiaxuan Yang - Hierarchical
clustering method can provide assistant decision-making in the field of waterway construction,
maritime control and so on.

The Grouping of Facial Images to Improve the CBIR Based Face Recognition System In 2017 by
Muhammad Fachrurrozi, Saparudin, Erwin, and Clara Fin Badillah - The image is used as training
data with dimensions of 150x150 pixels using AHC and LBP methods.

Optimization method for cotton production process in 2017 by Ms. Shraddha K. Popat, Mr. Pramod
B. Deshmukh, and Ms. Vishakha A. Metre - Presents a novel technique which involves more than one
point for measuring similarity. It also focuses on the selection of appropriate similarity measure for
analysing similarity between the documents.

Metabolomics Data Analysis in presence of Cell-wise and Case wise outliers in 2016 by
Kanchanamala, vineela, Neelima. - Metabolomics data are collected using high throughput
technology that provides high dimensional data matrix which may contaminated by cell-wise and
case-wise outliers (TSGS) method.

ALGORITHMS AND FLOWCHARTS


 CURE (Clustering Using Representatives)
CURE builds a balance between centroid and all point contacts and uses partitioning
of dataset. It is an agglomerative hierarchical clustering algorithm. A large database
can be handled with the combination of random sampling and partitioning. Each
partition is partially clustered after a random sample drawn from the data set is first
partitioned. To yield the desired clusters the partial clusters are then again clustered
in a second pass. It is confirmed by the experiments that the quality of clusters
produced by CURE is much better than those found by other existing algorithms.

 BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)


BIRCH is especially suitable for very large databases and is an agglomerative
hierarchical clustering algorithm. So as to minimize the number of I/O operations this
method has been designed. Using tree structure BIRCH process begins by partitioning
objects hierarchically and then applies other clustering algorithms to refine the clusters.
With the available resources like available memory and time constraints it incrementally
and dynamically clusters incoming data points and try to produce the best quality
clustering.

 ROCK (Robust Clustering using links)

ROCK is based on the notion of links and appropriate for handling large data sets
which is a robust agglomerative hierarchical-clustering algorithm. ROCK utilizes
links between data points not the gap between them for merging data points. This
algorithm is most relevant for clustering data that have Boolean and absolute
attributes. In this algorithm, cluster comparison is based on the number of points from
Different clusters that have neighbors in common in this algorithm. ROCK not only
exhibits good scalability property but also generates better quality cluster than
traditional algorithm.

 CHEMELEON Algorithm

CHEMELEON algorithm uses dynamic modeling to obtain clusters of arbitrary


densities and arbitrary shapes. It measures the comparison of two clusters based on
dynamic model. The dynamic model facilitates discovery of natural and
homogeneous clusters using the merging process. The methodology of dynamic
modeling of clusters that is used in this algorithm is applicable to all types of data as
long as a comparison matrix can be drawn.
 Bisecting k-Means

BKMS is a divisive hierarchical clustering algorithm. It was proposed by Steinbach et


al. in the context of document clustering. Bisecting kmeans always finds the partition
with the highest overall comparison, which is calculated based on the pair wise
similarity of all points in a cluster. Until the desired number of clusters is obtained
this procedure will continue. The bisecting k-means frequently exceeds the standard
k-means and agglomerative clustering approaches as reported. Low computational
cost is the Advantage of BKMS .for clustering large documents BKMS is identified to
have better performance than k-means (KMS) agglomerative hierarchical algorithms.

ACTUAL WORK
CONCLUSION
Hierarchical clustering is a process of cluster analysis which follows to assemble a hierarchy
of clusters. Once a merge or split decision has been executed the aspect of a pure hierarchical
clustering method suffers from its failure to perform alteration. If not well chosen at some
step, may lead to some-what low-quality clusters based on the merge or split decision. One
promising direction for developing the clustering quality of hierarchical methods is to
accommodate hierarchical clustering with other techniques for multiple phase clustering. To
work for both numerical and categorical data we will enhance this algorithm. Its efficiency
and usage multifold will be increased. The future works of this algorithm would also involve
testing it on several more real datasets with thousands of tuples and comparing the results
with existing algorithms. It may also include if needed the modification of formula for
calculating similarity for categorical data for performance analysis. The real datasets come
with noise and outliers. We shall try to remove in the algorithm that we had proposed.

REFERENCES
1. Chen, D., Cui, D. W., Wang, C. X., & Wang, Z. R. (2006). A rough set-based
hierarchical clustering algorithm for categorical data. International Journal of
Information Technology, 12(3), 149-159.
2. Milligan, G. W., & Cooper, M. C. (1986). A study of the comparability of external
criteria for hierarchical cluster analysis. Multivariate behavioral research, 21(4), 441-
458.
3. Dubes, R. C., & Jain, A. K. (1988). Algorithms for clustering data.
4. Steinbach, M., Karypis, G., & Kumar, V. (2000, August). A comparison of document
clustering techniques. In KDD workshop on text mining (Vol. 400, No. 1, pp. 525-
526).
5. Murtagh, F. (1983). A survey of recent advances in hierarchical clustering
algorithms. The Computer Journal, 26(4), 354-359.
6. Day, W. H., & Edelsbrunner, H. (1984). Efficient algorithms for agglomerative
hierarchical clustering methods. Journal of classification, 1(1), 7-24.

You might also like