Cluster Analysis
Cluster Analysis
- is a statistical technique used in data analysis to group similar data points together based on certain
characteristics or attributes.
- It is the basic and most important step of data mining and a common technique for statistical data
analysis, and it is used in many fields such as data compression, machine learning, pattern recognition,
information retrieval etc.
The goal is to find patterns or structure within a dataset by identifying clusters of data points that are
more similar to each other than to those in other clusters.
Clusters should exhibit high internal homogeneity and high external heterogeneity.
When plotted geometrically, objects within clusters should be very close together and clusters will be far
apart.
* Marketing: In marketing, cluster analysis can be used to segregate customers into different buckets
based on their buying patterns or interests. These are known as customer personas. Organizations then
use different marketing strategies for different clusters of customers.
* Risk Analysis in Finance: Financial organizations use various cluster analysis algorithms for segregating
their customers into various risk categories based on their bank balance and debt. While approving
loans, insurance, or credit cards, these clusters are used to aid in decision-making.
* Real Estate: Infrastructure specialists use clustering to group houses according to their size, location,
and market value. This information is used to assess the real estate potential of different parts of a city.
In this method, first, a cluster is made and then added to another cluster (the most similar and closest
one) to form one single cluster. This process is repeated until all subjects are in one cluster. This
particular method is known as Agglomerative method. Agglomerative clustering starts with single
objects and starts grouping them into clusters.
The divisive method is another kind of Hierarchical method in which clustering starts with the complete
data set and then starts dividing into partitions.
Centroid-based Clustering
In this type of clustering, clusters are represented by a central entity, which may or may not be a part of
the given data set. K-Means method of clustering is used in this method, where k are the cluster centers
and objects are assigned to the nearest cluster centres.
Distribution-based Clustering
It is a type of clustering model closely related to statistics based on the modals of distribution. Objects
that belong to the same distribution are put into a single cluster.This type of clustering can capture
some complex properties of objects like correlation and dependence between attributes.
Density-based Clustering
In this type of clustering, clusters are defined by the areas of density that are higher than the remaining
of the data set. Objects in sparse areas are usually required to separate clusters.The objects in these
sparse points are usually noise and border points in the graph.The most popular method in this type of
clustering is DBSCAN.
The following example shows you how to use the centroid-based clustering algorithm to cluster 30
different points into five groups. You can plot points on a two-dimensional graph, as shown in the
graphs below.
On the left, we have a random distribution of the 30 points. The first iteration of the K-means clustering
divides the points into five groups, with each cluster represented by a different color, as shown in the
center graph.
The algorithm will then iteratively move the points from one cluster to another until the points are
grouped optimally. The end result will be five distinct clusters, as shown in the graph on the right.
When Is Cluster Analysis Useful?
Cluster analysis helps us understand data and detect patterns. In certain cases, it provides a great
starting point for further analysis. In other cases, it can give you the greatest insights from the data.
Here are some cases when cluster analysis is useful,
* If you have large and unstructured data sets, it can be expensive and time-consuming to
label groups manually. In this case, cluster analysis provides the best solution to divide your data into
groups.
* When you don’t know the number of clusters in advance, cluster analysis can provide
the first insight into groups that are available in your data set.
* When you need to detect outliers in your data, cluster analysis provides an effective
method compared to traditional outlier detection methods, such as standard deviation.
* Cluster analysis can help you detect anomalies. While outliers are observations distant
from the mean, they don’t necessarily represent abnormalities. On the other hand, anomalies relate to
identifying rare events or observations that deviate greatly from the mean.
Cluster analysis has applications in many disparate industries and fields. Here’s a list of some disciplines
that make use of this methodology.
* Business Operations: Businesses can optimize their processes and reduce costs by
analyzing clusters and identifying similarities and differences between data points. For example, you can
identify patterns in customer data and improve customer support processes for a particular group that
may require special attention.
* Earth Observation: Using a clustering algorithm, you can create a pixel mask for objects
in an image. For example, you can use image segmentation to classify vegetation or built-up areas in a
satellite image.
* Data Science: We can use cluster analysis for predictive analytics. By applying machine
learning techniques to clusters, we can create predictive models to make inferences about a particular
data set.