Unit 3
Unit 3
Measuring similarity and dissimilarity in data mining is an important task that helps identify
patterns and relationships in large datasets. To quantify the degree of similarity or dissimilarity
between two data points or objects, mathematical functions called similarity and dissimilarity
measures are used. Similarity measures produce a score that indicates the degree of similarity
between two data points, while dissimilarity measures produce a score that indicates the degree
of dissimilarity between two data points. These measures are crucial for many data mining tasks,
such as identifying duplicate records, clustering, classification, and anomaly detection.
Dissimilarity Measure
For nominal variables, these measures are binary, indicating whether two values are equal
or not.
For ordinal variables, it is the difference between two values that are normalized by the
max distance. For the other variables, it is just a distance function.
Distinction Between Distance And Similarity
Distance is a typical measure of dissimilarity between two data points or objects, whereas
similarity is a measure of how similar or alike two data points or objects are. Distance measures
typically produce a non-negative value that increases as the data points become more dissimilar.
Distance measures are fundamental principles for various algorithms, such as KNN, K-Means,
etc. On the other hand, similarity measures typically produce a non-negative value that increases
as the data points become more similar.
Similarity Measures
o Similarity measures are mathematical functions used to determine the degree of similarity
between two data points or objects. These measures produce a score that indicates how
similar or alike the two data points are.
o It takes two data points as input and produces a similarity score as output, typically ranging
from 0 (completely dissimilar) to 1 (identical or perfectly similar).
Cosine Similarity
Cosine similarity is a widely used similarity measure in data mining and information retrieval. It
measures the cosine of the angle between two non-zero vectors in a multi-dimensional space. In
the context of data mining, these vectors represent the feature vectors of two data points. The
cosine similarity score ranges from 0 to 1, with 0 indicating no similarity and 1 indicating perfect
similarity.
The cosine similarity between two vectors is calculated as the dot product of the vectors divided
by the product of their magnitudes. This calculation can be represented mathematically as
follows
where A and B are the feature vectors of two data points, "." denotes the dot
product, and "||" denotes the magnitude of the vector.
Jaccard Similarity
The Jaccard similarity is another widely used similarity measure in data mining, particularly in
text analysis and clustering. It measures the similarity between two sets of data by calculating the
ratio of the intersection of the sets to their union. The Jaccard similarity score ranges from 0 to 1,
with 0 indicating no similarity and 1 indicating perfect similarity.
The Pearson correlation coefficient is a widely used similarity measure in data mining and
statistical analysis. It measures the linear correlation between two continuous variables, X and Y.
The Pearson correlation coefficient ranges from -1 to +1, with -1 indicating a perfect negative
correlation, 0 indicating no correlation, and +1 indicating a perfect positive correlation. The
Pearson correlation coefficient is commonly used in data mining applications such as feature
selection and regression analysis. It can help identify variables that are highly correlated with
each other, which can be useful for reducing the dimensionality of a dataset. In regression
analysis, it can also be used to predict the value of one variable based on the value of another
variable.
The Pearson correlation coefficient between two variables, X and Y, is calculated as follows -
Sørensen-Dice Coefficient
The Sørensen-Dice coefficient, also known as the Dice similarity index or Dice coefficient, is a
similarity measure used to compare the similarity between two sets of data, typically used in the
context of text or image analysis. The coefficient ranges from 0 to 1, with 0 indicating no
similarity and 1 indicating perfect similarity. The Sørensen-Dice coefficient is commonly used in
text analysis to compare the similarity between two documents based on the set of words or
terms they contain. It is also used in image analysis to compare the similarity between two
images based on the set of pixels they contain.
Different similarity measures are suitable for different data types, such as
continuous or categorical data, text or image data, etc. For example, the
Pearson correlation coefficient, which is only suitable for continuous variables.
Some similarity measures are sensitive to the scale of measurement of the data.
The choice of similarity measure also depends on the specific task at hand. For
example, cosine similarity is often used in information retrieval and text
mining, while Jaccard similarity is commonly used in clustering and
recommendation systems.
Some similarity measures are more robust to noise and outliers in the data than
others. For example, the Sørensen-Dice coefficient is less sensitive to noise.
Dissimilarity Measures
Dissimilarity measures are used to quantify the degree of difference or distance
between two objects or data points.
Euclidean Distance
Euclidean distance is a commonly used dissimilarity measure that quantifies the
distance between two points in a multidimensional space. It is named after the ancient
Greek mathematician Euclid, who first studied its properties. The Euclidean distance
between two points XX and YY in an n-dimensional space is defined as the square
root of the sum of the squared differences between their corresponding coordinates, as
shown below -
Manhattan Distance
Manhattan distance, also known as city block distance, is a dissimilarity measure that
quantifies the distance between two points in a multidimensional space. It is named
after the geometric structure of the streets in Manhattan, where the distance between
two points is measured by the number of blocks one has to walk horizontally and
vertically to reach the other point. The Manhattan distance between two
points xx and yy in an n-dimensional space is defined as the sum of the absolute
differences between their corresponding coordinates, as shown below -
In data mining and machine learning, the Manhattan distance is commonly used in
clustering, classification, and anomaly detection applications. It is particularly useful
when dealing with high-dimensional data, sparse data, or data with outliers, as it is
less sensitive to extreme values than the Euclidean distance. However, it may not be
suitable for data that exhibit complex geometric structures or nonlinear relationships
between features.
Minkowski Distance
Minkowski distance is a generalization of Euclidean distance and Manhattan distance,
which are special cases of Minkowski distance. The Minkowski distance between two
points xx and yy in an n-dimensional space can be defined as -
Hamming Distance
Hamming distance is a distance metric used to measure the dissimilarity between two
strings of equal length. It is defined as the number of positions at which the
corresponding symbols in the two strings are different.
For example, consider the strings "101010" and "111000". The Hamming distance
between these two strings is three since there are three positions at which the
corresponding symbols are different: the second, fourth, and sixth positions.
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database of n objects and the partitioning method constructs k partition
of data. Each partition will represent a cluster and k ≤ n. It means that it will classify the data into
k groups, which satisfy the following requirements −
Points to remember −
For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
Then it uses the iterative relocation technique to improve the partitioning by moving objects
from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We
can classify hierarchical methods on the basis of how the hierarchical decomposition
is formed. There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each
object forming a separate group. It keeps on merging the objects or groups that are
close to one another. It keep on doing so until all of the groups are merged into one or
until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of
the objects in the same cluster. In the continuous iteration, a cluster is split up into
smaller clusters. It is down until each object in one cluster or the termination
condition holds. This method is rigid, i.e., once a merging or splitting is done, it can
never be undone.
Here are the two approaches that are used to improve the quality of hierarchical
clustering −
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing
the given cluster as long as the density in the neighborhood exceeds some threshold,
i.e., for each data point within a given cluster, the radius of a given cluster has to
contain at least a minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite
number of cells that form a grid structure.
Advantages
It is dependent only on the number of cells in each dimension in the quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for
a given model. This method locates the clusters by clustering the density function. It
reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields
robust clustering methods.
Constraint-based Method
In data mining, to assess how we can cluster all the well data points, we
need to choose an appropriate clustering algorithm and set the parameters
and various metrics or techniques that must be used.
1. Pattern Discovery
In data mining, with the help of Clustering, we can discover the patterns and
connections in data. Because of this, it becomes simple to understand, and
we can analyze the data by combining similar data points that help us to
reveal the unstructured data.
2. Data Summarization
With the help of Clustering, we can also summarize large data sets into a
smaller cluster that is much easier to manage. The data analysis process can
be made simpler by working with clusters rather than individual data points.
3. Anomaly Detection
Clustering helps us identify anomalies and outline the data in data mining.
Data points that are not part of any cluster or that form small, unusual
clusters could indicate errors or unusual events that need to be addressed.
4. Customer Segmentation
6. Recommendation Systems
7. Scientific Research
8. Data preprocessing
9. Risk Assessment
Using Clustering, we can find the risks and spot fraud in the finance sector. It
also helps in grouping unusual patterns in financial transactions for
additional investigation.