0% found this document useful (0 votes)
3 views

Unit 3

The document discusses the importance of similarity and dissimilarity measures in data mining, which are mathematical functions used to quantify relationships between data points. It outlines various similarity measures such as Cosine similarity, Jaccard similarity, and Pearson correlation, as well as dissimilarity measures like Euclidean and Manhattan distances. Additionally, it covers clustering methods and their applications in pattern discovery, data summarization, and customer segmentation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Unit 3

The document discusses the importance of similarity and dissimilarity measures in data mining, which are mathematical functions used to quantify relationships between data points. It outlines various similarity measures such as Cosine similarity, Jaccard similarity, and Pearson correlation, as well as dissimilarity measures like Euclidean and Manhattan distances. Additionally, it covers clustering methods and their applications in pattern discovery, data summarization, and customer segmentation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Introduction

Measuring similarity and dissimilarity in data mining is an important task that helps identify
patterns and relationships in large datasets. To quantify the degree of similarity or dissimilarity
between two data points or objects, mathematical functions called similarity and dissimilarity
measures are used. Similarity measures produce a score that indicates the degree of similarity
between two data points, while dissimilarity measures produce a score that indicates the degree
of dissimilarity between two data points. These measures are crucial for many data mining tasks,
such as identifying duplicate records, clustering, classification, and anomaly detection.

Basics of Similarity and Dissimilarity Measures


Similarity Measure

 A similarity measure is a mathematical function that quantifies the degree of similarity


between two objects or data points. It is a numerical score measuring how alike two data
points are.
 It takes two data points as input and produces a similarity score as output, typically
ranging from 0 (completely dissimilar) to 1 (identical or perfectly similar).
 A similarity measure can be based on various mathematical techniques such as Cosine
similarity, Jaccard similarity, and Pearson correlation coefficient.
 Similarity measures are generally used to identify duplicate records, equivalent instances,
or identifying clusters.

Dissimilarity Measure

 A dissimilarity measure is a mathematical function that quantifies the degree of


dissimilarity between two objects or data points. It is a numerical score measuring how
different two data points are.
 It takes two data points as input and produces a dissimilarity score as output, ranging
from 0 (identical or perfectly similar) to 1 (completely dissimilar). A few dissimilarity
measures also have infinity as their upper limit.
 A dissimilarity measure can be obtained by using different techniques such as Euclidean
distance, Manhattan distance, and Hamming distance.
 Dissimilarity measures are often used in identifying outliers, anomalies, or clusters.

Data Types Similarity and Dissimilarity Measures

 For nominal variables, these measures are binary, indicating whether two values are equal
or not.
 For ordinal variables, it is the difference between two values that are normalized by the
max distance. For the other variables, it is just a distance function.
Distinction Between Distance And Similarity

Distance is a typical measure of dissimilarity between two data points or objects, whereas
similarity is a measure of how similar or alike two data points or objects are. Distance measures
typically produce a non-negative value that increases as the data points become more dissimilar.
Distance measures are fundamental principles for various algorithms, such as KNN, K-Means,
etc. On the other hand, similarity measures typically produce a non-negative value that increases
as the data points become more similar.

Similarity Measures
o Similarity measures are mathematical functions used to determine the degree of similarity
between two data points or objects. These measures produce a score that indicates how
similar or alike the two data points are.
o It takes two data points as input and produces a similarity score as output, typically ranging
from 0 (completely dissimilar) to 1 (identical or perfectly similar).

Cosine Similarity

Cosine similarity is a widely used similarity measure in data mining and information retrieval. It
measures the cosine of the angle between two non-zero vectors in a multi-dimensional space. In
the context of data mining, these vectors represent the feature vectors of two data points. The
cosine similarity score ranges from 0 to 1, with 0 indicating no similarity and 1 indicating perfect
similarity.

The cosine similarity between two vectors is calculated as the dot product of the vectors divided
by the product of their magnitudes. This calculation can be represented mathematically as
follows

where A and B are the feature vectors of two data points, "." denotes the dot
product, and "||" denotes the magnitude of the vector.

Jaccard Similarity

The Jaccard similarity is another widely used similarity measure in data mining, particularly in
text analysis and clustering. It measures the similarity between two sets of data by calculating the
ratio of the intersection of the sets to their union. The Jaccard similarity score ranges from 0 to 1,
with 0 indicating no similarity and 1 indicating perfect similarity.

The Jaccard similarity between two sets A and B is calculated as follows -


where ∣A∩B∣∣A∩B∣ is the size of the intersection of sets AA and BB,
and ∣A∪B∣∣A∪B∣ is the size of the union of sets AA and BB.

Pearson Correlation Coefficient

The Pearson correlation coefficient is a widely used similarity measure in data mining and
statistical analysis. It measures the linear correlation between two continuous variables, X and Y.
The Pearson correlation coefficient ranges from -1 to +1, with -1 indicating a perfect negative
correlation, 0 indicating no correlation, and +1 indicating a perfect positive correlation. The
Pearson correlation coefficient is commonly used in data mining applications such as feature
selection and regression analysis. It can help identify variables that are highly correlated with
each other, which can be useful for reducing the dimensionality of a dataset. In regression
analysis, it can also be used to predict the value of one variable based on the value of another
variable.

The Pearson correlation coefficient between two variables, X and Y, is calculated as follows -

Sørensen-Dice Coefficient

The Sørensen-Dice coefficient, also known as the Dice similarity index or Dice coefficient, is a
similarity measure used to compare the similarity between two sets of data, typically used in the
context of text or image analysis. The coefficient ranges from 0 to 1, with 0 indicating no
similarity and 1 indicating perfect similarity. The Sørensen-Dice coefficient is commonly used in
text analysis to compare the similarity between two documents based on the set of words or
terms they contain. It is also used in image analysis to compare the similarity between two
images based on the set of pixels they contain.

The Sørensen-Dice coefficient between two sets, A and B, is calculated as follows


Choosing The Appropriate Similarity Measure
Choosing an appropriate similarity measure depends on the nature of the data and the
specific task at hand. Here are some factors to consider when choosing a similarity
measure -

 Different similarity measures are suitable for different data types, such as
continuous or categorical data, text or image data, etc. For example, the
Pearson correlation coefficient, which is only suitable for continuous variables.
 Some similarity measures are sensitive to the scale of measurement of the data.
 The choice of similarity measure also depends on the specific task at hand. For
example, cosine similarity is often used in information retrieval and text
mining, while Jaccard similarity is commonly used in clustering and
recommendation systems.
 Some similarity measures are more robust to noise and outliers in the data than
others. For example, the Sørensen-Dice coefficient is less sensitive to noise.

Dissimilarity Measures
 Dissimilarity measures are used to quantify the degree of difference or distance
between two objects or data points.

Euclidean Distance
Euclidean distance is a commonly used dissimilarity measure that quantifies the
distance between two points in a multidimensional space. It is named after the ancient
Greek mathematician Euclid, who first studied its properties. The Euclidean distance
between two points XX and YY in an n-dimensional space is defined as the square
root of the sum of the squared differences between their corresponding coordinates, as
shown below -

Euclidean distance is commonly used in clustering, classification, and anomaly


detection applications in data mining and machine learning. It has the advantage of
being easy to interpret and visualize. However, it can be sensitive to the scale of the
data and may not perform well when dealing with high-dimensional data or data
with outliers.

Manhattan Distance
Manhattan distance, also known as city block distance, is a dissimilarity measure that
quantifies the distance between two points in a multidimensional space. It is named
after the geometric structure of the streets in Manhattan, where the distance between
two points is measured by the number of blocks one has to walk horizontally and
vertically to reach the other point. The Manhattan distance between two
points xx and yy in an n-dimensional space is defined as the sum of the absolute
differences between their corresponding coordinates, as shown below -

In data mining and machine learning, the Manhattan distance is commonly used in
clustering, classification, and anomaly detection applications. It is particularly useful
when dealing with high-dimensional data, sparse data, or data with outliers, as it is
less sensitive to extreme values than the Euclidean distance. However, it may not be
suitable for data that exhibit complex geometric structures or nonlinear relationships
between features.

Minkowski Distance
Minkowski distance is a generalization of Euclidean distance and Manhattan distance,
which are special cases of Minkowski distance. The Minkowski distance between two
points xx and yy in an n-dimensional space can be defined as -

Where pp is a parameter that determines the degree of the Minkowski distance.


When p=1p=1, the Minkowski distance reduces to the Manhattan distance, and
when p=2p=2, it reduces to the Euclidean distance. When p>2p>2, it is
sometimes referred to as a "higher-order" distance metric.

Hamming Distance
Hamming distance is a distance metric used to measure the dissimilarity between two
strings of equal length. It is defined as the number of positions at which the
corresponding symbols in the two strings are different.

For example, consider the strings "101010" and "111000". The Hamming distance
between these two strings is three since there are three positions at which the
corresponding symbols are different: the second, fourth, and sixth positions.

Hamming distance is often used in error-correcting codes and cryptography, where it


is important to detect and correct errors in data transmission. It is also used in data
mining and machine learning applications to compare categorical or binary data, such
as DNA sequences or binary feature vectors.
Clustering
The process of making a group of abstract objects into classes of similar
objects is known as clustering.
Points to Remember:
One group is treated as a cluster of data objects
 In the process of cluster analysis, the first step is to partition the set of
data into groups with the help of data similarity, and then groups are
assigned to their respective labels.
 The biggest advantage of clustering over-classification is it can adapt to
the changes made and helps single out useful features that differentiate
different groups.
Applications of cluster analysis:
 It is widely used in many applications such as image processing, data
analysis, and pattern recognition.
 It helps marketers to find the distinct groups in their customer base and
they can characterize their customer groups by using purchasing patterns.
 It can be used in the field of biology, by deriving animal and plant
taxonomies and identifying genes with the same capabilities.
 It also helps in information discovery by classifying documents on the
web.
The process of making a group of abstract objects into classes of similar
objects is known as clustering.
Points to Remember:
One group is treated as a cluster of data objects
 In the process of cluster analysis, the first step is to partition the set of
data into groups with the help of data similarity, and then groups are
assigned to their respective labels.
 The biggest advantage of clustering over-classification is it can adapt to
the changes made and helps single out useful features that differentiate
different groups.
Applications of cluster analysis :
 It is widely used in many applications such as image processing, data
analysis, and pattern recognition.
 It helps marketers to find the distinct groups in their customer base and
they can characterize their customer groups by using purchasing patterns.
 It can be used in the field of biology, by deriving animal and plant
taxonomies and identifying genes with the same capabilities.
 It also helps in information discovery by classifying documents on the
web.
Clustering Methods

Clustering methods can be classified into the following categories −

 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method

Partitioning Method

Suppose we are given a database of n objects and the partitioning method constructs k partition
of data. Each partition will represent a cluster and k ≤ n. It means that it will classify the data into
k groups, which satisfy the following requirements −

Each group contains at least one object.

Each object must belong to exactly one group.

Points to remember −

For a given number of partitions (say k), the partitioning method will create an initial
partitioning.

Then it uses the iterative relocation technique to improve the partitioning by moving objects
from one group to other.

Hierarchical Methods

This method creates a hierarchical decomposition of the given set of data objects. We
can classify hierarchical methods on the basis of how the hierarchical decomposition
is formed. There are two approaches here −

Agglomerative Approach

Divisive Approach

Agglomerative Approach

This approach is also known as the bottom-up approach. In this, we start with each
object forming a separate group. It keeps on merging the objects or groups that are
close to one another. It keep on doing so until all of the groups are merged into one or
until the termination condition holds.
Divisive Approach

This approach is also known as the top-down approach. In this, we start with all of
the objects in the same cluster. In the continuous iteration, a cluster is split up into
smaller clusters. It is down until each object in one cluster or the termination
condition holds. This method is rigid, i.e., once a merging or splitting is done, it can
never be undone.

Approaches to Improve Quality of Hierarchical Clustering

Here are the two approaches that are used to improve the quality of hierarchical
clustering −

Perform careful analysis of object linkages at each hierarchical partitioning.

Integrate hierarchical agglomeration by first using a hierarchical agglomerative


algorithm to group objects into micro-clusters, and then performing macro-clustering
on the micro-clusters.

Density-based Method

This method is based on the notion of density. The basic idea is to continue growing
the given cluster as long as the density in the neighborhood exceeds some threshold,
i.e., for each data point within a given cluster, the radius of a given cluster has to
contain at least a minimum number of points.

Grid-based Method

In this, the objects together form a grid. The object space is quantized into finite
number of cells that form a grid structure.

Advantages

The major advantage of this method is fast processing time.

It is dependent only on the number of cells in each dimension in the quantized space.

Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for
a given model. This method locates the clusters by clustering the density function. It
reflects spatial distribution of the data points.

This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields
robust clustering methods.

Constraint-based Method

In this method, the clustering is performed by the incorporation of user or application-


oriented constraints. A constraint refers to the user expectation or the properties of
desired clustering results. Constraints provide us with an interactive way of
communication with the clustering process. Constraints can be specified by the user or
the application requirement.

What is the Evaluation of Clustering?


Evaluation of Clustering is a process that determines the quality and value of
clustering outcomes in data mining and machine learning.

In data mining, to assess how we can cluster all the well data points, we
need to choose an appropriate clustering algorithm and set the parameters
and various metrics or techniques that must be used.

The main objective of clustering evaluation is to analyze the data with


specific objectives to improve performance and provide a better
understanding of clustering solutions.

Importance of Clustering in Data Mining


The following are some major reasons why Clustering is so important in data
mining:

1. Pattern Discovery

In data mining, with the help of Clustering, we can discover the patterns and
connections in data. Because of this, it becomes simple to understand, and
we can analyze the data by combining similar data points that help us to
reveal the unstructured data.

2. Data Summarization
With the help of Clustering, we can also summarize large data sets into a
smaller cluster that is much easier to manage. The data analysis process can
be made simpler by working with clusters rather than individual data points.

3. Anomaly Detection

Clustering helps us identify anomalies and outline the data in data mining.
Data points that are not part of any cluster or that form small, unusual
clusters could indicate errors or unusual events that need to be addressed.

4. Customer Segmentation

Clustering is a technique used in business and marketing to divide customers


into different groups according to their behaviour, preferences, or
demographics. This segmentation enables the customization of marketing
plans and product offerings for particular customer groups.

5. Image and Document Categorization

Image and document categorization: Clustering is useful for categorizing


images and documents. It assists in classifying and organizing texts, images,
or documents based on similarities, making it simpler to manage and
retrieve information.

6. Recommendation Systems

In data mining, we can use Clustering for e-commerce and content


recommendation systems to put users and products in a similar group. With
the help of this, we can make sure the recommendation systems can better
suggest good content so user can find it interesting based on the
preferences of their cluster.

7. Scientific Research

Clustering categorizes scientific data, such as classifying stars in astronomy


or identifying genes in bioinformatics. It helps interpret challenging scientific
datasets.

8. Data preprocessing

Clustering can be used to reduce the dimensionality and noise in data as a


preprocessing step in data mining. The data is streamlined and made ready
for additional analysis.

9. Risk Assessment
Using Clustering, we can find the risks and spot fraud in the finance sector. It
also helps in grouping unusual patterns in financial transactions for
additional investigation.

In conclusion, Clustering is a flexible and essential data mining technique for


organizing, comprehending and making sense of complex datasets. With the
help of this useful tool, we can easily find important information from the
data, and with the help of its broad application in a variety of fields like
business and marketing, it also helps in scientific research and beyond.

You might also like