How to Calculate Jaccard Similarity in Python
Last Updated :
24 Apr, 2025
In Data Science, Similarity measurements between the two sets are a crucial task. Jaccard Similarity is one of the widely used techniques for similarity measurements in machine learning, natural language processing and recommendation systems. This article explains what Jaccard similarity is, why it is important, and how to compute it with Python.
What is Jaccard Similarity?
Jaccard Similarity also known as Jaccard index, is a statistic to measure the similarity between two data sets. It is measured as the size of the intersection of two sets divided by the size of their union.
For example: Given two sets A and B, their Jaccard Similarity is provided by,
Jaccard Similarity
\text{Jaccard Similarity }J(A, B) = \frac{|A \cap B|}{|A \cup B|}
Where:
- |A \cap B|
is the cardinality (size) of the intersection of sets A and B.
- |A\cup B|
is the cardinality (size) of the union of sets A and B.
Jaccard Similarity is also known as the Jaccard index or Jaccard coefficient, its values lie between 0 and 1. where 0 means no similarity and the values get closer to 1 means increasing similarity 1 means the same datasets.
Computing Jaccard Similarity
EXAMPLE: 1
Python
A = {1,2,3,4,6}
B = {1,2,5,8,9}
# Intersaction and Union of two sets can also be done using & and | operators.
C = A.intersection(B)
D = A.union(B)
print('AnB = ', C)
print('AUB = ', D)
print('J(A,B) = ', float(len(C))/float(len(D)))
Output:
AnB = {1, 2}
AUB = {1, 2, 3, 4, 5, 6, 8, 9}
J(A,B) = 0.25
EXAMPLE: 2
The Jaccard similarity can be used to compare the similarity of two sets of words, which are frequently represented as sets of unique terms.
Python3
def jaccard_similarity(set1, set2):
# intersection of two sets
intersection = len(set1.intersection(set2))
# Unions of two sets
union = len(set1.union(set2))
return intersection / union
set_a = {"Geeks", "for", "Geeks", "NLP", "DSc"}
set_b = {"Geek", "for", "Geeks", "DSc.", 'ML', "DSA"}
similarity = jaccard_similarity(set_a, set_b)
print("Jaccard Similarity:", similarity)
Output:
Jaccard Similarity: 0.25
Significance of Jaccard Similarity
The Jaccard similarity is especially effective when the order of items is irrelevant and only the presence or absence of elements is examined. It is extensively used in:
- Text Analysis: Jaccard similarity can be used in natural language processing to compare texts, text samples, or even individual words.
- Recommendation Systems: Jaccard similarity can help in finding similar items or products based on user behavior.
- Data Deduplication: Jaccard similarity can be used to find duplicate or near-duplicate records in a dataset.
- Social Network Analysis: Jaccard similarity can be used in social networks to detect similarities between user profiles or groups.
- Genomics: Jaccard similarity is employed to compare gene sets in biological studies.
Jaccard Distance
The Jaccard distance is a measure of how different two sets are i.e Unlike the Jaccard coefficient, which determines the similarity of two sets. The Jaccard distance is computed by subtracting the Jaccard coefficient from one, or by dividing the difference in the sizes of the union and the intersection of two sets by the size of the union.
Jaccard Distance
\begin{aligned}
\text{Jaccard Distance}J_D(A,B) &=1-\text{Jaccard Similarity}J(A, B)
\\&=1- \frac{|A \cap B|}{|A \cup B|}
\\&=\frac{|A \cup B|-|A \cap B|}{|A \cup B|}
\\&=\frac{|A\triangle B|}{|A \cup B|}
\end{aligned}
Where:
- |A \cap B|
is the cardinality (size) of the intersection of sets A and B.
- |A\cup B|
is the cardinality (size) of the union of sets A and B.
- |A \triangle B|
represents the cardinality (size) of symmetric difference of sets (A) and (B), containing elements that are in either set but not in their intersection.
The Jaccard distance is often used to calculate a nxn matrix For clustering and multidimensional scaling of n sample sets. This distance is a collection metric for all finite sets.
Example 1:
Python3
def jaccard_distance(set1, set2):
#Symmetric difference of two sets
Symmetric_difference = set1.symmetric_difference(set2)
# Unions of two sets
union = set1.union(set2)
return len(Symmetric_difference)/len(union)
set_a = {"Geeks", "for", "Geeks", "NLP", "DSc"}
set_b = {"Geek", "for", "Geeks", "DSc.", 'ML', "DSA"}
distance = jaccard_distance(set_a, set_b)
print("Jaccard distance:", distance)
Output:
Jaccard distance: 0.75
EXAMPLE 2:
Suppose two persons, A and B, went shopping in a department store, and there are five items. Let A = {1, 1,1, 0,1} and B = {1, 1, 0, 0, 1} sets represent items they picked (1) or not (0). Then ‘Jaccard score’ will represent the similar items they bought, and Jaccard Distance measure of dissimilarity and is calculated as 1 minus the Jaccard similarity score:
Python
import numpy as np
from sklearn.metrics import jaccard_score
# predicted values
y_pred = np.array([1, 1, 1, 0, 1]).reshape(-1, 1)
# true values
y_true = np.array([1, 1, 0, 0, 1]).reshape(-1, 1)
# Calculate Jaccard Index
jaccard_index = jaccard_score(y_true, y_pred)
# Calculate Jaccard Distance
jaccard_distance = 1 - jaccard_index
print("Jaccard Index:", jaccard_index)
print("Jaccard Distance:", jaccard_distance)
Output:
Jaccard Index: 0.75
Jaccard Distance: 0.25
Conclusion
The Jaccard similarity coefficient is a useful tool to check the similarity of sets, with applications ranging from text analysis to recommendation systems to data deduplication. You may quickly compute Jaccard similarity to improve your data analysis and decision-making processes by learning the formula and employing Python's capabilities.
Similar Reads
How to Calculate Jaccard Similarity in R?
Jaccard Similarity also called as Jaccard Index or Jaccard Coefficient is a simple measure to represent the similarity between data samples. The similarity is computed as the ratio of the length of the intersection within data samples to the length of the union of the data samples. It is represente
6 min read
How to Calculate Cosine Similarity in Python?
In this article, we calculate the Cosine Similarity between the two non-zero vectors. A vector is a single dimesingle-dimensional signal NumPy array. Cosine similarity is a measure of similarity, often used to measure document similarity in text analysis. We use the below formula to compute the cosi
3 min read
How to Calculate Cosine Similarity in R?
In this article, we are going to see how to calculate Cosine Similarity in the R Programming language. We can define cosine similarity as the measure of the similarity between two vectors of an inner product space. The formula to calculate the cosine similarity between two vectors is: ΣXiYi / (âΣXi^
2 min read
Cosine Similarity Calculation Between Two Matrices in MATLAB
MATLAB (Matrix Laboratory) is a high-level programming language and numerical computing environment for performing complex mathematical computations and simulations. It is used in a wide range of applications including signal and image processing, control systems, and engineering and scientific calc
5 min read
How to Calculate Manhattan Distance in R?
Manhattan distance is a distance metric between two points in an N-dimensional vector space. It is defined as the sum of absolute distance between coordinates in corresponding dimensions. For example, In a 2-dimensional space having two points Point1 (x1,y1) and Point2 (x2,y2), the Manhattan distan
4 min read
How to Calculate Bray-Curtis Dissimilarity in R
The Bray-Curtis dissimilarity is a measure of dissimilarity between two samples, used primarily in ecology and environmental sciences. It's especially useful for comparing community compositions, such as species abundances in different ecosystems. The Bray-Curtis dissimilarity between two samples ?
4 min read
How to Calculate Polychoric Correlation in R?
In this article, we will discuss how to calculate polychoric correlation in R Programming Language. Calculate Polychoric Correlation in R Correlation measures the relationship between two variables. we can say the correlation is positive if the value is 1, the correlation is negative if the value is
2 min read
Jaccard Similarity
Measuring similarity between datasets is a fundamental problem in many fields, such as natural language processing, machine learning, and recommendation systems. One of the simplest and most effective similarity measures is Jaccard similarity, which quantifies how much two sets overlap.Jaccard simil
4 min read
Different methods to find Document Similarity
In natural language processing (NLP), document similarity is a crucial concept that helps in various applications such as search engines, plagiarism detection, and document clustering. This article explores various methods used to determine how similar two documents are, discussing techniques rangin
4 min read
How to Calculate Minkowski Distance in R?
In this article, we are going to see how to calculate Minkowski Distance in the R Programming language. Minkowski distance:Â Minkowski distance is a distance measured between two points in N-dimensional space. It is basically a generalization of the Euclidean distance and the Manhattan distance. It
6 min read