Measuring the Document Similarity in Python
Last Updated :
27 Feb, 2020
Document similarity, as the name suggests determines how similar are the two given documents. By "documents", we mean a collection of strings. For example, an essay or a .txt file. Many organizations use this principle of document similarity to check plagiarism. It is also used by many exams conducting institutions to check if a student cheated from the other. Therefore, it is very important as well as interesting to know how all of this works.

Document similarity is calculated by calculating document distance. Document distance is a concept where words(documents) are treated as vectors and is calculated as the angle between two given document vectors. Document vectors are the frequency of occurrences of words in a given document. Let's see an example:
Say that we are given two documents
D1 and
D2 as:
D1: "This is a geek"
D2: "This was a geek thing"
The similar words in both these documents then become:
"This a geek"
If we make a 3-D representation of this as vectors by taking D1, D2 and similar words in 3 axis geometry, then we get:

Now if we take dot product of
D1 and
D2,
D1.D2 = "This"."This"+"is"."was"+"a"."a"+"geek"."geek"+"thing".0
D1.D2 = 1+0+1+1+0
D1.D2 = 3
Now that we know how to calculate the dot product of these documents, we can now calculate the angle between the document vectors:
cos d = D1.D2/|D1||D2|
Here d is the document distance. It's value ranges from 0 degree to 90 degrees. Where 0 degree means the two documents are exactly identical and 90 degrees indicate that the two documents are very different.
Now that we know about document similarity and document distance, let's look at a Python program to calculate the same:
Document similarity program :
Our algorithm to confirm document similarity will consist of three fundamental steps:
- Split the documents in words.
- Compute the word frequencies.
- Calculate the dot product of the document vectors.
For the first step, we will first use the
.read()
method to open and read the content of the files. As we read the contents, we will split them into a list. Next, we will calculate the word frequency list of the read in the file. Therefore, the occurrence of each word is counted and the list is sorted alphabetically.
Python3
import math
import string
import sys
# reading the text file
# This functio will return a
# list of the lines of text
# in the file.
def read_file(filename):
try:
with open(filename, 'r') as f:
data = f.read()
return data
except IOError:
print("Error opening or reading input file: ", filename)
sys.exit()
# splitting the text lines into words
# translation table is a global variable
# mapping upper case to lower case and
# punctuation to spaces
translation_table = str.maketrans(string.punctuation+string.ascii_uppercase,
" "*len(string.punctuation)+string.ascii_lowercase)
# returns a list of the words
# in the file
def get_words_from_line_list(text):
text = text.translate(translation_table)
word_list = text.split()
return word_list
Now that we have the word list, we will now calculate the frequency of occurrences of the words.
Python3
# counts frequency of each word
# returns a dictionary which maps
# the words to their frequency.
def count_frequency(word_list):
D = {}
for new_word in word_list:
if new_word in D:
D[new_word] = D[new_word] + 1
else:
D[new_word] = 1
return D
# returns dictionary of (word, frequency)
# pairs from the previous dictionary.
def word_frequencies_for_file(filename):
line_list = read_file(filename)
word_list = get_words_from_line_list(line_list)
freq_mapping = count_frequency(word_list)
print("File", filename, ":", )
print(len(line_list), "lines, ", )
print(len(word_list), "words, ", )
print(len(freq_mapping), "distinct words")
return freq_mapping
Lastly, we will calculate the dot product to give the document distance.
Python3
# returns the dot product of two documents
def dotProduct(D1, D2):
Sum = 0.0
for key in D1:
if key in D2:
Sum += (D1[key] * D2[key])
return Sum
# returns the angle in radians
# between document vectors
def vector_angle(D1, D2):
numerator = dotProduct(D1, D2)
denominator = math.sqrt(dotProduct(D1, D1)*dotProduct(D2, D2))
return math.acos(numerator / denominator)
That's all! Time to see the document similarity function:
Python3
def documentSimilarity(filename_1, filename_2):
# filename_1 = sys.argv[1]
# filename_2 = sys.argv[2]
sorted_word_list_1 = word_frequencies_for_file(filename_1)
sorted_word_list_2 = word_frequencies_for_file(filename_2)
distance = vector_angle(sorted_word_list_1, sorted_word_list_2)
print("The distance between the documents is: % 0.6f (radians)"% distance)
Here is the full sourcecode.
Python3
import math
import string
import sys
# reading the text file
# This functio will return a
# list of the lines of text
# in the file.
def read_file(filename):
try:
with open(filename, 'r') as f:
data = f.read()
return data
except IOError:
print("Error opening or reading input file: ", filename)
sys.exit()
# splitting the text lines into words
# translation table is a global variable
# mapping upper case to lower case and
# punctuation to spaces
translation_table = str.maketrans(string.punctuation+string.ascii_uppercase,
" "*len(string.punctuation)+string.ascii_lowercase)
# returns a list of the words
# in the file
def get_words_from_line_list(text):
text = text.translate(translation_table)
word_list = text.split()
return word_list
# counts frequency of each word
# returns a dictionary which maps
# the words to their frequency.
def count_frequency(word_list):
D = {}
for new_word in word_list:
if new_word in D:
D[new_word] = D[new_word] + 1
else:
D[new_word] = 1
return D
# returns dictionary of (word, frequency)
# pairs from the previous dictionary.
def word_frequencies_for_file(filename):
line_list = read_file(filename)
word_list = get_words_from_line_list(line_list)
freq_mapping = count_frequency(word_list)
print("File", filename, ":", )
print(len(line_list), "lines, ", )
print(len(word_list), "words, ", )
print(len(freq_mapping), "distinct words")
return freq_mapping
# returns the dot product of two documents
def dotProduct(D1, D2):
Sum = 0.0
for key in D1:
if key in D2:
Sum += (D1[key] * D2[key])
return Sum
# returns the angle in radians
# between document vectors
def vector_angle(D1, D2):
numerator = dotProduct(D1, D2)
denominator = math.sqrt(dotProduct(D1, D1)*dotProduct(D2, D2))
return math.acos(numerator / denominator)
def documentSimilarity(filename_1, filename_2):
# filename_1 = sys.argv[1]
# filename_2 = sys.argv[2]
sorted_word_list_1 = word_frequencies_for_file(filename_1)
sorted_word_list_2 = word_frequencies_for_file(filename_2)
distance = vector_angle(sorted_word_list_1, sorted_word_list_2)
print("The distance between the documents is: % 0.6f (radians)"% distance)
# Driver code
documentSimilarity('GFG.txt', 'file.txt')
Output:
File GFG.txt :
15 lines,
4 words,
4 distinct words
File file.txt :
22 lines,
5 words,
5 distinct words
The distance between the documents is: 0.835482 (radians)
Similar Reads
How to Calculate Cosine Similarity in Python?
In this article, we calculate the Cosine Similarity between the two non-zero vectors. A vector is a single dimesingle-dimensional signal NumPy array. Cosine similarity is a measure of similarity, often used to measure document similarity in text analysis. We use the below formula to compute the cosi
3 min read
Measure similarity between images using Python-OpenCV
Prerequisites: Python OpenCVSuppose we have two data images and a test image. Let's find out which data image is more similar to the test image using python and OpenCV library in Python.Let's first load the image and find out the histogram of images.Importing library  import cv2 Importing image dat
2 min read
Simple Plagiarism Detector in Python
In this article, we are going to make a simple plagiarism detector in Python. What is Plagiarism? Plagiarism is simply called cheating. When one person copies the work or idea of another person and uses that in their work by their name it is called plagiarism. For example, if someone writing an arti
3 min read
Template Method - Python Design Patterns
The Template method is a Behavioral Design Pattern that defines the skeleton of the operation and leaves the details to be implemented by the child class. Its subclasses can override the method implementations as per need but the invocation is to be in the same way as defined by an abstract class. I
4 min read
Parsing PDFs in Python with Tika
Apache Tika is a library that is used for document type detection and content extraction from various file formats. Using this, one can develop a universal type detector and content extractor to extract both structured text and metadata from different types of documents such as spreadsheets, text do
2 min read
Python | Pandas Series.value_counts()
Pandas is one of the most widely used library for data handling and analysis. It simplifies many data manipulation tasks especially when working with tabular data. In this article, we'll explore the Series.value_counts() function in Pandas which helps you quickly count the frequency of unique values
2 min read
Finding Duplicate Files with Python
In this article, we will code a python script to find duplicate files in the file system or inside a particular folder. Method 1: Using Filecmp The python module filecmp offers functions to compare directories and files. The cmp function compares the files and returns True if they appear identical
4 min read
Python | Pandas TimedeltaIndex.identical
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas TimedeltaIndex.identical() function determines if two Index objects contain the
2 min read
Pandas Index.value_counts()-Python
Python is popular for data analysis thanks to its powerful libraries and Pandas is one of the best. It makes working with data simple and efficient. The Index.value_counts() function in Pandas returns the count of each unique value in an Index, sorted in descending order so the most frequent item co
3 min read
Python | Count of common elements in the lists
Sometimes, while working with Python list we can have a problem in which we have to compare two lists for index similarity and hence can have a task of counting equal index pairs. Let's discuss certain ways in which this task can be performed. Method #1: Using sum() + zip() This task can be performe
5 min read