Text Semantic Similarity
Text Semantic Similarity
I. INTRODUCTION
• Let S1 be “I was given a card by her in the garden” and S2 be “In the garden, she gave
me a card.” For semantic analysis, two phrases/sentences are taken. The two sentences
are similar, dissimilar or somewhat similar.
• After that, set of stopwords are defined for English language.
• After eliminating the special characters and punctuations and then removing all the stop
words and lemmatizing, we get S1= {I, given, card, garden} and S2= {In, garden, gave,
card}.
• Only 2 tokens {garden, card} in S1 exactly match tokens in S2 and so we remove those 2
words (garden and card) from both S1 and S2.
• After lemmatizing, we find the synonyms of the lemmatized words which are called synsets.
Then, we compare first word of S1 with all the words of S2 and continue this iteratively and find
the similarity index of each word with words in the S2.
• We find the mean of the computed similarity indexes and thus we analyse the semantic
similarity using machine learning.
• If the similarity index is less than 0.65, the sentences are labelled as ’Not Similar’, if it is
between 0.65 and 0.8, the sentences are labelled as ’Somewhat Similar’ and more than 0.8,
the sentences are ’Similar’.
V. FURTHER ENHANCEMENT
• Semantics Similarity has been done for sentences and phrases. However, for paragraphs
and short texts will need complex algorithms for separating of sentences and finding their
semantics similarity.
• This is done on strings generated on the IDE. For images, we need image processing
techniques along with Natural language processing.
• We find similarity word by word and thus we may get false positives and negatives.
• Our implementation does not consider spellings. To implement that, Longest Common
Subsequence (LCS) Algorithm can be used.
PROJECT BY:
Saathvik R Khatavkar(1RN16EC122)
Shiv Swarup(1RN16EC137)
Shreyas M S(1RN16EC140)
Varun H H(1RN16EC167)
THANK YOU