0% found this document useful (0 votes)
57 views

Text Semantic Similarity

This document discusses text semantic similarity and various approaches used by machines to determine the meaning of text, including corpus-based approaches, tokenizing, stop words, lemmatizing, and synsets. It provides an example of analyzing the semantic similarity between two sentences and explains the steps taken, which involves removing stopwords, lemmatizing, finding matching tokens, comparing word synonyms, and calculating a similarity index. Potential areas for further enhancement are discussed.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Text Semantic Similarity

This document discusses text semantic similarity and various approaches used by machines to determine the meaning of text, including corpus-based approaches, tokenizing, stop words, lemmatizing, and synsets. It provides an example of analyzing the semantic similarity between two sentences and explains the steps taken, which involves removing stopwords, lemmatizing, finding matching tokens, comparing word synonyms, and calculating a similarity index. Potential areas for further enhancement are discussed.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

TEXT SEMANTIC SIMILARITY

 
I. INTRODUCTION

• The fundamental challenge in natural language processing or plagiarism checking


software is to find out the meaning of text. Semantic similarity is important for various
purposes - plagiarism checking, information retrieval and enabling machines to answer
questions. But for machines, it is difficult to determine the semantics similarity.
• Due to advancement in Machine Learning, machines are getting better at Text Semantics
and various algorithms are proposed. Semantic Analysis is not about teaching the machine, it
is about getting them to learn. Existing methods have focused on either large documents or
individual words. There are Deep Learning Models like Recursive Neural Networks and
Recurrent Neural Networks which also emerged as approached for text semantics.
II. NATURAL LANGUAGE PROCESSING
Natural Language processing is a wide domain covering concepts of Computer Science, Artificial
Intelligence and Machine Learning. It is used to analyse text or how humans speak. One of the
applications of NLP is Semantic Analysis (Understanding the meaning of text).
III. CORPUS-BASED APPROACH

• This approach uses semantically annotated corpora to


train Machine learning algorithms to decide which word
to use in which context. Corpus- based methods are
supervised learning approaches when the training data
is trained by the algorithms. The corpora and the lexical
resource used is WordNet.
TOKENIZING
• Splitting sentences and words from the body of text. Words are separated by
space after the word, i.e. after every word, there is a space. It counts
punctuation as a separate token/word (,.!? etc.)
STOP WORDS
• These are some extremely common words which would appear
to be of little value in helping select documents matching a user
need are excluded from the vocabulary entirely. Stop words can
be filtered/ removed from the text to be processed. The nltk
module contains a list of stop words.
LEMMATIZING
• Lemmatization is the process of grouping together the different inflected forms of a word
so they can be analysed as a single item. If confronted with the token ’saw’,
lemmatization would attempt to return either see or saw depending on whether the use of
the token was as a verb or a noun. Lemmatize takes a part of speech parameter, pos. If
not supplied, the default is noun. This means that an attempt will be made to find the
closest noun of that word.
SYNSETS
• WordNet is a lexical database for the English language, and is part of the NLTK corpus.
We can use WordNet alongside the NLTK module to find the meaning of words,
synonyms, antonyms and more. Next, we can also easily use WordNet to compare the
similarity of two words and their tenses, by incorporating the Wu and Palmer method for
semantic relatedness.
IV. RESULTS

• Let S1 be “I was given a card by her in the garden” and S2 be “In the garden, she gave
me a card.” For semantic analysis, two phrases/sentences are taken. The two sentences
are similar, dissimilar or somewhat similar.
• After that, set of stopwords are defined for English language.
• After eliminating the special characters and punctuations and then removing all the stop
words and lemmatizing, we get S1= {I, given, card, garden} and S2= {In, garden, gave,
card}.
• Only 2 tokens {garden, card} in S1 exactly match tokens in S2 and so we remove those 2
words (garden and card) from both S1 and S2.
• After lemmatizing, we find the synonyms of the lemmatized words which are called synsets.
Then, we compare first word of S1 with all the words of S2 and continue this iteratively and find
the similarity index of each word with words in the S2.
• We find the mean of the computed similarity indexes and thus we analyse the semantic
similarity using machine learning.
• If the similarity index is less than 0.65, the sentences are labelled as ’Not Similar’, if it is
between 0.65 and 0.8, the sentences are labelled as ’Somewhat Similar’ and more than 0.8,
the sentences are ’Similar’.
V. FURTHER ENHANCEMENT

• Semantics Similarity has been done for sentences and phrases. However, for paragraphs
and short texts will need complex algorithms for separating of sentences and finding their
semantics similarity.
• This is done on strings generated on the IDE. For images, we need image processing
techniques along with Natural language processing.
• We find similarity word by word and thus we may get false positives and negatives.
• Our implementation does not consider spellings. To implement that, Longest Common
Subsequence (LCS) Algorithm can be used.
PROJECT BY:
Saathvik R Khatavkar(1RN16EC122)
Shiv Swarup(1RN16EC137)
Shreyas M S(1RN16EC140)
Varun H H(1RN16EC167)
THANK YOU

You might also like