Sample Paper Questions - NLP (Part 2)
Sample Paper Questions - NLP (Part 2)
Q2 What are the types of data used for Natural Language Processing applications?
Natural Language Processing takes in the data of Natural Languages in the form of written words and spoken
words which humans use in their daily lives and operates on this.
Q3 While working with NLP what is the meaning of Syntax and Semantics?
Syntax: Syntax refers to the grammatical structure of a sentence.
Semantics: It refers to the meaning of the sentence.
Different semantics, same syntax: 2/3 (Python 2.7) ≠ 2/3 (Python 3), Here the statements written have the
same syntax but their meanings are different. In Python 2.7, this statement would result in 1 while in
Python 3, it would give an output of 1.5.
Multiple meanings of a word – In natural language, a word can have multiple meanings and the meanings
fit into the statement according to the context of it.
Example - His face turns red after consuming the medicine.
Meaning - Is he having an allergic reaction? Or is he not able to bear the taste of that medicine?
Perfect syntax, no meaning – Sometimes, a statement can have perfectly correct syntax but it does not
mean anything.
Example - Chickens feed extravagantly while the moon drinks tea.
This statement is correct grammatically but it does not make any sense. In Human language, a perfect
balance of syntax and semantics is important for better understanding.
Q5 What is a Chatbot?
A chatbot is a computer program that's designed to simulate human conversation through voice commands
or text chats or both. Eg: Mitsuku Bot, Jabberwacky etc.
A chatbot is a computer program that can learn over time how to best interact with humans. It can answer
questions and troubleshoot customer problems, evaluate and qualify prospects, generate sales leads and
increase sales on an ecommerce site.
A chatbot is also known as an artificial conversational entity (ACE), chat robot, talk bot, chatterbot or
chatterbox.
Q6 Mention Examples of Chatbots.
Mitsuku Bot, CleverBot, Jabberwacky, Haptik, Rose, Ochatbot
As we all know that the language of computers is Numerical, the very first step that comes to our mind is to
convert our language to numbers.
This conversion takes a few steps to happen. The first step to it is Text Normalization. Since human languages
are complex, we need to first of all simplify them in order to make sure that the understanding becomes possible.
Text Normalization helps in cleaning up the textual data in such a way that it comes down to a level where its
complexity is lower than the actual data.
Q14 What are the steps of text Normalization? Explain them in brief.
In Text Normalization, we undergo several steps to normalize the text to a lower level.
1. Sentence Segmentation - Under sentence segmentation, the whole corpus is divided into sentences. Each
sentence is taken as a different data so now the whole corpus gets reduced to sentences.
2. Tokenisation- After segmenting the sentences, each sentence is then further divided into tokens. Tokens
is a term used for any word or number or special character occurring in a sentence. Under tokenisation,
every word, number and special character is considered separately and each of them is now a separate
token.
3. Removing Stop words, Special Characters and Numbers –In this step, the tokens which are not
necessary are removed from the token list.
4. Converting text to a common case -After the stop words removal, we convert the whole text into a
similar case, preferably lower case. This ensures that the case-sensitivity of the machine does not consider
same words as different just because of different cases.
5. Stemming: In this step, the remaining words are reduced to their root words. In other words, stemming is
the process in which the affixes of words are removed and the words are converted to their base form. In
stemming, the stemmed words (words which are we get after removing the affixes) may or may not be
meaningful.
6. Lemmatization: In lemmatization, the word we get after affix removal (also known as lemma) is a
meaningful one. Lemmatization makes sure that lemma is a word with meaning and hence it takes a longer
time to execute than stemming.
Q15 What is the importance of converting the text into a common case?
In Text Normalization, we undergo several steps to normalize the text to a lower level. After the removal
of stop words, we convert the whole text into a similar case, preferably lower case. This ensures that the
case-sensitivity of the machine does not consider same words as different just because of different cases.
Sentence Segmentation:
1. Raj and Vijay are best friends.
2. They play together with other friends.
Q18 What is meant by Removing Stop words, Special Characters and Numbers?
Stop words are the words which occur very frequently in the corpus but do not add any value to it. Humans use
grammar to make their sentences meaningful for the other person to understand. But grammatical words do not
add any essence to the information which is to be transmitted through the statement hence they come under stop
words.
In this step, all the stop words or special characters like #$%@! Or numbers which are not necessary are
removed from the list of tokens to make it easier for the NLP system to focus on the words that are important for
data processing.
It is possible to remove stop words using Natural Language Toolkit (NLTK), a suite of libraries and programs for
symbolic and statistical natural language processing.
Q22 Create the stemming and lemmatization of word “running”, “runner” and “runs”
After stemming, they would all be reduced to the common stem “run.”
Stemming Words: run, run, run
Tokenization
This is my first experience This is my first experience of Text Mining .
of Text Mining.
Q24 Identify any two stop words which should not be removed from the given sentence and why? Get help
and support whether you’re shopping now or need help with a past purchase.
Q29 Which package is used for Natural Language Processing in Python programming?
Natural Language Toolkit (NLTK). NLTK is one of the leading platforms for building Python programs that can
work with human language data.
Q31 Through a step-by-step process, calculate TFIDF for the given corpus and mention the word(s) having
highest value.
Document 1: We are going to Mumbai
Document 2: Mumbai is a famous place.
Document 3: We are going to a famous place.
Document 4: I am famous in Mumbai.
Term Frequency
Term frequency is the frequency of a word in one document. Term frequency can easily be found from the
document vector table as in that table we mention the frequency of each word of the vocabulary in each document.
We are going to Mumbai is a famous place I am in
1 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 1 1 1 1 1 0 0 0
1 1 1 1 0 0 1 1 1 0 0 0
0 0 0 0 1 0 0 1 0 1 1 1
Document Frequency is the number of documents in which the word occurs irrespective of how many times it
has occurred in those documents. The document frequency for the exemplar vocabulary would be:
We are going to Mumbai is a famous place I am in
2 2 2 2 3 1 2 3 2 1 1 1
For inverse document frequency, we need to put the document frequency in the denominator while the total
number of documents is the numerator. Here, the total number of documents are 3, hence inverse document
frequency becomes:
We are going to Mumbai is a famous place I am in
4/2 4/2 4/2 4/2 4/3 4/1 4/2 4/3 4/2 4/1 4/1 4/1