Unit-6 Natural Language Processing
Unit-6 Natural Language Processing
ChatBots
• A chatbot is a very common and popular model of NLP. It is a computer program that simulates and processes
human conversation (either written or spoken), allowing humans to interact with digital devices as if they were
communicating with a real person.
• Some of the popular chatbots are as follows:
o Mitsuku Bot
o Clever Bot
o Jabberwacky
o Haptik
o Rose
o Chatbot
Types of ChatBots
There are 2 types of chatbots around us: Script-bot and Smart-bot.
Script Bot
1. Script bots are easy to make
2. Script bots work around a script that is programmed in them
3. Mostly they are free and are easy to integrate into a messaging platform
4. No or little language processing skills
5. Limited functionality
6. Example: the bots which are deployed in the customer care section of various companies
Smart Bot
1. Smart bots are flexible and powerful
2. Smart bots work on bigger databases and other resources directly
3. Smart bots learn with more data
4. Coding is required to take this up on board
5. Wide functionality
6. Example: Google Assistant, Alexa, Cortana, Siri, etc.
The computer understands the language of numbers. Everything that is sent to the machine has to be
converted to numbers. And while typing, if a single mistake is made, the computer throws an error and
does not process that part. The communications made by the machines are very basic and simple.
Data Processing
Q: How NLP makes it possible for the machines to understand and speak just like humans?
Ans: - We all know that the language of computers is numerical, so the very first step that comes to our mind is to convert
our language to numbers. This conversion i.e. Data Processing happens in various steps which are given below:
Text Normalisation: In Text Normalisation, we undergo several steps to normalise the text to a lower level. Text
Normalisation helps in cleaning up the textual data in such a way that it comes down to a level where its complexity is
lower than the actual data.
Corpus: A corpus is a large and structured set of machine-readable texts that have been produced in a natural
communicative setting.
OR
A corpus can be defined as a collection of text documents. It can be thought of as just a bunch of text files in a directory,
often alongside many other directories of text files.
Step-3: Removing Stopwords, Special Characters and Numbers: In this step, the tokens which are not necessary are
removed from the token list. To make it easier for the computer to focus on meaningful terms, these words are removed.
Along with these words, a lot of times our corpus might have special characters and/or numbers.
Removal of special characters and/or numbers depends on the type of corpus that we are working on and whether we
should keep them in it or not.
For example: if you are working on a document containing email IDs, then you might not want to remove the special
characters and numbers whereas in some other textual data if these characters do not make sense, then you can remove
them along with the stopwords.
Stopwords: Stopwords are the words that occur very frequently in the corpus but do not add any value to it.
Examples: a, an, and, are, as, for, it, is, into, in, if, on, or, such, the, there, to.
Example
1. You want to see the dreams with close eyes and achieve them?
o the removed words would be
o to, the, and, ?
2. The outcome would be:
o You want see dreams with close eyes achieve them
Step-4: Converting text to a common case: After the stopwords removal, we convert the whole text into a similar case,
preferably lower case.
Step-5: Stemming :
• In this step, the words are reduced to their root words.
• Stemming is the process in which the affixes of words are removed and the words are converted to their base
form.
• The stemmed words (words which we get after removing the affixes) may or may not be meaningful.
• It just removes the affixes hence it is faster.
• A stem may or may not be equal to a dictionary word.
Q: Explain most commonly used techniques used in NLP for extracting information.
Ans: Most commonly used techniques used in NLP for extracting information are:
1. Bag of Words (BOW)
2. Term Frequency and Inverse Document Frequency (TFIDF)
3. Natural Language Toolkit (NLTK)
Note that no tokens have been removed in the stopwords removal step. It is because we have very little data and since
the frequency of all the words is almost the same, no word can be said to have lesser value than the other.
Dictionary:
aman and anil are stressed went
aman and anil are stressed went to a therapist download health chatbot
1 1 1 1 1 0 0 0 0 0 0 0
aman and anil are stressed went to a therapist download health chatbot
1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0
0 0 1 0 0 1 1 1 0 1 1 1
In this table, the header row contains the vocabulary of the corpus and three rows correspond to three different
documents. Take a look at this table and analyse the positioning of 0s and 1s in it.
Finally, this gives us the document vector table for our corpus. But the tokens have still not converted to numbers. This
leads us to the final steps of our algorithm: TFIDF.
TFIDF
TFIDF stands for Term Frequency & Inverse Document Frequency.
Term Frequency
Term Frequency: Term frequency is the frequency of a word in one document.
Term frequency can easily be found in the document vector table in that table we mention the frequency of each word of
the vocabulary in each document.
1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0
0 0 1 0 0 1 1 1 0 1 1 1
Here, as we can see that the frequency of each word for each document has been recorded in the table. These numbers
are nothing but the Term Frequencies!
Inverse Document Frequency
To understand IDF (Inverse Document Frequency) we should understand DF (Document Frequency) first.
DF (Document Frequency)
Definition of Document Frequency (DF): Document Frequency is the number of documents in which the word occurs
irrespective of how many times it has occurred in those documents.
2 1 2 1 1 2 2 2 1 1 1 1
We can observe from the table is:
• Document frequency of ‘aman’, ‘anil’, ‘went’, ‘to’ and ‘a’ is 2 as they have occurred in two documents.
• Rest of them occurred in just one document hence the document frequency for them is one.
3/2 3/1 3/2 3/1 3/1 3/2 3/2 3/2 3/1 3/1 3/1 3/1
Formula of TFIDF
The formula of TFIDF for any word W becomes:
TFIDF(W) = TF(W) * log( IDF(W) )
We don’t need to calculate the log values by ourselves. We simply have to use the log function in the calculator and find
out!
Example of TFIDF:
aman and anil are stressed went to a therapist download health chatbot
1*log(3/2) 1*log(3) 1*log(3/2) 1*log(3) 1*log(3) 0*log(3/2) 0*log(3/2) 0*log(3/2) 0*log(3) 0*log(3) 0*log(3) 0*log(3)
1*log(3/2) 0*log(3) 0*log(3/2) 0*log(3) 0*log(3) 1*log(3/2) 1*log(3/2) 1*log(3/2) 1*log(3) 0*log(3) 0*log(3) 0*log(3)
0*log(3/2) 0*log(3) 1*log(3/2) 0*log(3) 0*log(3) 1*log(3/2) 1*log(3/2) 1*log(3/2) 0*log(3) 1*log(3) 1*log(3) 1*log(3)
Here, we can see that the IDF values for Aman in each row are the same and a similar pattern is followed for all the words
in the vocabulary. After calculating all the values, we get:
aman and anil are stressed went to a therapist download health chatbot
*********************