AI-Natural Language Processing
AI-Natural Language Processing
Artificial Intelligence
Code-417
Grade 10
• Sentiment Analysis
• To identify opinions and sentiment online to help them understand what
customers think about their products and services and overall indicators of
their reputation
• “I love the new iPhone” and, a few lines later “But sometimes it doesn’t work
well” where the person is still talking about the iPhone
Case Study 1-Practice
The world is competitive nowadays. People face competition in even the
tiniest tasks and are expected to give their best at every point in time. When
people are unable to meet these expectations, they get stressed and could
even go into depression. We get to hear a lot of cases where people are
depressed due to reasons like peer pressure, studies, family issues,
relationships, etc. and they eventually get into something that is bad for them
as well as for others. So, to overcome this, cognitive behavioral therapy (CBT)
is considered to be one of the best methods to address stress as it is easy to
implement on people and also gives good results. This therapy includes
understanding the behavior and mindset of a person in their normal life. With
the help of CBT, therapists help people overcome their stress and live a happy
life.
Revisiting the AI Project Cycle
Stage1: Problem Scoping
Case Study 2- Practice
Humans are social animals. We tend to organise and/or participate in various
kinds of social gatherings all the time. We love eating out with friends and family
because of which we can find restaurants almost everywhere and out of these,
many of the restaurants arrange for buffets to offer a variety of food items to their
customers. Be it small shops or big outlets, every restaurant prepares food in bulk
as they expect a good crowd to come and enjoy their food. But in most cases,
after the day ends, a lot of food is left which becomes unusable for the restaurant
as they do not wish to serve stale food to their customers the next day. So, every
day, they prepare food in large quantities keeping in mind the probable number of
customers walking into their outlet. But if the expectations are not met, a good
amount of food gets wasted which eventually becomes a loss for the restaurant as
they either have to dump it or give it to hungry people for free. And if this daily
loss is taken into account for a year, it becomes quite a big amount.
Stage2: Data Acquisition
Stage 5: Evaluation
The accuracy for the same is generated on the basis of the relevance of the answers
which the machine gives to the user’s responses. To understand the efficiency of the
model, the suggested answers by the chatbot are compared to the actual answers.
Chatbots
Mitsuku Bot - https://www.pandorabots.com/mitsuku/
CleverBot - https://www.cleverbot.com/
Jabberwacky - http://www.jabberwacky.com/
Haptik - https://haptik.ai/contact-us
Rose -
http://ec2-54-215-197-164.us-west-1.compute.amazonaws.com/speech.php
Ochatbot - https://www.ometrics.com/blog/list-of-fun-chatbots/
Lets Introspect
Chatbot Name Chat Bot
What is the purpose of this chatbot?
This statement is correct grammatically but does this make any sense?
In Human language, a perfect balance of syntax and semantics is
important for better understanding.
Natural Language-Data Processing
Text Normalisation
• Process of downsizing and simplifying the text to make it suitable for
machine processing.
• Involves
• removing unnecessary pieces from the text
• breaking down into smaller simple tokens
• Converting them into numeric form
• During text normalisation, processing is done on the text collected
from multiple documents and sources, called as corpus.
Step 1: Sentence Segmentation
• Corpus is broken into simple sentences.
• Each sentence is treated as a separate data to be processed
• Also known as sentence tokenization.
Step 2: Tokenization
• After segmentation, each
sentence is further broken
down into individual text
pieces called tokens.
• Also known as word
tokenization.
• Every word, number and
special character is
considered separately and
each of them is now a
separate token.
Step 3: Removing Stopwords, Special
Characters and Numbers
• Stopwords are the words which occur very frequently in the corpus
but do not add any value to it.
Along with these words, a lot of times our corpus might have special
characters and/or numbers. Now it depends on the type of corpus that
we are working on whether we should keep them in it or not. For
example, if you are working on a document containing email IDs, then
you might not want to remove the special characters and numbers
whereas in some other textual data if these characters do not make
sense, then you can remove them along with the stopwords.
Step 4: Converting text to a common case
Step 5: Stemming
• Stemming is the process in which the affixes of words are removed
and the words are converted to their base form or their root words.
Here calling this algorithm “bag” of words symbolises that the sequence of
sentences or tokens does not matter in this case as all we need are the unique words
and their frequency in it.
• Note that no tokens have been removed in the stopwords removal step. It is
because we have very little data and since the frequency of all the words is almost
the same, no word can be said to have lesser value than the other.
Step 2: Create Dictionary
• Go through all the steps and create a dictionary i.e., list down all the
words which occur in all three documents:
Note that even though some words are repeated in different documents,
they are all written just once as while creating the dictionary, we create
the list of unique words.
Step 3: Create document vector
We need to put the document frequency in the denominator while the total
number of documents is the numerator
Formula of TFIDF for any word W becomes:
TFIDF(W) = TF(W) * log( IDF(W) )
TF values are for each document while the IDF values are for the whole
corpus.
The IDF values for Aman in each row is the same and similar pattern is
followed for all the words of the vocabulary. After calculating all the values,
we get:
Since we have less amount of data, words like ‘are’ and ‘and’ also
have a high value. But as the IDF value increases, the value of that
word decreases
For example: Total Number of documents: 10
Number of documents in which ‘and’ occurs: 10
IDF(and) = 10/10 = 1
Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0.
On the other hand, number of documents in which ‘pollution’ occurs: 3
IDF(pollution) = 10/3 = 3.3333…
Which means: log(3.3333) = 0.522; which shows that the word
‘pollution’ has considerable value in the corpus.
Summarising the concept, we can say that:
1. Words that occur in all the documents with high term frequencies
have the least values and are considered to be the stopwords.
2. For a word to have high TFIDF value, the word needs to have a high
term frequency but less document frequency which shows that the
word is important for one document but is not a common word for
all documents.
3. These values help the computer understand which words are to be
considered while processing the natural language. The higher the
value, the more important the word is for a given corpus.
Applications of TFIDF
• Document Classification Helps in classifying the type and genre of a
document.
• Topic Modelling Information: It helps in predicting the topic for a
corpus.
• Retrieval System: To extract the important information out of a
corpus.
• Stop word filtering: Helps in removing the unnecessary words out of a
text body.
Lets Practice
Here is a corpus for you to challenge yourself with the given tasks. Use
the knowledge you have gained in the above sections and try
completing the whole exercise by yourself.
The Corpus
Document 1: We can use health chatbots for treating stress.
Document 2: We can use NLP to create chatbots and we will be making
health chatbots now!
Document 3: Health Chatbots cannot replace human counsellors now.
Accomplish the following challenges on the basis of the corpus given above. You
can use the tools available online for these challenges. Link for each tool is given
below:
1. Sentence Segmentation: https://tinyurl.com/y36hd92n
2. Tokenisation: https://textanalysisonline.com/nltk-word-tokenize
3. Stopwords removal: https://demos.datasciencedojo.com/demo/stopwords/
4. Lowercase conversion: https://caseconverter.com/
5. Stemming: http://textanalysisonline.com/nltk-porter-stemmer
6. Lemmatisation: http://textanalysisonline.com/spacy-word-lemmatize
7. Bag of Words: Create a document vector table for all documents.
8. Generate TFIDF values for all the words.
9. Find the words having highest value.
10. Find the words having the least value.
Questions
1. What is a Chabot?
2. What is the difference between stemming and lemmatization?
3. Which package is used for Natural Language Processing in Python programming?
4. What do you mean by corpus?
5. Differentiate between a script-bot and a smart-bot.
6. What are the types of data used for Natural Language Processing applications?
7. Give an example of the following:
a. Multiple meanings of a word
b. Perfect syntax, no meaning
8. While working with NLP what is the meaning of?
a. Syntax
b. Semantics
Reference for notes
https://drive.google.com/file/d/1jhl6eribMOHmOGyXa7vCyPcZ-790wYS
2/view?usp=sharing