Natural Language Processing
Natural Language Processing
such
as English or Hindi to analyze and derive their meaning.
NLP takes the data as input from the spoken words, verbal commands or speech recognition software which humans use in their daily lives and
operates on this. Before getting deeper into the concept follow the links to play the games based on NLP.
Just open the browser and play the game then answer a few questions like:
As we have seen some of the common uses of Natural Langauge Processing in our daily lives like virtual assistants, google translate etc. Here
are some more applications of NLP:
Automatic Summarization
Today internet is huge source of informtion. So it is very difficult to access specific information from the web.
Summurizing help us to understand the emotional meaning from the information. For example social media
It can be also helpful to provide an overview of blogpost or new story by avoiding redundancy from multiple sources and maximize the
diversity of content obtained.
For example, newsletters, social media marketing, video scripting etc.
Sentiment Analysis
Sometimes companies need to identify the options and sentiment online to help them to understand customers thought about products and
services.
Sometimes sentiment analysis also help to define the overall reputation.
It help customer to purcahse the product or services based on expressed opinion, understand the sentiment in context to help better.
For example, Customer think about “I like the new smartphone, but it has weak battery backup” and brnad monitoring, customer support
analysis, customer feedback analysis, market research etc.
Text Classification
It helps to assign a predefined category to a document, organize it in such a way that help customers to find the information they want.
It also helps to simplify the activities.
For example, spam filtering in email, auto tagging in social media, categorization of news articles etc.
Virtual Assistants
CBT technique is used to cure depression and stress. But what to do who do not want to connect with a psychiatrist. So let us try to look at this
problem with the 4Ws problem canvas.
Who Canvas
What did we know about People who are going through stress are reluctant to consult a
them? psychiatrist
What Canvas
How do you know it is a Studies around mental stress and depression are available from
problem? various authentic sources.
Where Canvas
Why Canvas
People get a platform where they can talk and vent out their
What would be of key-value feelings anonymously
to the stakeholders? People get a medium that can interact with them and
applies primitive CBT on them and can suggest help
whenever needed
After 4Ws canvas the problem statement template is created which is as follows:
Have a problem
Not being able to share their feelings What?
of
“To create a chatbot which can interact with people, help them to vent out their feelings and take them through primitive CBT.”
Data Aquisition
Once the text has been normalised, it is then fed to an NLP based AI model.
Note that in NLP, modelling requires data pre-processing only after which the data is fed to the machine.
Depending upon the type of chatbot we try to make, there are a lot of AI models available which help us build the foundation of our project.
Evaluation
The model trained is then evaluated and the accuracy for the same is generated on the basis of the relevance of the answers which the
machine gives to the user’s responses.
To understand the efficiency of the model, the suggested answers by the chatbot are compared to the actual answers.
In the above diagram, the blue line talks about the model’s output while the green one is the actual output along with data samples.
1. Figure 1: The model’s output does not match the true function at all. Hence the model is said to be underfitting and its accuracy is lower.
2. Figure 2: In the second one, the model’s performance matches well with the true function which states that the model has optimum accuracy
and the model is called a perfect fit.
3. Figure 3: In the third case, model performance is trying to cover all the data samples even if they are out of alignment to the true function.
This model is said to be overfitting and this too has a lower accuracy.
Once the model is evaluated thoroughly, it is then deployed in the form of an app that people can use easily.
Chatbot
A chatbot is a very common and popular model of NLP.
There are lots of chatbots available and many of them use the same approach.
Some of the popular chatbots are as following:
Mitsuku Bot
Celver Bot
Jabberwacky
Haptik
Rose
OChatbot
Answer the following questions:
1. Script-bot:
Script-bot are very easy to make
It works around the script which is programmed in them
Free to use
easy to integrate to a messaging paltform
no or little processing skills required
offers limited funtionality
story speaker is an example of script-bot
2. Smart-bot
Flexible and powerful
Work on bigger databases and other resources directly
Learn with more data
coding is required to take up this on board
Wide functionality
Google Virtual Assistant, Alexa, Siri, Cortana etc
In the next section of Unit 6 Natural Language Processing AI Class 10, we will discuss human language vs computer language.
“His face turned red after he found out that he took the wrong bag”
This statement is correct in syntax but doea this make any sense?
In human language a perfect balance of syntax and semantics is important for better understanding.
Data Processing
Humans communicate with each other very easily.
The nagural langauge can be used very conveniently and effeciently by speaking and understadning by humans.
At the same time it is very difficult and complex for computer to process them.
Now the question is how machine can understand and speak in the Natual Language just like humans.
So computer understands only numbers, so basic step is to convert each word or letter into numbers.
The conversion requires text normalization.
Text Normalization
As human languages are to complex, it needs to be simplified to understand .
Text Normalization helps in cleaning up the textual data is such a way that it comes down to a level where its complexity is lower than actual
data.
In text normalization, we follow several steps to normalise the text to a lower level.
We need to collect the text for text normalization.
What does text normalization include?
The process of transforming a text into a canonical (standard) form. For example, the word ‘Welll’ and ‘Wel’ can be transformed to “Well”.
While text normalization we reduce the randomness and bring them closer to predefined standards. It reduces the amount of different
information that the computer has to deal with and therefore improves efficiency.
Corpus
The text and terms collected from various documents and used for whole textual data from all documents altogether is known as corpus.
To work out on corpus these steps are requried:
Sentence Segmentation
Tokenisation
Stop words are those words which are used very frequently in the corpus and do not add any value to corpus.
In human language there are certain words used to garmmar which does not add any essence to the coprus.
Some examples of stop words are:
The above words have little or no meaning in the corpus, hence these words are removed and focus on meaningful terms.
Along with these stopwords, the corpus may have some special characters and numbers. Sometimes some of them are meaningful, sometimes
not. For example, for email ids, the symbol @ and some numbers are very important. If symbolism special characters and numbers are not
meaningful can be removed like stopwords.
The next step after removing stopwords, convert the whole text into similar case.
The most preferable case is lower case.
This ensures that the case-sensitivity of the machine does not consider same words as different just because of different cases.
In the above example, the word “hello” is written in 6 different forms, which is converted into lower case and hence all of them are treated as a
similar word by the machine.
Stemming
healed ed heal
healer er heal
studies es studi
Lemmatization
healed ed heal
healer er heal
studies es study
studying ing study
Compare the tables of stemming and lemmatization table, you will find the word studies converted into studi by stemming whereas the lemma
word is study.
After normalisation of the corpus, let’s convert the tokens into numbers. To do so the bag of words algorithm will be used.
Bag of words
Bag of words is an NLP model that extracts the features of the text which can be helpful in macine learing algorithms.
We get the occurences of each word and develop the vouclbary for the corpus.
The above image shows how the bag of word algorithm works.
The text is given on the left is normalised corpus after going through all the steps of text processing.
The image in middle shows the bag of words algortihm. Here we put all the words we got from the text processing.
The image in the rights shows unique words returned by the bag of words algorithm along with its occurence in the text corpus.
Eventually the bag of words returns us two things:
A vocublury of words
The frequency of words
The algorithm “bag” of words symbolises that the sequence of sentences or tokens does not matter in this case as all we need are the unique
words and their frequency in it.
The step by step process to implement bag of words algorithm
1. Text Normalization
2. Create dictionary
3. Create document vectors
4. Create document vectors for all documents
Text Normalization:
This step collects the data and pre-process it. For example:
The above example consists of three documents having one sentence each. After text normalization, the text would be:
To create a dictionary write all words which occurred in the three documents.
Dictionary:
In this step, the repeated words are written just once and we create a list of unique words.
Document 1 1 1 1 0 1 1 0 0 0 0 0 0
Document 2 0 0 1 1 0 0 1 1 1 0 0 0
Document 3 1 0 0 1 0 0 1 1 0 1 1 1
There are two terms in TFIDF, one is Term Frequency and another one is Inverse Document Frequency.
Term Frequency
1 1 1 0 1 1 0 0 0 0 0 0
0 0 1 1 0 0 1 1 1 0 0 0
1 0 0 1 0 0 1 1 0 1 1 1
2 1 2 2 1 1 2 2 1 1 1 1
In the above table you can observe the words – “Divya”,”Rani”,”went”,”to”,”a” is having 2 frequency as they occured in two documents.
Rest of the terms occured in just one document.
Now for Inverse Frequency put the doucment frequency in the denominator while the no. of documents is numerator.
For example,
Divya and Rani went are stressed to a therapist download health chatbot
1*log(3/2) 1*log(3) 1*log(3/2) 0*log((3/2) 1*log(3) 1*log(3) 0*log(3/2) 0*log(3/2) 0*log(3/1) 0*log(3/1) 0*log(3/1) 0*log(3/1)
0 0 1 0 0 1 1
1 *log(3/2) 1 *log(3) 0 *log(3) 0 *log(3) 0 *log(3)
*log(3/2) *log(3) *log(3/2) *log(3) *log(3) *log(3/2) *log(3/2)
1 0 0 0 0 1 1
1 *log(3/2) 0 *log(3) 1 *log(3) 1 *log(3) 1 *log(3)
*log(3/2) *log(3) *log(3/2) *log(3) *log(3) *log(3/2) *log(3/2)
Divya and Rani went are stressed to a therapist download health chatbot
3/2 3/1 3/2 3/2 3/1 3/1 3/2 3/2 3/1 3/1 3/1 3/1
Here, log is to the base of 10. Don’t worry! You don’t need to calculate the log values by yourself. Simply use the log function in the
calculator and find out!
Now, let’s multiply the IDF values to the TF values.
Note that the TF values are for each document while the IDF values are for the whole corpus.
Hence, we need to multiply the IDF values to each row of the document vector table.
The IDF values for each word is as follows:
Divya and Rani went are stressed to a therapist download health chatbot
On the other hand, suppose a number of documents in which ‘Artificial’ occurs: 3 IDF(Artificial) = 10/3 = 3.3333…
This means log(3.3333) = 0.522; which shows that the word ‘pollution’ has considerable value in the corpus.
Applications of TFDIF
Document Stop
Topic Modelling Information Retrieval System
Classification word filtering
Helps in
Helps in
It helps in removing the
classifying the To extract the important
predicting the unnecessary
type and genre information out of a corpus.
topic for a corpus. words out of a
of a document.
text body.
Here is a corpus for you to challenge yourself with the given tasks. Use the knowledge you have gained in the above sections and try
completing the whole exercise by yourself.
The Corpus
Document 1: We can use health chatbots for treating stress.
Document 2: We can use NLP to create chatbots and we will be making health chatbots now!
Document 3: Health Chatbots cannot replace human counsellors now.
Accomplish the following challenges on the basis of the corpus given above. You can use the tools available online for these challenges.
Natural Language Processing class 10 is one of the topics of the Artificial Intelligence skill course. It is unit 6 of the CBSE Artificial Intelligence
curriculum of class 10.
NLP stands for Natual Language Processing. It is one of the subfields of AI or domains of AI that helps machines to understand, interpret, and
manipulate human languages such as English or Hindi to analyze and derive their meaning. For example
Natural Language Processing is one of the branches of AI that helps machines to understand, interpret, and manipulate human languages
such as English or Hindi to analyze and derive their meaning.
NLP takes the data as input from the spoken words, verbal commands or speech recognition software that humans use in their daily lives and
operates on this.
[3] I am all about visual data like images and video. – Who am I?
Ans. Computer Vision
Follow this link to know more about the game: Mystery Animal
[6] Sagar is collecting data from social media platforms. He collected a large amount of data. but he wants specific information from
it. Which NLP application would help him?
Ans.: Automatic Summarization
[9] Which application of NLP assigns predefined categories to a document and organize it to help customer to find the information
they want?
Ans. text classification
[12] I am helping humans to keep notes of their important tasks, make calls for them, send messages and many more. Who am I?
Ans. Virtual Assistance or Chatbot
[18] The customer care section of various companies includes which type of chatbot?
Ans.: Most of the customer case section of various companies includes scriptbot.
[19] The virtual assistants like Siri, Cortana, google assistant etc. can be taken as which type of chatbots?
Ans. The virtual assistants like Siri, Cortana, google assistants etc can be taken as a smartbot.
[29] Name the term is used for any word or number or special character occurring in a sentence.
Ans. Tokens
[30] In which processes every word or number or special character is considered separately?
Ans. Tokenization