0% found this document useful (0 votes)
2 views

Unit-6 Natural Language Processing

The document discusses the domains of Artificial Intelligence, focusing on Data Science, Computer Vision, and Natural Language Processing (NLP). It outlines various applications of NLP, including automatic summarization, sentiment analysis, text classification, virtual assistants, and chatbots, while also explaining the challenges machines face in understanding human language. Additionally, it covers techniques for data processing in NLP, such as text normalization, stemming, lemmatization, and the Bag of Words model.

Uploaded by

araj131207
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit-6 Natural Language Processing

The document discusses the domains of Artificial Intelligence, focusing on Data Science, Computer Vision, and Natural Language Processing (NLP). It outlines various applications of NLP, including automatic summarization, sentiment analysis, text classification, virtual assistants, and chatbots, while also explaining the challenges machines face in understanding human language. Additionally, it covers techniques for data processing in NLP, such as text normalization, stemming, lemmatization, and the Bag of Words model.

Uploaded by

araj131207
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Part-B, Unit-6 Natural Language Processing

Q: What are the domains of A.I.?


Ans: There are three domains of AI that is Data Science, Computer Vision and NLP.
1. Data Science: It is all about applying mathematical and statistical principles to data or in simple words, Data
Science is the study of Data, This data can be of 3 types – Audio, Visual and Textual.
2. Computer Vision: In simple words is identifying the symbols from the given object (pictures) and learning the
pattern and alert or predicting the future object using the camera.
3. NLP-Natural Language Processing (NLP) is the sub-field of AI that focuses on the ability of a computer to
understand human language (command) as spoken or written and to give an output(mimic human conversation)
by processing it, is called Natural Language Processing (NLP).

Q: What are the applications of Natural Language Processing?


Ans: Some of the applications of Natural Language Processing that are used in the real-life scenario:
1. Automatic Summarization
• Summarizing the meaning of documents and information
• Extract the key emotional information from the text to understand there actions (Social Media).
2.Sentiment Analysis
• Definition: Identify sentiment among several posts or even in the same post where emotion is not always
explicitly expressed.
• Companies use it to identify opinions and sentiments to understand what customers think about their products
and services.
• Can be Positive, Negative or Neutral
3. Text classification
• Text classification makes it possible to assign predefined categories to a document and organize it to help you
find the information you need or simplify some activities.
• For example, an application of text categorization is spam filtering in email.
4. Virtual Assistants
• Nowadays Google Assistant, Cortana, Siri, Alexa, etc have become an integral part of our lives. Not only can we
talk to them but they also have the ability to make our lives easier.
• By accessing our data, they can help us in keeping notes of our tasks, making calls for us, sending messages, and a
lot more.
• With the help of speech recognition, these assistants can not only detect our speech but can also make sense of
it.

ChatBots
• A chatbot is a very common and popular model of NLP. It is a computer program that simulates and processes
human conversation (either written or spoken), allowing humans to interact with digital devices as if they were
communicating with a real person.
• Some of the popular chatbots are as follows:
o Mitsuku Bot
o Clever Bot
o Jabberwacky
o Haptik
o Rose
o Chatbot

Types of ChatBots
There are 2 types of chatbots around us: Script-bot and Smart-bot.
Script Bot
1. Script bots are easy to make
2. Script bots work around a script that is programmed in them
3. Mostly they are free and are easy to integrate into a messaging platform
4. No or little language processing skills
5. Limited functionality
6. Example: the bots which are deployed in the customer care section of various companies
Smart Bot
1. Smart bots are flexible and powerful
2. Smart bots work on bigger databases and other resources directly
3. Smart bots learn with more data
4. Coding is required to take this up on board
5. Wide functionality
6. Example: Google Assistant, Alexa, Cortana, Siri, etc.

Human Language VS Computer Language


Human Language
Humans communicate through language which we process all the time. As a person speaks, the sound
travels and enters the listener's eardrum. This sound then converted into neuron impulse and
transported to the brain for processing. After processing, the brain gains understanding around the
meaning of sound.
Computer Language

The computer understands the language of numbers. Everything that is sent to the machine has to be
converted to numbers. And while typing, if a single mistake is made, the computer throws an error and
does not process that part. The communications made by the machines are very basic and simple.

❖ Syntax: Syntax refers to the grammatical structure of a sentence.


❖ Semantics: It refers to the meaning of the sentence.
Semantics and Syntax with some examples:
1. Different syntax, same semantics: 2+3 = 3+2
o Here the way these statements are written is different, but their meanings are the same that is 5.
2. Different semantics, same syntax: 2/3 (Python 2.7) ≠ 2/3 (Python 3)
o Here the statements written have the same syntax but their meanings are different. In Python 2.7, this
statement would result in 1 while in Python 3, it would give an output of 1.5.

Q: What are the difficulties faced by machine to understand human language?


Ans: Difficulties faced by machine to understand human language are:
1. Arrangement of the words and meaning: There are rules in human language which provide structure to a language.
There are nouns, verbs, adverbs, adjectives. A word can be a noun at one time and an adjective some other time.
2. Multiple meanings of a word : In natural language, a word can have multiple meanings and the meanings fit into the
statement according to the context of it.
3. Perfect Syntax, no Meaning: Sometimes, a statement can have a perfectly correct syntax but it does not mean
anything.
For example, take a look at this statement:
Chickens feed extravagantly while the moon drinks tea.
This statement is correct grammatically but does not make any sense.

Data Processing
Q: How NLP makes it possible for the machines to understand and speak just like humans?
Ans: - We all know that the language of computers is numerical, so the very first step that comes to our mind is to convert
our language to numbers. This conversion i.e. Data Processing happens in various steps which are given below:

Text Normalisation: In Text Normalisation, we undergo several steps to normalise the text to a lower level. Text
Normalisation helps in cleaning up the textual data in such a way that it comes down to a level where its complexity is
lower than the actual data.

Steps of Text Normalisation are:


Step-1: Sentence Segmentation: In Sentence segmentation, the whole corpus (the whole textual data from all the
documents) is divided into sentences.
Example:
Before Sentence Segmentation
“You want to see the dreams with close eyes and achieve them? They’ll remain dreams, look for AIMs and
your eyes have to stay open for a change to be seen.”
After Sentence Segmentation
1. You want to see the dreams with close eyes and achieve them?
2. They’ll remain dreams, look for AIMs and your eyes have to stay open for a change to be seen.
Step-2: Tokenisation: After segmenting the sentences, each sentence is then further divided into tokens.
Tokens is a term used for any word or number or special character occurring in a sentence.
Example:
You want to
see the dreams
with close eyes You want to the dreams with close eyes and achieve them ?
and achieve
them?

Corpus: A corpus is a large and structured set of machine-readable texts that have been produced in a natural
communicative setting.
OR
A corpus can be defined as a collection of text documents. It can be thought of as just a bunch of text files in a directory,
often alongside many other directories of text files.

Step-3: Removing Stopwords, Special Characters and Numbers: In this step, the tokens which are not necessary are
removed from the token list. To make it easier for the computer to focus on meaningful terms, these words are removed.
Along with these words, a lot of times our corpus might have special characters and/or numbers.
Removal of special characters and/or numbers depends on the type of corpus that we are working on and whether we
should keep them in it or not.
For example: if you are working on a document containing email IDs, then you might not want to remove the special
characters and numbers whereas in some other textual data if these characters do not make sense, then you can remove
them along with the stopwords.

Stopwords: Stopwords are the words that occur very frequently in the corpus but do not add any value to it.
Examples: a, an, and, are, as, for, it, is, into, in, if, on, or, such, the, there, to.

Example
1. You want to see the dreams with close eyes and achieve them?
o the removed words would be
o to, the, and, ?
2. The outcome would be:
o You want see dreams with close eyes achieve them
Step-4: Converting text to a common case: After the stopwords removal, we convert the whole text into a similar case,
preferably lower case.
Step-5: Stemming :
• In this step, the words are reduced to their root words.
• Stemming is the process in which the affixes of words are removed and the words are converted to their base
form.
• The stemmed words (words which we get after removing the affixes) may or may not be meaningful.
• It just removes the affixes hence it is faster.
• A stem may or may not be equal to a dictionary word.

Word Affixes Stem


healed -ed heal
healing -ing heal
healer -er heal
studies -es studi
studying -ing study
Dreams -s dream
standardize -ed standard
simplified -ed simplifi
drives -s drive
f. lemmatization:
• It is an alternate process of stemming.
• It also removes the affix from the corpus.
• The only difference between lemmatization and stemming is the output of lemmatization are meaningful words.
• The final output is known as a lemma.
• It takes a longer time to execute than stemming.
• A Lemma is the base, root form that exists in a dictionary.
Word Affixes Lemma
healed -ed heal
healing -ing heal
healer -er heal
studies -es study
studying -ing study
Dreams -s dream
standardize -ed standar
d
simplified -ed simplify
drives -s drive

Difference between stemming and lemmatization


Stemming
1. The stemmed words might not be meaningful.
2. Caring ➔ Car
lemmatization
1. The lemma word is a meaningful one.
2. Caring ➔ Care

Original Stemmed Lemmatized


denied deni deny
sciences scienc science
uses use use
scientific scietif scientific
methods method methods
troubles troubl trouble
is, am, are is, am, are be
visibilities visibl visibilities
adhere adher adhere
oxen oxen ox
indices indic index
swum swum swim
bettered better good

Q: Explain most commonly used techniques used in NLP for extracting information.
Ans: Most commonly used techniques used in NLP for extracting information are:
1. Bag of Words (BOW)
2. Term Frequency and Inverse Document Frequency (TFIDF)
3. Natural Language Toolkit (NLTK)

Bag of word Algorithm


In bag of words, we get the occurrences of each word and construct the vocabulary for the corpus.
Bag of words gives us two things:
1. A vocabulary of words for the corpus
2. The frequency of these words (number of times it has occurred in the whole corpus).
Q: Write the steps to implement the Bag of Words algorithm/model.
Ans: BoW is implemented through the following steps:
1. Text Normalisation: Collecting data and pre-processing it by removing the known stop words.
2. Create Dictionary: Making a list of all the unique words occurring in the corpus. (Vocabulary)
3. Create document vectors: For each document in the corpus, find out how many times the word from the unique
list of words has occurred.
4. Create document vectors for all the documents.
Example:
Step 1: Collecting data and pre-processing it.
Raw Data
• Document 1: Aman and Anil are stressed
• Document 2: Aman went to a therapist
• Document 3: Anil went to download a health chatbot

Processed Data
• Document 1: [aman, and, anil, are, stressed]
• Document 2: [aman, went, to, a, therapist]
• Document 3: [anil, went, to, download, a, health, chatbot]

Note that no tokens have been removed in the stopwords removal step. It is because we have very little data and since
the frequency of all the words is almost the same, no word can be said to have lesser value than the other.

Step 2: Create Dictionary


Definition of Dictionary:
Dictionary in NLP means a list of all the unique words occurring in the corpus. If some words are repeated in different
documents, they are all written just once while creating the dictionary.

Dictionary:
aman and anil are stressed went

download health chatbot therapist a to


Some words are repeated in different documents, they are all written just once, while creating the dictionary, we create a
list of unique words.

Step 3: Create a document vector


Definition of Document Vector: The document Vector contains the frequency of each word of the vocabulary in a
particular document.

Q: How to make a document vector table?


Ans: In the document, vector vocabulary is written in the top row. Now, for each word in the document, if it matches the
vocabulary, put a 1 under it. If the same word appears again, increment the previous value by 1. And if the word does not
occur in that document, put a 0 under it.

aman and anil are stressed went to a therapist download health chatbot

1 1 1 1 1 0 0 0 0 0 0 0

step 4: Creating a document vector table for all documents

aman and anil are stressed went to a therapist download health chatbot

1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0
0 0 1 0 0 1 1 1 0 1 1 1
In this table, the header row contains the vocabulary of the corpus and three rows correspond to three different
documents. Take a look at this table and analyse the positioning of 0s and 1s in it.
Finally, this gives us the document vector table for our corpus. But the tokens have still not converted to numbers. This
leads us to the final steps of our algorithm: TFIDF.

TFIDF
TFIDF stands for Term Frequency & Inverse Document Frequency.
Term Frequency
Term Frequency: Term frequency is the frequency of a word in one document.
Term frequency can easily be found in the document vector table in that table we mention the frequency of each word of
the vocabulary in each document.

Example of Term Frequency:


aman and anil are stressed went to a therapist download health chatbot

1 1 1 1 1 0 0 0 0 0 0 0

1 0 0 0 0 1 1 1 1 0 0 0

0 0 1 0 0 1 1 1 0 1 1 1
Here, as we can see that the frequency of each word for each document has been recorded in the table. These numbers
are nothing but the Term Frequencies!
Inverse Document Frequency
To understand IDF (Inverse Document Frequency) we should understand DF (Document Frequency) first.
DF (Document Frequency)
Definition of Document Frequency (DF): Document Frequency is the number of documents in which the word occurs
irrespective of how many times it has occurred in those documents.

Example of Document Frequency:


aman and anil are stressed went to a therapist download health chatbot

2 1 2 1 1 2 2 2 1 1 1 1
We can observe from the table is:
• Document frequency of ‘aman’, ‘anil’, ‘went’, ‘to’ and ‘a’ is 2 as they have occurred in two documents.
• Rest of them occurred in just one document hence the document frequency for them is one.

IDF (Inverse Document Frequency)


Definition of Inverse Document Frequency (IDF): In the case of inverse document frequency, we need to put
the document frequency in the denominator while the total number of documents is the numerator.

Example of Inverse Document Frequency:


aman and anil are stressed went to a therapist download health chatbot

3/2 3/1 3/2 3/1 3/1 3/2 3/2 3/2 3/1 3/1 3/1 3/1

Formula of TFIDF
The formula of TFIDF for any word W becomes:
TFIDF(W) = TF(W) * log( IDF(W) )

We don’t need to calculate the log values by ourselves. We simply have to use the log function in the calculator and find
out!
Example of TFIDF:

aman and anil are stressed went to a therapist download health chatbot

1*log(3/2) 1*log(3) 1*log(3/2) 1*log(3) 1*log(3) 0*log(3/2) 0*log(3/2) 0*log(3/2) 0*log(3) 0*log(3) 0*log(3) 0*log(3)

1*log(3/2) 0*log(3) 0*log(3/2) 0*log(3) 0*log(3) 1*log(3/2) 1*log(3/2) 1*log(3/2) 1*log(3) 0*log(3) 0*log(3) 0*log(3)

0*log(3/2) 0*log(3) 1*log(3/2) 0*log(3) 0*log(3) 1*log(3/2) 1*log(3/2) 1*log(3/2) 0*log(3) 1*log(3) 1*log(3) 1*log(3)

Here, we can see that the IDF values for Aman in each row are the same and a similar pattern is followed for all the words
in the vocabulary. After calculating all the values, we get:

aman and anil are stressed went to a therapist download health chatbot

0.176 0.477 0.176 0.477 0.477 0 0 0 0 0 0 0

0.176 0 0 0 0 0.176 0.176 0.176 0.477 0 0 0

0 0 0.176 0 0 0.176 0.176 0.176 0 0.477 0.477 0.477


Finally, the words have been converted to numbers. These numbers are the values of each document. Here, we can see
that since we have less amount of data, words like ‘are’ and ‘and’ also have a high value. But as the IDF value increases,
the value of that word decreases. That is, for example:
• Total Number of documents: 10
• Number of documents in which ‘and’ occurs: 10
Therefore, IDF(and) = 10/10 = 1
Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0.
On the other hand, the number of documents in which ‘pollution’ occurs: 3
IDF(pollution) = 10/3 = 3.3333…
This means log(3.3333) = 0.522; which shows that the word ‘pollution’ has considerable value in the corpus.

Important concepts to remember:


1. Words that occur in all the documents with high term frequencies have the least values and are considered to be
the stopwords.
2. For a word to have a high TFIDF value, the word needs to have a high term frequency but less document
frequency which shows that the word is important for one document but is not a common word for all
documents.
3. These values help the computer understand which words are to be considered while processing the natural
language. The higher the value, the more important the word is for a given corpus.

Q: What are the applications of TFIDF


Ans:-TFIDF is commonly used in the Natural Language Processing domain. Some of its applications are:
1. Document Classification – Helps in classifying the type and genre of a document.
2. Topic Modelling – It helps in predicting the topic for a corpus.
3. Information Retrieval System – To extract the important information out of a corpus.
4. Stop word filtering – Helps in removing the unnecessary words from a text body.

*********************

You might also like