0% found this document useful (0 votes)
5 views

NLP (4)

Natural Language Processing (NLP) is a subfield of AI that enables computers to understand and process human languages, facilitating applications like automatic summarization, sentiment analysis, and virtual assistants. The document outlines the project cycle for Cognitive Behavioral Therapy (CBT), emphasizing the need for data acquisition and processing to bridge the gap between individuals needing help and therapists. It details steps in text normalization, including tokenization, removal of stop words, stemming, and the Bag of Words algorithm for feature extraction in NLP.

Uploaded by

vedikagarg01
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

NLP (4)

Natural Language Processing (NLP) is a subfield of AI that enables computers to understand and process human languages, facilitating applications like automatic summarization, sentiment analysis, and virtual assistants. The document outlines the project cycle for Cognitive Behavioral Therapy (CBT), emphasizing the need for data acquisition and processing to bridge the gap between individuals needing help and therapists. It details steps in text normalization, including tokenization, removal of stop words, stemming, and the Bag of Words algorithm for feature extraction in NLP.

Uploaded by

vedikagarg01
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

NLP

Definition
• Natural Language Processing, or NLP, is the sub-field of AI that is
focused on enabling computers to understand and process human
languages.
• AI is a subfield of Linguistics, Computer Science, Information
Engineering, and Artificial Intelligence concerned with the
interactions between computers and human (natural) languages, in
particular how to program computers to process and analyse large
amounts of natural language data.
• NLP is used to analyze text, allowing machines to understand how
human’s speak. This human-computer interaction enables real-world
applications like automatic text summarization, sentiment
analysis, topic extraction, named entity recognition, parts-of-speech
tagging, relationship extraction, stemming, and more.
• NLP is commonly used for text mining, machine translation,
and automated question answering.
Applications of Natural Language
Processing
• Automatic Summarization: Information overload is a real problem when
we need to access a specific, important piece of information from a
huge knowledge base.
• Automatic summarization is relevant not only for summarizing the
meaning of documents and information, but also to understand the
emotional meanings within the information, such as in collecting data
from social media.
• Automatic summarization is especially relevant when used to provide
an overview of a news item or blog post, while avoiding redundancy
from multiple sources and maximizing the diversity of content obtained.
Sentiment Analysis
The goal of sentiment analysis is to identify sentiment among several posts or
even in the same post where emotion is not always explicitly expressed.
Companies use Natural Language Processing applications, such as sentiment
analysis, to identify opinions and sentiment online to help them understand
what customers think about their products and services (i.e., “I love the new
iPhone” and, a few lines later “But sometimes it doesn’t work well” where
the person is still talking about the iPhone) and overall indicators of their
reputation.
Beyond determining simple polarity, sentiment analysis understands
sentiment in context to help better understand what’s behind an expressed
opinion, which can be extremely relevant in understanding and driving
purchasing decisions.
Text classification
• Text classification makes it possible to assign predefined categories to
a document and organize it to help you find the information you need
or simplify some activities. For example, an application of text
categorization is spam filtering in email.
Virtual Assistants:
• Nowadays Google Assistant, Cortana, Siri, Alexa, etc have become an
integral part of our lives. Not only can we talk to them but they also
have the abilities to make our lives easier.
• By accessing our data, they can help us in keeping notes of our tasks,
make calls for us, send messages and a lot more.
• With the help of speech recognition, these assistants can not only
detect our speech but can also make sense out of it. According to
recent researches, a lot more advancements are expected in this field
in the near future.
Project Cycle: Cognitive Behavioural
therapy (CBT)
The Scenario
• The world is competitive nowadays. People face competition in even the tiniest tasks
and are expected to give their best at every point in time. When people are unable
to meet these expectations, they get stressed and could even go into depression.
• We get to hear a lot of cases where people are depressed due to reasons like peer
pressure, studies, family issues, relationships, etc. and they eventually get into
something that is bad for them as well as for others.
• So, to overcome this, cognitive behavioural therapy (CBT) is considered to be one of
the best methods to address stress as it is easy to implement on people and also
gives good results.
• This therapy includes understanding the behaviour and mindset of a person in their
normal life. With the help of CBT, therapists help people overcome their stress and
live a happy life.
Problem Scoping
• CBT is a technique used by most therapists to cure patients out of
stress and depression. But it has been observed that people do not
wish to seek the help of a psychiatrist willingly. They try to avoid such
interactions as much as possible. Thus, there is a need to bridge the
gap between a person who needs help and the psychiatrist. Let us
look at various factors around this problem through the 4Ws problem
canvas.
Data Acquisition
• To understand the sentiments of people, we need to collect their
conversational data so the machine can interpret the words that they
use and understand their meaning.
• Such data can be collected from various means:
• 1. Surveys
• 2. Onsite Observations
• 3. Databases available on the internet
• 4.Interviews, etc.
Data Processing
• Humans interact with each other very easily. For us, the natural languages that we use are so
convenient that we speak them easily and understand them well too. But for computers, our
languages are very complex. As you have already gone through some of the complications in
human languages above, now it is time to see how Natural Language Processing makes it possible
for the machines to understand and speak in the Natural Languages just like humans.
• Since we all know that the language of computers is Numerical, the very first step that comes to
our mind is to convert our language to numbers. This conversion takes a few steps to happen. The
first step to it is Text Normalisation.
• Since human languages are complex, we need to first of all simplify them in order to make sure
that the understanding becomes possible. Text Normalisation helps in cleaning up the textual
data in such a way that it comes down to a level where its complexity is lower than the actual
data.
Data Exploration
• Once the textual data has been collected, it needs to be processed
and cleaned so that an easier version can be sent to the machine.
Thus, the text is normalised through various steps and is lowered to
minimum vocabulary since the machine does not require
grammatically correct statements but the essence of it.
We need to first of all simplify them in order to make sure that
the understanding becomes possible. Text Normalisation helps in
cleaning up the textual data in such a way that it comes down to
a level where its complexity is lower than the actual data.

Text Normalisation
• In Text Normalisation, we undergo several steps to normalise the text
to a lower level. Before we begin, we need to understand that in this
section, we will be working on a collection of written text. That is, we
will be working on text from multiple documents and the term used
for the whole textual data from all the documents altogether is
known as corpus. Not only would we go through all the steps of Text
Normalisation, we would also work them out on a corpus.
SENTENCE SEGMENTATION
• Under sentence segmentation, the whole corpus is divided into
sentences. Each sentence is taken as a different data so now the
whole corpus gets reduced to sentences.
Tokenisation
• After segmenting the sentences, each sentence is then further divided into
tokens. Tokens is a term used for any word or number or special character
occurring in a sentence. Under tokenisation, every word, number and special
character is considered separately and each of them is now a separate token.
Removal of Stop Words
• In this step, the tokens which are not necessary are removed from the
token list. What can be the possible words which we might not
require?
• Stopwords are the words which occur very frequently in the corpus
but do not add any value to it. Humans use grammar to make their
sentences meaningful for the other person to understand. But
grammatical words do not add any essence to the information which
is to be transmitted through the statement hence they come under
stopwords. Some examples of stopwords are:
These words occur the most in any given corpus but talk very little or nothing about the
context or the meaning of it. Hence, to make it easier for the computer to focus on
meaningful terms, these words are removed.
STEMMING
• In this step, the remaining words are reduced to their root words. In
other words, stemming is the process in which the affixes of words
are removed and the words are converted to their base form.
• Stemming algorithms work by cutting off the end or the beginning of
the word, taking into account a list of common prefixes and suffixes
that can be found in an inflected word.
Lemmatization is an organized & step by step procedure of
obtaining the root form of the word, it makes use of
morphological analysis (word structure and grammar
relations). Alongside, it is necessary to have detailed
dictionaries which the algorithm can look through to form its
lemma.
With this we have normalised our text to tokens which are the simplest
form of words present in the corpus.
Knowledge Check

1. What do you mean by corpus?

2. Does the vocabulary of a corpus remain the same before and after text normalization?
Why?

3. Enlist the Steps of text normalization

4. Normalize the given text

Raj and Vijay are best friends, they play together with other friends. Raj likes to play
football but Vijay prefers to play online games. Raj wants to be a footballer, Vijay wants
to become an online gamer.
•"Jumped" stems to "jump"
•"Running" stems to "run"
•"Swimming" stems to "swim"
2.Stemming of Nouns:
•"Cats" stems to "cat"
•"Houses" stems to "house"
•"Apples" stems to "appl"
3.Stemming of Adjectives:
•"Faster" stems to "fast"
•"Brightest" stems to "bright"
•"Happier" stems to "happi"
4.Stemming of Adverbs:
•"Quickly" stems to "quick"
•"Badly" stems to "bad"
5.Stemming of Suffixes:
•"Unhappiness" stems to "unhappi"
•"Friendly" stems to "friend“
•Stemming of Words ending with y
•Study stems to Studi
•Happy stems to Happi
•Copy stems to copi
•Journey stems to journei
6.Stemming of Irregular Words (note that stemming may not handle irregular words well):
•"Mice" stems to "mice" (ideally, it should be "mouse")
•"Men" stems to "men" (ideally, it should be "man")
Here is a list of common English suffixes:
1.-s or -es: Plural marker (e.g., cats, dogs).
2.-ed: Past tense marker (e.g., walked, played).
3.-ing: Present participle or gerund marker (e.g.,
running, swimming).
4.-er: Comparative form (e.g., faster, smarter).
5.-est: Superlative form (e.g., fastest, smartest).
6.-ly: Adverb marker (e.g., quickly, happily).
7.-ful: Full of, characterized by (e.g., beautiful, helpful).
8.-less: Without (e.g., fearless, powerless).
9.-ment: State or quality (e.g., government,
excitement).
10.-tion or -sion: Action or process (e.g., celebration,
decision).
11.-able or -ible: Capable of, fit for (e.g., comfortable,
invisible).
12.-ity or -ty: State or quality (e.g., authenticity,
responsibility).
13.-ize or -ise: Form a verb (e.g., organize, realize).
14.-al: Relating to, pertaining to (e.g., cultural, natural).
15.-ish: Having the quality of (e.g., childish, selfish).
16.-ous: Full of, characterized by (e.g., dangerous,
1. With this we have normalised our text to tokens which are the simplest form of words
present in the corpus.
2. Now it is time to convert the tokens into numbers. For this, we would use the Bag of
Words algorithm
3. Here calling this algorithm “bag” of words symbolises that the sequence of sentences
or tokens does not matter in this case as all we need are the unique words and their
frequency in it.
4. Bag of Words Bag of Words is a Natural Language Processing model which helps in
extracting features out of the text which can be helpful in machine learning algorithms.
In bag of words, we get the occurrences of each word and the vocabulary for the
corpus.
• Let us assume that the text on the left in this image is the normalised corpus
which we have got after going through all the steps of text processing. Now,
as we put this text into the bag of words algorithm, the algorithm returns to
us the unique words out of the corpus and their occurrences in it.
• At the right, it shows us a list of words appearing in the corpus and the
numbers corresponding to it shows how many times the word has occurred in
the text body.
• Thus, we can say that the bag of words gives us two things:
1. A vocabulary of words for the corpus
2. The frequency of these words (number of times it has occurred in
the whole corpus).
The step-by-step approach to implement bag of words algorithm:
1. Text Normalisation: Collect data and pre-process it
2. Create Dictionary: Make a list of all the unique words occurring in the corpus. (Vocabulary)
3. Create document vectors: For each document in the corpus, find out how many times the
word from the unique list of words has occurred.
4. Create document vectors for all the documents.
Step 2: Create Dictionary Go through all the steps and create a dictionary i.e., list down all the
words which occur in all three documents:
Dictionary:

Note that even though some words are repeated in different documents, they are all written
just once as while creating the dictionary, we create the list of unique words.
Step 3: Create document vector
In this step, the vocabulary is written in the top row.
Now, for each word in the document, if it matches with the vocabulary, put a 1 under it.
If the same word appears again, increment the previous value by 1.
And if the word does not occur in that document, put a 0 under it.
Since in the first document, we have words: aman, and, anil, are, stressed. So, all
these words get a value of 1 and rest of the words get a 0 value.
tep 4: Repeat for all documents (Same exercise has to be done for all the documents.) Hence, the
table becomes:

In this table, the header row contains the vocabulary of the corpus and three rows correspond
to three different documents. Take a look at this table and analyse the positioning of 0s and 1s in
it. Finally, this gives us the document vector table for our corpus.
Inverse Document Frequency
Now, let us look at the other half of TFIDF which is Inverse Document Frequency. For this, let us first understand what
document frequency means. Document Frequency is the number of documents in which the word occurs irrespective of how
many times it has occurred in those documents. The document frequency for the exemplar vocabulary would be:
Modelling
• Once the text has been normalised, it is then fed to an NLP based AI
model.
• Note that in NLP, modelling requires data pre-processing only after
which the data is fed to the machine. Depending upon the type of
chatbot we try to make, there are a lot of AI models available which
help us build the foundation of our project.
Evaluation
• The model trained is then evaluated and the accuracy for the same is
generated on the basis of the relevance of the answers which the
machine gives to the user’s responses. To understand the efficiency of
the model, the suggested answers by the chat bot are compared to
the actual answers

You might also like