0% found this document useful (0 votes)

5 views

NLP (4)

Natural Language Processing (NLP) is a subfield of AI that enables computers to understand and process human languages, facilitating applications like automatic summarization, sentiment analysis, and virtual assistants. The document outlines the project cycle for Cognitive Behavioral Therapy (CBT), emphasizing the need for data acquisition and processing to bridge the gap between individuals needing help and therapists. It details steps in text normalization, including tokenization, removal of stop words, stemming, and the Bag of Words algorithm for feature extraction in NLP.

Uploaded by

vedikagarg01

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

NLP (4)

Uploaded by

vedikagarg01

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

NLP

Definition
• Natural Language Processing, or NLP, is the sub-field of AI that is
focused on enabling computers to understand and process human
languages.
• AI is a subfield of Linguistics, Computer Science, Information
Engineering, and Artificial Intelligence concerned with the
interactions between computers and human (natural) languages, in
particular how to program computers to process and analyse large
amounts of natural language data.
• NLP is used to analyze text, allowing machines to understand how
human’s speak. This human-computer interaction enables real-world
applications like automatic text summarization, sentiment
analysis, topic extraction, named entity recognition, parts-of-speech
tagging, relationship extraction, stemming, and more.
• NLP is commonly used for text mining, machine translation,
and automated question answering.
Applications of Natural Language
Processing
• Automatic Summarization: Information overload is a real problem when
we need to access a specific, important piece of information from a
huge knowledge base.
• Automatic summarization is relevant not only for summarizing the
meaning of documents and information, but also to understand the
emotional meanings within the information, such as in collecting data
from social media.
• Automatic summarization is especially relevant when used to provide
an overview of a news item or blog post, while avoiding redundancy
from multiple sources and maximizing the diversity of content obtained.
Sentiment Analysis
The goal of sentiment analysis is to identify sentiment among several posts or
even in the same post where emotion is not always explicitly expressed.
Companies use Natural Language Processing applications, such as sentiment
analysis, to identify opinions and sentiment online to help them understand
what customers think about their products and services (i.e., “I love the new
iPhone” and, a few lines later “But sometimes it doesn’t work well” where
the person is still talking about the iPhone) and overall indicators of their
reputation.
Beyond determining simple polarity, sentiment analysis understands
sentiment in context to help better understand what’s behind an expressed
opinion, which can be extremely relevant in understanding and driving
purchasing decisions.
Text classification
• Text classification makes it possible to assign predefined categories to
a document and organize it to help you find the information you need
or simplify some activities. For example, an application of text
categorization is spam filtering in email.
Virtual Assistants:
• Nowadays Google Assistant, Cortana, Siri, Alexa, etc have become an
integral part of our lives. Not only can we talk to them but they also
have the abilities to make our lives easier.
• By accessing our data, they can help us in keeping notes of our tasks,
make calls for us, send messages and a lot more.
• With the help of speech recognition, these assistants can not only
detect our speech but can also make sense out of it. According to
recent researches, a lot more advancements are expected in this field
in the near future.
Project Cycle: Cognitive Behavioural
therapy (CBT)
The Scenario
• The world is competitive nowadays. People face competition in even the tiniest tasks
and are expected to give their best at every point in time. When people are unable
to meet these expectations, they get stressed and could even go into depression.
• We get to hear a lot of cases where people are depressed due to reasons like peer
pressure, studies, family issues, relationships, etc. and they eventually get into
something that is bad for them as well as for others.
• So, to overcome this, cognitive behavioural therapy (CBT) is considered to be one of
the best methods to address stress as it is easy to implement on people and also
gives good results.
• This therapy includes understanding the behaviour and mindset of a person in their
normal life. With the help of CBT, therapists help people overcome their stress and
live a happy life.
Problem Scoping
• CBT is a technique used by most therapists to cure patients out of
stress and depression. But it has been observed that people do not
wish to seek the help of a psychiatrist willingly. They try to avoid such
interactions as much as possible. Thus, there is a need to bridge the
gap between a person who needs help and the psychiatrist. Let us
look at various factors around this problem through the 4Ws problem
canvas.
Data Acquisition
• To understand the sentiments of people, we need to collect their
conversational data so the machine can interpret the words that they
use and understand their meaning.
• Such data can be collected from various means:
• 1. Surveys
• 2. Onsite Observations
• 3. Databases available on the internet
• 4.Interviews, etc.
Data Processing
• Humans interact with each other very easily. For us, the natural languages that we use are so
convenient that we speak them easily and understand them well too. But for computers, our
languages are very complex. As you have already gone through some of the complications in
human languages above, now it is time to see how Natural Language Processing makes it possible
for the machines to understand and speak in the Natural Languages just like humans.
• Since we all know that the language of computers is Numerical, the very first step that comes to
our mind is to convert our language to numbers. This conversion takes a few steps to happen. The
first step to it is Text Normalisation.
• Since human languages are complex, we need to first of all simplify them in order to make sure
that the understanding becomes possible. Text Normalisation helps in cleaning up the textual
data in such a way that it comes down to a level where its complexity is lower than the actual
data.
Data Exploration
• Once the textual data has been collected, it needs to be processed
and cleaned so that an easier version can be sent to the machine.
Thus, the text is normalised through various steps and is lowered to
minimum vocabulary since the machine does not require
grammatically correct statements but the essence of it.
We need to first of all simplify them in order to make sure that
the understanding becomes possible. Text Normalisation helps in
cleaning up the textual data in such a way that it comes down to
a level where its complexity is lower than the actual data.

Text Normalisation
• In Text Normalisation, we undergo several steps to normalise the text
to a lower level. Before we begin, we need to understand that in this
section, we will be working on a collection of written text. That is, we
will be working on text from multiple documents and the term used
for the whole textual data from all the documents altogether is
known as corpus. Not only would we go through all the steps of Text
Normalisation, we would also work them out on a corpus.
SENTENCE SEGMENTATION
• Under sentence segmentation, the whole corpus is divided into
sentences. Each sentence is taken as a different data so now the
whole corpus gets reduced to sentences.
Tokenisation
• After segmenting the sentences, each sentence is then further divided into
tokens. Tokens is a term used for any word or number or special character
occurring in a sentence. Under tokenisation, every word, number and special
character is considered separately and each of them is now a separate token.
Removal of Stop Words
• In this step, the tokens which are not necessary are removed from the
token list. What can be the possible words which we might not
require?
• Stopwords are the words which occur very frequently in the corpus
but do not add any value to it. Humans use grammar to make their
sentences meaningful for the other person to understand. But
grammatical words do not add any essence to the information which
is to be transmitted through the statement hence they come under
stopwords. Some examples of stopwords are:
These words occur the most in any given corpus but talk very little or nothing about the
context or the meaning of it. Hence, to make it easier for the computer to focus on
meaningful terms, these words are removed.
STEMMING
• In this step, the remaining words are reduced to their root words. In
other words, stemming is the process in which the affixes of words
are removed and the words are converted to their base form.
• Stemming algorithms work by cutting off the end or the beginning of
the word, taking into account a list of common prefixes and suffixes
that can be found in an inflected word.
Lemmatization is an organized & step by step procedure of
obtaining the root form of the word, it makes use of
morphological analysis (word structure and grammar
relations). Alongside, it is necessary to have detailed
dictionaries which the algorithm can look through to form its
lemma.
With this we have normalised our text to tokens which are the simplest
form of words present in the corpus.
Knowledge Check

1. What do you mean by corpus?

2. Does the vocabulary of a corpus remain the same before and after text normalization?
Why?

3. Enlist the Steps of text normalization

4. Normalize the given text

Raj and Vijay are best friends, they play together with other friends. Raj likes to play
football but Vijay prefers to play online games. Raj wants to be a footballer, Vijay wants
to become an online gamer.
•"Jumped" stems to "jump"
•"Running" stems to "run"
•"Swimming" stems to "swim"
2.Stemming of Nouns:
•"Cats" stems to "cat"
•"Houses" stems to "house"
•"Apples" stems to "appl"
3.Stemming of Adjectives:
•"Faster" stems to "fast"
•"Brightest" stems to "bright"
•"Happier" stems to "happi"
4.Stemming of Adverbs:
•"Quickly" stems to "quick"
•"Badly" stems to "bad"
5.Stemming of Suffixes:
•"Unhappiness" stems to "unhappi"
•"Friendly" stems to "friend“
•Stemming of Words ending with y
•Study stems to Studi
•Happy stems to Happi
•Copy stems to copi
•Journey stems to journei
6.Stemming of Irregular Words (note that stemming may not handle irregular words well):
•"Mice" stems to "mice" (ideally, it should be "mouse")
•"Men" stems to "men" (ideally, it should be "man")
Here is a list of common English suffixes:
1.-s or -es: Plural marker (e.g., cats, dogs).
2.-ed: Past tense marker (e.g., walked, played).
3.-ing: Present participle or gerund marker (e.g.,
running, swimming).
4.-er: Comparative form (e.g., faster, smarter).
5.-est: Superlative form (e.g., fastest, smartest).
6.-ly: Adverb marker (e.g., quickly, happily).
7.-ful: Full of, characterized by (e.g., beautiful, helpful).
8.-less: Without (e.g., fearless, powerless).
9.-ment: State or quality (e.g., government,
excitement).
10.-tion or -sion: Action or process (e.g., celebration,
decision).
11.-able or -ible: Capable of, fit for (e.g., comfortable,
invisible).
12.-ity or -ty: State or quality (e.g., authenticity,
responsibility).
13.-ize or -ise: Form a verb (e.g., organize, realize).
14.-al: Relating to, pertaining to (e.g., cultural, natural).
15.-ish: Having the quality of (e.g., childish, selfish).
16.-ous: Full of, characterized by (e.g., dangerous,
1. With this we have normalised our text to tokens which are the simplest form of words
present in the corpus.
2. Now it is time to convert the tokens into numbers. For this, we would use the Bag of
Words algorithm
3. Here calling this algorithm “bag” of words symbolises that the sequence of sentences
or tokens does not matter in this case as all we need are the unique words and their
frequency in it.
4. Bag of Words Bag of Words is a Natural Language Processing model which helps in
extracting features out of the text which can be helpful in machine learning algorithms.
In bag of words, we get the occurrences of each word and the vocabulary for the
corpus.
• Let us assume that the text on the left in this image is the normalised corpus
which we have got after going through all the steps of text processing. Now,
as we put this text into the bag of words algorithm, the algorithm returns to
us the unique words out of the corpus and their occurrences in it.
• At the right, it shows us a list of words appearing in the corpus and the
numbers corresponding to it shows how many times the word has occurred in
the text body.
• Thus, we can say that the bag of words gives us two things:
1. A vocabulary of words for the corpus
2. The frequency of these words (number of times it has occurred in
the whole corpus).
The step-by-step approach to implement bag of words algorithm:
1. Text Normalisation: Collect data and pre-process it
2. Create Dictionary: Make a list of all the unique words occurring in the corpus. (Vocabulary)
3. Create document vectors: For each document in the corpus, find out how many times the
word from the unique list of words has occurred.
4. Create document vectors for all the documents.
Step 2: Create Dictionary Go through all the steps and create a dictionary i.e., list down all the
words which occur in all three documents:
Dictionary:

Note that even though some words are repeated in different documents, they are all written
just once as while creating the dictionary, we create the list of unique words.
Step 3: Create document vector
In this step, the vocabulary is written in the top row.
Now, for each word in the document, if it matches with the vocabulary, put a 1 under it.
If the same word appears again, increment the previous value by 1.
And if the word does not occur in that document, put a 0 under it.
Since in the first document, we have words: aman, and, anil, are, stressed. So, all
these words get a value of 1 and rest of the words get a 0 value.
tep 4: Repeat for all documents (Same exercise has to be done for all the documents.) Hence, the
table becomes:

In this table, the header row contains the vocabulary of the corpus and three rows correspond
to three different documents. Take a look at this table and analyse the positioning of 0s and 1s in
it. Finally, this gives us the document vector table for our corpus.
Inverse Document Frequency
Now, let us look at the other half of TFIDF which is Inverse Document Frequency. For this, let us first understand what
document frequency means. Document Frequency is the number of documents in which the word occurs irrespective of how
many times it has occurred in those documents. The document frequency for the exemplar vocabulary would be:
Modelling
• Once the text has been normalised, it is then fed to an NLP based AI
model.
• Note that in NLP, modelling requires data pre-processing only after
which the data is fed to the machine. Depending upon the type of
chatbot we try to make, there are a lot of AI models available which
help us build the foundation of our project.
Evaluation
• The model trained is then evaluated and the accuracy for the same is
generated on the basis of the relevance of the answers which the
machine gives to the user’s responses. To understand the efficiency of
the model, the suggested answers by the chat bot are compared to
the actual answers

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
57% (80)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
1001 Songs
69% (72)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
IBDP English B Text Type
100% (2)
IBDP English B Text Type
21 pages
pdf NLP
No ratings yet
pdf NLP
7 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
AI_NLP
No ratings yet
AI_NLP
9 pages
nlp
No ratings yet
nlp
70 pages
AIML-HC Mod 04
No ratings yet
AIML-HC Mod 04
71 pages
NLP_AI_X
No ratings yet
NLP_AI_X
6 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Unit-6 Natural Language Processing
No ratings yet
Unit-6 Natural Language Processing
7 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
NLP
No ratings yet
NLP
61 pages
Class X Unit VI Natural Language Processing
No ratings yet
Class X Unit VI Natural Language Processing
42 pages
C10_AI_UNIT 3_NLP_ HALF YEARLY
No ratings yet
C10_AI_UNIT 3_NLP_ HALF YEARLY
37 pages
AI-Natural Language Processing
No ratings yet
AI-Natural Language Processing
49 pages
01 - Intro NLP
No ratings yet
01 - Intro NLP
13 pages
Welcome
No ratings yet
Welcome
8 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
Harambe University
No ratings yet
Harambe University
8 pages
NLP Notes
No ratings yet
NLP Notes
16 pages
Artificial Intelligence - Nlp
No ratings yet
Artificial Intelligence - Nlp
32 pages
IP Projects NLP
No ratings yet
IP Projects NLP
8 pages
NLP - CH-6
No ratings yet
NLP - CH-6
4 pages
UNIT 1_Part1
No ratings yet
UNIT 1_Part1
121 pages
Chapter 4
No ratings yet
Chapter 4
17 pages
ورقة الذكاء
No ratings yet
ورقة الذكاء
7 pages
Assignment of AI Finished
No ratings yet
Assignment of AI Finished
16 pages
14 NLP
No ratings yet
14 NLP
20 pages
Seminar Report
No ratings yet
Seminar Report
12 pages
Unit 6 Natural Language Processing
No ratings yet
Unit 6 Natural Language Processing
48 pages
Unit 6 (NLP)
No ratings yet
Unit 6 (NLP)
8 pages
NLP Grade 10 2023-2024
No ratings yet
NLP Grade 10 2023-2024
72 pages
unit 3&4
No ratings yet
unit 3&4
10 pages
NLP
No ratings yet
NLP
16 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
NLP SEM IMP
No ratings yet
NLP SEM IMP
46 pages
ai-part-b-ch12
No ratings yet
ai-part-b-ch12
16 pages
NLP Class10.PDF
No ratings yet
NLP Class10.PDF
9 pages
NLP Unit 1
No ratings yet
NLP Unit 1
44 pages
UNIT 1
No ratings yet
UNIT 1
26 pages
Class10 Facilitator Handbook Removed
No ratings yet
Class10 Facilitator Handbook Removed
31 pages
Subjective Ai 417 2023
No ratings yet
Subjective Ai 417 2023
43 pages
Natural Language Process (NLP)
No ratings yet
Natural Language Process (NLP)
29 pages
Disruptive Technologies AI Lecture 3
No ratings yet
Disruptive Technologies AI Lecture 3
19 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
CH1
No ratings yet
CH1
87 pages
Youtube Summ
No ratings yet
Youtube Summ
116 pages
Introduction To Natural Language Processing (NLP)
No ratings yet
Introduction To Natural Language Processing (NLP)
87 pages
Natural Language Processing GRADE 10 - 2021
No ratings yet
Natural Language Processing GRADE 10 - 2021
82 pages
Chapter 6.
No ratings yet
Chapter 6.
31 pages
Screenshot 2023-10-23 at 6.50.44 AM
No ratings yet
Screenshot 2023-10-23 at 6.50.44 AM
48 pages
Grade 10 Unit 6 - Natural Language Processing
No ratings yet
Grade 10 Unit 6 - Natural Language Processing
33 pages
Introduction to Data Science_Week 7_LAQ's
No ratings yet
Introduction to Data Science_Week 7_LAQ's
4 pages
W Ith Support From
No ratings yet
W Ith Support From
73 pages
Seminar Report1
No ratings yet
Seminar Report1
17 pages
Introduction to NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
No ratings yet
Introduction to NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
35 pages
Dataverse Analytics Club
No ratings yet
Dataverse Analytics Club
1 page
Chapter 6 - NLP Question Answer
No ratings yet
Chapter 6 - NLP Question Answer
7 pages
NLP Notes
No ratings yet
NLP Notes
90 pages
UNIT 6- NLP NOTES
No ratings yet
UNIT 6- NLP NOTES
7 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Catedra 1 de Ingles
No ratings yet
Catedra 1 de Ingles
4 pages
4ea1 02 Que 20230613
No ratings yet
4ea1 02 Que 20230613
20 pages
Adverbs of Frecuency - Exercices
No ratings yet
Adverbs of Frecuency - Exercices
2 pages
Jakia Jumped in The Jar of Jelly.: Alliteration
No ratings yet
Jakia Jumped in The Jar of Jelly.: Alliteration
4 pages
Types of Philippine Poetry Line Break and Enjambment
No ratings yet
Types of Philippine Poetry Line Break and Enjambment
3 pages
Essay Test Booklet
No ratings yet
Essay Test Booklet
29 pages
4 Puntos: 6) Choose Between Was or Were
No ratings yet
4 Puntos: 6) Choose Between Was or Were
2 pages
Passive and Active Voice Grammar Worksheet - Avrilita
No ratings yet
Passive and Active Voice Grammar Worksheet - Avrilita
2 pages
Common 30 Opposites Arabic
No ratings yet
Common 30 Opposites Arabic
6 pages
Pronouns
No ratings yet
Pronouns
13 pages
Communication Skills - Week 1
No ratings yet
Communication Skills - Week 1
12 pages
Directed Speech
100% (1)
Directed Speech
24 pages
Writing Portfolio: B1+ Intermediate Level
No ratings yet
Writing Portfolio: B1+ Intermediate Level
26 pages
GRASP Programme Facilitator Booklet
No ratings yet
GRASP Programme Facilitator Booklet
24 pages
Semantics: Dong Nai University
No ratings yet
Semantics: Dong Nai University
44 pages
Language, Age, Gender and Ethnicity
No ratings yet
Language, Age, Gender and Ethnicity
16 pages
Advanced English Words
No ratings yet
Advanced English Words
7 pages
Kalimat Elip
No ratings yet
Kalimat Elip
6 pages
English Adjective Notes PDF
No ratings yet
English Adjective Notes PDF
13 pages
The Wonderful Wizard of Oz Instructions
No ratings yet
The Wonderful Wizard of Oz Instructions
7 pages
English Room - Proposal
No ratings yet
English Room - Proposal
6 pages
Portrait of A Lady + A Photograph
No ratings yet
Portrait of A Lady + A Photograph
12 pages
Majalis - Naeem-ul-Abrar - 1 of 2
83% (12)
Majalis - Naeem-ul-Abrar - 1 of 2
348 pages
Progression Paper
No ratings yet
Progression Paper
5 pages
500 Word Essay - Topics, Format, and Examples - 5staressays
No ratings yet
500 Word Essay - Topics, Format, and Examples - 5staressays
4 pages
Report Text 1
No ratings yet
Report Text 1
15 pages
Grade 2 Mother Tongue Most Essential Learning Competencies MELCs
100% (1)
Grade 2 Mother Tongue Most Essential Learning Competencies MELCs
5 pages
Quarter 2 Week 7
No ratings yet
Quarter 2 Week 7
3 pages
Grammar Rules
No ratings yet
Grammar Rules
26 pages

NLP (4)

Uploaded by

NLP (4)

Uploaded by

NLP

1. What do you mean by corpus?

3. Enlist the Steps of text normalization

4. Normalize the given text

You might also like