Natural Language
Processing (NLP)
Lecture-1 : Introduction to Natural Language Processing
By Dr Javed Iqbal
Course Policy
• Assignments 10 %
• Quizzes 10 %
• Mid exam 20 %
• Course Project20 % **
• End Exam 40%
** Research Project (Development/Implementation, Article Writing, Presentation)
Introduction
According to industry estimates, more than 80% of the data
being generated is in an unstructured format, maybe in the
form of text(natural language), image, audio, video
3
Introduction
Data is getting generated as we
• Speak
• Write
• Tweet
• Use Social Media platforms
• Send messages on various messaging platforms
• Use e-commerce for shopping
………………………………………..
4
Introduction
The majority of this data exists in the
textual(natural language) form
5
Unstructured data
Unstructured data is the information that doesn't
reside in a traditional relational database.
Examples include
• Documents, blogs, social media feeds, pictures, and
videos
6
Why Analyzing Unstructured
Data
• Most of the insight is locked in unstructured data.
• Text data is most common and covers more than 50% of the
unstructured data
• Unlocking it plays a vital role in every organization to make
improved and better decisions.
7
8
Natural Language Processing
In order to produce significant and actionable insights
from natural language(text data), we use Natural
Language Processing coupled with machine learning
and deep learning.
9
Natural Language Processing
Definition
what is Natural Language Processing - NLP?
We all know that machines/algorithms cannot understand texts or
characters, so it is very important to convert these text data into
machine understandable format (like numbers or binary) to perform
any kind of analysis on text data
10
Natural Language Processing
Definition
Natural language processing is an area of research in
computer science and artificial intelligence (AI) concerned
with processing natural languages such as English
11
Components of NLP
• There are two components of NLP as given
Natural Language Understanding (NLU)
Mapping the given input in natural language into useful representations.
Analyzing different aspects of the language.
Natural Language Generation (NLG)
It is the process of producing meaningful phrases and sentences in the form
of natural language from some internal representation.
It involves:
Text planning − It includes retrieving the relevant content from knowledge
base.
Sentence planning − It includes choosing required words, forming
meaningful phrases, setting tone of the sentence.
Text Realization − It is mapping sentence plan into sentence structure.
• The NLU is harder than NLG.
Natural Language Processing
Goal
The goal of NLP is to make machines understand our
spoken and written languages…….
more recent ones include voice-driven bots like
ChatGPT, Siri, Alexa. Google Assistant etc
13
What we Learn
• You will learn how to efficiently use a wide range of NLP packages and
implement
• text classification,
• identify parts of speech,
• topic modeling,
• text summarization,
• text generation,
• sentiment analysis,
• and many more applications of NLP
14
What we Learn
• ways of extracting text data along with web scraping
• how to clean and preprocess text data and ways to analyze
• explore the semantic as well as syntactic analysis of the text
• text normalization,
• advanced preprocessing methods,
• POS tagging,
• text similarity,
• text summarization,
• sentiment analysis,
• topic modeling,
• word2vec, seq2seq, 15
What we Learn
Most Important for Implementation
• Working in Python with NLP Packages / Libraries
i.e. NLTK, TextBlob, SpaCy, genism, Stanford CoreNLP
• Implementing text preprocessing and feature engineering
i.e. like word embedding.
Working with Different Data Sets/ Corpus
• Implementing an end-to-end pipeline of the NLP life cycle, which
• includes framing the problem,
• finding the data,
• collecting,
• preprocessing the data,
• solving it using state-of-the-art techniques. 16
Natural language pipeline
A natural language processing system is often referred to as a
pipeline
17
Major Steps in Text (Natural
Language) Analysis
• Data collection
• Text Preprocessing
• Text to feature
• Machine learning / Deep learning
18
19
Data Source Freely Available
Huge amount of data is freely available over the internet
• start exploring multiple free data sources
• Free APIs like Twitter, Facebook, Amazon etc.
• Wikipedia
• Government data (e.g. [Link]
• Census data (e.g. [Link]
• Health care claim data (e.g. [Link]
• Link of Word file contains links and description of different datasets
20
Other Data Source
Client Data (Own data that is already present)
• SQL databases
• Hadoop clusters
• Cloud storage
• Flat files
Web scraping
• Extracting the content/data from websites, blogs, forums, and retail
websites for reviews with the permission from the respective sources
using web scraping packages
lot of other sources like crime data, accident data, and economic data 21
Linguistics Terminologies
Phonetics and Phonology The study of language sounds
Ecology The study of language conventions for
punctuation, text mark-up and encoding
Morphology The study of meaningful components of words
Syntax The study of structural relationships among
words
Lexical semantics The study of word meaning
Compositional semantics The study of the meaning of sentences
Pragmatics The study of the use of language to accomplish
goals
Discourse conventions The study of conventions of dialogue
Steps or Stages in NLP
• Lexical Analysis − It involves identifying and analyzing the
structure of words. Lexical analysis is dividing the whole
chunk of text into paragraphs, sentences, and words.
• Syntactic Analysis (Parsing) − It involves analysis of
words in the sentence for grammar and arranging words in a
manner that shows the relationship among the words. The
sentence such as “The school goes to boy” is rejected by
English syntactic analyzer.
• Semantic Analysis − It draws the exact meaning or the
dictionary meaning from the text. The semantic analyzer
disregards sentence such as “hot ice-cream”.
Steps or Stages in NLP
• Discourse Integration − The meaning of any sentence
depends upon the meaning of the sentence just before
it. In addition, it also brings about the meaning of
immediately succeeding sentence.
• Pragmatic Analysis − During this, what was said is re-
interpreted on what it actually meant. It involves
deriving those aspects of language which require real
world knowledge.
What Is a Corpus?
• The plural form of corpus is corpora.
• The corpus may be composed of written language, spoken language
or both. Spoken corpus is usually in the form of audio recordings.
• Corpora(plural of corpus) are collections of related documents that
contain natural language.
• A corpus can be large or small, though generally they consist of
dozens or even hundreds of gigabytes of data inside of thousands of
documents.
• Some popular corpora are British National Corpus (BNC),
COBUILD/Birmingham Corpus, IBM/Lancaster Spoken English Corpus.
25
Monolingual and Bilingual Corpora
• Monolingual corpora represent only one language while bilingual
corpora represent two languages.
For example
• European Corpus Initiative (ECI) corpus is multilingual having 98
million words in Turkish, Japenese, Russian, Chinese, and other
languages.
26
Open or closed Corpus
• An open corpus is one which does not claim to contain all
data from a specific area while a closed corpus does claim to
contain all or nearly all data from a particular field.
• Historical corpora, for example, are closed as there can be no
further input to an area.
27
Taxonomy
We first have two broad categories:
1. Analysis (analyzing existing text)
2. Generation (generating new text) tasks.
Then we divide analysis into three different categories: syntactic
(language structure-based tasks),semantic (meaning-based tasks), and
pragmatic (open problems difficult to solve):
28
Hierarchical
taxonomy of
different NLP tasks
Web Link for further details
Most ubiquitous NLP
tasks
Tokenization
Tokenization is the task of separating a text corpus into atomic units
(for example, words).
32
Word-sense Disambiguation (WSD):
• WSD is the task of identifying the correct meaning of a word.
For example,
• The dog barked at the mailman,
and
• Tree bark is sometimes used as a medicine,
The word bark has two different meanings. WSD is critical for tasks such
as question answering.
33
Named Entity Recognition (NER):
• NER attempts to extract entities(For example, person, location, and organization ). from a
given body of text or a text corpus.
For example, the sentence,
• John gave Mary two apples at school on Monday
will be transformed to
• [John]name gave [Mary]name [two]number apples at [school]organization on
[Monday.]time.
34
NER is an imperative topic in fields such as information retrieval and knowledge
Part-of-Speech (PoS) tagging
• PoS tagging is the task of assigning words to their respective parts of
speech.
For example
• It can either be basic tags such as noun, verb, adjective, adverb, and
preposition
OR
• It can be granular such as proper noun, common noun, phrasal verb,
verb, and so on.
35
Sentence/Synopsis classification
Sentence or synopsis classification has many use cases such as
• Spam detection
• News article classification (for example, political, technology, and
sport)
• Product review ratings (that is, positive or negative).
36
Language generation
• Predict new text based on previous text .
37
Question Answering (QA)
QA techniques are found at the foundation of chatbots and VA (for
example, Google Assistant and Apple Siri).
38
Machine Translation (MT)
• MT is the task of transforming a sentence/phrase from a source
language (for example, German) to a target language (for example,
English)
39
Finally, to develop a system that can assist a human in day-to-day
tasks (for example,
VA or a chatbot) many of these tasks need to be performed together.
40
Chatbot
• Chatbots have been adopted by many companies for customer
support.
• Chatbots can be used to answer and resolve straightforward customer
concerns, which can be solved without human intervention
For example
• Changing a customer's monthly mobile plan.
41
42
43
Top Libraries of NLP in Python
[Link] Language Toolkit (NLTK)
[Link]
[Link]
[Link]
[Link]
[Link]
[Link]
Natural language Libraries
• NLTK: Natural language toolkit and commonly called the mother of all NLP libraries
• SpaCy: SpaCy is recently a trending library, as it comes with the added flavors of a
deep learning, While SpaCy doesn’t cover all of the NLP functionalities
• TextBlob: This is one of the data scientist’s favorite library when it comes to
implementing NLP tasks. It is based on both NLTK and Pattern. However, TextBlob
certainly isn’t the fastest or most complete library.
• Gensim: Gensim is a library for topic modeling and similarity detection, which can be
used to find patterns in large corpora of text.
• CoreNLP: It is a Python wrapper for Stanford CoreNLP. The toolkit provides very
robust, accurate, and optimized techniques for tagging, parsing, and analyzing text in
various languages.
• PyTorch, TensorFlow, Keras:
There are hundreds of NLP libraries
45
NLP Course Output
By the end of the course you will be able to do
• Sentiment analysis: Customer’s emotions toward products offered by
the business.
• Topic modeling: Extract the unique topics from the group of documents.
• Complaint classifications/Email classifications/ E- commerce product
classification, etc.
• Document categorization/management using different clustering
techniques.
• Resume shortlisting and job description matching using similarity
methods.
46
NLP Course Output
• Advanced feature engineering techniques (word2vec and fastText) to
capture context..
• Information/Document Retrieval Systems, for example, search engine.
• Chatbot, Q & A, and Voice-to-Text applications like Siri and Alexa
• Language detection and translation using neural networks.
• Text summarization using graph methods and advanced techniques
• Text generation/predicting the next sequence of words using deep
learning algorithms.
47
NLP Case Study
Virtual Assistants (VAs)
• Google Assistant
• Cortana
• Apple Siri,
are largely NLP systems.
48
NLP Case Study
Asks a Virtual Assistant (VA)
“Can you show me a good Italian restaurant nearby?".
VA will perform various NLP tasks to process our query
49
NLP Case Study
NLP Tasks Performed by VA
• Convert the sound to text (that is, speech-to-text).
• Understand the semantics of the request and formulate a structured
request (for example, cooking = Italian, rating = 3-5, distance< 10 km).
• Search for restaurants filtering by the location and cooking, and then,
sort the restaurants by the ratings received.
50
NLP Case Study
NLP Tasks Performed by VA
• Calculate an overall rating for a restaurant by both the rating and text
description provided by each user.
• Finally, once the user is at the restaurant, the VA might assist the user
by translating various menu items from Italian to English.
51
Other NLP Systems
• Searching for today's weather on Google
• Google Translate to find out how to say, "How are you?" in French
....................... and the list continues
52
Moral of the lesson
A good NLP system is that which performs many NLP tasks
53
54
NLP Applications
• Finding appropriate documents on certain topics from a database of texts (for
example, finding relevant books in a library)
• Extracting information from messages or articles on certain topics (for example,
building a database of all stock transactions described in the news on a given
day)
• Translating documents from one language to another (for example, producing
automobile repair manuals in many different languages)
• Summarizing texts for certain purposes (for example, producing a 3-page
summary of a 1000-page government report)
NLP Applications
• Question-Answering Systems, where natural language is used to query a database
(for example, a query system to a personnel database)
• Automated Customer Service over the telephone (for example, to perform banking
transactions or order items from a catalogue)
• Tutoring Systems, where the machine interacts with a student (for example, an
automated mathematics tutoring system)
• Spoken Language Control of a machine (for example, voice control of a VCR or
computer)
Production-Level Applications
• A computer program in Canada accepts daily weather data and
automatically generates weather reports in English and French
• Over 1,000,000 translation requests daily are processed by the Babel Fish
system available through Altavista
• A visitor to Cambridge, can ask a computer about places to eat using only
spoken language. The system returns relevant information from a
database of facts about the restaurant scene.
Prototype-Level Applications
• Computers grade student essays in a manner indistinguishable from
human graders
• An automated reading tutor intervenes, through speech, when the reader
makes a mistake or asks for help
• A computer watches a video clip of a soccer game and produces a report
about what it has seen
• A computer predicts upcoming words and expands abbreviations to help
people with disabilities to communicate
Final Output of the Course
Pick a technical paper and reproduce their results
Make sure the model is reasonably technically demanding
Pick an existing algorithm or learning model and design a new
enhanced version
Apply an existing model to a new domain or an application
Make sure to provide rigorous analysis and/or experiment with new
model variations
Make a new dataset and conduct annotation studies
Make sure to provide baseline results
59
Speech Systems
Speech Systems (siri)
Before Siri and Alexa, there was
ELIZA
Sophia (robot)
The Voice Assistant
Battle
Spoken input
Basic Process of NLU
For speech
understanding Phonological /
morphological Phonological & morphological
analyser rules
Sequence of words
“He love +s Mary.”
SYNTACTIC Grammatical
COMPONENT Knowledge
Indicating relns (e.g.,
He Syntactic structure mod) between words
loves Mary (parse tree)
Thematic
SEMANTIC Semantic rules, Roles
INTERPRETER Lexical semantics
Selectional
restrictions
x loves(x, Mary) Logical form
CONTEXTUAL Pragmatic &
REASONER World Knowledge
loves(John, Mary)
Meaning Representation
65