Introduction
Introduction
& Applications
Why Text ?
Source: www.pinterest.com
2
Source : RECOMND
3
Source: lifeboat.com
4
Natural Language Processing
5
Fitting in CS taxonomy
Computers
6
NLP- Tasks
7
Working towards
8
Natural language understanding
Raw speech signal /Raw Text
• Speech recognition
Sequence of words spoken /written
• Syntactic analysis
Structure of the sentence
• Semantic analysis
Partial representation of meaning of sentence
• Discourse & Pragmatic analysis
Final representation of meaning of sentence
9
Need for Language Technologies – In Daily life
10
Application Areas
➢Machine Translation
➢Information Retrieval
Selecting from a set of documents the ones that are relevant to a query
➢Text Categorization
Classifying text into fixed topic categories
➢Question Answering
➢ Information Extraction
Converting unstructured text into structured data
11
Application Areas (cont..)
➢Spoken language control systems
➢Spelling and grammar checkers
➢Sentiment Analysis
➢Text-to-Speech & Speech recognition
➢Natural Language Dialogue Interfaces to Databases
12
Question Answering
Source: Google
13
Information Retrieval
Source: Google
14
Email Spam Filtering/Categorizing
Source : junkemailfilter.com
15
Text Categorization
• Assign Label to a document representing its content (ACM keyword, Yahoo
category)
• E.g. Decide if a newspaper article is about politics, business, or sports?
16
Source: Medium
Machine Translation
• Multilingual Usage
• Machine-assisted human Translation
• Scope
Creating Language resources.
Source: www.localizer.co
17
Source: Google
18
Duplicate Question detection
19
Knowledge Extraction
Source: http://aritter.github.io
20
Information Extraction
Information extraction systems
• Find and understand relevant parts of text.
• Produce a structured representation of the relevant information
from text, in the form of :
• entities,
• relations between entities ,
• events in which the entities are involved.
• Produce a structured representation of the relevant information-
relations/events
21
Information Extraction
Source : cs.washington.edu 22
Applications of IE Systems
23
Semantic Web
• Linked Data
• Vocabularies / Domain Information
• Inference
• Query
Source :Google
24
TOOLS
• Apache OpenNLP : Java machine learning toolkit for natural language
processing
• OpenCalais : Tag the people, places, companies, facts, and events in your
content to increase its value, accessibility and interoperability
• DBpedia Spotlight : Tool for automatically annotating mentions of DBpedia
resources in text.
• Natural Language Toolkit is a suite of libraries and programs for NLP
• General Architecture for Text Engineering (GATE)
• Spacy is a free open-source library featuring state-of-the-art speed and
accuracy and a powerful Python API.
• Stanford CoreNLP:a Java annotation pipeline framework, which provides
most of the common core natural language processing (NLP) steps, from
tokenization through to coreference resolution.
25
Aspects of Language Processing
• Phonology
• Word, lexicon: lexical analysis
• Morphology, word segmentation
• Syntax
• Sentence structure, phrase, grammar, …
• Semantics
• Meaning
• Discourse analysis
• Meaning of a text
• Relationship between sentences
• Pragmatics
The study of meaning in different contexts of use
26
Phonology
Speech processing
• Humans process speech remarkably well.
• Speech interface can replace keyboards and monitors.
• Convert Acoustic signals to Text.
• Phonemes are the smallest recognizable speech unit in a language.
Grapheme
A way of writing
down a phoneme
Delegate
(de + leg + ate)
Take the legs from
cashier
(cashy + er)
More wealthy
Source: www.pinterest.co.uk
28
Morphology
• Structures and patterns in words
• Words are a sequence of Morphemes.
• Morpheme – smallest meaningful unit in a word.
• Analyses how words are formed from morphemes.
e.g., dogs= dog+s.
• Inflectional Morphology – Same Part of Speech
• Buses = Bus + es
• Carried = Carry + ed
• Derivational Morphology – Change PoS.
• Destruct + ion = Destruction (Noun)
• Beauty + ful = Beautiful (Adjective)
• Affixes – Prefixes, Suffixes Rules govern the fusion.
31
Parsing
• Analyze the structure of a sentence
NP VP
PP
NP NP
D N V D N P D N
The student put the book on the table
32
Semantic Analysis
• What do you mean..?
• Words – Lexical Semantics
• Sentences – Compositional Semantics
• Converting the syntactic structures to semantic format – meaning
representation.
• Semantics: the meaning of a word or phrase within a sentence
34
Discourse Analysis
• The meaning of an individual sentence may depend on the sentences that
precede it and may influence the meaning of the sentence that follow it.
• Issues related to discourse Integration
• Anaphora
Resolving the pronoun’s reference. Co-reference resolution
• Ellipsis
Incomplete sentences
• Anaphora
• I read the book by Dr. Kalam. It was great
36
Discourse Structures- Ellipsis
The second sentence is not complete, but what it means can be inferred
from the first one.
37
Pragmatics
• Uses context of utterance
38
Challenges in NLP: Ambiguity
Morphology
39
Syntax Ambiguity
S S
VP VP
NP NP
NP NP
N N V N N V Adj N
Teacher strikes idle kids Teacher strikes idle kids
40
Attachment Ambiguity
41
Semantic Ambiguity
• Meaning of the words themselves can be misinterpreted.
• Example 1: The car hit the pole while it was moving.
• The interpretations can be
• The car, while moving, hit the pole
• The car hit the pole while the pole was moving.
• Example 2:
42
Semantic Ambiguity
Semantic ambiguity: “I saw the prudential building flying into Boston”
43
Sample Ontology
44
Discourse Ambiguity
“We gave the monkeys the bananas because they were hungry”
“We gave the monkeys the bananas because they were over-ripe”
45
Pragmatics Ambiguity
46
Enabling Computing Techniques
• Stemming
• Reduce words to base form.
• Part of Speech Tagging
• Determine for each word whether it is a noun, adjective, verb, …..
• Parsing
• sentence to parse tree
• Wordnet – Lexical Database - 206941 word sense pairs
• Word Sense Disambiguation
• Bank (Financial Bank vs Riverbank)
• Semantic similarity metrics
• Vector Representations of Words, Sentences
• Neural Network based Models
• Word2Vec, Glove, Elmo.
• Pretrained models
• BERT etc.
• Large Language Models
• GPT, Llama etc.
47
Conclusion
48
References
Books
• Dan Jurafsky and James H. Martin, Speech and Language Processing , Pearson
education
49