Lecture-#0
Natural Language Processing
A7707
Introduction
Syllabus Details
Introduction: What is Natural Language Processing (NLP), Origins of NLP, The
Chal- lenges of NLP, Phases of NLP, Language and Grammar. Finding the
Structure of Words and Documents: Words and Their Components, Issues and
Challenges, Morphological Models. Finding the Structure of Documents:
Introduction, Sentence Boundary Detection, Topic Bound- ary Detection,
Methods, Complexity of the Approaches, Performances of the Approaches,
Features, Processing Stages.
Syntax: Parsing Natural Language, A Data-Driven Approach to Syntax, Stop
words, Correcting Words, Stemming, Lemmatization, Parts of Speech (POS)
Tagging, Representation of Syntactic Structure, Parsing Algorithms, Models for
Ambiguity Resolution in Parsing.
Syllabus Details
Semantic Parsing: Introduction, Semantic Interpretation: Structural Ambiguity, Entity and
Event Resolution, System Paradigms, Word Sense, Predicate-Argument Structure, Meaning
Representation.
Language modelling: Introduction, n-Gram Models, Language Model Evaluation, Pa- rameter
Estimation, Types of Language Models: Class-Based Language Models, MaxEnt Language
Models, Neural Network Language Models Language- Specific Modeling Problems,
Multilingual and Crosslingual Language Modeling.
Applications: Question Answering: History, Architectures, Question Analysis, Search and
Candidate Extraction, Automatic Summarization: Approaches to Summarization, Spoken
Dialog Systems: Speech Recognition and Understanding, Speech Generation, Dialog
Manager, Voice User Interface, Information Retrieval: Document Preprocessing, Monolingual
Information Retrieval
Syllabus Details Lab
1. a. Write a program to Tokenize Text to word using NLTK.
b. Write a program to Tokenize Text to Sentence using NLTK.
2. a. Write a program to remove numbers, punctuations, and whitespaces in a file.
b. Write a program to Count Word Frequency in a file.
3. Write a program to Tokenize and tag the following sentence using Morphological Analysis
in NLP.
4. a. Write a program to get Synonyms from WordNet.
b. Write a program to get Antonyms from WordNet.
5. a. Write a program to show the difference in the results of Stemming and Lemmatization.
b. Write a program to Lemmatizing Words Using WordNet.
6. a. Write a program to print all stop words in NLP.
b. Write a program to remove all stop words from a given text.
Syllabus Details Lab
7. Write a Python program to apply Collocation extraction word combinations in the text.
Collocation examples are “break the rules,” “free time,” “draw a conclusion,” “keeps in
mind,” “get ready,” and so on.
8. Write a Python program to extract Relationship that allows obtaining structured in-
formation from unstructured sources such as raw text. Strictly stated, it is identifying
relations (e.g., acquisition, spouse, employment) among named entities (e.g., people,
organizations, locations). For example,from the sentence “Mark and Emily married
yesterday,” we can extract the information that Mark is Emily’s husband.
9. Write a program to print POS and parse tree of a given Text.
10. Write a program to print bigram and Trigram of a given Text.
11. Implement a case study of NLP application.
Reference Books
Text Books:
Daniel M. Bikel Imed Zitouni., Multilingual Natural Language Processing
Applica- tions: From Theory to Practice, IBM Press, 2013.
Tanveer Siddiqui, U.S. Tiwary., Natural Language Processing and
Information Re- trieval, Oxford University, 2008.
Reference Books:
Daniel Jurafsky and James H Martin., Speech and Language Processing: An
in- troduction to Natural Language Processing, Computational Linguistics
and Speech Recognition,2nd Edition, Prentice Hall, 2008.
James Allen., Natural Language Understanding,2nd Edition, Cummings
publishing company,1995.
Courses Outcomes
A7707.1 Identify the structure of words and documents for text
preprocessing.
A7707.2 Choose an approach to parse the given text document.
A7707.3 Make use of semantic parsing to capture real meaning of text.
A7707.4 Select a language model to predict the probability of a sequence
of words.
A7707.5 Examine the various applications of NLP.
Introduction to Web and Text Analytics
Hours Per Hours Per Credits Assessment
Week Semester Marks
L T P L T P C CIE SEE Total
2 0 2 28 0 28 3 30 70 100
Introduction to AI for Natural Language Processing
Natural language processing (NLP) refers to the branch of computer
science—and more specifically, the branch of artificial intelligence —
concerned with giving computers the ability to understand text and spoken
words in much the same way human beings can.
Introduction to AI for Natural Language Processing
NLP combines computational linguistics—rule-based modeling of human
language—with statistical, machine learning, and deep learning models.
Together, these technologies enable computers to process human language
in the form of text or voice data and to ‘understand’ its full meaning,
complete with the speaker or writer’s intent and sentiment.
NLP drives computer programs that translate text from one language to
another, respond to spoken commands, and summarize large volumes of text
rapidly—even in real time.
Applications of AI for Natural Language Processing
Applications of AI for Natural Language Processing
1. Sentiment Analysis
Sentiment Analysis is also known as opinion mining.
It is used on the web to analyse the attitude, behaviour, and
emotional state of the sender.
This application is assigning the values to the text (positive,
negative, or natural), identify the mood of the context (happy,
sad, angry, etc.)
Applications of Text Analytics
2 . Question Answering
Question Answering focuses on building systems that automatically answer
the questions asked by humans in a natural language.
3. Machine Translation
Machine translation is used to translate text or speech from one natural
language to another natural language.
Example: Google Translator
Applications of AI for Natural Language
Processing
4.Spam Detection s
Spam detection is used to detect unwanted e-mails getting to a user's
inbox.
5. Spelling and Grammar correction
Microsoft Corporation provides word processor software like MS-word,
PowerPoint for the spelling correction.
Example: Grammarly software
Applications of AI for Natural Language
Processing
6. Speech Recognition
Speech recognition is used for converting spoken words into text. It is used
in applications, such as mobile, home automation, video recovery,
dictating to Microsoft Word, voice biometrics, voice user interface, and so
on.
Example: google Assistant, Siri
7. Chatbot
Implementing the Chatbot is one of the important applications of NLP. It is
used by many companies to provide the customer's chat services.
Applications of AI for Natural Language
Processing
8. Social media monitoring
It means tracking hashtags, keywords, and mentions relevant to your brand in
order to stay informed about your audience and industry.
Lecture-#1
Natural Language Processing
A7707
Introduction
Origins of Natural Language Processing
Origins of Natural Language Processing
Alan Turing is the
father of Natural
language
processing. In his
1950 paper
Computing
Machinery and
Intelligence, he
described a test
for an intelligent
machine that
could understand
and respond to
natural human
conversation.
NLP Terminology
Phases of NLP
1. Lexical Analysis and Morphological
•The first phase of NLP is the Lexical Analysis. This phase
scans the source code as a stream of characters and converts it
into meaningful lexemes.
• It divides the whole text into paragraphs, sentences, and
words.
2. Syntactic Analysis (Parsing)
•Syntactic Analysis is used to check grammar, word
arrangements, and shows the relationship among the words.
Example: Agra goes to the Poonam
•In the real world, Agra goes to the Poonam, does not make
any sense, so this sentence is rejected by the Syntactic
analyzer.
Phases of NLP
3. Semantic Analysis
•Semantic analysis is concerned with the meaning representation. It
mainly focuses on the literal meaning of words, phrases, and
sentences.
4. Discourse Integration
•Discourse Integration depends upon the sentences that proceeds it
and also invokes the meaning of the sentences that follow it.
•For example, the word "that" in the sentence "He wanted that"
depends upon the prior discourse context.
5. Pragmatic Analysis
•Pragmatic is the fifth and last phase of NLP. It helps you to
discover the intended effect by applying a set of rules that
characterize cooperative dialogues.
For Example: "Open the door" is interpreted as a request instead
of an order.
The Challenges of NLP
NLP is a powerful tool with huge benefits, but there are still a number of
Natural Language Processing limitations and problems:
Contextual words and phrases and homonyms
Synonyms
Ambiguity
Errors in text or speech
Colloquialisms and slang
The Challenges of NLP
Contextual words and phrases and homonyms
1.The same words and phrases can have different meanings according the context of a
sentence and many words – especially in English – have the exact same pronunciation but
totally different meanings.
2.Homonyms – two or more words that are pronounced the same but have different
definitions – can be problematic for question answering and speech-to-text applications
because they aren’t written in text form. Usage of their and there, for example, is even a
common problem for humans.
For example:
I ran to the store because we ran out of milk.
Can I run something past you real quick?
The house is looking really run down.
These are easy for humans to understand because we read the context of the sentence and
we understand all of the different definitions. And, while NLP language models may have
learned all of the definitions, differentiating between them in context can present
problems.
The Challenges of NLP
Synonyms
Synonyms can lead to issues similar to contextual understanding because we use many
different words to express the same idea.
Furthermore, some of these words may convey exactly the same meaning, while some may
be levels of complexity (small, little, tiny, minute) and different people use synonyms to
denote slightly different meanings within their personal vocabulary.
So, for building NLP systems, it’s important to include all of a word’s possible meanings and
all possible synonyms.
Text analysis models may still occasionally make mistakes, but the more relevant training
data they receive, the better they will be able to understand synonyms.
The Challenges of NLP
Ambiguity
Ambiguity in NLP refers to sentences and phrases that potentially have two or more possible
interpretations.
Lexical ambiguity: a word that could be used as a verb, noun, or adjective. ExampleLight,Fast
Semantic ambiguity: the interpretation of a sentence in context. For example: I saw the boy
on the beach with my binoculars.
This could mean that I saw a boy through my binoculars or the boy had my binoculars with
him
Syntactic ambiguity: In the sentence above, this is what creates the confusion of meaning.
The phrase with my binoculars could modify the verb, “saw,” or the noun, “boy.”
Even for humans this sentence alone is difficult to interpret without the context of
surrounding text.
POS (part of speech) tagging is one NLP solution that can help solve the problem, somewhat.
The Challenges of NLP
Colloquialisms and slang
Informal phrases, expressions, idioms, and culture-specific lingo present a number of
problems for NLP – especially for models intended for broad use.
Because as formal language, colloquialisms may have no “dictionary definition” at all, and
these expressions may even have different meanings in different geographic areas.
Furthermore, cultural slang is constantly morphing and expanding, so new words pop up
every day.
This is where training and regularly updating custom models can be helpful, although it
oftentimes requires quite a lot of data.
The Challenges of NLP
Colloquialisms and slang
Informal phrases, expressions, idioms, and culture-specific lingo present a number of
problems for NLP – especially for models intended for broad use.
Because as formal language, colloquialisms may have no “dictionary definition” at all, and
these expressions may even have different meanings in different geographic areas.
Furthermore, cultural slang is constantly morphing and expanding, so new words pop up
every day.
This is where training and regularly updating custom models can be helpful, although it
oftentimes requires quite a lot of data.
Language and Grammar
Difference between Natural language and Computer Language
Natural Language Computer Language
Natural language has a very large Computer language has a very limited
vocabulary. vocabulary.
Natural language is easily understood by Computer language is easily understood by the
humans. machines.
Natural language is ambiguous in nature. Computer language is unambiguous.
How many languages are there in NLP?
NLP is usually used for chatbots, virtual assistants, and modern spam
detection.
But NLP isn't perfect, although there are over 7000 languages spoken
around the globe, most NLP processes only use seven languages: English,
Chinese, Urdu, Farsi, Arabic, French, and Spanish.
Language and Grammar
Grammar in NLP is a set of rules for constructing
sentences in a language used to understand and analyze
the structure of sentences in text data.
Mathematically, a grammar G can be written as a 4-tuple
(N, T, S, P) where,
N or VN = set of non-terminal symbols, or variables.
T or ∑ = set of terminal symbols.
S = Start symbol where S ∈ N
P = Production rules for Terminals as well as Non-
terminals.
It has the form α → β, where α and β are strings on VN ∪
∑ and at least one symbol of α belongs to VN
Components of NLP
There are the following two components of NLP –
Components of NLP
Natural language understanding (NLU)
NLU enables machines to understand and interpret human language by extracting metadata from
content. It performs the following tasks:
Helps analyze different aspects of language.
Helps map the input in natural language into valid representations.
NLU is more difficult than NLG tasks owing to referential, lexical, and syntactic ambiguity.
Lexical ambiguity: This means that one word holds several meanings. For example, "The man is
looking for the match." The sentence is ambiguous as ‘match’ could mean different things such as a
partner or a competition.
Syntactic ambiguity: This refers to a sequence of words with more than one meaning. For example,
"The fish is ready to eat.” The ambiguity here is whether the fish is ready to eat its food or whether the
fish is ready for someone else to eat. This ambiguity can be resolved with the help of the part-of-speech
tagging technique.
Referential ambiguity: This involves a word or a phrase that could refer to two or more properties.
For example, Tom met Jerry and John. They went to the movies. Here, the pronoun ‘they’ causes
ambiguity as it isn’t clear who it refers to.
Components of NLP
Natural language generation (NLG)
NLG is a method of creating meaningful phrases and sentences (natural
language) from data. It comprises three stages: text planning, sentence
planning, and text realization.
Text planning: Retrieving applicable content.
Sentence planning: Forming meaningful phrases and setting the
sentence tone.
Text realization: Mapping sentence plans to sentence structures.
Chatbots, machine translation tools, analytics platforms, voice assistants,
sentiment analysis platforms, and AI-powered transcription tools are some
applications of NLG.
Components of NLP
NLU NLG
NLU is the process of NLG is the process of
reading and interpreting writing or generating
language. language.
It produces non-linguistic It produces constructing
outputs from natural natural language outputs
language inputs. from non-linguistic inputs.
Lecture-#2
Natural Language Processing
A7707
Introduction
Text Tokenization
Corpus:
Text Tokenization
Tokens:
Text Tokenization
ngrams:
!pip install nltk
from nltk import ngrams
sentence = 'Hello everyone. Welcome to class. You are studying Web and Text Analytics article'
n=2
x = ngrams(sentence.split(), n)
for grams in x:
print (grams)
O/P:
('Hello', 'everyone.') ('everyone.', 'Welcome') ('Welcome', 'to') ('to', 'class.') ('class.', 'You') ('You', 'are') ('are',
'studying') ('studying', 'Web') ('Web', 'and') ('and', 'Text') ('Text', 'Analytics') ('Analytics', 'article')
Text Tokenization
Tokenization:
Types:
Sentence Tokenization
Word Tokenization
Sentence Tokenization
Sentences are separated by a full stop, so the process of sentence tokenization
finds all the full stops in a piece of text to split the data into sentences.
!pip install nltk
import nltk
Sentence Tokenization
PUNKT is an unsupervised trainable model, which means it can be
trained on unlabeled data (Data that has not been tagged with information
identifying its characteristics, properties, or categories is referred to as
unlabeled data.)
It contains a variety of libraries for various purposes like text
classification, parsing, stemming, tokenizing, etc.
punkt is the required package for tokenization. Hence you may
download it using nltk download manager or download it programmatically
using
nltk.download('punkt').
Sentence Tokenization
!pip install nltk
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
text = "Hello everyone. Welcome to class. You are studying Web and Text An
alytics article"
sent_tokenize(text)
OUTPUT:
['Hello everyone.', 'Welcome to class.', 'You are studying Web and Text
Analytics article']
Word Tokenization
Word tokenization is the process of splitting a large sample of text into
words.
This is a requirement in natural language processing tasks where each word
needs to be captured and subjected to further analysis like classifying and
counting them for a particular sentiment etc.
Word Tokenization
from nltk.tokenize import word_tokenize
text = "Hello everyone. Welcome to Web and Text Analytic class"
word_tokenize(text)
O/P:
['Hello', 'everyone', '.', 'Welcome', 'to', 'Web', 'and', 'Text', 'Analytic', 'class']
Word Tokenization
Regular expression (RE): It is a set of characters or a pattern that is used
to find substrings in a given string.
Example:
Language
L = {ε, 00, 0000, 000000, ......}
RE
R = (00)*
Language
{google, googoogle, googoogoogle, googoogoogoogle, ...}
RE
R= g(oog)+le
Word Tokenization
RegexpTokenizer:
from nltk.tokenize import RegexpTokenizer
tokenizer2 = RegexpTokenizer("[\w]+")
tokenizer2.tokenize(text)
O/P:
['Hello', 'everyone', 'Welcome', 'to', 'class', 'You', 'are', 'studying',
'Web', 'and', 'Text', 'Analytics', 'article']
Word Tokenization
RegexpTokenizer:
Word Tokenization
Punctuation:
# import string library function
import string
# Storing the sets of punctuation in variable result
result = string.punctuation
# Printing the punctuation values
print(result)
O/P:
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Word Tokenization
WordPunctTokenizer:
WordPunctTokenizer is used for separating the punctuation from the words.
from nltk.tokenize import WordPunctTokenizer
tokenizer1 = WordPunctTokenizer()
tokenizer1.tokenize(text)
O/P:
['Hello', 'everyone', '.', 'Welcome', 'to', 'class', '.', 'You', 'are', 'studying',
'Web', 'and', 'Text', 'Analytics', 'article']
Stemming
Stemming:
Stemming is the process of producing morphological variants of a root/base word.
Stemming programs are commonly referred to as stemming algorithms or
stemmers.
A stemming algorithm reduces the words “chocolates”, “chocolatey”, and “choco”
to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the
stem “retrieve”.
Errors in Stemming: There are mainly two errors in stemming –
Overstemming and Understemming.
Overstemming occurs when two words are stemmed from the same root that are
of different stems.
Under-stemming occurs when two words are stemmed from the same root that is
not of different stems.
Stemming
Stemming:
Stemming
Stemming:
Example: university and universe
Overstemming: univers
data” and “datum
Understemming: dat and datu
Applications of stemming are:
Stemming is used in information retrieval systems like search engines.
It is used to determine domain vocabularies in domain analysis.
Stemming is desirable as it may reduce redundancy as most of the time the
word stem and their inflected/derived words mean the same.
Stemming
Stemming:
# import these modules
from nltk.stem import PorterStemmer O/P:
from nltk.tokenize import word_tokenize program : program
ps = PorterStemmer() programs : program
# choose some words to be stemmed programmer : programm
words = ["program", "programs", programming : program
"programmer", "programming", programmers : programm
"programmers"]
for w in words:
print(w, " : ", ps.stem(w))
Stemming
Stemming:
# importing modules
from nltk.stem import PorterStemmer O/P:
from nltk.tokenize import word_tokenize Programmers : programm
program : program
ps = PorterStemmer() with : with
programming : program
sentence = "Programmers program with languages : languag
programming languages"
words = word_tokenize(sentence)
for w in words:
print(w, " : ", ps.stem(w))
Lemmatization
Lemmatization:
Lemmatization is the process of grouping together the different inflected
forms of a word so they can be analyzed as a single item.
Lemmatization is similar to stemming but it brings context to the words.
So it links words with similar meanings to one word.
Example:
-> rocks : rock
-> corpora : corpus
-> better : good
Applications of lemmatization are:
Used in comprehensive retrieval systems like search engines.
Used in compact indexing
Lemmatization
Lemmatization:
Lemmatization
Lemmatization:
# import these modules
from nltk.stem import WordNetLemmatizer
O/P:
lemmatizer = WordNetLemmatizer() rocks : rock
corpora : corpus
print("rocks :", lemmatizer.lemmatize("rocks")) better : good
print("corpora :", lemmatizer.lemmatize("corpora"))
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos ="a"))
STEMMING vs LEMMATIZATION
Grammar
rules of a language governing the sounds, words, sentences, and other
elements, as well as their combination and interpretation.
Context Free Grammar
Context free grammar is a formal grammar which is used to generate all
possible strings in a given formal language.
Context free grammar G can be defined by four tuples as:
G= (V, T, P, S)
Where
V describes a finite set of non-terminal symbols or variables
T describes a finite set of terminal symbols.
P describes a set of production rules
S is the start symbol.
Context Free Grammar
Context free grammar is a formal grammar which is used to generate all
possible strings in a given formal language.
Context free grammar G can be defined by four tuples as:
G= (V, T, P, S)
Where
V describes a finite set of non-terminal symbols or variables
T describes a finite set of terminal symbols.
P describes a set of production rules
S is the start symbol.
Example
The grammar ({A}, {a, b, c}, P, A), P : A → aA, A → abc.
The grammar ({S}, {a, b}, P, S), P: S → aSa, S → bSb, S → ε
The grammar ({S, F}, {0, 1}, P, S), P: S → 00S | 11F, F → 00F | ε
Context Free Grammar
Generation of Derivation Tree
A derivation tree or parse tree is an ordered rooted tree that graphically
represents the semantic information a string derived from a context-free
grammar.
Representation Technique
Root vertex − Must be labeled by the start symbol.
Vertex − Labeled by a non-terminal symbol.
Leaves − Labeled by a terminal symbol or ε.
Leftmost derivation − A le most deriva on is obtained by applying
production to the leftmost variable in each step.
Rightmost derivation − A rightmost deriva on is obtained by applying
production to the rightmost variable in each step.
Context Free Grammar
Generation of Derivation Tree or parse tree
A → AA
A → (A)
A→a
Solution:
For the string "a(a)aa" the above grammar can generate two parse trees:
.
Context Free Grammar
A CFG is ambiguous if one or more terminal strings have multiple leftmost
derivations from the start symbol.
Parts of Speech (POS) Tagging
part-of-speech tagging (POS tagging or PoS tagging or POST), also
called grammatical tagging or word-category disambiguation.
Input: I am going.
Output:
with stop word
[('I', 'PRP'), ('am', 'VBP'), ('going', 'VBG'), ('.', '.')]
Without stop word
[('I', 'PRP'), ('going', 'VBG'), ('.', '.')]
Where
PRP- Pronoun Personal
VBP-Verb non-3rd personal singular present
VBG-Verb Gerund/ Present Participle
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
txt = " Everything is all about money.“
# sent_tokenize is one of instances of
# PunktSentenceTokenizer from the nltk.tokenize.punkt module O/P:
tokenized = sent_tokenize(txt) [('Everything', 'VBG'),
for i in tokenized: ('money', 'NN'),
# Word tokenizers is used to find the words ('.', '.')]
# and punctuation in a string
wordsList = nltk.word_tokenize(i)
# removing stop words from wordList
wordsList = [w for w in wordsList if not w in stop_words]
# Using a Tagger. Which is part-of-speech
# tagger or POS-tagger.
tagged = nltk.pos_tag(wordsList)
print(tagged)
Parts of Speech (POS) Tagging
# with stop words
import nltk
# download required nltk packages
# required for tokenization
nltk.download('punkt')
# required for parts of speech tagging O/P:
nltk.download('averaged_perceptron_tagger') [('Today', 'NN'),
('morning', 'NN'),
# input text (',', ','),
sentence = """Today morning, I am Very Happy.""" ('I', 'PRP'),
('am', 'VBP'),
# tokene into words ('Very', 'RB'),
tokens = nltk.word_tokenize(sentence) ('Happy', 'JJ'), ('.', '.')]
# parts of speech tagging
tagged = nltk.pos_tag(tokens)
# print tagged tokens
print(tagged)
Parts of Speech (POS) Tagging
# Without stop words
import nltk
from nltk.corpus import stopwords O/P:
from nltk.tokenize import word_tokenize, sent_tokenize [('Today', 'NN'),
stop_words = set(stopwords.words('english')) ('morning', 'NN'),
txt = "Today morning, I am Very Happy.“ (',', ','), ('I', 'PRP'),
# sent_tokenize is one of instances of ('Very', 'RB'),
# PunktSentenceTokenizer from the nltk.tokenize.punkt module ('Happy', 'JJ'), ('.',
tokenized = sent_tokenize(txt) '.')]
for i in tokenized:
# Word tokenizers is used to find the words
# and punctuation in a string
wordsList = nltk.word_tokenize(i)
# removing stop words from wordList
wordsList = [w for w in wordsList if not w in stop_words]
# Using a Tagger. Which is part-of-speech
# tagger or POS-tagger.
tagged = nltk.pos_tag(wordsList)
print(tagged)
Lecture-#3
Natural Language Processing
A7707
Introduction
Finding the Structure of Words and Documents
Finding the Structure of Words and Documents
•Human language is a complicated thing.
•We use it to express our thoughts, and through language, we receive information and infer its
meaning.
•Trying to understand language all together is not a viable approach.
•Linguists have developed whole disciplines that look at language from different perspectives
and at different levels of detail.
•The point of morphology, for instance, is to study the variable forms and functions of words,
•The syntax is concerned with the arrangement of words into phrases, clauses, and sentences.
•Word structure constraints due to pronunciation are described by phonology,
•The conventions for writing constitute the orthography of a language.
•The meaning of a linguistic expression is its semantics, and etymology and lexicology cover
especially the evolution of words and explain the semantic, morphological, and other links among
them.
•Words are perhaps the most intuitive units of language, yet they are in general tricky to define.
•Knowing how to work with them allows, in particular, the development of syntactic and
semantic abstractions and simplifies other advanced views on language.
•Here, first we explore how to identify words of distinct types in human languages, and how the
internal structure of words can be modelled in connection with the grammatical properties and
lexical concepts the words should represent.
Finding the Structure of Words and Documents
•The discovery of word structure is morphological parsing.
•In many languages, words are delimited in the orthography by whitespace and
punctuation.
•But in many other languages, the writing system leaves it up to the reader to tell
words
apart or determine their exact phonological forms.
Words and Their Components
•Words are defined in most languages as the smallest linguistic units that can
form a
complete utterance by themselves.
•The minimal parts of words that deliver aspects of meaning to them are called
morphemes.
Tokens
•Suppose, for a moment, that words in English are delimited only by whitespace
and punctuation (the marks, such as full stop, comma, and brackets)
•Example: Will you read the newspaper? Will you read it? I won’t read it.
Finding the Structure of Words and Documents
If we confront our assumption with insights from
syntax, we notice two here: words newspaper and won’t.
•Being a compound word, newspaper has an interesting derivational structure.
•In writing, newspaper and the associated concept is distinguished
from the isolated news and paper.
•For reasons of generality, linguists prefer to analyze won’t as two syntactic words, or tokens,
each of which has its independent role and can be reverted to its normalized form.
•The structure of won’t could be parsed as will followed by not.
•In English, this kind of tokenization and normalization may apply to just a limited set of
cases, but in other languages, these phenomena have to be treated in a less trivial manner.
•In Arabic or Hebrew, certain tokens are concatenated in writing with the preceding or the
following ones, possibly changing their forms as well.
•The underlying lexical or syntactic units are thereby blurred into one compact string of letters
and no longer appear as distinct words.
•Tokens behaving in this way can be found in various languages and are often called clitics.
•In the writing systems of Chinese, Japanese, and Thai, whitespace is not used to separate
words.
Finding the Structure of Words and Documents
Lexemes
•By the term word, we often denote not just the one linguistic form in the given context but
also the concept behind the form and the set of alternative forms that can express it.
•Such sets are called lexemes or lexical items, and they constitute the lexicon of a
language.
•Lexemes can be divided by their behaviour into the lexical categories of verbs, nouns,
adjectives, conjunctions, particles, or other parts of speech.
•The citation form of a lexeme, by which it is commonly identified, is also called its
lemma.
•When we convert a word into its other forms, such as turning the singular mouse into the
plural mice or mouses, we say we inflect the lexeme.
•When we transform a lexeme into another one that is morphologically related, regardless of
its lexical category, we say we derive the lexeme: for instance, the nouns receiver and
reception are derived from the verb to receive.
•Example: Did you see him? I didn’t see him. I didn’t see anyone.
•Example presents the problem of tokenization of didn’t and the investigation of the internal
structure of anyone.
Finding the Structure of Words and Documents
•In the paraphrase I saw no one, the lexeme to see would be inflected into the form
saw to reflect its grammatical function of expressing positive past tense.
•Likewise, him is the oblique case form of he or even of a more abstract lexeme
representing all personal pronouns.
•In the paraphrase, no one can be perceived as the minimal word synonymous with
nobody.
•The difficulty with the definition of what counts as a word need not pose a problem for
the syntactic description if we understand no one as two closely connected tokens
treated as one fixed element.
Morphemes
•Morphological theories differ on whether and how to associate the properties of word
forms with their structural components.
•These components are usually called segments or morphs.
•The morphs that by themselves represent some aspect of the meaning of a word are
called morphemes of some function.
•Human languages employ a variety of devices by which morphs and morphemes are
combined into word forms.
Finding the Structure of Words and Documents
Morphology
•Morphology is the domain of linguistics that analyses the internal structure of words.
•Morphological analysis – exploring the structure of words
•Words are built up of minimal meaningful elements called morphemes: played = play-ed
cats = cat-s
unfriendly = un-friend-ly
•Two types of morphemes: i Stems: play, cat, friend ii Affixes: -ed, -s, un-, -ly
•Two main types of affixes:
i Prefixes precede the stem: un-
ii Suffixes follow the stem: -ed, -s, un-, -ly
•Stemming = find the stem by stripping off affixes play = play
replayed = re-play-ed
computerized = comput-er-ize-d
Finding the Structure of Words and Documents
Problems in morphological processing
•Inflectional morphology: inflected forms are constructed from base forms and inflectional
affixes. Lemma Singular Plural
•Inflection relates different forms of the same word cat cat cats
dog dog dogs
knife knife knives
sheep sheep sheep
mouse mouse mice
•Derivational morphology: words are constructed from roots (or stems) and derivational
affixes:
inter+national = international international+ize = internationalize internationalize+ation =
internationalization
Finding the Structure of Words and Documents
•The simplest morphological process concatenates morphs one by one, as in dis- agree-ment-s,
where agree is a free lexical morpheme and the other elements are bound grammatical
morphemes contributing some partial meaning to the whole word.
•in a more complex scheme, morphs can interact with each other, and their forms may become
subject to additional phonological and orthographic changes denoted as morphophonemic.
•The alternative forms of a morpheme are termed allomorphs.
Typology
•Morphological typology divides languages into groups by characterizing the prevalent
morphological phenomena in those languages.
•It can consider various criteria, and during the history of linguistics, different classifications
have been proposed.
•Let us outline the typology that is based on quantitative relations between words, their morphemes,
and their features:
•Isolating, or analytic, languages include no or relatively few words that would comprise more than one
morpheme (typical members are Chinese, Vietnamese, and Thai; analytic tendencies are also found in
English).
Finding the Structure of Words and Documents
•Synthetic languages can combine more morphemes in one word and are further
divided into agglutinative and fusional languages.
•Agglutinative languages have morphemes associated with only a single function
at a time (as in Korean, Japanese, Finnish, and Tamil, etc.)
•Fusional languages are defined by their feature-per-morpheme ratio higher than
one
(as in Arabic, Czech, Latin, Sanskrit, German, etc.).
•In accordance with the notions about word formation processes mentioned earlier,
we
can also find out using concatenative and nonlinear:
•Concatenative languages linking morphs and morphemes one after another.
•Nonlinear languages allowing structural components to merge nonsequentially to
apply tonal morphemes or change the consonantal or vocalic templates of words.
Finding the Structure of Words and Documents
Morphological Typology
•Morphological typology is a way of classifying the languages of the world that groups languages
according to their common morphological structures.
•The field organizes languages on the basis of how those
languages form words by combining morphemes.
•The morphological typology classifies languages into two broad classes of synthetic languages and
analytical languages.
•The synthetic class is then further sub classified as either agglutinative languages or fusional
languages.
•Analytic languages contain very little inflection, instead relying on features like word order and
auxiliary words to convey meaning.
•Synthetic languages, ones that are not analytic, are divided into
two categories: agglutinative and fusional languages.
•Agglutinative languages rely primarily on discrete particles(prefixes, suffixes, and infixes) for
inflection, ex: inter+national = international, international+ize = internationalize.
• While fusional languages "fuse" inflectional categories together, often allowing one word ending to
contain several categories, such that the original root can be difficult to extract (anybody, newspaper).
Lecture-#4
Natural Language Processing
A7707
Introduction
Finding the Structure of Words and Documents
Issues and Challenges
•Irregularity: word forms are not described by a prototypical linguistic model.
•Ambiguity: word forms be understood in multiple ways out of the context of their discourse.
•Productivity: is the inventory of words in a language finite, or is it unlimited?
•Irregularity: Irregularity in language refers to situations where word forms do not follow the
typical or regular patterns that might be expected based on linguistic rules or models.
•Instead, irregular words have unique or idiosyncratic forms that don't conform to standard
morphological or phonological rules. Here are some key points regarding irregularity in language:
Morphological Irregularity: Irregularity often manifests in the way words are formed. For
example, in English, the past tense of regular verbs is typically formed by adding "-ed" (e.g.,
"walk" becomes "walked"). However, irregular verbs like "go" have unique past tense forms
("went") that do not follow this pattern.
Phonological Irregularity: Irregularity can also occur at the phonological level, affecting the
pronunciation of words. For instance, the pronunciation of some irregularly spelled words in
English may not align with their spelling (e.g., "colonel" is pronounced as "kernel").
Finding the Structure of Words and Documents
Ambiguity: Word structure can introduce ambiguity in language. The same word may have
multiple meanings depending on its context and the affixes used.
For example, the word "unhinged" can mean "mentally unstable" or "detached from its
hinges," illustrating how a single word can have different meanings based on its structure.
Lexical Ambiguity: Lexical ambiguity occurs when a single word has multiple meanings, and
the context does not make it clear which meaning is intended. Examples include:
"Bank" can mean a financial institution or the side of a river.
"Bat" can refer to a flying mammal or a piece of sports equipment.
Syntactic Ambiguity: Syntactic ambiguity arises from the structure or arrangement of words
within a sentence. Different interpretations of the sentence can result from alternative ways
of parsing it. Examples include:
"I saw the man with the telescope." (Did I use the telescope to see the man, or did the man
have the telescope?)
"The chicken is ready to eat." (Is the chicken ready to be eaten, or is it prepared to eat
something else?)
Finding the Structure of Words and Documents
Semantic Ambiguity: Semantic ambiguity occurs when a sentence or phrase is vague or
imprecise due to multiple possible interpretations of its meaning. Examples include:
"She saw the man with glasses." (Does "with glasses" refer to the man wearing glasses or the
glasses themselves?)
"He's a little chicken." (Does "chicken" mean cowardly or a small bird?)
Pragmatic Ambiguity: Pragmatic ambiguity arises from the context or implied meaning of a
statement. It can occur when a statement is technically clear but can be interpreted in different
ways based on the speaker's intention or the situation. Examples include:
"Could you pass the salt?" (Is this a request for someone to pass the salt, or is it a polite way of
saying, "I would like some salt, please"?)
"I am not saying he's a liar." (Is the speaker denying that the person is a liar, or are they implying
something indirectly?)
Homonymy: Homonyms are words that have the same spelling or pronunciation but different
meanings. This can lead to lexical ambiguity. Examples include:
"Bear" (referring to the animal) and "bear" (meaning to carry or endure).
"Tear" (referring to ripping something) and "tear" (referring to a drop of liquid from the eye).
Finding the Structure of Words and Documents
Productivity:
Not all word formations are equally productive in a language. Some affixes are more versatile
and can be applied to a wide range of words, while others have limited applicability.
Understanding which affixes are productive and which are not is important for linguistic
analysis and language learning.
Original word: happy
New word: unhappy (not happy)
Original word: do
New word: undo (reverse the action of doing)
Original word: likely
New word: unlikely (not likely)
Original word: fair
New word: unfair (not fair)
Original word: able
New word: unable (not able)
Original word: pleasant
New word: unpleasant (not pleasant)
Finding the Structure of Words and Documents
Morphological Models
Morphological models in linguistics are theoretical frameworks or approaches that are used to describe and
analyze the structure of words in a language.
These models help linguists understand how words are formed, how they inflect, and how their internal
components (morphemes) contribute to their meanings and functions.
There are several morphological models, each with its own way of categorizing and analyzing linguistic
elements. Here are some of the key morphological models:
•Dictionary Lookup
Dictionary lookup is a practical application of morphological models in the context of language and
linguistics.
While dictionary lookup itself doesn't explicitly employ a specific morphological model, it relies on the
underlying principles of morphology to provide users with information about words and their structures.
Finding the Structure of Words and Documents
Here's how dictionary lookup relates to morphological models :
Word Structure: Morphological models describe how words are structured, including the
identification of morphemes (the smallest meaningful units of language) and their
combinations to form words.
When you perform a dictionary lookup, the information you receive often includes
details about a word's morphological structure. For example, you may learn about the
word's root, affixes, and how they contribute to its meaning.
Inflection and Derivation: Morphological models distinguish between inflectional and
derivational morphemes.
Inflectional morphemes convey grammatical information such as tense, number, and
case, while derivational morphemes create new words or change word categories (e.g.,
noun to verb).
Dictionaries often indicate whether a word has undergone inflectional changes and may
provide derivational information about related words or word forms.
Finding the Structure of Words and Documents
Word Forms: Morphological models address the various forms that a word can take
based on its grammatical or semantic function.
Dictionaries typically list different word forms, including singular and plural forms for
nouns, different tenses for verbs, and comparative/superlative forms for adjectives.
These forms are derived following the principles of word formation outlined by
morphological models.
Lexical Ambiguity: Morphological models help explain the different meanings that a
word can have based on its morphological structure.
When you look up a word in a dictionary, you may encounter multiple definitions or
senses of the word, and understanding the morphological components can aid in
disambiguating these meanings.
Finding the Structure of Words and Documents
Finite-State Morphology
Finite-State Morphology (FSM) is a computational linguistic model used for describing and
generating complex word forms and morphological processes in natural languages. It
employs finite-state automata and transducers to represent the morphological rules and
processes. FSM is particularly useful for languages with relatively simple and regular
morphological systems.
Here's a simplified example of Finite-State Morphology in English, focusing on the
pluralization of nouns:
By finite-state morphological models, we mean those in which the specifications written
by human programmers are directly compiled into finite-state transducers.
The two most popular tools supporting this approach, XFST (Xerox Finite-State Tool) and
Lex Tools.
Finite-state transducers are computational devices extending the power of finite-state
Automata.
Finding the Structure of Words and Documents
Example: Pluralization in English
In English, many nouns form their plurals by adding the suffix "-s" or "-es" to the singular form. Finite-State Morphology can be
used to model this process.
Singular Nouns: The FSM model starts with a list of singular nouns, represented as follows:
"cat" -> "cat"
"dog" -> "dog"
"book" -> "book"
"house" -> "house"
Pluralization Rules: The FSM model includes rules for pluralization, which are typically defined as finite-state transducers. In this
example, we have two rules:
Rule 1: Add "-s" to the singular form.
Rule 2: Add "-es" to the singular form if it ends with a consonant followed by "y" (e.g., "city" -> "cities").
Plural Nouns: The FSM applies the pluralization rules to the singular nouns to generate the plural forms:
Applying Rule 1 to "cat" -> "cats"
Applying Rule 1 to "dog" -> "dogs"
Applying Rule 1 to "book" -> "books"
Applying Rule 2 to "city" -> "cities"
Applying Rule 2 to "house" -> "houses"
Finding the Structure of Words and Documents
Input Input Morphological
parsed output
Cats cat +N +PL
Cat cat +N +SG
Cities city +N +PL
Geese goose +N +PL
Goose goose +N +SG) or (goose
+V)
Gooses goose +V +3SG
merging merge +V +PRES-PART
Caught (caught +V +PAST-PART) or
(catch +V +PAST)
Finding the Structure of Words and Documents
The FSM model captures the regularity of English pluralization by representing it as a set
of finite-state rules and applying these rules to generate plural forms.
While this example is highly simplified, FSM models can be extended to cover more
complex morphological processes in various languages.
FSM is commonly used in computational linguistics and natural language processing tasks,
such as stemming, lemmatization, and inflection generation, where it can efficiently handle
large sets of word forms and their associated morphological rules.
Finding the Structure of Words and Documents
Unification-Based Morphology
•Unification-Based Morphology (UBM) is a linguistic model that represents morphological
processes using unification, a concept from formal logic and artificial intelligence.
•UBM focuses on the idea that morphological processes can be described as a series of
feature unifications, where features of morphemes come together to form words.
•Morphological parsing P thus associates linear forms φ with alternatives of
structured content ψ, cf.
Finding the Structure of Words and Documents
unification can be monotonic (i.e., information-preserving), or it can allow inheritance of
default values and their overriding.
In either case, information in a model can be efficiently shared and reused by means of
inheritance hierarchies defined on the feature structure types.
Unification is the key operation by which feature structures can be merged into a
more informative feature structure.
Unification of feature structures can also fail, which means that the information in
them is mutually incompatible.
Morphological models of this kind are typically formulated as logic programs, and
unification is used to solve the system of constraints imposed by the model.
Advantages of this approach include better abstraction possibilities for developing a
morphological grammar as well as elimination of redundant information from it.
Finding the Structure of Words and Documents
Functional Morphology
Functional morphology defines its models using principles of functional
programming and type theory.
It treats morphological operations and processes as pure mathematical functions
and organizes the linguistic as well as abstract elements of a model into distinct types
of values and type classes.
Though functional morphology is not limited to modelling particular types of
morphologies in human languages, it is especially useful for fusional morphologies.
Linguistic notions like paradigms, rules and exceptions, grammatical categories
and parameters, lexemes, morphemes, and morphs can be represented
intuitively(without conscious reasoning; instinctively) and succinctly(in a brief and
clearly expressed manner) in this approach.
Functional morphology implementations are intended to be reused as programming
libraries capable of handling the complete morphology of a language and to be
incorporated into various kinds of applications.
Finding the Structure of Words and Documents
Morphological parsing is just one usage of the system, the
others being morphological generation, lexicon browsing, and so on.
we can describe inflection I, derivation D, and lookup L as functions of these generic
type
Many functional morphology implementations are embedded in a general-purpose
programming language, which gives programmers more freedom with advanced
programming techniques and allows them to develop full-featured, real-world
applications for their models.
Lecture-#5
Natural Language Processing
A7707
Introduction
Finding the Structure of Words and Documents
2.Finding structure of Documents
1.Introduction
•Finding the structure of documents is a crucial task in document analysis and information
retrieval. Understanding the structure of a document allows for effective organization,
navigation, summarization, and retrieval of information within the document. Here are some
common techniques and approaches for finding the structure of documents:
Document Preprocessing:
Tokenization: Divide the text into individual tokens (words, phrases, sentences) to identify
the basic units of text.
Sentence Segmentation: Split the text into sentences to analyze the document at a more
granular level.
Headings and Sections:
Text Analysis: Use natural language processing (NLP) techniques to analyze the document's
text and identify headings, subheadings, and section titles. These are often characterized by
font size, style, or markup tags (e.g., HTML <h1>, <h2>).
Table of Contents: Extract the table of contents if available in the document.
Finding the Structure of Words and Documents
1.Sentence Boundary Detection
Sentence Boundary Detection (SBD), also known as sentence segmentation or sentence
tokenization, is the process of identifying the boundaries between sentences within a block
of text. Accurate sentence boundary detection is a crucial preprocessing step in natural
language processing (NLP) tasks, as it allows you to analyze and process text at the sentence
level.
In written text in English and some other languages, the beginning of a sentence is usually
marked with an uppercase letter, and the end of a sentence is explicitly marked with a
period(.), a question mark(?), an exclamation mark(!), or another type of punctuation.
Finding the Structure of Words and Documents
Rule-Based Approaches:
Punctuation Rules: One of the simplest methods is to split sentences at punctuation marks such
as periods, question marks, and exclamation points. However, this method may lead to incorrect
splits in cases where periods are used in abbreviations or numbers.
Abbreviation Detection: To address the issue of abbreviations, you can create rules to detect
common abbreviations and avoid splitting sentences on periods within them.
Quotation Marks: Splitting sentences at quotation marks (e.g., ".", "?", or "!") outside of
quotation marks can be effective for many texts.
Machine Learning Approaches:
Supervised Learning: You can train a supervised machine learning model (e.g., a binary
classifier) to predict whether a punctuation mark indicates the end of a sentence. Features
might include the punctuation mark itself, surrounding words, and contextual information.
Conditional Random Fields (CRF): CRFs are a type of probabilistic graphical model that can be
used for sequence labeling tasks like SBD. They consider the entire sequence of tokens when
making segmentation decisions.
Finding the Structure of Words and Documents
Deep Learning Approaches:
Recurrent Neural Networks (RNNs): RNNs, particularly Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU)
variants, can be used to model sequences of words and predict sentence boundaries based on the context.
Transformer Models: State-of-the-art models like BERT and GPT can be fine-tuned for sentence boundary detection as
they have a strong understanding of contextual information.
Lexical Clues:
Capitalization: Sentences often start with a capital letter. You can use capitalization rules to detect sentence boundaries.
Numbers: Sentences often start after numbers, especially when followed by punctuation.
Language-Specific Approaches:
Languages may have specific grammatical rules that can be leveraged for SBD. For example, in languages like Japanese and
Chinese, sentence boundaries are not indicated by spaces but by specific characters.
Third-Party Tools:
There are many NLP libraries and tools (e.g., NLTK, spaCy) that include pre-trained models and functions for sentence
boundary detection. These tools often combine rule-based and machine learning techniques.
Evaluation:
Once you implement an SBD algorithm, it's important to evaluate its performance using metrics like precision, recall, and
F1-score on a labeled dataset to ensure its accuracy.
Handling Edge Cases:
Consider handling edge cases, such as abbreviations, ellipses, and non-standard punctuation, to improve the robustness of
your SBD algorithm.
Finding the Structure of Words and Documents
1.Topic Boundary Detection
Topic boundary detection is a natural language processing (NLP) task that involves
identifying points within a text where the topic or subject matter changes.
Detecting topic boundaries is essential for various NLP applications, such as text
summarization, content segmentation, document classification, and information
retrieval. Here are some techniques and approaches for topic boundary detection:
Keyword-Based Approaches:
Identify keywords or key phrases that indicate a change in topic. These keywords can be
domain-specific or general words that typically signal topic shifts (e.g., "however," "on
the other hand," "in conclusion").
Term Frequency and Inverse Document Frequency (TF-IDF):
Calculate TF-IDF scores for terms in the document. A sudden increase in the TF-IDF
score of a term may signal a topic boundary. A drop in TF-IDF values can also be
indicative of a topic shift.
Finding the Structure of Words and Documents
1.Topic Boundary Detection
Cosine Similarity:
Measure the cosine similarity between document segments or paragraphs. A significant
drop in similarity between consecutive segments may indicate a topic change.
Supervised Learning:
Train a supervised machine learning model, such as a binary classifier, to predict topic
boundaries based on labeled data. Features for the model might include linguistic cues,
such as punctuation, sentence length, and word usage.
Clustering:
Use clustering algorithms (e.g., k-means clustering) to group similar paragraphs or
sentences together. Changes in cluster assignments can indicate topic boundaries.
Lecture-#6
Natural Language Processing
A7707
Introduction
Finding the Structure of Words and Documents
2.2. Methods
Sentence segmentation and topic segmentation have been considered as a boundary
classification problem.
Given a boundary candidate( between two word tokens for sentence segmentation
and between two sentences for topic segmentation), the goal is to predict whether or not
the candidate is an actual boundary (sentence or topic boundary).
Formally, let xƐX be the vector of features (the observation) associated with a candidate
and y ƐY be the label predicted for that candidate.
The label y can be b for boundary and ’ for nonboundary.
Classification problem: given a set of training examples(x,y)train, find a function that
will assign the most accurate possible label y of unseen examples x unseen.
Alternatively to the binary classification problem, it is possible to model boundary
types using finer-grained categories.
For segmentation in text be framed as a three-class problem: sentence boundary ba,
without an abbreviation ba’and abbreviation not as a boundary b-a
Finding the Structure of Words and Documents
Finding the Structure of Words and Documents
Generative Sequence Classification Methods:
Generative sequence classification methods are a class of machine learning techniques that
combine elements of sequence generation and classification tasks.
These methods aim to generate sequences while also assigning a class label or category to
each sequence.
This can be particularly useful in various natural language processing (NLP) and
bioinformatics tasks, among others.
Generative sequence classification methods find applications in various domains, including
natural language processing, bioinformatics, speech recognition, and more.
The choice of method depends on the specific task and dataset, as well as the available
computational resources and desired model performance.
Finding the Structure of Words and Documents
Generative Sequence Classification Methods:
Generative models, to find the conditional probability P(Y|X), they estimate the
priorprobability P(Y) and likelihood probability P(X|Y) with the help of the training data and
use the Bayes Theorem to calculate the posterior probability P(Y |X):
Generative models are considered a class of statistical models that can generate new data
instances. These models are used in unsupervised machine learning as a means to perform
tasks such as:
Probability and Likelihood estimation,
Modeling data points
To describe the phenomenon in data,
To distinguish between classes based on these probabilities.
Since these models often rely on the Bayes theorem to find the joint probability, generative
models can tackle a more complex task than analogous discriminative models.
Finding the Structure of Words and Documents
The Generative approach focuses on the distribution
of individual classes in a dataset, and the learning
algorithms tend to model the underlying patterns or
distribution of the data points (e.g., gaussian). These
models use the concept of joint probability and create
instances where a given feature (x) or input and the
desired output or label (y) exist simultaneously.
These models use probability
estimates and likelihood to model data points and
differentiate between different class labels present in a
dataset. Unlike discriminative models, these models
can also generate new data points.
Finding the Structure of Words and Documents
Training generative classifiers involve estimating a function f: X -> Y, or probability P(Y|X):
Assume some functional form for the probabilities such as P(Y), P(X|Y)
With the help of training data, we estimate the parameters of P(X|Y), P(Y)
Use the Bayes theorem to calculate the posterior probability P(Y |X)
Examples of Generative Models
Naïve Bayes
Bayesian networks
Markov random fields
Hidden Markov Models (HMMs)
Latent Dirichlet Allocation (LDA)
Generative Adversarial Networks (GANs)
Autoregressive Model
Finding the Structure of Words and Documents
Training generative classifiers involve estimating a function f: X -> Y, or probability P(Y|X):
Assume some functional form for the probabilities such as P(Y), P(X|Y)
With the help of training data, we estimate the parameters of P(X|Y), P(Y)
Use the Bayes theorem to calculate the posterior probability P(Y |X)
Examples of Generative Models
Naïve Bayes
Bayesian networks
Markov random fields
Hidden Markov Models (HMMs)
Latent Dirichlet Allocation (LDA)
Generative Adversarial Networks (GANs)
Autoregressive Model
Finding the Structure of Words and Documents
Discriminative Local Classification Methods:
Discriminative local classification methods are a class of machine learning techniques
that focus on making decisions or predictions for individual data points within a larger
dataset.
These methods aim to classify data points into specific categories or classes based on
their features while considering the local context around each data point.
The discriminative model refers to a class of models used in Statistical Classification,
mainly used for supervised machine learning.
These types of models are also known as conditional models since they learn the
boundaries between classes or labels in a dataset.
These discriminative local classification methods are used in a wide range of applications,
including image classification, text classification, anomaly detection, and more.
The choice of method depends on the nature of the data, the problem at hand, and the
trade-offs between model complexity and performance.
Finding the Structure of Words and Documents
Training discriminative classifiers or discriminant analysis
involves estimating a function f: X -> Y, or
probability P(Y|X)
Assume some functional form for the probability, such
as P(Y|X)
With the help of training data, we estimate the
parameters of P(Y|X)
Examples of Discriminative Models
Logistic regression
Support vector machines(SVMs)
Traditional neural networks
Nearest neighbor
Conditional Random Fields (CRFs)
Decision Trees and Random Forest
Finding the Structure of Words and Documents
Discriminative Sequence Classification Methods:
In segmentation tasks, the sentence or topic decision for a given example(word,
sentence, paragraph) highly depends on the decision for the examples in its vicinity.
Discriminative sequence classification methods are in general extensions of local
discriminative models with additional decoding stages that find the best assignment of
labels by looking at neighbouring decisions to label.
Conditional random fields(CRFs) are extension of maximum entropy, SVM struct is an
extension of SVM, and maximum margin Markov networks(M3N) are extensions of HMM.
CRFs are a class of log-linear models for labelling structures.
Finding the Structure of Words and Documents
Hybrid Approaches:
Hybrid approaches in Natural Language Processing (NLP) classifications refer to techniques that
combine multiple methods or models to improve the performance of NLP classification tasks.
These methods often leverage the strengths of different approaches to address the limitations
of individual models. Here are some common hybrid approaches in NLP classifications:
Ensemble Methods:
Ensemble methods combine predictions from multiple base models to make a final
classification decision.
Bagging (Bootstrap Aggregating) methods like Random Forests combine multiple decision
trees.
Boosting methods like AdaBoost and Gradient Boosting combine weak learners to create a
strong classifier.
These ensemble methods can improve classification accuracy and robustness.
Finding the Structure of Words and Documents
Hybrid Approaches:
Feature Combination:
Hybrid models can combine features from different sources or representations to enhance classification
performance.
For example, combining word embeddings with hand-crafted features or syntactic features can capture both
semantic and structural information.
Multi-Modal Fusion:
In some NLP tasks, information comes from multiple modalities, such as text, images, or audio.
Hybrid models can fuse information from these modalities to make better classifications. For instance,
combining textual and visual information for sentiment analysis of multimedia content.
Neural Network Ensembles:
Combining multiple neural networks with different architectures or initialization weights can lead to
improved performance.
Techniques like stacking or blending neural network predictions can be effective in hybrid approaches.
Transfer Learning and Pretraining:
Transfer learning models like BERT, GPT, and RoBERTa have become popular in NLP.
Hybrid models can fine-tune these pretrained models for specific classification tasks, achieving state-of-the-
art results with less training data.
Finding the Structure of Words and Documents
Hybrid Approaches:
Rule-Based and Machine Learning Hybrid:
Combining rule-based systems with machine learning models can leverage domain knowledge while
benefiting from data-driven approaches.
For instance, using regular expressions or handcrafted rules to preprocess text data before feeding it into a
machine learning classifier.
Hierarchical Approaches:
Hybrid models can use a hierarchy of classifiers, where higher-level classifiers make decisions based on the
outputs of lower-level classifiers.
This approach is useful for complex classification tasks, where different aspects of the problem can be tackled
independently.
Sequential and Tree-Based Models:
Hybrid models can combine sequential models like Recurrent Neural Networks (RNNs) with tree-based
models like Random Forests or Gradient Boosting.
This approach can be useful for problems that involve both sequential and non-sequential information.
Hybrid approaches in NLP classifications are often problem-specific and depend on the nature of the data and
the task. They offer flexibility and the potential for significant performance improvements by leveraging the
complementary strengths of different techniques and models.
Finding the Structure of Words and Documents
Extensions for Global Modeling for Sentence Segmentation:
In Natural Language Processing (NLP), sentence segmentation, also known as sentence
boundary detection or sentence tokenization, is the process of identifying and splitting a text into
individual sentences.
While basic sentence segmentation methods often rely on simple heuristics or punctuation
cues, more advanced techniques employ global modeling and extensions to improve accuracy
and handle complex cases.
These extensions for global modeling in sentence segmentation aim to handle various
linguistic complexities, including abbreviations, quotations, nested structures, and different
writing styles.
The choice of method and extension depends on the specific requirements of the NLP task and
the characteristics of the text data being processed.
Finding the Structure of Words and Documents
Complexity of the Approaches
In Natural Language Processing (NLP), the complexity of approaches refers to the
computational, algorithmic, and resource requirements associated with different methods and
models used for various NLP tasks.
The complexity can vary significantly depending on the specific approach, the size of the
dataset, the complexity of the language, and the available computational resources.
Model Architecture:
The choice of model architecture plays a significant role in complexity. For example, simple models like logistic regression
have lower computational demands compared to deep neural networks such as Transformers.
Model Size:
Larger models with more parameters tend to be more complex. Pretrained models like GPT-3 and BERT, which have
hundreds of millions or even billions of parameters, require substantial computational resources.
Training Data Size:
The amount of training data affects the complexity of model training. Larger datasets often require more computational
power and time to train effectively.
Hyperparameter Tuning:
Finding the optimal hyperparameters for a model can involve a time-consuming and computationally intensive search,
especially for complex models.
Finding the Structure of Words and Documents
Tokenization and Sequence Length:
Tokenization, or splitting text into smaller units (e.g., words or subwords), can impact model complexity. Handling long
sequences may require special techniques or models designed for long-range dependencies.
Data Preprocessing:
Data preprocessing steps such as text cleaning, feature extraction, and handling imbalanced datasets can add complexity to
an NLP pipeline.
Fine-Tuning:
Fine-tuning pretrained models for specific tasks can be complex, as it involves adjusting model weights on a task-specific
dataset while retaining knowledge from the pretrained model.
Decoding:
In sequence generation tasks like text summarization or machine translation, decoding the output sequence can be
computationally intensive, especially for beam search or sampling-based methods.
Parallelization:
Parallelizing model training and inference across multiple GPUs or TPUs can mitigate complexity by reducing the time
required for computation.
Resource Requirements:
Complex models often demand powerful hardware resources, including GPUs or TPUs, to achieve reasonable training and
inference times.
Finding the Structure of Words and Documents
Interpretability:
As models become more complex, their interpretability may decrease. Understanding and interpreting the decisions made
by complex models can be challenging.
Scaling Up:
To achieve state-of-the-art performance, researchers often resort to scaling up models, which can lead to increased
complexity and resource requirements.
Regularization:
To prevent overfitting in complex models, regularization techniques may be needed, adding complexity to the training
process.
Deployment Considerations:
Deploying complex NLP models in real-world applications may require optimizations for latency, memory usage, and
scalability.
Finding the Structure of Words and Documents
Performance of approaches:
The performance of approaches in Natural Language Processing (NLP) can vary significantly
depending on the specific task, the choice of algorithms, models, and data.
NLP tasks encompass a wide range of applications, and performance is typically evaluated
using various metrics tailored to each task.
Here are some key performance metrics and factors influencing the performance of NLP
approaches:
Accuracy: Accuracy measures the proportion of correctly predicted instances in classification
tasks. While accuracy is a common metric, it may not be appropriate for imbalanced datasets.
Precision, Recall, and F1-Score: Precision measures the fraction of true positives among
predicted positives, while recall measures the fraction of true positives among actual positives.
The F1-score is the harmonic mean of precision and recall, providing a balanced measure of
performance.
Mean Squared Error (MSE): MSE is often used to evaluate regression tasks, measuring the
average squared difference between predicted and actual values.
Finding the Structure of Words and Documents
Features:
In Natural Language Processing (NLP), features are essential components used to represent
text and speech data for various tasks, including classification, recognition, and generation.
Many features can be applied to both text and speech, as both modalities share certain
common characteristics. Here are some features that are commonly used for both text and
speech in NLP:
Bag of Words (BoW):
BoW represents text by creating a vocabulary of unique words in the corpus and counting their
occurrences in each document.
For speech, BoW can be applied by transcribing spoken language into text and then using text-
based BoW features.
Term Frequency-Inverse Document Frequency (TF-IDF):
TF-IDF represents the importance of a term in a document relative to a corpus.
It can be used for both text and transcribed speech data to capture the importance of
words or terms.
Finding the Structure of Words and Documents
Word Embeddings:
Word embeddings like Word2Vec, and FastText represent words in a continuous vector space, capturing semantic
relationships.
Word embeddings can be used for both text and transcribed speech data to capture word-level semantics.
n-grams:
n-grams represent sequences of n contiguous words or characters.
They are commonly used in text for capturing local context and can also be applied to transcribed speech data.
Part-of-Speech (POS) Tags:
POS tags represent the grammatical category of each word in a sentence.
They are used in both text and speech processing to analyze syntactic properties.
Phonetic Features:
For speech data, phonetic features represent acoustic characteristics of speech sounds, such as pitch, duration, and
formants.
These features can be used for tasks like speech recognition and emotion detection.
Spectrogram Features:
Spectrogram features represent the time-frequency distribution of audio signals.
They are used for both speech and environmental sound analysis.
Finding the Structure of Words and Documents
Sentiment Lexicons:
Sentiment lexicons provide sentiment scores for words.
They are used in both text and speech sentiment analysis tasks.
Topic Models:
Topic modeling techniques like Latent Dirichlet Allocation (LDA) can be applied to both text and transcribed speech to
identify latent topics in the data.
Syntactic Dependency Features:
Syntactic dependency features represent grammatical relationships between words in a sentence.
They are used in both text parsing and spoken language understanding.
Emotion Features:
Emotion features capture emotional content in both text and speech data, helping with tasks like emotion detection or
sentiment analysis.
Text-to-Speech (TTS) Features:
TTS features represent acoustic properties of synthesized speech.
They are used in TTS systems for generating natural-sounding speech.
Prosodic Features:
Prosodic features capture suprasegmental aspects of speech, such as intonation, rhythm, and stress patterns.
They are used in various speech processing tasks, including emotion recognition and speech synthesis.
Finding the Structure of Words and Documents
Processing Stages:
Processing stages in Natural Language Processing (NLP) refer to the steps or phases involved in
the analysis and understanding of natural language text or speech data.
These stages are typically sequential and aim to transform raw language input into structured
information that can be used for various NLP tasks.
These processing stages form the foundation of many NLP applications, including text
classification, machine translation, sentiment analysis, question answering, and more. The choice
of stages and techniques depends on the specific NLP task and goals.
Text Acquisition:
The process begins with acquiring raw text or speech data from various sources, such as
documents, websites, audio recordings, or chat logs.
Finding the Structure of Words and Documents
Text Preprocessing: In this stage, the raw data is cleaned and standardized. Common preprocessing steps
include:
Tokenization: Splitting text into words or subword units (e.g., by spaces or punctuation).
Lowercasing: Converting all text to lowercase for uniformity.
Stop Word Removal: Eliminating common words (e.g., "and," "the") that do not carry significant
meaning.
Stemming or Lemmatization: Reducing words to their base or root form (e.g., "running" to "run")
Spell Correction: Fixing typographical errors and misspellings.
Noise Removal: Eliminating irrelevant characters or symbols.
Text Representation: In this stage, text or speech data is transformed into numerical or symbolic
representations that machine learning models can work with. Common representations include:
Bag of Words (BoW): Representing text as a matrix of word frequencies.
Term Frequency-Inverse Document Frequency (TF-IDF): Assigning weights to words based on their
importance in documents.
Word Embeddings: Representing words as continuous vectors in a vector space.
Phonemes: Representing speech as sequences of phonemes, the smallest sound units in a language.
Finding the Structure of Words and Documents
.
Aspect Generative Models Discriminative Models
Purpose Model data distribution Model conditional probability of
labels given data
Use Cases Data generation, denoising, Classification, supervised learning
unsupervised learning tasks
Common Variational Autoencoders (VAEs), Logistic Regression, Support Vector
Examples Generative Adversarial Networks Machines, Deep Neural Networks
(GANs)
Training Focus Maximize likelihood of observed Learn decision boundary,
data, Capture data structure Differentiate between classes
Example Task Image generation, Inpainting Text classification, Object detection
(e.g., GANs, VAEs) (e.g., Deep Neural Networks)