Module1_NLP_BAD613B_Notes
Module1_NLP_BAD613B_Notes
Course objectives:
• Learn the importance of natural language modelling.
• Understand the applications of natural language processing.
• Study spelling, error detection and correction methods and parsing techniques in NLP.
• Illustrate the information retrieval models in natural language processing.
Module-1
Introduction: What is Natural Language Processing? Origins of NLP, Language and
Knowledge, The Challenges of NLP, Language and Grammar, Processing Indian Languages,
NLP Applications.
Language Modeling: Statistical Language Model - N-gram model (unigram, bigram),
Paninion Framework, Karaka theory.
Textbook 1: Ch. 1, Ch. 2.
Module-2
Module-1
Textbook 1: Tanveer Siddiqui, U.S. Tiwary, “Natural Language Processing and Information
Retrieval”, Oxford University Press. Ch. 1, Ch. 2.
1. INTRODUCTION
1. Rationalist approach
2. Empiricist approach
Rationalist Approach: Early approach, assumes the existence of some language faculty in
the human brain. Supporters of this approach argue that it is not possible to learn a complex thing
like natural language from limited sensory inputs.
Empiricist approach: Do not believe in existence of a language faculty. Believe in the existence
of some general organization principles such as pattern recognition, generalization, and association.
Learning of detailed structures takes place through the application of these principles on sensory
inputs available to the child.
Computational linguistics: is similar to theoretical- and psycho-linguistics, but uses different tools.
While theoretical linguistics is more about the structural rules of language, psycho-linguistics focuses on
how language is used and processed in the mind.
Theoretical linguistics explores the abstract rules and structures that govern language. It investigates
universal grammar, syntax, semantics, phonology, and morphology. Linguists create models to explain
how languages are structured and how meaning is encoded. Eg. Most languages have constructs like noun
and verb phrases. Theoretical linguists identify rules that describe and restrict the structure of languages
(grammar).
Psycho-linguistics focuses on the psychological and cognitive processes involved in language use. It
examines how individuals acquire, process, and produce language. Researchers study language
development in children and how the brain processes language in real-time. Eg. Studying how children
acquire language, such as learning to form questions ("What’s that?").
With the unprecedented amount of information now available on the web, NLP has become one
of the leading techniques for processing and retrieving information. NLP has become one of the leading
techniques for processing and retrieving information.
Information retrieval includes a number of information processing applications such as information
extraction, text summarization, question answering, and so forth. It includes multiple modes of
information, including speech, images, and text.
Syntactic Semantic
• Finding out the correct meaning of a particular use of word is necessary to find meaning of larger
units.
• Eg. Kabir and Ayan are married.
Kabir and Suha are married.
• Syntactic structure and compositional semantics fail to explain these interpretations.
• This means that semantic analysis requires pragmatic knowledge besides semantic and syntactic
knowledge.
• Pragmatics helps us understand how meaning is influenced by context, social factors, and
speaker intentions.
Anamorphic Reference
• Pragmatic knowledge may be needed for resolving anaphoric references.
Example: The district administration refused to give the trade union
permission for the meeting because they feared violence. (a)
Phrase structure grammar consists of rules that generate natural language sentences and assign a
structural description to them. As an example, consider the following set of rules:
Transformation rules, transform one phrase-maker (underlying) into another phrase-marker (derived).
These rules are applied on the terminal string generated by phrase structure rules. It transforms one surface
representation into another, e.g., an active sentence into passive one.
Consider the active sentence: “The police will catch the snatcher.”
The application of phrase structure rules will assign the structure shown in Fig 2 (a)
Morphophonemic Rule: Another transformational rule will then reorder 'en + catch' to 'catch + en' and
subsequently one of the morphophonemic rules will convert 'catch + en' to 'caught'.
Note: Long distance dependency refers to syntactic phenomena where a verb and its subject or object can
be arbitrarily apart. Wh-movement are a specific case of these types of dependencies.
E.g.
"I wonder who John gave the book to" involves a long-distance dependency between the verb "wonder"
and the object "who". Even though "who" is not directly adjacent to the verb, the syntactic relationship
between them is still clear.
The problem in the specification of appropriate phrase structure rules occurs because these phenomena
cannot be localized at the surface structure level.
Paninian grammar provides a framework for Indian language models. These can be used for
computation of Indian languages. The grammar focuses on extraction of relations from a
sentence.
The first SysTran machine translation system was developed in 1969 for Russian-English translation.
SysTran also provided the first on-line machine translation service called Babel Fish, which is used by
AltaVista search engines for handling translation requests from users.
This is a natural language generation system used in Canada to generate weather reports. It accepts
daily weather data and generates weather reports in English and French.
This is a natural language understanding system that simulates actions of a robot in a block world
domain. It uses syntactic parsing and semantic reasoning to understand instructions. The user can ask the
robot to manipulate the blocks, to tell the blocks configurations, and to explain its reasoning.
This was an early question answering system that answered questions about moon rock.
The availability of vast amounts of electronic text has made it challenging to find relevant
information. Information retrieval (IR) systems aim to address this issue by providing efficient access to
relevant content. Unlike 'entropy' in communication theory, which measures uncertainty, information
here refers to the content or subject matter of text, not digital communication or data transmission. Words
serve as carriers of information, and text is seen as the message encoded in natural language.
organizing, storing, retrieving, and evaluating information that matches a query, working with
unstructured data. Retrieval is based on content, not structure, and systems typically return a ranked list
of relevant documents.
IR has been integrated into various systems, including database management systems, bibliographic
retrieval systems, question answering systems, and search engines. Approaches for accessing large text
collections fall into two categories: one builds topic hierarchies (e.g., Yahoo), requiring manual
classification of new documents, which can be cost-ineffective; the other ranks documents by relevance,
offering more scalability and efficiency for large collections
Major issues in designing and evaluating Information Retrieval (IR) systems include selecting
appropriate document representations. Current models often use keyword-based representation, which
suffers from problems like polysemy, homonymy, and synonymy, as well as ignoring semantic and
contextual information. Additionally, vague or inaccurate user queries lead to poor retrieval performance,
which can be addressed through query modification or relevance feedback.
2. LANGUAGE MODELLING
To create a general model of any language is a difficult task. There are two approaches for language
modelling.
2.1 Introduction
Our purpose is to understand and generate natural languages from a computational viewpoint.
1st approach: Try to understand every word and sentence of it, and then come to a conclusion (has not
succeeded).
2nd approach: To study the grammar of various languages, compare them, and if possible, arrive at
reasonable models that facilitate our understanding of the problem and designing of natural-language
tools.
Language Model: A model is a description of some complex entity or process. Natural language is a
complex entity and in order to process it through a computer-based program, we need to build a
representation (model) of it.
Two categories of language modelling approaches:
Grammar-based language model:
Eg. A sentence usually consists of noun phrase and a verb phrase. The grammar-based approach attempts
to utilize this structure and also the relationships between these structures.
Government and Binding (GB) theories rename these as s-level and d-level, adding phonetic and
logical forms as parallel levels of representation for analysis, as shown in Figure.
• 'meaning' in a 'sound' form is represented as logical form (LF) and phonetic form (PF) in above
figure.
• The GB is concerned with LF, rather than PF.
• The GB imagines that if we define rules for structural units at the deep level, it will be possible
to generate any language with fewer rules.
Components of GB
• Government and binding (GB) comprise a set of theories that map the structures from d-structure
to s-structure and to logical form (LF).
• A general transformational rule called 'Move 𝛼' is applied at d-structure level as well as at s-
structure level.
• Simplest form GB can be represented as below.
GB consists of 'a series of modules that contain constraints and principles' applied at various
levels of its representations and the transformation rule, Move α.
The GB considers all three levels of representations (d-, s-, and LF) as syntactic, and LF is also
related to meaning or semantic-interpretive mechanisms.
GB applies the same Move 𝛼 transformation to map d-levels to s-levels or s-levels to LF level.
LF level helps in quantifier scoping and also in handling various sentence constructions such as passive
or interrogative constructions.
Example:
Consider the sentence: “Two countries are visited by most travellers.”
Its two possible logical forms are:
LF1: [s Two countries are visited by [NP most travellers]]
LF2: Applying Move 𝛼
[NP Most travellersi ] [s two countries are visited by ei]
• In LF1, the interpretation is that most travellers visit the same two countries (say, India and
China).
• In LF2, when we move [most travellers] outside the scope of the sentence, the interpretation can
be that most travellers visit two countries, which may be different for different travellers.
• One of the important concepts in GB is that of constraints. It is the part of the grammar which
prohibits certain combinations and movements; otherwise Move α can move anything to any
possible position.
• Thus, GB, is basically the formulation of theories or principles which create constraints to
disallow the construction of ill-formed sentences.
The organization of GB is as given below:
̅ Theory:
𝑿
• ̅ Theory (pronounced X-bar theory) is one of the central concepts in GB. Instead of defining
The 𝑿
̅ Theory defines
several phrase structures and the sentence structure with separate sets of rules, 𝑿
them both as maximal projections of some head.
• Noun phrase (NP), verb phrase (VP), adjective phrase (AP), and prepositional phrase (PP) are
maximal projections of noun (N), verb (V), adjective (A), and preposition (P) respectively, and
can be represented as head X of their corresponding phrases (where X = {N, V, A, P})
• Even the sentence structure can be regarded as the maximal projection of inflection (INFL).
• The GB envisages projections at two levels:
• The projection of head at semi-phrasal level, denoted by 𝑿 ̅,
• ̿.
The Maximal projection at the phrasal level, denoted by 𝑿
Sub-categorization: It refers to the process of classifying words or phrases (typically verbs) according
to the types of arguments or complements they can take. It's a form of syntactic categorization that is
important for understanding the structure and meaning of sentences.
For example, different verbs in English can have different sub-categorization frames (also called
argument structures). A verb like "give" might take three arguments (subject, object, and indirect object),
while a verb like "arrive" might only take a subject and no objects.
"He gave her a book." ("gave" requires a subject, an indirect object, and a direct object)
In principle, any maximal projection can be the argument of a head, but sub-categorization is used as a
filter to permit various heads to select a certain subset of the range of maximal projections.
Projection Principle:
Three syntactic representations:
1. Constituency Parsing (Tree Structure):
• Sentences are broken into hierarchical phrases or constituents (e.g., noun phrases, verb
phrases), represented as a tree structure.
2. Dependency Parsing (Directed Graph):
• Focuses on the direct relationships between words, where words are connected by directed
edges indicating syntactic dependencies.
3. Semantic Role Labelling (SRL):
• Identifies the semantic roles (e.g., agent, patient) of words in a sentence, focusing on the
meaning behind the syntactic structure.
The projection principle, a basic notion in GB, places a constraint on the three syntactic representations
and their mapping from one to the other.
The principle states that representations at all syntactic levels (i.e., d-level, s-level, and LF level) are
projections from the lexicon (collection or database of words and their associated linguistic information).
Thus, lexical properties of categorical structure (sub-categorization) must be observed at each level.
Suppose 'the object' is not present at d-level, then another NP cannot take this position at s-level.
Example:
• At D-structure, each argument of a verb is assigned a thematic role (e.g., Agent, Theme, Goal,
etc.).
• In a sentence like "John gave Mary the book", the verb "gave" requires three arguments: Agent
(John), Recipient (Mary), and Theme (the book).
• If the object (Theme) is not present at the deep structure, it cannot be filled at the surface structure
(S-structure) by another NP (e.g., a different noun phrase).
• 'Sub-categorization' only places a restriction on syntactic categories which a head can accept.
• GB puts another restriction on the lexical heads through which it assigns certain roles to its
arguments.
• These roles are pre-assigned and cannot be violated at any syntactical level as per the projection
principle.
• These role assignments are called theta-roles and are related to 'semantic-selection'.
Agent is a special type of role which can be assigned by a head to outside arguments (external
arguments) whereas other roles are assigned within its domain (internal arguments).
the verb 'eat' assigns the 'Agent' role to 'Mukesh' (outside VP)
Theta-Criterion states that 'each argument bears one and only one Ɵ-role, and each Ɵ-role is
assigned to one and only one argument'.
If any word or phrase (say α or ß) falls within the scope of and is determined by a maximal projection,
we say that it is dominated by the maximal projection.
If there are two structures α and ß related in such a way that 'every maximal projection dominating a
dominates ß', we say that a C-commands ß, and this is the necessary and sufficient condition (iff) for C-
command.
Government
α governs ß iff: α C-commands ß
α is an X (head, e.g., noun, verb, preposition, adjective, and inflection), and every maximal projection
dominating ß dominates α.
Additional information
C-COMMAND
A c-command is a syntactic relationship in linguistics, particularly in the theory of syntax, where one node (word
or phrase) in a tree structure can "command" or "govern" another node in certain ways. In simpler terms, it's a rule
that helps determine which parts of a sentence can or cannot affect each other syntactically.
Simple Definition:
C-command occurs when one word or phrase in a sentence has a syntactic connection to another word or phrase,
typically by being higher in the syntactic tree (closer to the top).
Example 1:
In the sentence "John saw Mary,"
"John" c-commands "Mary" because "John" is higher up in the tree structure and can potentially affect "Mary"
syntactically.
Example 2:
In the sentence "She thinks that I am smart,"
The pronoun "She" c-commands "I" because "She" is higher in the syntactic tree, governing the phrase where "I"
occurs.
In essence, c-command helps explain which words in a sentence are connected in ways that allow for things like
pronoun interpretation or binding relations (e.g., which noun a pronoun refers to).
GOVERNMENT
-is a special case of C-COMMAND
government refers to the syntactic relationship between a head (typically a verb, noun, or adjective) and its
dependent elements (such as objects or complements) within a sentence. It determines how certain words control
the form or case of other words in a sentence.
On the other hand, c-command is a syntactic relationship between two constituents in a sentence. A constituent A
c-commands another constituent B if the first constituent (A) is higher in the syntactic structure (usually in the tree)
and can potentially govern or affect the second constituent (B), provided no intervening nodes.
To put it together in context:
Government: This is a formal rule determining how certain words govern the case or form of other words in a
sentence (e.g., verbs can govern the object noun in accusative case in languages like Latin or German).
C-command: This is a structural relationship in which one constituent can influence another, typically affecting
operations like binding, scope, and sometimes government.
In short, government often operates within the structures of c-command, but c-command itself is a broader syntactic
relationship that is also relevant for other linguistic phenomena, such as binding theory, where one element can bind
another if it c-commands it.
Sure! Here are a few examples of government in syntax, showing how one word governs the form or case of another
word in a sentence:
1. Verb Government
In many languages, verbs can govern the case of their objects. Here’s an example in Latin:
Latin: "Vidēre puellam" (to see the girl)
The verb "vidēre" (to see) governs the accusative case of "puellam" (the girl).
In this case, the verb "vidēre" governs the object "puellam" by requiring it to be in the accusative case.
2. Preposition Government
Prepositions can also govern the case of their objects. Here’s an example from German:
German: "Ich gehe in den Park" (I am going to the park)
The preposition "in" governs the accusative case of "den Park" (the park).
The preposition "in" governs the accusative case for the noun "Park" in this sentence.
3. Adjective Government
Adjectives can govern the case, gender, or number of the noun they modify. Here's an example from Russian:
Russian: "Я вижу красивую девочку" (I see a beautiful girl)
The adjective "красивую" (beautiful) governs the accusative case of "девочку" (girl).
In this case, the adjective "красивую" (beautiful) governs the accusative case of "девочку".
4. Noun Government
In some languages, nouns can govern the case of their arguments. In Russian, for example, some nouns govern a
particular case:
Russian: "Я горжусь успехом" (I am proud of the success)
The noun "успехом" (success) governs the instrumental case in this sentence.
Here, the noun "успехом" governs the instrumental case of its argument "успех".
Summary:
Government involves syntactic relationships where a head (verb, preposition, adjective, etc.) dictates or determines
the form (such as case) of its dependent elements.
In these examples, verbs, prepositions, and adjectives have a "governing" influence on the cases of nouns or objects
in the sentence, which is a core part of the syntax in many languages.
Two being empty NP positions called wh-trace and NP trace, and the remaining two being pronouns
called small 'pro' and big 'PRO'.
This division is based on two properties-anaphoric (+a or -a ) and pronominal (+p or -p).
Wh-trace -a, -p
NP-trace +a, -p
small 'pro' -a, +p
big 'PRO' . +a, +p
The traces help ensure that the proper binding relationships are maintained between moved elements
(such as how pronouns or reflexives bind to their antecedents, even after movement).
Additional Information:
• +a (Anaphor): A form that must refer back to something mentioned earlier (i.e., it has an
antecedent). For example, "himself" in "John washed himself." The form "himself" is an anaphor
because it refers back to "John."
• -a (Non-Anaphor): A form that does not require an antecedent to complete its meaning. A regular
pronoun like "he" in "He went to the store" is not an anaphor because it doesn't explicitly need to
refer back to something within the same sentence or clause.
• +p (Pronominal): A form that can function as a pronoun, standing in for a noun or noun phrase.
For example, "she" in "She is my friend" is a pronominal because it refers to a specific person
(though not necessarily previously mentioned).
• -p (Non-Pronominal): A word or form that isn't used as a pronoun. It could be a noun or other
word that doesn't serve as a replacement for a noun phrase in a given context. For example, in
"John went to the store," "John" is not pronominal—it is a noun phrase.
Co-indexing
It is the indexing of the subject NP and AGR (agreement) at d-structure which are preserved by Move α
operations at s-structure.
When an NP-movement takes place, a trace of the movement is created by having an indexed empty
category (e) from the position at which the movement began to the corresponding indexed NP.
For defining constraints to movement, the theory identifies two positions in a sentence. Positions assigned
̅ positions.
θ -roles are called θ-positions, while others are called 𝜃
In a similar way, core grammatical positions (where subject, object, indirect object, etc., are positioned)
̅-positions.
are called A-positions (arguments positions), and the rest are called 𝐴
Binding theory:
Binding Theory is a syntactic theory that explains how pronouns and noun phrases are interpreted and
distributed in a sentence. It's concerned with the relationships between pronouns and their antecedents
(myself, herself, himself).
Empty clause (ei) and Mukesh (NPi) are bound. This theory gives a relationship between NPs (including
pronouns and reflexive pronouns). Now, binding theory can be given as follows:
(a) An anaphor (+a) is bound in its governing category.
(b) A pronominal (+p) is free in its governing category.
(c) An R-expression (-a, -p) is free.
Example
A: Mukeshi knows himselfi
B: Mukeshi believes that Amrita knows himi
C: Mukesh believes that Amritaj knows Nupurk (Referring expression)
Note: There are many other types of constraints on Move α and not possible to explain all of them.
In English, the long-distance movement for complement clause can be explained by bounding theory if
NP and S are taken to be bounding nodes. The theory says that the application of Move a may not cross
more than one bounding node. The theory of control involves syntax, semantics, and pragmatics.
In GB, case theory deals with the distribution of NPs and mentions that each NP must be assigned a case.
In English, we have the nominative, objective, genitive, etc., cases, which are assigned to NPs at particular
positions. Indian languages are rich in case-markers, which are carried even during movements.
Example:
He is running ("He" is the subject of the sentence, performing the action. - nominative)
She sees him. ("Him" is the object of the verb "sees." - Objective)
The man's book. (The genitive case expresses possession or a relationship between nouns,)
Case filter: An NP is ungrammatical if it has phonetic content or if it is an argument and is not case-
marked. Phonetic content here, refers to some physical realization, as opposed to empty categories.
Thus, case filters restrict the movement of NP at a position which has no case assignment. It works in a
manner similar to that of the θ-criterion.
Summary of GB:
In short, GB presents a model of the language which has three levels of syntactic representation.
• It assumes phrase structures to be the maximal projection of some lexical head and in a similar
fashion, explains the structure of a sentence or a clause.
• It assigns various types of roles to these structures and allows them a broad kind of movement
called Move α.
• It then defines various types of constraints which restrict certain movements and justifies others.
• LFG represents sentences at two syntactic levels - constituent structure (c-structure) and
functional structure (f-structure).
• Kaplan proposed a concrete form for the register names and values which became the functional
structures in LFG.
• Bresnan was more concerned with the problem of explaining some linguistic issues, such as
active/passive and dative alternations, in transformational approach. She proposed that such
issues can be dealt with by using lexical redundancy rules.
• The unification of these two diverse approaches (with a common concern) led to the development
of the LFG theory.
• The 'functional' part is derived from 'grammatical functions', such as subject and object, or roles
played by various arguments in a sentence.
• The 'lexical' part is derived from the fact that the lexical rules can be formulated to help define
the given structure of a sentence and some of the long-distance dependencies, which is difficult
in transformational grammars.
The grammatical-functional role cannot be derived directly from phrase and sentence structure, functional
specifications are annotated on the nodes of c-structure, which when applied on sentences, results in f-
structure
[
SUBJ: [ PERS: 3, NUM: SG ], // "She" is the subject, 3rd person, singular
PRED: "see", // The verb "saw" represents the predicate "see"
OBJ: [ NUM: PL, PRED: "star" ], // "stars" is the object, plural, and the predicate is "star"
LOC: [ PRED: "sky", DEF: + ] // "sky" is the location, with a definite determiner ("the")
]
f-structure
c- structure
Example:
She saw stars in the sky
CFG rules to handle this sentence are:
S → NP VP
VP → V {NP} PP* {NP} {S'}
Stars Sky
PP → P NP
NP → Det N {PP}
S' → Comp S
Where: S: Sentence V: Verb P: Preposition N: Noun
• Here, (up arrow) refers to the f-structure of the mother node that is on the left-hand side of
the rule.
• The (down arrow) symbol refers to the f-structure of the node under which it is denoted.
• Hence, in Rule 1, indicates that the f-structure of the first NP goes to the f-structure of
the subject of the sentence, while indicates that the f-structure of the VP node goes directly
to the f-structure of the sentence VP.
Consistency In a given f-structure, a particular attribute may have at the most one value. Hence, while
unifying two f-structures, if the attribute Num has value SG in one and PL in the other, it will be rejected.
Completeness When an f-structure and all its subsidiary f-structures (as the value of any attribute of f-
structure can again contain other f-structures) contain all the functions that their predicates govern, then
and only then is the f-structure complete.
For example, since the predicate 'see < ( Subj) ( Obj) >' contains an object as its governable function,
a sentence like 'She saw' will be incomplete.
Coherence Coherence maps the completeness property in the reverse direction. It requires that all
governable functions of an f-structure, and all its subsidiary f-structures, must be governed by their
respective predicates. Hence, in the f-structure of a sentence, an object cannot be taken if its verb does
not allow that object. Thus, it will reject the sentence, 'I laughed a book.'
Example:
Let us see first the lexical entries of various words in the sentence:
Lexical entries
c – structure
In LFG, the verb is converted to the participial form, but the sub-categorization is changed directly.
Example
Active: तारा हँ सी
Taaraa hansii
Tara laughed
In LFG, unbounded movement and coordination is handled by the functional identity and by correlation
with the corresponding f-structure.
Unlike English (Subject-Verb-Object ordered), Asian languages are SOV (Subject-Object-Verb) ordered
and inflectionally rich. The inflections provide important syntactic and semantic cues for language
analysis and understanding. The Paninian framework takes advantage of these features.
Note: Inflectional – refers to the changes a word undergoes to express different grammatical categories
such as tense, number, gender, case, mood, and aspect without altering the core meaning of the word.
Indian languages have traditionally used oral communication for knowledge propagation. In Hindi, we
can change the position of subject and object. For example:
In Hindi, some verbs (main), e.g., give (दे िा), take (लेिा), also combine with other verbs (main) to
change the aspect and modality of the verbs.
वह चला वह चल नदया
He move given
He moved He moved (started the action)
The nouns are followed by post-positions instead of prepositions. They generally remain as separate
words in Hindi,
• The surface and the semantic levels are obvious. The other
two levels should not be confused with the levels of GB.
• Vibhakti literally means inflection, but here, it refers to word
(noun, verb, or other) groups based either on case endings, or
post-positions, or compound verbs, or main and auxiliary
verbs, etc
• Karaka (pronounced Kaaraka) literally means Case, and in GB, we have already discussed case
theory, θ-theory, and sub-categorization, etc. Paninian Grammar has its own way of defining
Karaka relations.
Karaka Theory
Example:
• 'maan' (mother) is the Karta, Karta has generally 'ne' or 'o' case marker.
• rotii (bread) is the Karma. ('Karma' is similar to object and is the locus of the result of the activity)
• haath (hand) is the Karan. (noun group through which the goal is achieved), It has the marker
“dwara” (by) or “se”
• 'Sampradan' is the beneficiary of the activity, e.g., bachchi (child).
• 'Apaadaan' denotes separation and the marker is attached to the part that serves as a reference
point (being stationary). It takes the marker “ko” (to) or “ke liye” (for).
• aangan (courtyard) is the Adhikaran (is the locus (support in space or time) of Karta or Karma).
It is a statistical method that predicts the probability of a word appearing next in a sequence based on the
previous "n" words.
Why n-gram?
The goal of a statistical language model is to estimate the probability (likelihood) of a sentence. This is
achieved by decomposing sentence probability into a product of conditional probabilities using the chain
rule as follows:
So, in order to calculate sentence probability, we need to calculate the probability of a word, given the
sequence of words preceding it. This is not a simple task.
An n-gram model simplifies the task by approximating the probability of a word given all the previous
words by the conditional probability given previous n-1 words only.
P(Wi/hi) = P(Wi/Wi-n+1.Wi-1)
Thus, an n-gram model calculates P(w/h) by modelling language as Markov model of order n-1, i.e., by
looking at previous n-1 words only.
A model that limits the history to the previous one word only, is termed a bi-gram (n= 1) model.
A model that conditions the probability of a word to the previous two words, is called a tri-gram (n=2)
model.
Using bi-gram and tri-gram estimate, the probability of a sentence can be calculated as:
One pseudo-word <s> is introduced to mark the beginning of the sentence in bi-gram estimation.
Two pseudo-words <s1> and <s2> for tri-gram estimation.
How to estimate these probabilities?
1. Train n-gram model on training corpus.
2. Estimate n-gram parameters using the maximum likelihood estimation (MLE) technique, i.e.,
using relative frequencies.
o Count a particular n-gram in the training corpus and divide it by the sum of all n-grams
that share the same prefix
3. The sum of all n-grams that share first n-1 words is equal to the count of the common prefix
Wi-n+1, ... , Wi-1.
Example tri-gram:
Example
Training set:
Bi-gram model:
P(the/<s>) =0.67 P(Arabian/the) = 0.4 P(knights /Arabian) =1.0
P(are/these) = 1.0 P(the/are) = 0.5 P(fairy/the) =0.2
P(tales/fairy) =1.0 P(of/tales) =1.0 P(the/of) =1.0
P(east/the) = 0.2 P(stories/the) =0.2 P(of/stories) =1.0
P(are/knights) =1.0 P(translated/are) =0.5 P(in /translated) =1.0
P(many/in) =1.0
P(languages/many) =1.0
Test sentence(s): The Arabian knights are the fairy tales of the east.
P(The/<s>)×P(Arabian/the)×P(Knights/Arabian)x P(are/knights)
× P(the/are)×P(fairy/the)xP(tales/fairy)×P(of/tales)× P(the/of)
x P(east/the)
=0.67×0.5×1.0×1.0×0.5×0.2×1.0×1.0×1.0×0.2
=0.0067
Limitations:
• Multiplying the probabilities might cause a numerical underflow, particularly in long sentences.
To avoid this, calculations are made in log space, where a calculation corresponds to adding log
of individual probabilities and taking antilog of the sum.
• The n-gram model faces data sparsity, assigning zero probability to unseen n-grams in the training
data, leading to many zero entries in the bigram matrix. This results from the assumption that a
word's probability depends solely on the preceding word(s), which isn't always true.
• Fails to capture long-distance dependencies in natural language sentences.
Solution:
• A number of smoothing techniques have been developed to handle the data sparseness problem.
• Smoothing in general refers to the task of re-evaluating zero-probability or low-probability n-
grams and assigning them non-zero values.
• It adds a value of one to each n-gram frequency before normalizing them into probabilities. Thus,
the conditional probability becomes:
• Yet, not effective, since it assigns the same probability to all missing n-grams, even though some
of them could be more intuitively appealing than others.s
Example:
We want to calculate the probability of the bigram "I love" using Add-one smoothing.
• Unigrams:
• Bigrams:
• Vocabulary size V: There are 4 unique words: "I", "love", "programming", "coding".
Let’s say we want to calculate the probability for the bigram "I coding" (which doesn’t appear in the
training data):
• Good-Turing smoothing improves probability estimates by adjusting for unseen n-grams based
on the frequency distribution of observed n-grams.
• It adjusts the frequency f of an n-gram using the count of n-grams having a frequency of
occurrence f+1. It converts the frequency of an n-gram from f to f* using the following
expression:
where n is the number of n-grams that occur exactly f times in the training corpus. As an example, consider
that the number of n-grams that occur 4 times is 25,108 and the number of n-grams that occur 5 times is
20,542. Then, the smoothed count for 5 will be: