0% found this document useful (0 votes)
66 views

Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology

Language modeling involves creating statistical or rule-based models of natural language to process and understand it computationally. There are two main approaches: grammar-based modeling uses syntactic rules while statistical modeling estimates word probabilities from large corpora. N-gram models are a common statistical approach that estimate the probability of a word based on the previous n-1 words. These probabilities are calculated from word counts in training data. Language models are essential for applications like speech recognition, machine translation, and spelling correction.

Uploaded by

Rohan Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology

Language modeling involves creating statistical or rule-based models of natural language to process and understand it computationally. There are two main approaches: grammar-based modeling uses syntactic rules while statistical modeling estimates word probabilities from large corpora. N-gram models are a common statistical approach that estimate the probability of a word based on the previous n-1 words. These probabilities are calculated from word counts in training data. Language models are essential for applications like speech recognition, machine translation, and spelling correction.

Uploaded by

Rohan Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

LANGUAGE MODELING

Prabhleen Juneja
Thapar Institute of Engineering & Technology
INTRODUCTION
 Natural Language is a complex entity and in order to process it
through a computer based program, we need to build a model of
it.
 A language model is a description of a language that can be
used to check whether a string is a valid member of the
language or not.
 The process of creation of language models is called language
modeling.
 There are two major approaches used for language modeling
namely Grammar-Based Language Modeling and Statistical
Language Modeling
GRAMMAR-BASED LANGUAGE
MODELING
 A grammar-based approach uses the grammar of a language to
create a model.
 It attempts to represent the syntactic structure of the language.

 Grammar consists of hand-coded rules defining the structure


and ordering of various constituents appearing in the linguistic
unit (phrase, sentence, etc.)
 Various computational grammars have been studied and
proposed such as phrase structure grammar, transformational
grammar, functional grammar, government and binding,
dependency grammar, paninian grammar, tree-adjoining
grammar,etc.
 The major limitation of grammar-based language modeling is
that these methods are language-depenedent.
STATISTICAL LANGUAGE MODELING
 Statistical Language Modeling aims to build a statistical language model
that can estimate the distribution of natural language as accurate as
possible.
 A statistical language model (SLM) is a probability distribution P(s) over
strings S that attempts to reflect how frequently a string S occurs as a
sentence.
 In other words, statistical language model attempts to identify a sentence
based on its probability measure.
 Statistical language models attempt the capture the regularities of natural
language using large training corpus for the purpose of improving
performance of various NLP applications.
 The original (and is still the most important) application of SLMs is
speech recognition, but SLMs also play a vital role in various other
natural language applications as diverse as machine translation,
handwritten recognition, intelligent input method, part-of-speech and Text
To Speech system.
STATISTICAL LANGUAGE MODELING CONTD..

 SLMs are essential in any task in which we have to identify


words in noisy, ambiguous input.
 In speech recognition, for example, the input speech sounds
are very confusable and many words sound extremely similar.
 In speech recognition, the computer tries to match sounds
with word sequences.
 The language model provides context to distinguish between
words and phrases that sound similar.
 For example, in American English, the phrases "recognize
speech" and "wreck a nice beach" are pronounced almost the
same but mean very different things. These ambiguities are
easier to resolve when evidence from the language model is
incorporated
STATISTICAL LANGUAGE MODELING CONTD..

 In OCR & Handwriting recognition – More probable


sentences are more likely correct readings.
 In the movie Take the Money and Run, Woody Allen tries to
rob a bank with a sloppily written hold-up note that the teller
incorrectly reads as “I have a gub”.
 Any speech and language processing system could avoid
making this mistake by using the knowledge that the
sequence “I have a gun” is far more probable than the non-
word “I have a gub”
 In Machine translation – More likely sentences are probably
better translations
STATISTICAL LANGUAGE MODELING CONTD..

 Suppose we are translating a Chinese source sentence and


as part of the process we have a set of potential rough English
translations:
 he briefed to reporters on the chief contents of the statement
 he briefed reporters on the chief contents of the statement
 he briefed to reporters on the main contents of the statement
 he briefed reporters on the main contents of the statement

 A SLM might tell us that, even after controlling for length,


briefed reporters is more likely than briefed to reporters, and
main contents is more likely than chief contents.
STATISTICAL LANGUAGE MODELING CONTD..

 In spelling correction, we need to find and correct spelling


errors like the following that accidentally result in real
English words:
 They are leaving in about fifteen minuets to go to her house.
 The design an construction of the system will take more than a
year.
 Since these errors have real words, we can’t find them by just flagging
words that are not in the dictionary. But note that in about fifteen
minuets is a much less probable sequence than in about fifteen
minutes.
 A spellchecker can use a probability estimator both to detect these
errors and to suggest higher-probability corrections.
STATISTICAL LANGUAGE MODELING CONTD..

 Word prediction is also important for augmentative


communication systems that help the disabled.
 People who are unable to use speech or sign- language to
communicate, like the physicist Steven Hawking, can
communicate by using simple body movements to select
words from a menu that are spoken by the system.
 Word prediction can be used to suggest likely words for the
menu.
STATISTICAL LANGUAGE MODELING
 The goal of statistical models is to estimate the probability (likelihood) of a
sentence.
 This is achieved by decomposing the sentence probability into a product of
conditional probabilities using chain rule as follows:

P( S )  P( w1w2 w3 .......... ...... wn )


 P( w1 ) P( w2 | w1 ) P( w3 | w1w2 )......... ... P( wn | w1w2 w3 .......... ...wn 1 )
n
  P( w | h )
i 1
i i

where hi is the history of the ith word defined as:


hi  w1 w2 w3 .......... ..... wi 1
Thus, in order to calculate the probability of the sentence, we need to
calculate the probability of the word, given the sequence of word
preceding it.
STATISTICAL LANGUAGE MODELING
 It is quite difficult to estimate the probabilities of words
following very long strings because of the following reasons:
 The language is very creative; new sentences are created all the
time, and we won’t be able to count all the sentences.
 Any particular context might have not occurred ever before in the
training corpus.

 An N-Gram model simplifies the task of these probability


estimation.
N-GRAM BASED LANGUAGE MODELING
 A n-gram model approximate the probability of the word given all previous
word sequences by the conditional probability given previous n-1 words
only.
P( wi | hi )  P( wi | wi  n 1.......... .wi 1 )

 A model that limits the history to previous one word only is called bi-gram
model (n=2)
P( wi | hi )  P ( wi | wi 1 )

 A tri-gram model limits the history to previous two words only (n=3)
P( wi | hi )  P( wi | wi  2 wi 1 )

 A special pseudo-word <s> is introduced to mark the beginning of the


sentence in bigram estimation. Similarly, in tri-gram estimation, 2 pseudo-
words are introduced<s1> & <s2>
N-GRAM BASED LANGUAGE MODELING
 A n-gram model approximate the probability of the word given all previous
word sequences by the conditional probability given previous n-1 words
only.
P( wi | hi )  P( wi | wi  n 1.......... .wi 1 )

 A model that limits the history to previous one word only is called bi-gram
model (n=2)
P( wi | hi )  P ( wi | wi 1 )

 A tri-gram model limits the history to previous two words only (n=3)
P( wi | hi )  P( wi | wi  2 wi 1 )

 A special pseudo-word <s> is introduced to mark the beginning of the


sentence in bigram estimation. Similarly, in tri-gram estimation, 2 pseudo-
words are introduced<s1> & <s2>
N-GRAM BASED LANGUAGE MODELING
 The probabilities of word given n-1 words are estimated by
training the n-gram model on a training corpus.
 The probabilities are estimated using Maximum Likelihood
estimation (MLE) (i.e. using relative frequency).
 The count of a particular n-gram in the training corpus is
divided by the count of all the n-grams that share the same
prefix.

C ( wi  n 1......... wi 1wi )
P ( wi | wi  n 1 ...... wi 1 ) 
 C (w ...... w w) i  n 1
 The sum of all n-grams that share the n-1 words is equal to the
w
i 1

count of the common prefix

C ( wi  n 1......... wi 1wi )
P( wi | wi  n 1...... wi 1 ) 
C ( wi  n 1...... wi 1 )
N-GRAM MODEL- EXAMPLE
 Consider the following training corpus with three sentences:
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green ham </s>
Compute bi-gram probabilities and find the probability of the
test sentence “Sam I ham” using bi-gram language model.
N-GRAM MODEL- EXAMPLE CONTD…
Bi-gram Probabilities:
P (I|<s>)= 2/3 =0.67 P(am|I) = 2/3 =0.67
P(Sam|am) =1/2=0.5 P(</s>|Sam)= ½=0.5
P(Sam|<s>)= 1/3=0.33 P(I|Sam)= ½ = 0.5
P(</s>|am)=1/2 =0.5 P(do|I)= 1/3 = 0.33
P(not|do) = 1 /1 =1 P (like|not) =1/1 =1
P(green|like) =1/1 =1 P(ham|green) = 1/1 =1
P(</s>|ham) =1/1

P(Sam I ham)= P(Sam|<s>) P(I|Sam) P(ham|I) P(</s>|ham)


= 0.33  0.5  0  1
=0
N-GRAM MODEL- EXAMPLE 2
 The bigram counts from a piece of a bigram grammar from the
Berkeley Restaurant Project.
I want to eat chinese food lunch spend
I 5 827 0 9 0 0 0 2
want 2 0 608 1 6 6 5 1
To 2 0 4 686 2 0 6 211
eat 0 0 2 0 16 2 42 0
chinese 1 0 0 0 0 82 1 0
food 15 0 15 0 1 4 0 0
lunch 2 0 0 0 0 1 0 0
spend 1 0 1 0 0 0 0 0

The unigram count of each word is:


i want to eat food lunch spend chinese
2533 927 2417 746 1093 341 278 158
N-GRAM MODEL- EXAMPLE 2
 Some other useful probabilities are:
P(i|<s>)= 0.25 ,
P(</s>|food) = 0.68

Compute the probability of the sentence “I want chinese food”


N-GRAM MODEL- EXAMPLE 2
 The bigram probabilities after normalization (dividing each row
by the unigram counts) is shown below:
i want to eat chinese food lunch spend
i 0.002 0.33 0 0.0036 0 0 0 0.00079
want 0.0022 0 0.66 0.0011 0.0065 0.0065 0.0054 0.0011
to 0.00083 0 0.0017 0.28 0.00083 0 0.0025 0.087
eat 0 0 0.0027 0 0.021 0.0027 0.056 0
chinese 0.0063 0 0 0 0 0.52 0.0063 0
food 0.014 0 0.014 0 0.00092 0.0037 0 0
lunch 0.0059 0 0 0 0 0.0029 0 0
spend 0.0036 0 0.0036 0 0 0 0 0
N-GRAM MODEL- EXAMPLE 2
P(I want chinese food)
= P(I|<s>) P(want|I) P(chinese|want) P(food|chinese)
P(</s>|food)
= 0.25  0.33  0.0065  0.52  0.68
= 0.000189618
TRAINING & TEST SETS
 The probabilities of an N-gram model come from the corpus it
is trained on.
 In general, the parameters of a statistical model are trained on
some set of data (training data), and then we apply the models
to some new data (test data) in some task (such as speech
recognition) and see how well they work.
 This training-and-testing paradigm can also be used to
evaluate different N-gram architectures.
 If our test sentence is part of the training corpus, we will
mistakenly assign it an artificially high probability when it
occurs in the test set. We call this situation training on the test
set.
TRAINING & TEST SETS
 In addition to training and test sets, other divisions of data are
often useful.
 Sometimes we need an extra source of data to augment the
training set. Such extra data is called a held-out set, because
we hold it out from our training set when we train our N-gram
counts.
 The held-out corpus is then used to set some other parameters;

 Sometimes we need to have multiple test sets.

 This happens because we might use a particular test set so


often that we implicitly tune to its characteristics. Then we
would definitely need a fresh test set which is truly unseen. In
such cases, we call the initial test set the development test set
or, devset.
Unknown Words: Open VS. closed
Vocabulary tasks
 Sometimes we have a language task in which we know all the words that
can occur, and hence we know the vocabulary size V in advance.
 The closed vocabulary assumption is the assumption that we have such a
lexicon, and that the test set can only contain words from this lexicon. The
closed vocabulary task thus assumes there are no unknown words.
 But the number of unseen words grows constantly, so we can’t possibly
know in advance exactly how many there are, and we’d like our model to
do something reasonable with them.
 We call these unseen events unknown words, or out of vocabulary
(OOV) words. The percentage of OOV words that appear in the test set is
called the OOV rate.
 An open vocabulary system is one where we model these potential
unknown words in the test set by adding a pseudo-word called <UNK>.
Unknown Words: Open VS. closed
Vocabulary tasks
 We can train the probabilities of the unknown word model
<UNK> as follows:
 Choose a vocabulary (word list) which is fixed in advance.
 Convert in the training set any word that is not in this set (any OOV word)

to the unknown word token <UNK> in a text normalization step.


 Estimate the probabilities for <UNK> from its counts just like any other

regular word in the training set.


EVALUATING N-GRAMS
 The best way to evaluate the performance of a language model is to
embed it in an application and measure the total performance of the
application. Such end-to-end evaluation is called extrinsic
evaluation.
 For example, for speech recognition, we can compare the
performance of two language models by running the speech
recognizer twice, once with each language model, and seeing which
gives the more accurate transcription.
 An end-to-end evaluation is often very expensive; evaluating a large
speech recognition test set, for example, takes hours or even days.
 An intrinsic evaluation metric is one which measures the quality of
a model independent of any application.
 Perplexity is the most common intrinsic evaluation metric for N-
gram language models.
EVALUATING N-GRAMS CONTD….
 The perplexity (sometimes called PP for short) of a language
model on a test set is a function of the probability that the
language model assigns to that test set.
 For a test set W = w1w2 . . . wN , the perplexity is given by:

1
 1 N
PP ( W  w 1w 2 . . . w N )   

 P ( w w
1 2 . . . w )
N 
 Perplexity is the degree of randomness or confusion of a model
as it is inversely proportional to the probability of the test set.
 Thus, a lower value of perplexity, indicate better language
model.
SMOOTHING
 There is a major problem with the maximum likelihood estimation
process. This is the problem of sparse data caused by the fact that
the maximum likelihood estimate is based on a particular set of
training data.
 For any N-gram that occurred a sufficient number of times, we
might have a good estimate of its probability. But because any
corpus is limited, some perfectly acceptable English word
sequences are bound to be missing from it.
 This missing data means that the N-gram matrix for any given
training corpus is bound to have a very large number of “zero
probability N-grams” that should really have some non-zero
probability.
 Furthermore, the MLE method also produces poor estimates when
the counts are non-zero but still small.
SMOOTHING
 Zero counts turn out to cause another huge problem. The
perplexity metric requires that we compute the probability of
each test sentence.
 But if a test sentence has an N-gram that never appeared in the
training set, the Maximum Likelihood estimate of the probability
for this N-gram, and hence for the whole test sentence, will be
zero.
 This means that in order to evaluate our language models, we
need to modify the MLE method to assign some non-zero
probability to any N-gram, even one that was never observed in
training.
 Smoothing are the modifications done in maximum likelihood
estimates to address the problem of poor estimates that are due to
variability in small data sets.
ADD ONE / LAPLACE SMOOTHING
 Laplace Smoothing adds one to each count. Since there are V
words in the vocabulary, and each one got incre- mented, we
also need to adjust the the denominator to take into account the
extra V observations
 For unigram model ,

ci  1
PLaplace  wi  
N V
 For bigram model
c( wi wi 1 )  1
P( wi | wi 1 ) 
c( wi )  V
ADD ONE / LAPLACE SMOOTHING
Laplace smoothing results in sharp change in counts and
probabilities because too much probability mass is moved to all
the zeros.
 We could move a bit less mass by adding a fractional count
rather than 1 (add-d smoothing)
c( wi wi 1 )  d
P( wi | wi 1 ) 
c( wi )  dV
 But this method requires a method for choosing d dynamically.
 I also results in an in- appropriate discount for many counts, and
turns out to give counts with poor variances.
GOOD TURING DISCOUNTING
 A related way to view smoothing is as discounting (lowering)
some non-zero counts in order to get the probability mass that
will be assigned to the zero counts.
 The Good-Turing algorithm was first described by Good
(1953), who credits Turing with the original idea.
 The basic principle of Good-Turing smoothing is to re-estimate
the amount of probability mass to assign to N-grams with zero
counts by looking at the number of N-grams that occurred one
time.
 A word or N-gram (or any event) that occurs once is called a
singleton, or a hapax. The Good-Turing relies on the
assumption is to use the frequency of singletons as a re-estimate
of the frequency of zero-count bigrams.
GOOD TURING DISCOUNTING

 The Good-Turing algorithm is based on computing Nc, the


number of N-grams that occur c times.
 We refer to the number of N-grams that occur c times as the
frequency of frequency c.
 The MLE count for Nc is c. The Good-Turing estimate replaces
this with a smoothed count c∗, as a function of Nc+1:

* N c 1
c  (c  1)
Nc
EXAMPLE
 Consider the following training corpus with three sentences:
I am Sam
Sam I am
I do not like green ham
Compute the probability of the sentence “Sam I like” using
(i) unsmoothed bi-gram language model.
(ii) smoothed bi-gram language model using Laplace
smoothing.
(iii) smoothed bi-gram language model using Good -Turing
discounting.
EXAMPLE CONTD…..
(i) Unsmoothed bi-gram language model
P(Sam I like) = P(Sam|<s>) P(I|Sam) P(like|I)
= 1 1 0
  0
3 3 3

(ii) Smoothed bi-gram model using Laplace smoothing:


V= 9 (<s>, I , am, Sam, do, not, like, green, ham)
P(Sam I like) = P(Sam|<s>) P(I|Sam) P(like|I)

11 11 0 1
  
39 39 39
1 1 1 1
     0.0023148
6 6 12 432
EXAMPLE CONTD…..
(ii) Smoothed bi-gram model using Laplace smoothing:
V= 9 (<s>, I , am, Sam, do, not, like, green, ham)
Total Bi-grams possible = 99 =81
Seen Pairs = 12
<s> I, I am, am Sam, <s> Sam, Sam I, I am, <s> I , I do, do
not, not like, like green, green ham.
Seen Distinct Pairs = 10
Total Unseen Pairs = 81-10 =71
EXAMPLE CONTD…..

c Nc Pairs c* = (c+1) Nc+1


Nc
0 71 - 8/71
1 8 am Sam, <s> Sam, Sam I, 2*2/8=4/8
I do, do not, hot like,
like green
2 2 <s> I , I am 0

P(Sam I like) = P(Sam|<s>) P(I|Sam) P(like|I)

4 1  4 1   8 1 
       
 8 12   8 12   71 12 
1 1 2
  
24 24 213
 0.0000163

You might also like