Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
Prabhleen Juneja
Thapar Institute of Engineering & Technology
INTRODUCTION
Natural Language is a complex entity and in order to process it
through a computer based program, we need to build a model of
it.
A language model is a description of a language that can be
used to check whether a string is a valid member of the
language or not.
The process of creation of language models is called language
modeling.
There are two major approaches used for language modeling
namely Grammar-Based Language Modeling and Statistical
Language Modeling
GRAMMAR-BASED LANGUAGE
MODELING
A grammar-based approach uses the grammar of a language to
create a model.
It attempts to represent the syntactic structure of the language.
A model that limits the history to previous one word only is called bi-gram
model (n=2)
P( wi | hi ) P ( wi | wi 1 )
A tri-gram model limits the history to previous two words only (n=3)
P( wi | hi ) P( wi | wi 2 wi 1 )
A model that limits the history to previous one word only is called bi-gram
model (n=2)
P( wi | hi ) P ( wi | wi 1 )
A tri-gram model limits the history to previous two words only (n=3)
P( wi | hi ) P( wi | wi 2 wi 1 )
C ( wi n 1......... wi 1wi )
P ( wi | wi n 1 ...... wi 1 )
C (w ...... w w) i n 1
The sum of all n-grams that share the n-1 words is equal to the
w
i 1
C ( wi n 1......... wi 1wi )
P( wi | wi n 1...... wi 1 )
C ( wi n 1...... wi 1 )
N-GRAM MODEL- EXAMPLE
Consider the following training corpus with three sentences:
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green ham </s>
Compute bi-gram probabilities and find the probability of the
test sentence “Sam I ham” using bi-gram language model.
N-GRAM MODEL- EXAMPLE CONTD…
Bi-gram Probabilities:
P (I|<s>)= 2/3 =0.67 P(am|I) = 2/3 =0.67
P(Sam|am) =1/2=0.5 P(</s>|Sam)= ½=0.5
P(Sam|<s>)= 1/3=0.33 P(I|Sam)= ½ = 0.5
P(</s>|am)=1/2 =0.5 P(do|I)= 1/3 = 0.33
P(not|do) = 1 /1 =1 P (like|not) =1/1 =1
P(green|like) =1/1 =1 P(ham|green) = 1/1 =1
P(</s>|ham) =1/1
1
1 N
PP ( W w 1w 2 . . . w N )
P ( w w
1 2 . . . w )
N
Perplexity is the degree of randomness or confusion of a model
as it is inversely proportional to the probability of the test set.
Thus, a lower value of perplexity, indicate better language
model.
SMOOTHING
There is a major problem with the maximum likelihood estimation
process. This is the problem of sparse data caused by the fact that
the maximum likelihood estimate is based on a particular set of
training data.
For any N-gram that occurred a sufficient number of times, we
might have a good estimate of its probability. But because any
corpus is limited, some perfectly acceptable English word
sequences are bound to be missing from it.
This missing data means that the N-gram matrix for any given
training corpus is bound to have a very large number of “zero
probability N-grams” that should really have some non-zero
probability.
Furthermore, the MLE method also produces poor estimates when
the counts are non-zero but still small.
SMOOTHING
Zero counts turn out to cause another huge problem. The
perplexity metric requires that we compute the probability of
each test sentence.
But if a test sentence has an N-gram that never appeared in the
training set, the Maximum Likelihood estimate of the probability
for this N-gram, and hence for the whole test sentence, will be
zero.
This means that in order to evaluate our language models, we
need to modify the MLE method to assign some non-zero
probability to any N-gram, even one that was never observed in
training.
Smoothing are the modifications done in maximum likelihood
estimates to address the problem of poor estimates that are due to
variability in small data sets.
ADD ONE / LAPLACE SMOOTHING
Laplace Smoothing adds one to each count. Since there are V
words in the vocabulary, and each one got incre- mented, we
also need to adjust the the denominator to take into account the
extra V observations
For unigram model ,
ci 1
PLaplace wi
N V
For bigram model
c( wi wi 1 ) 1
P( wi | wi 1 )
c( wi ) V
ADD ONE / LAPLACE SMOOTHING
Laplace smoothing results in sharp change in counts and
probabilities because too much probability mass is moved to all
the zeros.
We could move a bit less mass by adding a fractional count
rather than 1 (add-d smoothing)
c( wi wi 1 ) d
P( wi | wi 1 )
c( wi ) dV
But this method requires a method for choosing d dynamically.
I also results in an in- appropriate discount for many counts, and
turns out to give counts with poor variances.
GOOD TURING DISCOUNTING
A related way to view smoothing is as discounting (lowering)
some non-zero counts in order to get the probability mass that
will be assigned to the zero counts.
The Good-Turing algorithm was first described by Good
(1953), who credits Turing with the original idea.
The basic principle of Good-Turing smoothing is to re-estimate
the amount of probability mass to assign to N-grams with zero
counts by looking at the number of N-grams that occurred one
time.
A word or N-gram (or any event) that occurs once is called a
singleton, or a hapax. The Good-Turing relies on the
assumption is to use the frequency of singletons as a re-estimate
of the frequency of zero-count bigrams.
GOOD TURING DISCOUNTING
* N c 1
c (c 1)
Nc
EXAMPLE
Consider the following training corpus with three sentences:
I am Sam
Sam I am
I do not like green ham
Compute the probability of the sentence “Sam I like” using
(i) unsmoothed bi-gram language model.
(ii) smoothed bi-gram language model using Laplace
smoothing.
(iii) smoothed bi-gram language model using Good -Turing
discounting.
EXAMPLE CONTD…..
(i) Unsmoothed bi-gram language model
P(Sam I like) = P(Sam|<s>) P(I|Sam) P(like|I)
= 1 1 0
0
3 3 3
11 11 0 1
39 39 39
1 1 1 1
0.0023148
6 6 12 432
EXAMPLE CONTD…..
(ii) Smoothed bi-gram model using Laplace smoothing:
V= 9 (<s>, I , am, Sam, do, not, like, green, ham)
Total Bi-grams possible = 99 =81
Seen Pairs = 12
<s> I, I am, am Sam, <s> Sam, Sam I, I am, <s> I , I do, do
not, not like, like green, green ham.
Seen Distinct Pairs = 10
Total Unseen Pairs = 81-10 =71
EXAMPLE CONTD…..
4 1 4 1 8 1
8 12 8 12 71 12
1 1 2
24 24 213
0.0000163