0% found this document useful (0 votes)
25 views21 pages

Evaluating Language Models

The document discusses different methods for evaluating language models, including intrinsic and extrinsic evaluation. Intrinsic evaluation assesses model quality independently of applications by measuring how well trained models predict held-out test data using metrics like perplexity. Extrinsic evaluation uses models in tasks like translation to compare performance. The document also covers techniques like smoothing to address zero probabilities for unseen events in test data.

Uploaded by

Mahesh Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views21 pages

Evaluating Language Models

The document discusses different methods for evaluating language models, including intrinsic and extrinsic evaluation. Intrinsic evaluation assesses model quality independently of applications by measuring how well trained models predict held-out test data using metrics like perplexity. Extrinsic evaluation uses models in tasks like translation to compare performance. The document also covers techniques like smoothing to address zero probabilities for unseen events in test data.

Uploaded by

Mahesh Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Evaluating Language Models

Natural Language Processing 1


Evaluating Language Models
• Does our language model prefer good sentences to bad
ones?
– Assign higher probability to “real” or “frequently
observed” sentences than “ungrammatical” or
“rarely observed” sentences?
• We train parameters of our model on a training set.
• We test the model’s performance on data we haven’t
seen.
– A test set is an unseen dataset that is different from
our training set, totally unused.

Natural Language Processing 2
Evaluating Language Models
Extrinsic Evaluation
• Extrinsic Evaluation of a N-gram language model is to use it
in an application and measure how much the application
improves.
• To compare two language models A and B:
– Use each of language model in a task such as spelling corrector, MT
system.
– Get an accuracy for A and for B
• How many misspelled words corrected properly
• How many words translated correctly
– Compare accuracy for A and B
• The model produces the better accuracy is the better model.
Natural Language Processing 3
Evaluating Language Models
Intrinsic Evaluation
• An intrinsic evaluation metric is one that measures the
quality of a model independent of any application.
• When a corpus of text is given and to compare two different
n-gram models,
– Divide the data into training and test sets,
– Train the parameters of both models on the training set,
and
– Compare how well the two trained models fit the test set.
• Whichever model assigns a higher probability to the
test set
• In practice, probability-based metric called perplexity is used
instead of raw probability as our metric for evaluating
language models. Natural Language Processing 4
Evaluating Language Models
Perplexity
• The best language model is one that best predicts an unseen test set
– Gives the highest P(testset)
• The perplexity of a language model on a test set is the inverse probability of the
test set, normalized by the number of words.
• Minimizing perplexity is the same as maximizing probability
• The perpelexity PP for a test set W=w1w2…wn is

PP(W) by
chain rule

• The perpelexity PP for bigrams:

PP(W)
Natural Language Processing 5
Evaluating Language Models
Perplexity as branching factor
• Perplexity can be seen as the weighted average branching factor of a
language.
– The branching factor of a language is the number of possible next words that
can follow any
word.
• Let’s suppose a sentence consisting of random digits
• What is the perplexity of this sentence according to a model that assign
P=1/10 to each digit?

Natural Language Processing 6


Evaluating Language Models
Perplexity
• Lower perplexity = better model
• Training 38 million words, test 1.5 million words, WSJ

Natural Language Processing 7


Generalization and Zeros
• The n-gram model, like many statistical models, is dependent on the
training corpus.
– the probabilities often encode specific facts about a given training corpus.
• N-grams only work well for word prediction if the test corpus looks
like the training corpus
– In real life, it often doesn’t
– We need to train robust models that generalize!
– One kind of generalization: Getting rid of Zeros!
• Things that don’t ever occur in the training set, but occur in the test set

• Zeros: things that don’t ever occur in the training set but do occur in the
test set causes problem for two reasons.
– First, underestimating the probability of all sorts of words that might occur,
– Second, if probability of any word in test set is 0, entire probability of test set is 0.
Natural Language Processing 8
Unknown Words
• We have to deal with words we haven’t seen before, which we call
unknown words.
• We can model these potential unknown words in the test set by adding
a pseudo-word
called <UNK> into our training set too.

• One way to handle unknown words is:


– Replace words in the training data by <UNK> based on their frequency.
• For example we can replace by <UNK> all words that occur fewer than n
times in the training set, where n is some small number, or
• Equivalently select a vocabulary size V in advance (say 50,000) and choose
the top V words by frequency and replace the rest by <UNK>.
– Proceed to train the language model as before, treating <UNK> like a regular
word. Natural Language Processing 9
Smoothing
• To keep a language model from assigning zero probability to
these unseen events, we’ll have to shave off a bit of
probability mass from some more frequent events and give
it to the events we’ve never seen.
• This modification is called smoothing (or discounting).
• There are many ways to do smoothing, and some of them
are:
– Add-1 smoothing (Laplace Smoothing)
– Add-k smoothing,
– Backoff
Natural Language Processing 12
Laplace Smoothing
• The simplest way to do smoothing is to add one to all
the counts, before we normalize them into
probabilities.
– All the counts that used to be zero will now have a count of 1, the
counts of 1 will be 2, and so on.
• This algorithm is called Laplace smoothing (Add-1
Smoothing).
• We pretend that we saw each word one more time than we
did, and we just add one to
all the counts!,

Natural Language Processing 13


Laplace Smoothing
( Add-1
Laplace Smoothing forSmoothing
Unigrams: )
• The unsmoothed maximum likelihood estimate of the unigram
probability of the word
wi is its count ci normalized by the total number of word tokens N:

P(wi) = ci / N
• Laplace smoothing adds one to each count. Since there are V words in
the vocabulary and each one was incremented, we also need to adjust
the denominator to take into account the extra V observations.

PLaplace(wi) = (ci + 1) / (N + V)

Natural Language Processing 14


Laplace Smoothing for Bigrams
• The normal bigram probabilities are computed by normalizing each
bigram counts by the unigram count:

• Add-one smoothed bigram probabilities:

Natural Language Processing 15


Laplace-smoothed Bigrams
Corpus: Berkeley Restaurant Project Sentences

Natural Language Processing 19


Laplace-smoothed Bigrams
Corpus: Berkeley Restaurant Project Sentences: Adjusted counts

Natural Language Processing 20


Add-k Smoothing
• Add-one smoothing has made a very big change to the counts.
– C(want to) changed from 608 to 238!
– P(to|want) decreases from .66 in the unsmoothed case to .26 in the smoothed case.
– Looking at discount d shows us how counts for each prefix word have been reduced;
• discount for bigram want to is .39, while discount for Chinese food is .10, a factor of 10

• The sharp change in counts and probabilities occurs because too much
probability
mass is moved to all the zeros.
• One alternative to add-one smoothing is to move a bit less of the probability
mass from the seen to the unseen events.
• Instead of adding 1 to each count, we add a fractional count k (.5? .05? .01?).
• This algorithm is called add-k smoothing.

Natural Language Processing 21

You might also like